-
Database : the Journal of Biological... Aug 2022Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely...
Reproducibility of research is essential for science. However, in the way modern computational biology research is done, it is easy to lose track of small, but extremely critical, details. Key details, such as the specific version of a software used or iteration of a genome can easily be lost in the shuffle or perhaps not noted at all. Much work is being done on the database and storage side of things, ensuring that there exists a space-to-store experiment-specific details, but current mechanisms for recording details are cumbersome for scientists to use. We propose a new metadata description language, named MEtaData Format for Open Reef Data (MEDFORD), in which scientists can record all details relevant to their research. Being human-readable, easily editable and templatable, MEDFORD serves as a collection point for all notes that a researcher could find relevant to their research, be it for internal use or for future replication. MEDFORD has been applied to coral research, documenting research from RNA-seq analyses to photo collections.
Topics: Computational Biology; Humans; Language; Metadata; Reproducibility of Results; Software
PubMed: 35976727
DOI: 10.1093/database/baac065 -
Database : the Journal of Biological... Jan 2016Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is...
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data-defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction.
Topics: Computational Biology; Data Mining; Databases, Genetic; Gene Expression Profiling; Humans; Metadata; Models, Statistical
PubMed: 28637268
DOI: 10.1093/database/baw080 -
Nature Communications Nov 2022There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular...
There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .
Topics: Humans; Metadata; Natural Language Processing; Machine Learning; Genomics; Language
PubMed: 36347858
DOI: 10.1038/s41467-022-34435-x -
F1000Research 2022The web tool Adamant has been developed to systematically collect research metadata as early as the conception of the experiment. Adamant enables a continuous,...
The web tool Adamant has been developed to systematically collect research metadata as early as the conception of the experiment. Adamant enables a continuous, consistent, and transparent research data management (RDM) process, which is a key element of good scientific practice ensuring the path to Findable, Accessible, Interoperable, Reusable (FAIR) research data. It simplifies the creation of on-demand metadata schemas and the collection of metadata according to established or new standards. The approach is based on JavaScript Object Notation (JSON) schema, where any valid schema can be presented as an interactive web-form. Furthermore, Adamant eases the integration of numerous available RDM methods and software tools into the everyday research activities of especially small independent laboratories. A programming interface allows programmatic integration with other software tools such as electronic lab books or repositories. The user interface (UI) of Adamant is designed to be as user friendly as possible. Each UI element is self-explanatory and intuitive to use, which makes it accessible for users that have little to no experience with JSON format and programming in general. Several examples of research data management workflows that can be implemented using Adamant are introduced. Adamant (client-only version) is available from: https://plasma-mds.github.io/adamant.
Topics: Data Management; Humans; Metadata; Software; Workflow
PubMed: 35707001
DOI: 10.12688/f1000research.110875.2 -
Studies in Health Technology and... Sep 2019Interoperability is a growing demand in healthcare, caused by heterogeneous sources, which aggravate information transfer. The interoperability issues can be addressed...
Interoperability is a growing demand in healthcare, caused by heterogeneous sources, which aggravate information transfer. The interoperability issues can be addressed by metadata repositories. These support to ensure syntactical interoperability, like compatible data formats or value ranges, however especially semantic interoperability is still challenging. Semantic annotation through standardized terminologies and classifications enables to foster semantic interoperability. This work aims to interconnect Samply.MDR and Portal of Medical Data Model (MDM-Portal) to allow facilitated semantic annotation with UMLS. Therefore, Samply.MDR was extended to store semantic information. While creating a data element, a request to MDM is send, which results in possible UMLS codes. The user can now adopt the most suitable code and select a link type between the code and the element itself. A successful enrichment of data elements with UMLS codes was shown by interconnecting Samply.MDR and MDM-Portal.
Topics: Metadata; Semantics
PubMed: 31483259
DOI: 10.3233/SHTI190810 -
BMC Research Notes Mar 2022Open science and open data within scholarly research programs are growing both in popularity and by requirement from grant funding agencies and journal publishers. A...
Centralized project-specific metadata platforms: toolkit provides new perspectives on open data management within multi-institution and multidisciplinary research projects.
Open science and open data within scholarly research programs are growing both in popularity and by requirement from grant funding agencies and journal publishers. A central component of open data management, especially on collaborative, multidisciplinary, and multi-institutional science projects, is documentation of complete and accurate metadata, workflow, and source code in addition to access to raw data and data products to uphold FAIR (Findable, Accessible, Interoperable, Reusable) principles. Although best practice in data/metadata management is to use established internationally accepted metadata schemata, many of these standards are discipline-specific making it difficult to catalog multidisciplinary data and data products in a way that is easily findable and accessible. Consequently, scattered and incompatible metadata records create a barrier to scientific innovation, as researchers are burdened to find and link multidisciplinary datasets. One possible solution to increase data findability, accessibility, interoperability, reproducibility, and integrity within multi-institutional and interdisciplinary projects is a centralized and integrated data management platform. Overall, this type of interoperable framework supports reproducible open science and its dissemination to various stakeholders and the public in a FAIR manner by providing direct access to raw data and linking protocols, metadata and supporting workflow materials.
Topics: Data Management; Interdisciplinary Research; Metadata; Reproducibility of Results; Software
PubMed: 35303952
DOI: 10.1186/s13104-022-05996-3 -
Scientific Data Apr 2022The genomes of thousands of individuals are profiled within Dutch healthcare and research each year. However, this valuable genomic data, associated clinical data and...
The genomes of thousands of individuals are profiled within Dutch healthcare and research each year. However, this valuable genomic data, associated clinical data and consent are captured in different ways and stored across many systems and organizations. This makes it difficult to discover rare disease patients, reuse data for personalized medicine and establish research cohorts based on specific parameters. FAIR Genomes aims to enable NGS data reuse by developing metadata standards for the data descriptions needed to FAIRify genomic data while also addressing ELSI issues. We developed a semantic schema of essential data elements harmonized with international FAIR initiatives. The FAIR Genomes schema v1.1 contains 110 elements in 9 modules. It reuses common ontologies such as NCIT, DUO and EDAM, only introducing new terms when necessary. The schema is represented by a YAML file that can be transformed into templates for data entry software (EDC) and programmatic interfaces (JSON, RDF) to ease genomic data sharing in research and healthcare. The schema, documentation and MOLGENIS reference implementation are available at https://fairgenomes.org .
Topics: Delivery of Health Care; Genomics; High-Throughput Nucleotide Sequencing; Humans; Metadata; Software
PubMed: 35418585
DOI: 10.1038/s41597-022-01265-x -
Scientific Data Dec 2023Metadata from epidemiological studies, including chronic disease outcome metadata (CDOM), are important to be findable to allow interpretability and reusability. We...
Metadata from epidemiological studies, including chronic disease outcome metadata (CDOM), are important to be findable to allow interpretability and reusability. We propose a comprehensive metadata schema and used it to assess public availability and findability of CDOM from German population-based observational studies participating in the consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health). Additionally, principal investigators from the included studies completed a checklist evaluating consistency with FAIR principles (Findability, Accessibility, Interoperability, Reusability) within their studies. Overall, six of sixteen studies had complete publicly available CDOM. The most frequent CDOM source was scientific publications and the most frequently missing metadata were availability of codes of the International Classification of Diseases, Tenth Revision (ICD-10). Principal investigators' main perceived barriers for consistency with FAIR principles were limited human and financial resources. Our results reveal that CDOM from German population-based studies have incomplete availability and limited findability. There is a need to make CDOM publicly available in searchable platforms or metadata catalogues to improve their FAIRness, which requires human and financial resources.
Topics: Humans; Metadata; Publications; Chronic Disease
PubMed: 38052810
DOI: 10.1038/s41597-023-02726-7 -
Scientific Data Jan 2021Ancient DNA and RNA are valuable data sources for a wide range of disciplines. Within the field of ancient metagenomics, the number of published genetic datasets has...
Ancient DNA and RNA are valuable data sources for a wide range of disciplines. Within the field of ancient metagenomics, the number of published genetic datasets has risen dramatically in recent years, and tracking this data for reuse is particularly important for large-scale ecological and evolutionary studies of individual taxa and communities of both microbes and eukaryotes. AncientMetagenomeDir (archived at https://doi.org/10.5281/zenodo.3980833 ) is a collection of annotated metagenomic sample lists derived from published studies that provide basic, standardised metadata and accession numbers to allow rapid data retrieval from online repositories. These tables are community-curated and span multiple sub-disciplines to ensure adequate breadth and consensus in metadata definitions, as well as longevity of the database. Internal guidelines and automated checks facilitate compatibility with established sequence-read archives and term-ontologies, and ensure consistency and interoperability for future meta-analyses. This collection will also assist in standardising metadata reporting for future ancient metagenomic studies.
Topics: Databases, Genetic; Humans; Metadata; Metagenome; Metagenomics; Publications
PubMed: 33500403
DOI: 10.1038/s41597-021-00816-y -
Scientific Data Dec 2020Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream...
Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.
Topics: Clinical Trials as Topic; Databases, Factual; Datasets as Topic; Metadata; Research Design
PubMed: 33339830
DOI: 10.1038/s41597-020-00780-z