-
Plant Science : An International... Feb 2018More than 15 petabases of raw RNAseq data is now accessible through public repositories. Acquisition of other 'omics data types is expanding, though most lack a... (Review)
Review
More than 15 petabases of raw RNAseq data is now accessible through public repositories. Acquisition of other 'omics data types is expanding, though most lack a centralized archival repository. Data-reuse provides tremendous opportunity to extract new knowledge from existing experiments, and offers a unique opportunity for robust, multi-'omics analyses by merging metadata (information about experimental design, biological samples, protocols) and data from multiple experiments. We illustrate how predictive research can be accelerated by meta-analysis with a study of orphan (species-specific) genes. Computational predictions are critical to infer orphan function because their coding sequences provide very few clues. The metadata in public databases is often confusing; a test case with Zea mays mRNA seq data reveals a high proportion of missing, misleading or incomplete metadata. This metadata morass significantly diminishes the insight that can be extracted from these data. We provide tips for data submitters and users, including specific recommendations to improve metadata quality by more use of controlled vocabulary and by metadata reviews. Finally, we advocate for a unified, straightforward metadata submission and retrieval system.
Topics: Base Sequence; Databases, Factual; Metadata; Plant Proteins; RNA, Messenger; Zea mays
PubMed: 29362097
DOI: 10.1016/j.plantsci.2017.10.014 -
IEEE Journal of Biomedical and Health... Aug 2023Recently, artificial intelligence has been widely used in intelligent disease diagnosis and has achieved great success. However, most of the works mainly rely on the...
Recently, artificial intelligence has been widely used in intelligent disease diagnosis and has achieved great success. However, most of the works mainly rely on the extraction of image features but ignore the use of clinical text information of patients, which may limit the diagnosis accuracy fundamentally. In this paper, we propose a metadata and image features co-aware personalized federated learning scheme for smart healthcare. Specifically, we construct an intelligent diagnosis model, by which users can obtain fast and accurate diagnosis services. Meanwhile, a personalized federated learning scheme is designed to utilize the knowledge learned from other edge nodes with larger contributions and customize high-quality personalized classification models for each edge node. Subsequently, a Naïve Bayes classifier is devised for classifying patient metadata. And then the image and metadata diagnosis results are jointly aggregated by different weights to improve the accuracy of intelligent diagnosis. Finally, the simulation results illustrate that, compared with the existing methods, our proposed algorithm achieves better classification accuracy, reaching about 97.16% on PAD-UFES-20 dataset.
Topics: Humans; Artificial Intelligence; Bayes Theorem; Metadata; Algorithms; Computer Simulation
PubMed: 37220032
DOI: 10.1109/JBHI.2023.3279096 -
GigaScience Dec 2022Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many...
Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command-line access for metadata and file storage. SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.
Topics: Multiomics; Metadata; Software; Information Storage and Retrieval; Data Management
PubMed: 37498129
DOI: 10.1093/gigascience/giad052 -
Lab Animal Mar 2024Although biomedical research is experiencing a data explosion, the accumulation of vast quantities of data alone does not guarantee a primary objective for science:... (Review)
Review
Although biomedical research is experiencing a data explosion, the accumulation of vast quantities of data alone does not guarantee a primary objective for science: building upon existing knowledge. Data collected that lack appropriate metadata cannot be fully interrogated or integrated into new research projects, leading to wasted resources and missed opportunities for data repurposing. This issue is particularly acute for research using animals, where concerns regarding data reproducibility and ensuring animal welfare are paramount. Here, to address this problem, we propose a minimal metadata set (MNMS) designed to enable the repurposing of in vivo data. MNMS aligns with an existing validated guideline for reporting in vivo data (ARRIVE 2.0) and contributes to making in vivo data FAIR-compliant. Scenarios where MNMS should be implemented in diverse research environments are presented, highlighting opportunities and challenges for data repurposing at different scales. We conclude with a 'call for action' to key stakeholders in biomedical research to adopt and apply MNMS to accelerate both the advancement of knowledge and the betterment of animal welfare.
Topics: Animals; Metadata; Reproducibility of Results; Biomedical Research; Animal Welfare
PubMed: 38438748
DOI: 10.1038/s41684-024-01335-0 -
PLoS Biology May 2021Single-cell RNA sequencing (scRNA-seq) provides an unprecedented view of cellular diversity of biological systems. However, across the thousands of publications and...
Single-cell RNA sequencing (scRNA-seq) provides an unprecedented view of cellular diversity of biological systems. However, across the thousands of publications and datasets generated using this technology, we estimate that only a minority (<25%) of studies provide cell-level metadata information containing identified cell types and related findings of the published dataset. Metadata omission hinders reproduction, exploration, validation, and knowledge transfer and is a common problem across journals, data repositories, and publication dates. We encourage investigators, reviewers, journals, and data repositories to improve their standards and ensure proper documentation of these valuable datasets.
Topics: Animals; Computational Biology; Gene Expression Profiling; Humans; Meta-Analysis as Topic; Metadata; Sequence Analysis, RNA; Single-Cell Analysis; Software
PubMed: 33945522
DOI: 10.1371/journal.pbio.3001077 -
Studies in Health Technology and... May 2021Metadata management is an essential condition to follow the FAIR principles. Therefore, metadata management was one asset of an accompanying project within a funding...
Metadata management is an essential condition to follow the FAIR principles. Therefore, metadata management was one asset of an accompanying project within a funding scheme for registries in health services research. The metadata of the funded projects were acquired, combined in a database compatible with the metamodel of ISO/IEC 11179 "Information technology - Metadata registries" third edition (ISO/IEC 11179-3), and analyzed in order to support the development and the operation of the registries. In the second phase of the funding scheme, six registries delivered a complete update of their metadata. The mean number of data elements increased from 245.7 to 473.5 and the mean number of values from 569.5 to 1,306.0. The conceptual core of the database had to be extended by one third to cover the new elements. The reason for this increase remained unclear. Constraints from the grant might be causal, a deviation from an evidence-based development process as well. It is questionable, whether the revealed quality of the metadata is sufficient to fulfill the FAIR principles. The extension of the metamodel of ISO/IEC 11179-3 is in agreement with the literature. However, further research is needed to find workable solutions for metadata management.
Topics: Databases, Factual; Health Services Research; Information Technology; Metadata; Registries
PubMed: 34042697
DOI: 10.3233/SHTI210112 -
Bioinformatics (Oxford, England) Jun 2023Biological data repositories are an invaluable source of publicly available research evidence. Unfortunately, the lack of convergence of the scientific community on a...
SUMMARY
Biological data repositories are an invaluable source of publicly available research evidence. Unfortunately, the lack of convergence of the scientific community on a common metadata annotation strategy has resulted in large amounts of data with low FAIRness (Findable, Accessible, Interoperable and Reusable). The possibility of generating high-quality insights from their integration relies on data curation, which is typically an error-prone process while also being expensive in terms of time and human labour. Here, we present ESPERANTO, an innovative framework that enables a standardized semi-supervised harmonization and integration of toxicogenomics metadata and increases their FAIRness in a Good Laboratory Practice-compliant fashion. The harmonization across metadata is guaranteed with the definition of an ad hoc vocabulary. The tool interface is designed to support the user in metadata harmonization in a user-friendly manner, regardless of the background and the type of expertise.
AVAILABILITY AND IMPLEMENTATION
ESPERANTO and its user manual are freely available for academic purposes at https://github.com/fhaive/esperanto. The input and the results showcased in Supplementary File S1 are available at the same link.
Topics: Humans; Software; Metadata; Toxicogenetics; Language; Data Curation
PubMed: 37354497
DOI: 10.1093/bioinformatics/btad405 -
Conservation Biology : the Journal of... Aug 2023Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to...
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data-sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever.
Topics: Humans; Conservation of Natural Resources; Metadata; Biodiversity; Probability; Genetic Variation
PubMed: 36704891
DOI: 10.1111/cobi.14061 -
Database : the Journal of Biological... Apr 2021High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable...
High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.
Topics: Deep Learning; High-Throughput Nucleotide Sequencing; Metadata; Reproducibility of Results; Software
PubMed: 33914028
DOI: 10.1093/database/baab021 -
GigaScience Dec 2022The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient...
BACKGROUND
The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.
FINDINGS
We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the "form generation module" creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the "validation module." Finally, the "resource module" can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive-compatible XML metadata file.
CONCLUSIONS
Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.
Topics: Metadata; Biological Science Disciplines; Databases, Factual; Nucleotides; Publishing
PubMed: 36879493
DOI: 10.1093/gigascience/giad014