-
AMIA ... Annual Symposium Proceedings.... 2017In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those...
In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this paper, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Topics: Biological Ontologies; Biomedical Research; Data Accuracy; Data Analysis; Metadata; Methods
PubMed: 29854196
DOI: No ID Found -
Nucleic Acids Research Jan 2019Sharing of research data in public repositories has become best practice in academia. With the accumulation of massive data, network bandwidth and storage requirements...
Sharing of research data in public repositories has become best practice in academia. With the accumulation of massive data, network bandwidth and storage requirements are rapidly increasing. The ProteomeXchange (PX) consortium implements a mode of centralized metadata and distributed raw data management, which promotes effective data sharing. To facilitate open access of proteome data worldwide, we have developed the integrated proteome resource iProX (http://www.iprox.org) as a public platform for collecting and sharing raw data, analysis results and metadata obtained from proteomics experiments. The iProX repository employs a web-based proteome data submission process and open sharing of mass spectrometry-based proteomics datasets. Also, it deploys extensive controlled vocabularies and ontologies to annotate proteomics datasets. Users can use a GUI to provide and access data through a fast Aspera-based transfer tool. iProX is a full member of the PX consortium; all released datasets are freely accessible to the public. iProX is based on a high availability architecture and has been deployed as part of the proteomics infrastructure of China, ensuring long-term and stable resource support. iProX will facilitate worldwide data analysis and sharing of proteomics experiments.
Topics: Animals; Computational Biology; Databases, Protein; Humans; Information Storage and Retrieval; Internet; Metadata; Proteome; Proteomics; User-Computer Interface
PubMed: 30252093
DOI: 10.1093/nar/gky869 -
Nucleic Acids Research Jan 2021The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide...
The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide sequence data and metadata for >30 years. Three partner organizations, the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan; the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and GenBank at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA have been collaboratively maintaining the INSDC for the benefit of not only science but all types of community worldwide.
Topics: Academies and Institutes; Base Sequence; Databases, Nucleic Acid; Europe; High-Throughput Nucleotide Sequencing; Humans; International Cooperation; Japan; Metadata; Nucleotides; Sequence Analysis, DNA; Sequence Analysis, RNA; United States
PubMed: 33166387
DOI: 10.1093/nar/gkaa967 -
BMC Bioinformatics Apr 2022Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical...
BACKGROUND
Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.
RESULTS
We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.
CONCLUSIONS
RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
Topics: Big Data; Cloud Computing; Genomics; Metadata; Software
PubMed: 35392801
DOI: 10.1186/s12859-022-04648-4 -
Nucleic Acids Research Jan 2022The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic,...
The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic, phenotypic, and clinical data resulting from biomedical research projects. Its mission is to foster hosted data reuse, enable reproducibility, and accelerate biomedical and translational research in line with the FAIR principles. Launched in 2008, the EGA has grown quickly, currently archiving over 4,500 studies from nearly one thousand institutions. The EGA operates a distributed data access model in which requests are made to the data controller, not to the EGA, therefore, the submitter keeps control on who has access to the data and under which conditions. Given the size and value of data hosted, the EGA is constantly improving its value chain, that is, how the EGA can contribute to enhancing the value of human health data by facilitating its submission, discovery, access, and distribution, as well as leading the design and implementation of standards and methods necessary to deliver the value chain. The EGA has become a key GA4GH Driver Project, leading multiple development efforts and implementing new standards and tools, and has been appointed as an ELIXIR Core Data Resource.
Topics: Confidentiality; Datasets as Topic; Genome, Human; Genotype; History, 20th Century; History, 21st Century; Humans; Information Dissemination; Metadata; Phenomics; Phenotype; Translational Research, Biomedical
PubMed: 34791407
DOI: 10.1093/nar/gkab1059 -
Metabolites Aug 2023Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on...
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
PubMed: 37623884
DOI: 10.3390/metabo13080941 -
Radiology Feb 2022Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT...
Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT categories defined according to body region and clinical imaging indication and to cluster indications according to the dose required for sufficient image quality. Materials and Methods This was a retrospective study using Digital Imaging and Communications in Medicine metadata. CT examinations in adults from January 1, 2016 to December 31, 2019 from the University of California San Francisco International CT Dose Registry were grouped into 19 categories according to body region and required radiation dose levels. Five body regions had a single dose range (ie, extremities, neck, thoracolumbar spine, combined chest and abdomen, and combined thoracolumbar spine). Five additional regions were subdivided according to dose. Head, chest, cardiac, and abdomen each had low, routine, and high dose categories; combined head and neck had routine and high dose categories. For each category, the median and 75th percentile (ie, diagnostic reference level [DRL]) were determined for dose-length product, and the variation in dose within categories versus across categories was calculated and compared using an analysis of variance. Relative median and DRL (95% CI) doses comparing high dose versus low dose categories were calculated. Results Among 4.5 million examinations, the median and DRL doses varied approximately 10 times between categories compared with between indications within categories. For head, chest, abdomen, and cardiac (3 266 546 examinations [72%]), the relative median doses were higher in examinations assigned to the high dose categories than in examinations assigned to the low dose categories, suggesting the assignment of indications to the broad categories is valid (head, 3.4-fold higher [95% CI: 3.4, 3.5]; chest, 9.6 [95% CI: 9.3, 10.0]; abdomen, 2.4 [95% CI: 2.4, 2.5]; and cardiac, 18.1 [95% CI: 17.7, 18.6]). Results were similar for DRL doses (all < .001). Conclusion Broad categories based on image quality requirements are a suitable framework for simplifying radiation dose assessment, according to expected variation between and within categories. © RSNA, 2021 See also the editorial by Mahesh in this issue.
Topics: Adult; Aged; Female; Humans; Male; Metadata; Middle Aged; Radiation Dosage; Retrospective Studies; Tomography, X-Ray Computed
PubMed: 34751618
DOI: 10.1148/radiol.2021210591 -
Nucleic Acids Research Jul 2022Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into...
Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into searchable signatures along with signatures extracted from Genotype-Tissue Expression (GTEx) and Gene Expression Omnibus (GEO), connections between drugs, genes, pathways and diseases can be illuminated. SigCom LINCS is a webserver that serves over a million gene expression signatures processed, analyzed, and visualized from LINCS, GTEx, and GEO. SigCom LINCS is built with Signature Commons, a cloud-agnostic skeleton Data Commons with a focus on serving searchable signatures. SigCom LINCS provides a rapid signature similarity search for mimickers and reversers given sets of up and down genes, a gene set, a single gene, or any search term. Additionally, users of SigCom LINCS can perform a metadata search to find and analyze subsets of signatures and find information about genes and drugs. SigCom LINCS is findable, accessible, interoperable, and reusable (FAIR) with metadata linked to standard ontologies and vocabularies. In addition, all the data and signatures within SigCom LINCS are available via a well-documented API. In summary, SigCom LINCS, available at https://maayanlab.cloud/sigcom-lincs, is a rich webserver resource for accelerating drug and target discovery in systems pharmacology.
Topics: Transcriptome; Metadata; Search Engine
PubMed: 35524556
DOI: 10.1093/nar/gkac328 -
GigaScience Dec 2022The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient...
BACKGROUND
The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.
FINDINGS
We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the "form generation module" creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the "validation module." Finally, the "resource module" can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive-compatible XML metadata file.
CONCLUSIONS
Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.
Topics: Metadata; Biological Science Disciplines; Databases, Factual; Nucleotides; Publishing
PubMed: 36879493
DOI: 10.1093/gigascience/giad014 -
Nature Methods Dec 2021
Topics: Cell Nucleus; Chromatin; Humans; Image Processing, Computer-Assisted; Medical Informatics; Metadata; Microscopy; Software
PubMed: 34654919
DOI: 10.1038/s41592-021-01290-5