-
Nucleic Acids Research Jan 2021The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide...
The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide sequence data and metadata for >30 years. Three partner organizations, the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan; the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and GenBank at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA have been collaboratively maintaining the INSDC for the benefit of not only science but all types of community worldwide.
Topics: Academies and Institutes; Base Sequence; Databases, Nucleic Acid; Europe; High-Throughput Nucleotide Sequencing; Humans; International Cooperation; Japan; Metadata; Nucleotides; Sequence Analysis, DNA; Sequence Analysis, RNA; United States
PubMed: 33166387
DOI: 10.1093/nar/gkaa967 -
BMC Bioinformatics Apr 2022Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical...
BACKGROUND
Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.
RESULTS
We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.
CONCLUSIONS
RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
Topics: Big Data; Cloud Computing; Genomics; Metadata; Software
PubMed: 35392801
DOI: 10.1186/s12859-022-04648-4 -
Nucleic Acids Research Jan 2022The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic,...
The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic, phenotypic, and clinical data resulting from biomedical research projects. Its mission is to foster hosted data reuse, enable reproducibility, and accelerate biomedical and translational research in line with the FAIR principles. Launched in 2008, the EGA has grown quickly, currently archiving over 4,500 studies from nearly one thousand institutions. The EGA operates a distributed data access model in which requests are made to the data controller, not to the EGA, therefore, the submitter keeps control on who has access to the data and under which conditions. Given the size and value of data hosted, the EGA is constantly improving its value chain, that is, how the EGA can contribute to enhancing the value of human health data by facilitating its submission, discovery, access, and distribution, as well as leading the design and implementation of standards and methods necessary to deliver the value chain. The EGA has become a key GA4GH Driver Project, leading multiple development efforts and implementing new standards and tools, and has been appointed as an ELIXIR Core Data Resource.
Topics: Confidentiality; Datasets as Topic; Genome, Human; Genotype; History, 20th Century; History, 21st Century; Humans; Information Dissemination; Metadata; Phenomics; Phenotype; Translational Research, Biomedical
PubMed: 34791407
DOI: 10.1093/nar/gkab1059 -
Metabolites Aug 2023Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on...
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
PubMed: 37623884
DOI: 10.3390/metabo13080941 -
Radiology Feb 2022Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT...
Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT categories defined according to body region and clinical imaging indication and to cluster indications according to the dose required for sufficient image quality. Materials and Methods This was a retrospective study using Digital Imaging and Communications in Medicine metadata. CT examinations in adults from January 1, 2016 to December 31, 2019 from the University of California San Francisco International CT Dose Registry were grouped into 19 categories according to body region and required radiation dose levels. Five body regions had a single dose range (ie, extremities, neck, thoracolumbar spine, combined chest and abdomen, and combined thoracolumbar spine). Five additional regions were subdivided according to dose. Head, chest, cardiac, and abdomen each had low, routine, and high dose categories; combined head and neck had routine and high dose categories. For each category, the median and 75th percentile (ie, diagnostic reference level [DRL]) were determined for dose-length product, and the variation in dose within categories versus across categories was calculated and compared using an analysis of variance. Relative median and DRL (95% CI) doses comparing high dose versus low dose categories were calculated. Results Among 4.5 million examinations, the median and DRL doses varied approximately 10 times between categories compared with between indications within categories. For head, chest, abdomen, and cardiac (3 266 546 examinations [72%]), the relative median doses were higher in examinations assigned to the high dose categories than in examinations assigned to the low dose categories, suggesting the assignment of indications to the broad categories is valid (head, 3.4-fold higher [95% CI: 3.4, 3.5]; chest, 9.6 [95% CI: 9.3, 10.0]; abdomen, 2.4 [95% CI: 2.4, 2.5]; and cardiac, 18.1 [95% CI: 17.7, 18.6]). Results were similar for DRL doses (all < .001). Conclusion Broad categories based on image quality requirements are a suitable framework for simplifying radiation dose assessment, according to expected variation between and within categories. © RSNA, 2021 See also the editorial by Mahesh in this issue.
Topics: Adult; Aged; Female; Humans; Male; Metadata; Middle Aged; Radiation Dosage; Retrospective Studies; Tomography, X-Ray Computed
PubMed: 34751618
DOI: 10.1148/radiol.2021210591 -
Nucleic Acids Research Jul 2022Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into...
Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into searchable signatures along with signatures extracted from Genotype-Tissue Expression (GTEx) and Gene Expression Omnibus (GEO), connections between drugs, genes, pathways and diseases can be illuminated. SigCom LINCS is a webserver that serves over a million gene expression signatures processed, analyzed, and visualized from LINCS, GTEx, and GEO. SigCom LINCS is built with Signature Commons, a cloud-agnostic skeleton Data Commons with a focus on serving searchable signatures. SigCom LINCS provides a rapid signature similarity search for mimickers and reversers given sets of up and down genes, a gene set, a single gene, or any search term. Additionally, users of SigCom LINCS can perform a metadata search to find and analyze subsets of signatures and find information about genes and drugs. SigCom LINCS is findable, accessible, interoperable, and reusable (FAIR) with metadata linked to standard ontologies and vocabularies. In addition, all the data and signatures within SigCom LINCS are available via a well-documented API. In summary, SigCom LINCS, available at https://maayanlab.cloud/sigcom-lincs, is a rich webserver resource for accelerating drug and target discovery in systems pharmacology.
Topics: Transcriptome; Metadata; Search Engine
PubMed: 35524556
DOI: 10.1093/nar/gkac328 -
GigaScience Dec 2022The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient...
BACKGROUND
The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.
FINDINGS
We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the "form generation module" creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the "validation module." Finally, the "resource module" can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive-compatible XML metadata file.
CONCLUSIONS
Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.
Topics: Metadata; Biological Science Disciplines; Databases, Factual; Nucleotides; Publishing
PubMed: 36879493
DOI: 10.1093/gigascience/giad014 -
Nature Methods Dec 2021
Topics: Cell Nucleus; Chromatin; Humans; Image Processing, Computer-Assisted; Medical Informatics; Metadata; Microscopy; Software
PubMed: 34654919
DOI: 10.1038/s41592-021-01290-5 -
Journal of the American Medical... Aug 2021PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for...
OBJECTIVE
PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this end, we present a new disambiguation method, namely AggAND, by aggregating information from external databases.
MATERIALS AND METHODS
We address this issue by exploring Microsoft Academic Graph, Semantic Scholar, and PubMed Knowledge Graph to enhance the built-in name metadata, and extend the internal metadata with some external and more discriminative metadata.
RESULTS
Experimental results on enhanced name metadata demonstrate comparable performance to 3 author identifier systems, as well as show superiority over the original name metadata. More importantly, our method, AggAND, incorporating both enhanced name and extended metadata, yields F1 scores of 95.80% and 93.71% on 2 datasets and outperforms the state-of-the-art method by a large margin (3.61% and 6.55%, respectively).
CONCLUSIONS
The feasibility and good performance of our methods not only help better understand the importance of external databases for disambiguation, but also point to a promising direction for future AND studies in which information aggregated from multiple bibliographic databases can be effective in improving disambiguation performance. The methodology shown here can be generalized to broader bibliographic databases beyond PubMed. Our code and data are available online (https://github.com/carmanzhang/PubMed-AND-method).
Topics: Databases, Bibliographic; Databases, Factual; Metadata; PubMed; Semantics
PubMed: 34180522
DOI: 10.1093/jamia/ocab095 -
BMC Genomic Data Nov 2022While data sharing increases, most open data are difficult to re-use or to identify due to the lack of related metada. In this editorial, I discussed about the...
While data sharing increases, most open data are difficult to re-use or to identify due to the lack of related metada. In this editorial, I discussed about the importance of those metadata in the context of genomic, and why they are mandatory to ensure the success of data sharing.
Topics: Metadata; Information Dissemination; Genomics
PubMed: 36371151
DOI: 10.1186/s12863-022-01095-1