-
Bioinformatics (Oxford, England) Nov 2019Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets....
SUMMARY
Sequencing data resources have increased exponentially in recent years, as has interest in large-scale meta-analyses of integrated next-generation sequencing datasets. However, curation of integrated datasets that match a user's particular research priorities is currently a time-intensive and imprecise task. MetaSeek is a sequencing data discovery tool that enables users to flexibly search and filter on any metadata field to quickly find the sequencing datasets that meet their needs. MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database and predicts missing fields where possible. MetaSeek provides a web-based graphical user interface and interactive visualization dashboard, as well as a programmatic API to rapidly search, filter, visualize, save, share and download matching sequencing metadata.
AVAILABILITY AND IMPLEMENTATION
The MetaSeek online interface is available at https://www.metaseek.cloud/. The MetaSeek database can also be accessed via API to programmatically search, filter and download all metadata. MetaSeek source code, metadata scrapers and documents are available at https://github.com/MetaSeek-Sequencing-Data-Discovery/metaseek/.
Topics: Databases, Factual; High-Throughput Nucleotide Sequencing; Metadata; Software
PubMed: 31225863
DOI: 10.1093/bioinformatics/btz499 -
BMC Bioinformatics Apr 2022Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical...
BACKGROUND
Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.
RESULTS
We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.
CONCLUSIONS
RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
Topics: Big Data; Cloud Computing; Genomics; Metadata; Software
PubMed: 35392801
DOI: 10.1186/s12859-022-04648-4 -
BMC Bioinformatics Mar 2021Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects,...
BACKGROUND
Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio.
RESULTS
Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2-5%).
CONCLUSIONS
Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.
Topics: Animals; Bias; Databases, Factual; Female; Gene Expression; Male; Metadata; Mice; Neoplasms; Sex Factors
PubMed: 33784977
DOI: 10.1186/s12859-021-04070-2 -
American Journal of Biological... Apr 2024Ancient human dental calculus is a unique, nonrenewable biological resource encapsulating key information about the diets, lifestyles, and health conditions of past... (Review)
Review
OBJECTIVES
Ancient human dental calculus is a unique, nonrenewable biological resource encapsulating key information about the diets, lifestyles, and health conditions of past individuals and populations. With compounding calls its destructive analysis, it is imperative to refine the ways in which the scientific community documents, samples, and analyzes dental calculus so as to maximize its utility to the public and scientific community.
MATERIALS AND METHODS
Our research team conducted an IRB-approved survey of dental calculus researchers with diverse academic backgrounds, research foci, and analytical specializations.
RESULTS
This survey reveals variation in how metadata is collected and utilized across different subdisciplines and highlights how these differences have profound implications for dental calculus research. Moreover, the survey suggests the need for more communication between those who excavate, curate, and analyze biomolecular data from dental calculus.
DISCUSSION
Challenges in cross-disciplinary communication limit researchers' ability to effectively utilize samples in rigorous and reproducible ways. Specifically, the lack of standardized skeletal and dental metadata recording and contamination avoidance procedures hinder downstream anthropological applications, as well as the pursuit of broader paleodemographic and paleoepidemiological inquiries that rely on more complete information about the individuals sampled. To provide a path forward toward more ethical and standardized dental calculus sampling and documentation approaches, we review the current methods by which skeletal and dental metadata are recorded. We also describe trends in sampling and contamination-control approaches. Finally, we use that information to suggest new guidelines for ancient dental calculus documentation and sampling strategies that will improve research practices in the future.
Topics: Humans; Dental Calculus; Metadata; Anthropology; Communication; Documentation
PubMed: 37994571
DOI: 10.1002/ajpa.24871 -
Science (New York, N.Y.) Sep 2018
Topics: Metadata; Periodicals as Topic; Research; Science
PubMed: 30237336
DOI: 10.1126/science.361.6408.1178 -
Database : the Journal of Biological... May 2022Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different...
Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be correctly interpreted and applied. For example, are two terms equivalent or merely related? Are they narrow or broad matches? Or are they associated in some other way? Such relationships between the mapped terms are often not documented, which leads to incorrect assumptions and makes them hard to use in scenarios that require a high degree of precision (such as diagnostics or risk prediction). Furthermore, the lack of descriptions of how mappings were done makes it hard to combine and reconcile mappings, particularly curated and automated ones. We have developed the Simple Standard for Sharing Ontological Mappings (SSSOM) which addresses these problems by: (i) Introducing a machine-readable and extensible vocabulary to describe metadata that makes imprecision, inaccuracy and incompleteness in mappings explicit. (ii) Defining an easy-to-use simple table-based format that can be integrated into existing data science pipelines without the need to parse or query ontologies, and that integrates seamlessly with Linked Data principles. (iii) Implementing open and community-driven collaborative workflows that are designed to evolve the standard continuously to address changing requirements and mapping practices. (iv) Providing reference tools and software libraries for working with the standard. In this paper, we present the SSSOM standard, describe several use cases in detail and survey some of the existing work on standardizing the exchange of mappings, with the goal of making mappings Findable, Accessible, Interoperable and Reusable (FAIR). The SSSOM specification can be found at http://w3id.org/sssom/spec. Database URL: http://w3id.org/sssom/spec.
Topics: Data Management; Databases, Factual; Metadata; Semantic Web; Workflow
PubMed: 35616100
DOI: 10.1093/database/baac035 -
Journal of Integrative Bioinformatics Oct 2021This special issue of the contains updated specifications of COMBINE standards in systems and synthetic biology. The 2021 special issue presents four updates of...
This special issue of the contains updated specifications of COMBINE standards in systems and synthetic biology. The 2021 special issue presents four updates of standards: Synthetic Biology Open Language Visual Version 2.3, Synthetic Biology Open Language Visual Version 3.0, Simulation Experiment Description Markup Language Level 1 Version 4, and OMEX Metadata specification Version 1.2. This document can also be consulted to identify the latest specifications of all COMBINE standards.
Topics: Computational Biology; Computer Simulation; Metadata; Programming Languages; Software; Synthetic Biology
PubMed: 34674411
DOI: 10.1515/jib-2021-0026 -
Nature Methods Mar 2023The design of biocatalytic reaction systems is highly complex owing to the dependency of the estimated kinetic parameters on the enzyme, the reaction conditions, and the...
The design of biocatalytic reaction systems is highly complex owing to the dependency of the estimated kinetic parameters on the enzyme, the reaction conditions, and the modeling method. Consequently, reproducibility of enzymatic experiments and reusability of enzymatic data are challenging. We developed the XML-based markup language EnzymeML to enable storage and exchange of enzymatic data such as reaction conditions, the time course of the substrate and the product, kinetic parameters and the kinetic model, thus making enzymatic data findable, accessible, interoperable and reusable (FAIR). The feasibility and usefulness of the EnzymeML toolbox is demonstrated in six scenarios, for which data and metadata of different enzymatic reactions are collected and analyzed. EnzymeML serves as a seamless communication channel between experimental platforms, electronic lab notebooks, tools for modeling of enzyme kinetics, publication platforms and enzymatic reaction databases. EnzymeML is open and transparent, and invites the community to contribute. All documents and codes are freely available at https://enzymeml.org .
Topics: Reproducibility of Results; Data Management; Metadata; Databases, Factual; Kinetics
PubMed: 36759590
DOI: 10.1038/s41592-022-01763-1 -
Journal of Biomedical Semantics Mar 2022Health data from different specialties or domains generallly have diverse formats and meanings, which can cause semantic communication barriers when these data are...
BACKGROUND
Health data from different specialties or domains generallly have diverse formats and meanings, which can cause semantic communication barriers when these data are exchanged among heterogeneous systems. As such, this study is intended to develop a national health concept data model (HCDM) and develop a corresponding system to facilitate healthcare data standardization and centralized metadata management.
METHODS
Based on 55 data sets (4640 data items) from 7 health business domains in China, a bottom-up approach was employed to build the structure and metadata for HCDM by referencing HL7 RIM. According to ISO/IEC 11179, a top-down approach was used to develop and standardize the data elements.
RESULTS
HCDM adopted three-level architecture of class, attribute and data type, and consisted of 6 classes and 15 sub-classes. Each class had a set of descriptive attributes and every attribute was assigned a data type. 100 initial data elements (DEs) were extracted from HCDM and 144 general DEs were derived from corresponding initial DEs. Domain DEs were transformed by specializing general DEs using 12 controlled vocabularies which developed from HL7 vocabularies and actual health demands. A model-based system was successfully established to evaluate and manage the NHDD.
CONCLUSIONS
HCDM provided a unified metadata reference for multi-source data standardization and management. This approach of defining health data elements was a feasible solution in healthcare information standardization to enable healthcare interoperability in China.
Topics: Delivery of Health Care; Metadata; Semantics; Vocabulary, Controlled
PubMed: 35303946
DOI: 10.1186/s13326-022-00265-5 -
GigaScience Nov 2022The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National...
The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes of Health Common Fund without requiring that data owners move, reformat, or rehost those data. This system is centered on a catalog that integrates detailed descriptions of biomedical datasets from individual Common Fund Programs' Data Coordination Centers (DCCs) into a uniform metadata model that can then be indexed and searched from a centralized portal. This Crosscut Metadata Model (C2M2) supports the wide variety of data types and metadata terms used by individual DCCs and can readily describe nearly all forms of biomedical research data. We detail its use to ingest and index data from 11 DCCs.
Topics: Ecosystem; Metadata; Financial Management
PubMed: 36409836
DOI: 10.1093/gigascience/giac105