-
GigaScience Dec 2022Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many...
Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command-line access for metadata and file storage. SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.
Topics: Multiomics; Metadata; Software; Information Storage and Retrieval; Data Management
PubMed: 37498129
DOI: 10.1093/gigascience/giad052 -
Computational and Structural... 2023In the fast-evolving landscape of biomedical research, the emergence of big data has presented researchers with extraordinary opportunities to explore biological... (Review)
Review
In the fast-evolving landscape of biomedical research, the emergence of big data has presented researchers with extraordinary opportunities to explore biological complexities. In biomedical research, big data imply also a big responsibility. This is not only due to genomics data being sensitive information but also due to genomics data being shared and re-analysed among the scientific community. This saves valuable resources and can even help to find new insights . To fully use these opportunities, detailed and correct metadata are imperative. This includes not only the availability of metadata but also their correctness. Metadata integrity serves as a fundamental determinant of research credibility, supporting the reliability and reproducibility of data-driven findings. Ensuring metadata availability, curation, and accuracy are therefore essential for bioinformatic research. Not only must metadata be readily available, but they must also be meticulously curated and ideally error-free. Motivated by an accidental discovery of a critical metadata error in patient data published in two high-impact journals, we aim to raise awareness for the need of correct, complete, and curated metadata. We describe how the metadata error was found, addressed, and present examples for metadata-related challenges in omics research, along with supporting measures, including tools for checking metadata and software to facilitate various steps from data analysis to published research.
PubMed: 37860229
DOI: 10.1016/j.csbj.2023.10.006 -
GigaScience Sep 2021Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by...
BACKGROUND
Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs.
FINDINGS
Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples.
CONCLUSION
Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
Topics: Algorithms; Bias; Metadata; Molecular Sequence Annotation; RNA; Sequence Analysis, RNA
PubMed: 34553213
DOI: 10.1093/gigascience/giab064 -
Bioinformatics (Oxford, England) Jan 2023Several genomic databases host data and metadata for an ever-growing collection of sequence datasets. While these databases have a shared hierarchical structure, there...
MOTIVATION
Several genomic databases host data and metadata for an ever-growing collection of sequence datasets. While these databases have a shared hierarchical structure, there are no tools specifically designed to leverage it for metadata extraction.
RESULTS
We present a command-line tool, called ffq, for querying user-generated data and metadata from sequence databases. Given an accession or a paper's DOI, ffq efficiently fetches metadata and links to raw data in JSON format. ffq's modularity and simplicity make it extensible to any genomic database exposing its data for programmatic access.
AVAILABILITY AND IMPLEMENTATION
ffq is free and open source, and the code can be found here: https://github.com/pachterlab/ffq.
Topics: Software; Metadata; Databases, Nucleic Acid
PubMed: 36610997
DOI: 10.1093/bioinformatics/btac667 -
Scientific Data Jun 2022With the rapid development of high-throughput sequencing technology, the amount of metagenomic data (including both 16S and whole-genome sequencing data) in public...
With the rapid development of high-throughput sequencing technology, the amount of metagenomic data (including both 16S and whole-genome sequencing data) in public repositories is increasing exponentially. However, owing to the large and decentralized nature of the data, it is still difficult for users to mine, compare, and analyze the data. The animal metagenome database (AnimalMetagenome DB) integrates metagenomic sequencing data with host information, making it easier for users to find data of interest. The AnimalMetagenome DB is designed to contain all public metagenomic data from animals, and the data are divided into domestic and wild animal categories. Users can browse, search, and download animal metagenomic data of interest based on different attributes of the metadata such as animal species, sample site, study purpose, and DNA extraction method. The AnimalMetagenome DB version 1.0 includes metadata for 82,097 metagenomes from 4 domestic animals (pigs, bovines, horses, and sheep) and 540 wild animals. These metagenomes cover 15 years of experiments, 73 countries, 1,044 studies, 63,214 amplicon sequencing data, and 10,672 whole genome sequencing data. All data in the database are hosted and available in figshare https://doi.org/10.6084/m9.figshare.19728619 .
Topics: Animals; Cattle; Databases, Factual; High-Throughput Nucleotide Sequencing; Horses; Metadata; Metagenome; Metagenomics; Sheep; Swine
PubMed: 35710683
DOI: 10.1038/s41597-022-01444-w -
Journal of Medical Internet Research Mar 2023Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve... (Review)
Review
BACKGROUND
Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research.
OBJECTIVE
The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption.
METHODS
Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures.
RESULTS
We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV.
CONCLUSIONS
The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
Topics: Humans; Biomedical Research; Metadata; PubMed; Reproducibility of Results; Software
PubMed: 36972116
DOI: 10.2196/42289 -
Progress in Biophysics and Molecular... Jan 2022Advancements in neuroscience research have led to steadily accelerating data production and sharing. The online community repository of neural reconstructions...
Advancements in neuroscience research have led to steadily accelerating data production and sharing. The online community repository of neural reconstructions NeuroMorpho.Org grew from fewer than 1000 digitally traced neurons in 2006 to more than 140,000 cells today, including glia that now constitute 10.1% of the content. Every reconstruction consists of a detailed 3D representation of branch geometry and connectivity in a standardized format, from which a collection of morphometric features is extracted and stored. Moreover, each entry in the database is accompanied by rich metadata annotation describing the animal subject, anatomy, and experimental details. The rapid expansion of this resource in the past decade was accompanied by a parallel rise in the complexity of the available information, creating both opportunities and challenges for knowledge mining. Here, we introduce a new summary reporting functionality, allowing NeuroMorpho.Org users to efficiently download digests of metadata and morphometrics from multiple groups of similar cells for further analysis. We demonstrate the capabilities of the tool for both glia and neurons and present an illustrative statistical analysis of the resulting data.
Topics: Animals; Databases, Factual; Metadata; Neurons; Neurosciences
PubMed: 34022302
DOI: 10.1016/j.pbiomolbio.2021.05.005 -
GigaScience Dec 2022Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased...
BACKGROUND
Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature, could help to discover and mitigate contamination.
RESULTS
We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy, and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyzes contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for nonspecialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles.
CONCLUSION
GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open source and available at https://gitlab.com/dacs-hpi/grimer.
Topics: Microbiota; Biomass; Metadata
PubMed: 36994872
DOI: 10.1093/gigascience/giad017 -
Science Advances Oct 2022Integrating structural information and metadata, such as gender, social status, or interests, enriches networks and enables a better understanding of the large-scale...
Integrating structural information and metadata, such as gender, social status, or interests, enriches networks and enables a better understanding of the large-scale structure of complex systems. However, existing approaches to augment networks with metadata for community detection only consider immediately adjacent nodes and cannot exploit the nonlocal relationships between metadata and large-scale network structure present in many spatial and social systems. Here, we develop a flow-based community detection framework based on the map equation that integrates network information and metadata of distant nodes and reveals more complex relationships. We analyze social and spatial networks and find that our methodology can detect functional metadata-informed communities distinct from those derived solely from network information or metadata. For example, in a mobility network of London, we identify communities that reflect the heterogeneity of income distribution, and in a European power grid network, we identify communities that capture relationships between geography and energy prices beyond country borders.
PubMed: 36306360
DOI: 10.1126/sciadv.abn7558 -
Nucleic Acids Research Jan 2022Virus infections are huge threats to living organisms and cause many diseases, such as COVID-19 caused by SARS-CoV-2, which has led to millions of deaths. To develop...
Virus infections are huge threats to living organisms and cause many diseases, such as COVID-19 caused by SARS-CoV-2, which has led to millions of deaths. To develop effective strategies to control viral infection, we need to understand its molecular events in host cells. Virus related functional genomic datasets are growing rapidly, however, an integrative platform for systematically investigating host responses to viruses is missing. Here, we developed a user-friendly multi-omics portal of viral infection named as MVIP (https://mvip.whu.edu.cn/). We manually collected available high-throughput sequencing data under viral infection, and unified their detailed metadata including virus, host species, infection time, assay, and target, etc. We processed multi-layered omics data of more than 4900 viral infected samples from 77 viruses and 33 host species with standard pipelines, including RNA-seq, ChIP-seq, and CLIP-seq, etc. In addition, we integrated these genome-wide signals into customized genome browsers, and developed multiple dynamic charts to exhibit the information, such as time-course dynamic and differential gene expression profiles, alternative splicing changes and enriched GO/KEGG terms. Furthermore, we implemented several tools for efficiently mining the virus-host interactions by virus, host and genes. MVIP would help users to retrieve large-scale functional information and promote the understanding of virus-host interactions.
Topics: Animals; Chromatin Immunoprecipitation Sequencing; Databases, Factual; Gene Ontology; Genome, Viral; High-Throughput Nucleotide Sequencing; Host Microbial Interactions; Humans; Metadata; Sequence Analysis, RNA; Software; Transcriptome; User-Computer Interface; Virus Diseases; Web Browser
PubMed: 34718748
DOI: 10.1093/nar/gkab958