-
IEEE/ACM Transactions on Computational... 2022The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data...
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.
Topics: Computational Biology; Genomics; Information Storage and Retrieval; Metadata
PubMed: 32750853
DOI: 10.1109/TCBB.2020.2998954 -
AJOB Neuroscience 2023It has been recently suggested that if the Extended Mind thesis is true, mental privacy might be under serious threat. In this paper, I look into the details of this...
It has been recently suggested that if the Extended Mind thesis is true, mental privacy might be under serious threat. In this paper, I look into the details of this claim and propose that one way of dealing with this emerging threat requires that data ontology be enriched with an additional kind of data-viz mental data. I explore how mental data relates to both data and metadata and suggest that, arguably, and by contrast with these existing categories of informational content, mental data should not be merely legally protected. Rather, if we value mental privacy as we know it, technological measures should be employed to ensure that one's mental data are -not just legally-impossible for others to obtain.
Topics: Privacy; Metadata; Technology
PubMed: 36537997
DOI: 10.1080/21507740.2022.2148772 -
PloS One 2017The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the...
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
Topics: Animals; DNA; Databases, Genetic; Genome; Genomics; Humans; Metadata; Mice; Software
PubMed: 28403240
DOI: 10.1371/journal.pone.0175310 -
The Journal of the Acoustical Society... Feb 2022Knowledge of hearing ability, as represented in audiograms, is essential for understanding how animals acoustically perceive their environment, predicting and...
Knowledge of hearing ability, as represented in audiograms, is essential for understanding how animals acoustically perceive their environment, predicting and counteracting the effects of anthropogenic noise, and managing wildlife. Audiogram data and relevant background information are currently only available embedded in the text of individual scientific publications in various unstandardized formats. This heterogeneity makes it hard to access, compare, and integrate audiograms. The Animal Audiogram Database (https://animalaudiograms.org) assembles published audiogram data, metadata about the corresponding experiments, and links to the original publications in a consistent format. The database content is the result of an extensive survey of the scientific literature and manual curation of the audiometric data found therein. As of November 1, 2021, the database contains 306 audiogram datasets from 34 animal species. The scope and format of the provided metadata and design of the database interface were established by active research community involvement. Options to compare audiograms and download datasets in structured formats are provided. With the focus currently on vertebrates and hearing in underwater environments, the database is drafted as a free and open resource for facilitating the review and correction of the contained data and collaborative extension with audiogram data from any taxonomic group and habitat.
Topics: Animals; Audiometry; Hearing; Hearing Tests; Metadata; Noise
PubMed: 35232080
DOI: 10.1121/10.0009402 -
Particle and Fibre Toxicology Jan 2022Assessing the safety of engineered nanomaterials (ENMs) is an interdisciplinary and complex process producing huge amounts of information and data. To make such data and... (Review)
Review
BACKGROUND
Assessing the safety of engineered nanomaterials (ENMs) is an interdisciplinary and complex process producing huge amounts of information and data. To make such data and metadata reusable for researchers, manufacturers, and regulatory authorities, there is an urgent need to record and provide this information in a structured, harmonized, and digitized way.
RESULTS
This study aimed to identify appropriate description standards and quality criteria for the special use in nanosafety. There are many existing standards and guidelines designed for collecting data and metadata, ranging from regulatory guidelines to specific databases. Most of them are incomplete or not specifically designed for ENM research. However, by merging the content of several existing standards and guidelines, a basic catalogue of descriptive information and quality criteria was generated. In an iterative process, our interdisciplinary team identified deficits and added missing information into a comprehensive schema. Subsequently, this overview was externally evaluated by a panel of experts during a workshop. This whole process resulted in a minimum information table (MIT), specifying necessary minimum information to be provided along with experimental results on effects of ENMs in the biological context in a flexible and modular manner. The MIT is divided into six modules: general information, material information, biological model information, exposure information, endpoint read out information and analysis and statistics. These modules are further partitioned into module subdivisions serving to include more detailed information. A comparison with existing ontologies, which also aim to electronically collect data and metadata on nanosafety studies, showed that the newly developed MIT exhibits a higher level of detail compared to those existing schemas, making it more usable to prevent gaps in the communication of information.
CONCLUSION
Implementing the requirements of the MIT into e.g., electronic lab notebooks (ELNs) would make the collection of all necessary data and metadata a daily routine and thereby would improve the reproducibility and reusability of experiments. Furthermore, this approach is particularly beneficial regarding the rapidly expanding developments and applications of novel non-animal alternative testing methods.
Topics: Databases, Factual; Metadata; Reproducibility of Results; Research Design
PubMed: 34983569
DOI: 10.1186/s12989-021-00442-x -
Scientific Data Nov 2022Recent advances in high-throughput experiments and systems biology approaches have resulted in hundreds of publications identifying "immune signatures". Unfortunately,...
Recent advances in high-throughput experiments and systems biology approaches have resulted in hundreds of publications identifying "immune signatures". Unfortunately, these are often described within text, figures, or tables in a format not amenable to computational processing, thus severely hampering our ability to fully exploit this information. Here we present a data model to represent immune signatures, along with the Human Immunology Project Consortium (HIPC) Dashboard ( www.hipc-dashboard.org ), a web-enabled application to facilitate signature access and querying. The data model captures the biological response components (e.g., genes, proteins, cell types or metabolites) and metadata describing the context under which the signature was identified using standardized terms from established resources (e.g., HGNC, Protein Ontology, Cell Ontology). We have manually curated a collection of >600 immune signatures from >60 published studies profiling human vaccination responses for the current release. The system will aid in building a broader understanding of the human immune response to stimuli by enabling researchers to easily access and interrogate published immune signatures.
Topics: Humans; Metadata; Software; Systems Biology; Vaccination
PubMed: 36347894
DOI: 10.1038/s41597-022-01558-1 -
Bioinformatics (Oxford, England) Mar 2023The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and...
MOTIVATION
The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and metadata from Gene Expression Omnibus (GEO) in a standardized annotation format.
RESULTS
To address this, we present GEOfetch-a command-line tool that downloads and organizes data and metadata from GEO and SRA. GEOfetch formats the downloaded metadata as a Portable Encapsulated Project, providing universal format for the reanalysis of public data.
AVAILABILITY AND IMPLEMENTATION
GEOfetch is available on Bioconda and the Python Package Index (PyPI).
Topics: Metadata; Gene Expression; Computational Biology
PubMed: 36857584
DOI: 10.1093/bioinformatics/btad069 -
ELife Oct 2022The neurophysiology of cells and tissues are monitored electrophysiologically and optically in diverse experiments and species, ranging from flies to humans....
The neurophysiology of cells and tissues are monitored electrophysiologically and optically in diverse experiments and species, ranging from flies to humans. Understanding the brain requires integration of data across this diversity, and thus these data must be findable, accessible, interoperable, and reusable (FAIR). This requires a standard language for data and metadata that can coevolve with neuroscience. We describe design and implementation principles for a language for neurophysiology data. Our open-source software (Neurodata Without Borders, NWB) defines and modularizes the interdependent, yet separable, components of a data language. We demonstrate NWB's impact through unified description of neurophysiology data across diverse modalities and species. NWB exists in an ecosystem, which includes data management, analysis, visualization, and archive tools. Thus, the NWB data language enables reproduction, interchange, and reuse of diverse neurophysiology data. More broadly, the design principles of NWB are generally applicable to enhance discovery across biology through data FAIRness.
Topics: Data Science; Ecosystem; Humans; Metadata; Neurophysiology; Software
PubMed: 36193886
DOI: 10.7554/eLife.78362 -
Behavior Research Methods Apr 2021A consensus on the importance of open data and reproducible code is emerging. How should data and code be shared to maximize the key desiderata of reproducibility,...
A consensus on the importance of open data and reproducible code is emerging. How should data and code be shared to maximize the key desiderata of reproducibility, permanence, and accessibility? Research assets should be stored persistently in formats that are not software restrictive, and documented so that others can reproduce and extend the required computations. The sharing method should be easy to adopt by already busy researchers. We suggest the R package standard as a solution for creating, curating, and communicating research assets. The R package standard, with extensions discussed herein, provides a format for assets and metadata that satisfies the above desiderata, facilitates reproducibility, open access, and sharing of materials through online platforms like GitHub and Open Science Framework. We discuss a stack of R resources that help users create reproducible collections of research assets, from experiments to manuscripts, in the RStudio interface. We created an R package, vertical, to help researchers incorporate these tools into their workflows, and discuss its functionality at length in an online supplement. Together, these tools may increase the reproducibility and openness of psychological science.
Topics: Humans; Metadata; Reproducibility of Results; Software; Workflow
PubMed: 32875401
DOI: 10.3758/s13428-020-01436-x -
BMC Bioinformatics Sep 2022Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a...
BACKGROUND
Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics.
RESULTS
Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities.
CONCLUSIONS
The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.
Topics: Computational Biology; Genomics; Genotype; Humans; Metadata; Software
PubMed: 36175857
DOI: 10.1186/s12859-022-04927-0