metadata - OpenMD.com Journal Search

META-BASE: A Novel Architecture for Large-Scale Genomic Metadata Integration.

IEEE/ACM Transactions on Computational... 2022

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data...

Summary PubMed

Authors: Anna Bernasconi, Arif Canakoglu, Marco Masseroli...

The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.

Topics: Computational Biology; Genomics; Information Storage and Retrieval; Metadata

PubMed: 32750853
DOI: 10.1109/TCBB.2020.2998954

Data, Metadata, Mental Data? Privacy and the Extended Mind.

AJOB Neuroscience 2023

It has been recently suggested that if the Extended Mind thesis is true, mental privacy might be under serious threat. In this paper, I look into the details of this...

Summary PubMed

Authors: Spyridon Orestis Palermos

It has been recently suggested that if the Extended Mind thesis is true, mental privacy might be under serious threat. In this paper, I look into the details of this claim and propose that one way of dealing with this emerging threat requires that data ontology be enriched with an additional kind of data-viz mental data. I explore how mental data relates to both data and metadata and suggest that, arguably, and by contrast with these existing categories of informational content, mental data should not be merely legally protected. Rather, if we value mental privacy as we know it, technological measures should be employed to ensure that one's mental data are -not just legally-impossible for others to obtain.

Topics: Privacy; Metadata; Technology

PubMed: 36537997
DOI: 10.1080/21507740.2022.2148772

SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata.

PloS One 2017

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the...

Summary PubMed Full Text PDF

Authors: Benjamin C Hitz, Laurence D Rowe, Nikhil R Podduturi...

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.

Topics: Animals; DNA; Databases, Genetic; Genome; Genomics; Humans; Metadata; Mice; Software

PubMed: 28403240
DOI: 10.1371/journal.pone.0175310

The Animal Audiogram Database: A community-based resource for consolidated audiogram data and metadata.

The Journal of the Acoustical Society... Feb 2022

Knowledge of hearing ability, as represented in audiograms, is essential for understanding how animals acoustically perceive their environment, predicting and...

Summary PubMed

Authors: Denise Jäckel, Alvaro Ortiz Troncoso, Michael Dähne...

Knowledge of hearing ability, as represented in audiograms, is essential for understanding how animals acoustically perceive their environment, predicting and counteracting the effects of anthropogenic noise, and managing wildlife. Audiogram data and relevant background information are currently only available embedded in the text of individual scientific publications in various unstandardized formats. This heterogeneity makes it hard to access, compare, and integrate audiograms. The Animal Audiogram Database (https://animalaudiograms.org) assembles published audiogram data, metadata about the corresponding experiments, and links to the original publications in a consistent format. The database content is the result of an extensive survey of the scientific literature and manual curation of the audiometric data found therein. As of November 1, 2021, the database contains 306 audiogram datasets from 34 animal species. The scope and format of the provided metadata and design of the database interface were established by active research community involvement. Options to compare audiograms and download datasets in structured formats are provided. With the focus currently on vertebrates and hearing in underwater environments, the database is drafted as a free and open resource for facilitating the review and correction of the contained data and collaborative extension with audiogram data from any taxonomic group and habitat.

Topics: Animals; Audiometry; Hearing; Hearing Tests; Metadata; Noise

PubMed: 35232080
DOI: 10.1121/10.0009402

Digital research data: from analysis of existing standards to a scientific foundation for a modular metadata schema in nanosafety.

Particle and Fibre Toxicology Jan 2022

Assessing the safety of engineered nanomaterials (ENMs) is an interdisciplinary and complex process producing huge amounts of information and data. To make such data and... (Review)

Summary PubMed Full Text PDF

Review

Authors: Linda Elberskirch, Kunigunde Binder, Norbert Riefler...

BACKGROUND

Assessing the safety of engineered nanomaterials (ENMs) is an interdisciplinary and complex process producing huge amounts of information and data. To make such data and metadata reusable for researchers, manufacturers, and regulatory authorities, there is an urgent need to record and provide this information in a structured, harmonized, and digitized way.

RESULTS

This study aimed to identify appropriate description standards and quality criteria for the special use in nanosafety. There are many existing standards and guidelines designed for collecting data and metadata, ranging from regulatory guidelines to specific databases. Most of them are incomplete or not specifically designed for ENM research. However, by merging the content of several existing standards and guidelines, a basic catalogue of descriptive information and quality criteria was generated. In an iterative process, our interdisciplinary team identified deficits and added missing information into a comprehensive schema. Subsequently, this overview was externally evaluated by a panel of experts during a workshop. This whole process resulted in a minimum information table (MIT), specifying necessary minimum information to be provided along with experimental results on effects of ENMs in the biological context in a flexible and modular manner. The MIT is divided into six modules: general information, material information, biological model information, exposure information, endpoint read out information and analysis and statistics. These modules are further partitioned into module subdivisions serving to include more detailed information. A comparison with existing ontologies, which also aim to electronically collect data and metadata on nanosafety studies, showed that the newly developed MIT exhibits a higher level of detail compared to those existing schemas, making it more usable to prevent gaps in the communication of information.

CONCLUSION

Implementing the requirements of the MIT into e.g., electronic lab notebooks (ELNs) would make the collection of all necessary data and metadata a daily routine and thereby would improve the reproducibility and reusability of experiments. Furthermore, this approach is particularly beneficial regarding the rapidly expanding developments and applications of novel non-animal alternative testing methods.

Topics: Databases, Factual; Metadata; Reproducibility of Results; Research Design

PubMed: 34983569
DOI: 10.1186/s12989-021-00442-x

A curated collection of human vaccination response signatures.

Scientific Data Nov 2022

Recent advances in high-throughput experiments and systems biology approaches have resulted in hundreds of publications identifying "immune signatures". Unfortunately,...

Summary PubMed Full Text PDF

Authors: Kenneth C Smith, Daniel G Chawla, Bhavjinder K Dhillon...

Recent advances in high-throughput experiments and systems biology approaches have resulted in hundreds of publications identifying "immune signatures". Unfortunately, these are often described within text, figures, or tables in a format not amenable to computational processing, thus severely hampering our ability to fully exploit this information. Here we present a data model to represent immune signatures, along with the Human Immunology Project Consortium (HIPC) Dashboard ( www.hipc-dashboard.org ), a web-enabled application to facilitate signature access and querying. The data model captures the biological response components (e.g., genes, proteins, cell types or metabolites) and metadata describing the context under which the signature was identified using standardized terms from established resources (e.g., HGNC, Protein Ontology, Cell Ontology). We have manually curated a collection of >600 immune signatures from >60 published studies profiling human vaccination responses for the current release. The system will aid in building a broader understanding of the human immune response to stimuli by enabling researchers to easily access and interrogate published immune signatures.

Topics: Humans; Metadata; Software; Systems Biology; Vaccination

PubMed: 36347894
DOI: 10.1038/s41597-022-01558-1

GEOfetch: a command-line tool for downloading data and standardized metadata from GEO and SRA.

Bioinformatics (Oxford, England) Mar 2023

The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and...

Summary PubMed Full Text PDF

Authors: Oleksandr Khoroshevskyi, Nathan LeRoy, Vincent P Reuter...

MOTIVATION

The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and metadata from Gene Expression Omnibus (GEO) in a standardized annotation format.

RESULTS

To address this, we present GEOfetch-a command-line tool that downloads and organizes data and metadata from GEO and SRA. GEOfetch formats the downloaded metadata as a Portable Encapsulated Project, providing universal format for the reanalysis of public data.

AVAILABILITY AND IMPLEMENTATION

GEOfetch is available on Bioconda and the Python Package Index (PyPI).

Topics: Metadata; Gene Expression; Computational Biology

PubMed: 36857584
DOI: 10.1093/bioinformatics/btad069

The Neurodata Without Borders ecosystem for neurophysiological data science.

ELife Oct 2022

The neurophysiology of cells and tissues are monitored electrophysiologically and optically in diverse experiments and species, ranging from flies to humans....

Summary PubMed Full Text PDF

Authors: Oliver Rübel, Andrew Tritt, Ryan Ly...

The neurophysiology of cells and tissues are monitored electrophysiologically and optically in diverse experiments and species, ranging from flies to humans. Understanding the brain requires integration of data across this diversity, and thus these data must be findable, accessible, interoperable, and reusable (FAIR). This requires a standard language for data and metadata that can coevolve with neuroscience. We describe design and implementation principles for a language for neurophysiology data. Our open-source software (Neurodata Without Borders, NWB) defines and modularizes the interdependent, yet separable, components of a data language. We demonstrate NWB's impact through unified description of neurophysiology data across diverse modalities and species. NWB exists in an ecosystem, which includes data management, analysis, visualization, and archive tools. Thus, the NWB data language enables reproduction, interchange, and reuse of diverse neurophysiology data. More broadly, the design principles of NWB are generally applicable to enhance discovery across biology through data FAIRness.

Topics: Data Science; Ecosystem; Humans; Metadata; Neurophysiology; Software

PubMed: 36193886
DOI: 10.7554/eLife.78362

Sharing and organizing research products as R packages.

Behavior Research Methods Apr 2021

A consensus on the importance of open data and reproducible code is emerging. How should data and code be shared to maximize the key desiderata of reproducibility,...

Summary PubMed Full Text PDF

Authors: Matti Vuorre, Matthew J C Crump

A consensus on the importance of open data and reproducible code is emerging. How should data and code be shared to maximize the key desiderata of reproducibility, permanence, and accessibility? Research assets should be stored persistently in formats that are not software restrictive, and documented so that others can reproduce and extend the required computations. The sharing method should be easy to adopt by already busy researchers. We suggest the R package standard as a solution for creating, curating, and communicating research assets. The R package standard, with extensions discussed herein, provides a format for assets and metadata that satisfies the above desiderata, facilitates reproducibility, open access, and sharing of materials through online platforms like GitHub and Open Science Framework. We discuss a stack of R resources that help users create reproducible collections of research assets, from experiments to manuscripts, in the RStudio interface. We created an R package, vertical, to help researchers incorporate these tools into their workflows, and discuss its functionality at length in an online supplement. Together, these tools may increase the reproducibility and openness of psychological science.

Topics: Humans; Metadata; Reproducibility of Results; Software; Workflow

PubMed: 32875401
DOI: 10.3758/s13428-020-01436-x

Genomic data integration and user-defined sample-set extraction for population variant analysis.

BMC Bioinformatics Sep 2022

Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a...

Summary PubMed Full Text PDF

Authors: Tommaso Alfonsi, Anna Bernasconi, Arif Canakoglu...

BACKGROUND

Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics.

RESULTS

Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities.

CONCLUSIONS

The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.

Topics: Computational Biology; Genomics; Genotype; Humans; Metadata; Software

PubMed: 36175857
DOI: 10.1186/s12859-022-04927-0