-
Proceedings. IEEE Computer Society... Jun 2021Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods. While...
Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods. While these techniques normalize feature distributions by standardizing with batch statistics, they do not correct the influence on features from extraneous variables or multiple distributions. Such extra variables, referred to as metadata here, may create bias or confounding effects (e.g., race when classifying gender from face images). We introduce the Metadata Normalization (MDN) layer, a new batch-level operation which can be used end-to-end within the training framework, to correct the influence of metadata on feature distributions. MDN adopts a regression analysis technique traditionally used for preprocessing to remove (regress out) the metadata effects on model features during training. We utilize a metric based on distance correlation to quantify the distribution bias from the metadata and demonstrate that our method successfully removes metadata effects on four diverse settings: one synthetic, one 2D image, one video, and one 3D medical image dataset.
PubMed: 34776724
DOI: 10.1109/cvpr46437.2021.01077 -
Scientific Data Sep 2022Community-developed minimum information checklists are designed to drive the rich and consistent reporting of metadata, underpinning the reproducibility and reuse of the...
Community-developed minimum information checklists are designed to drive the rich and consistent reporting of metadata, underpinning the reproducibility and reuse of the data. These reporting guidelines, however, are usually in the form of narratives intended for human consumption. Modular and reusable machine-readable versions are also needed. Firstly, to provide the necessary quantitative and verifiable measures of the degree to which the metadata descriptors meet these community requirements, a requirement of the FAIR Principles. Secondly, to encourage the creation of standards-driven templates for metadata authoring, especially when describing complex experiments that require multiple reporting guidelines to be used in combination or extended. We present new functionalities to support the creation and improvements of machine-readable models. We apply the approach to an exemplar set of reporting guidelines in Life Science and discuss the challenges. Our work, targeted to developers of standards and those familiar with standards, promotes the concept of compositional metadata elements and encourages the creation of community-standards which are modular and interoperable from the onset.
Topics: Biological Science Disciplines; Humans; Metadata; Reproducibility of Results
PubMed: 36180441
DOI: 10.1038/s41597-022-01707-6 -
Trends in Cancer Apr 2021Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated... (Review)
Review
Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated descriptions of samples and their handling. We discuss how machine learning-derived elements complement such descriptions to enhance the research ecosystem around genomic data.
Topics: Genomics; Humans; Machine Learning; Metadata; Neoplasms
PubMed: 33229213
DOI: 10.1016/j.trecan.2020.10.011 -
Journal of Medical Internet Research Jan 2022Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example,... (Review)
Review
BACKGROUND
Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term "metadata" and its use is not always unambiguous.
OBJECTIVE
This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse.
METHODS
A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used.
RESULTS
The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas.
CONCLUSIONS
Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context.
Topics: Humans; Metadata; Publications; Reference Standards
PubMed: 35014967
DOI: 10.2196/25440 -
Trends in Parasitology Dec 2021Genomic epidemiology, which links pathogen genomes with associated metadata to understand disease transmission, has become a key component of outbreak response.... (Review)
Review
Genomic epidemiology, which links pathogen genomes with associated metadata to understand disease transmission, has become a key component of outbreak response. Decreasing costs of genome sequencing and increasing computational power provide opportunities to generate and analyse large viral genomic datasets that aim to uncover the spatial scales of transmission, the demographics contributing to transmission patterns, and to forecast epidemic trends. Emerging sources of genomic data and associated metadata provide new opportunities to further unravel transmission patterns. Key challenges include how to integrate genomic data with metadata from multiple sources, how to generate efficient computational algorithms to cope with large datasets, and how to establish sampling frameworks to enable robust conclusions.
Topics: Disease Outbreaks; Genome, Viral; Genomics
PubMed: 34620561
DOI: 10.1016/j.pt.2021.08.007 -
Genome Biology Jul 2021
Topics: Cell Lineage; Computational Biology; Datasets as Topic; High-Throughput Nucleotide Sequencing; Humans; Metadata; Reproducibility of Results; Sequence Analysis, RNA; Single-Cell Analysis; Transcriptome
PubMed: 34311752
DOI: 10.1186/s13059-021-02422-y -
Sensors (Basel, Switzerland) Apr 2022The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural... (Review)
Review
The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural networks (CNNs) per tested instrument. The paper starts with a review of tasks related to musical instrument identification. It focuses on tasks performed, input type, algorithms employed, and metrics used. The paper starts with the background presentation, i.e., metadata description and a review of related works. This is followed by showing the dataset prepared for the experiment and its division into subsets: training, validation, and evaluation. Then, the analyzed architecture of the neural network model is presented. Based on the described model, training is performed, and several quality metrics are determined for the training and validation sets. The results of the evaluation of the trained network on a separate set are shown. Detailed values for precision, recall, and the number of true and false positive and negative detections are presented. The model efficiency is high, with the metric values ranging from 0.86 for the guitar to 0.99 for drums. Finally, a discussion and a summary of the results obtained follows.
Topics: Algorithms; Benchmarking; Deep Learning; Metadata; Neural Networks, Computer
PubMed: 35459018
DOI: 10.3390/s22083033 -
Patterns (New York, N.Y.) Apr 2020Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data... (Review)
Review
Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data are used and results are published. As time passes, the information slowly fades into obscurity. Data discovery is not enough to slow this process. High-quality metadata that support understanding and reuse and cross domains are a critical antidote to information entropy, particularly as it supports reuse of the data-adding to community knowledge and wisdom. Ensuring the creation and preservation of these metadata is a responsibility shared across the entire data life cycle from creation through analysis and publication to archiving and reuse. Repositories can play an important role in this process by augmenting metadata through time with persistent identifiers and connections they facilitate. Data providers need to work with repositories to encourage metadata evolution as new capabilities and connections emerge.
PubMed: 33205081
DOI: 10.1016/j.patter.2020.100004 -
Journal of Integrative Bioinformatics Oct 2021A standardized approach to annotating computational biomedical models and their associated files can facilitate model reuse and reproducibility among research groups,...
A standardized approach to annotating computational biomedical models and their associated files can facilitate model reuse and reproducibility among research groups, enhance search and retrieval of models and data, and enable semantic comparisons between models. Motivated by these potential benefits and guided by consensus across the COmputational Modeling in BIology NEtwork (COMBINE) community, we have developed a specification for encoding annotations in Open Modeling and EXchange (OMEX)-formatted archives. This document details version 1.2 of the specification, which builds on version 1.0 published last year in this journal. In particular, this version includes a set of initial model-level annotations (whereas v 1.0 described exclusively annotations at a smaller scale). Additionally, this version uses best practices for namespaces, and introduces omex-library.org as a common root for all annotations. Distributing modeling projects within an OMEX archive is a best practice established by COMBINE, and the OMEX metadata specification presented here provides a harmonized, community-driven approach for annotating a variety of standardized model representations. This specification acts as a technical guideline for developing software tools that can support this standard, and thereby encourages broad advances in model reuse, discovery, and semantic analyses.
Topics: Computational Biology; Metadata; Reproducibility of Results; Semantics; Software
PubMed: 34668356
DOI: 10.1515/jib-2021-0020 -
GigaScience Sep 2021Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by...
BACKGROUND
Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs.
FINDINGS
Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples.
CONCLUSION
Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
Topics: Algorithms; Bias; Metadata; Molecular Sequence Annotation; RNA; Sequence Analysis, RNA
PubMed: 34553213
DOI: 10.1093/gigascience/giab064