metadata - OpenMD.com Journal Search

Proceedings. IEEE Computer Society... Jun 2021

Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods. While...

Summary PubMed Full Text PDF

Authors: Mandy Lu, Qingyu Zhao, Jiequan Zhang...

Batch Normalization (BN) and its variants have delivered tremendous success in combating the covariate shift induced by the training step of deep learning methods. While these techniques normalize feature distributions by standardizing with batch statistics, they do not correct the influence on features from extraneous variables or multiple distributions. Such extra variables, referred to as metadata here, may create bias or confounding effects (e.g., race when classifying gender from face images). We introduce the Metadata Normalization (MDN) layer, a new batch-level operation which can be used end-to-end within the training framework, to correct the influence of metadata on feature distributions. MDN adopts a regression analysis technique traditionally used for preprocessing to remove (regress out) the metadata effects on model features during training. We utilize a metric based on distance correlation to quantify the distribution bias from the metadata and demonstrate that our method successfully removes metadata effects on four diverse settings: one synthetic, one 2D image, one video, and one 3D medical image dataset.

PubMed: 34776724
DOI: 10.1109/cvpr46437.2021.01077

The Scientific Filesystem.

GigaScience May 2018

Here, we present the Scientific Filesystem (SCIF), an organizational format that supports exposure of executables and metadata for discoverability of scientific...

Summary PubMed Full Text PDF

Authors: Vanessa Sochat

BACKGROUND

Here, we present the Scientific Filesystem (SCIF), an organizational format that supports exposure of executables and metadata for discoverability of scientific applications. The format includes a known filesystem structure, a definition for a set of environment variables describing it, and functions for generation of the variables and interaction with the libraries, metadata, and executables located within. SCIF makes it easy to expose metadata, multiple environments, installation steps, files, and entry points to render scientific applications consistent, modular, and discoverable. A SCIF can be installed on a traditional host or in a container technology such as Docker or Singularity. We start by reviewing the background and rationale for the SCIF, followed by an overview of the specification and the different levels of internal modules ("apps") that the organizational format affords. Finally, we demonstrate that SCIF is useful by implementing and discussing several use cases that improve user interaction and understanding of scientific applications. SCIF is released along with a client and integration in the Singularity 2.4 software to quickly install and interact with SCIF. When used inside of a reproducible container, a SCIF is a recipe for reproducibility and introspection of the functions and users that it serves.

RESULTS

We use SCIF to evaluate container software, provide metrics, serve scientific workflows, and execute a primary function under different contexts. To encourage collaboration and sharing of applications, we developed tools along with an open source, version-controlled, tested, and programmatically accessible web infrastructure. SCIF and associated resources are available at https://sci-f.github.io. The ease of using SCIF, especially in the context of containers, offers promise for scientists' work to be self-documenting and programatically parseable for maximum reproducibility. SCIF opens up an abstraction from underlying programming languages and packaging logic to work with scientific applications, opening up new opportunities for scientific software development.

Topics: Information Storage and Retrieval; Metadata; Programming Languages; Science; Software; Workflow

PubMed: 29718213
DOI: 10.1093/gigascience/giy023

Expanding and Remixing the Metadata Landscape.

Trends in Cancer Apr 2021

Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated... (Review)

Summary PubMed Full Text PDF

Review

Authors: Ariel A Hippen, Casey S Greene

Genomic data sharing accelerates research. Data are most valuable when they are accompanied by detailed metadata. To date, metadata are often human-annotated descriptions of samples and their handling. We discuss how machine learning-derived elements complement such descriptions to enhance the research ecosystem around genomic data.

Topics: Genomics; Humans; Machine Learning; Metadata; Neoplasms

PubMed: 33229213
DOI: 10.1016/j.trecan.2020.10.011

Machine actionable metadata models.

Scientific Data Sep 2022

Community-developed minimum information checklists are designed to drive the rich and consistent reporting of metadata, underpinning the reproducibility and reuse of the...

Summary PubMed Full Text PDF

Authors: Dominique Batista, Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone...

Community-developed minimum information checklists are designed to drive the rich and consistent reporting of metadata, underpinning the reproducibility and reuse of the data. These reporting guidelines, however, are usually in the form of narratives intended for human consumption. Modular and reusable machine-readable versions are also needed. Firstly, to provide the necessary quantitative and verifiable measures of the degree to which the metadata descriptors meet these community requirements, a requirement of the FAIR Principles. Secondly, to encourage the creation of standards-driven templates for metadata authoring, especially when describing complex experiments that require multiple reporting guidelines to be used in combination or extended. We present new functionalities to support the creation and improvements of machine-readable models. We apply the approach to an exemplar set of reporting guidelines in Life Science and discuss the challenges. Our work, targeted to developers of standards and those familiar with standards, promotes the concept of compositional metadata elements and encourages the creation of community-standards which are modular and interoperable from the onset.

Topics: Biological Science Disciplines; Humans; Metadata; Reproducibility of Results

PubMed: 36180441
DOI: 10.1038/s41597-022-01707-6

Understanding the Nature of Metadata: Systematic Review.

Journal of Medical Internet Research Jan 2022

Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example,... (Review)

Summary PubMed Full Text PDF

Review

Authors: Hannes Ulrich, Ann-Kristin Kock-Schoppenhauer, Noemi Deppenwiese...

BACKGROUND

Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term "metadata" and its use is not always unambiguous.

OBJECTIVE

This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse.

METHODS

A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used.

RESULTS

The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas.

CONCLUSIONS

Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context.

Topics: Humans; Metadata; Publications; Reference Standards

PubMed: 35014967
DOI: 10.2196/25440

SODAR: managing multiomics study data and metadata.

GigaScience Dec 2022

Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many...

Summary PubMed Full Text PDF

Authors: Mikko Nieminen, Oliver Stolpe, Mathias Kuhring...

Scientists employing omics in life science studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command-line access for metadata and file storage. SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.

Topics: Multiomics; Metadata; Software; Information Storage and Retrieval; Data Management

PubMed: 37498129
DOI: 10.1093/gigascience/giad052

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format.

Journal of Chemical Information and... Apr 2022

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No...

Summary PubMed Full Text PDF

Authors: Miao Zhu, Jacqueline M Cole

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the "chemistry-aware" natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

Topics: Data Mining; Metadata; Natural Language Processing; Reading; Software

PubMed: 35349259
DOI: 10.1021/acs.jcim.1c01198

Metadata and Reuse: Antidotes to Information Entropy.

Patterns (New York, N.Y.) Apr 2020

Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data... (Review)

Summary PubMed Full Text PDF

Review

Authors: Ted Habermann

Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data are used and results are published. As time passes, the information slowly fades into obscurity. Data discovery is not enough to slow this process. High-quality metadata that support understanding and reuse and cross domains are a critical antidote to information entropy, particularly as it supports reuse of the data-adding to community knowledge and wisdom. Ensuring the creation and preservation of these metadata is a responsibility shared across the entire data life cycle from creation through analysis and publication to archiving and reuse. Repositories can play an important role in this process by augmenting metadata through time with persistent identifiers and connections they facilitate. Data providers need to work with repositories to encourage metadata evolution as new capabilities and connections emerge.

PubMed: 33205081
DOI: 10.1016/j.patter.2020.100004

FAIR data station for lightweight metadata management and validation of omics studies.

GigaScience Dec 2022

The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient...

Summary PubMed Full Text PDF

Authors: Bart Nijsse, Peter J Schaap, Jasper J Koehorst...

BACKGROUND

The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.

FINDINGS

We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the "form generation module" creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the "validation module." Finally, the "resource module" can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive-compatible XML metadata file.

CONCLUSIONS

Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.

Topics: Metadata; Biological Science Disciplines; Databases, Factual; Nucleotides; Publishing

PubMed: 36879493
DOI: 10.1093/gigascience/giad014

Transforming Dermatologic Imaging for the Digital Era: Metadata and Standards.

Journal of Digital Imaging Aug 2018

Imaging is increasingly being used in dermatology for documentation, diagnosis, and management of cutaneous disease. The lack of standards for dermatologic imaging is an... (Review)

Summary PubMed Full Text PDF

Review

Authors: Liam J Caffery, David Clunie, Clara Curiel-Lewandrowski...

Imaging is increasingly being used in dermatology for documentation, diagnosis, and management of cutaneous disease. The lack of standards for dermatologic imaging is an impediment to clinical uptake. Standardization can occur in image acquisition, terminology, interoperability, and metadata. This paper presents the International Skin Imaging Collaboration position on standardization of metadata for dermatologic imaging. Metadata is essential to ensure that dermatologic images are properly managed and interpreted. There are two standards-based approaches to recording and storing metadata in dermatologic imaging. The first uses standard consumer image file formats, and the second is the file format and metadata model developed for the Digital Imaging and Communication in Medicine (DICOM) standard. DICOM would appear to provide an advantage over using consumer image file formats for metadata as it includes all the patient, study, and technical metadata necessary to use images clinically. Whereas, consumer image file formats only include technical metadata and need to be used in conjunction with another actor-for example, an electronic medical record-to supply the patient and study metadata. The use of DICOM may have some ancillary benefits in dermatologic imaging including leveraging DICOM network and workflow services, interoperability of images and metadata, leveraging existing enterprise imaging infrastructure, greater patient safety, and better compliance to legislative requirements for image retention.

Topics: Dermatology; Dermoscopy; Diagnostic Imaging; Humans; Internationality; Metadata; Radiology Information Systems; Reproducibility of Results; Skin Diseases; United States

PubMed: 29344752
DOI: 10.1007/s10278-017-0045-8