-
Database : the Journal of Biological... Jan 2019Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research,...
Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.
Topics: Databases, Genetic; Female; Humans; Knowledge Bases; Metadata; Semantics; Software; User-Computer Interface
PubMed: 31820804
DOI: 10.1093/database/baz132 -
Journal of Proteome Research Oct 2020Metadata is essential in proteomics data repositories and is crucial to interpret and reanalyze the deposited data sets. For every proteomics data set, we should capture...
Metadata is essential in proteomics data repositories and is crucial to interpret and reanalyze the deposited data sets. For every proteomics data set, we should capture at least three levels of metadata: (i) data set description, (ii) the sample to data files related information, and (iii) standard data file formats (e.g., mzIdentML, mzML, or mzTab). While the data set description and standard data file formats are supported by all ProteomeXchange partners, the information regarding the sample to data files is mostly missing. Recently, members of the European Bioinformatics Community for Mass Spectrometry (EuBIC) have created an open-source project called Sample to Data file format for Proteomics (https://github.com/bigbio/proteomics-metadata-standard/) to enable the standardization of sample metadata of public proteomics data sets. Here, the project is presented to the proteomics community, and we call for contributors, including researchers, journals, and consortiums to provide feedback about the format. We believe this work will improve reproducibility and facilitate the development of new tools dedicated to proteomics data analysis.
Topics: Mass Spectrometry; Metadata; Proteomics; Reproducibility of Results; Software
PubMed: 32786688
DOI: 10.1021/acs.jproteome.0c00376 -
F1000Research 2021Many types of data from genomic analyses can be represented as genomic tracks, features linked to the genomic coordinates of a reference genome. Examples of such data...
Many types of data from genomic analyses can be represented as genomic tracks, features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, as well as RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information. We propose to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable) to produce searchable metadata for genomic tracks. Findability and Accessibility of metadata can then be ensured by a track search service that integrates globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories. Interoperability and Reusability need to be ensured by the specification and implementation of a basic set of recommendations for metadata. We have tested this concept by developing such a specification in a JSON Schema, called FAIRtracks, and have integrated it into a novel track search service, called TrackFind. We demonstrate practical usage by importing datasets through TrackFind into existing examples of relevant analytical tools for genomic tracks: EPICO and the GSuite HyperBrowser. We here provide a first iteration of a draft standard for genomic track metadata, as well as the accompanying software ecosystem. It can easily be adapted or extended to future needs of the research community regarding data, methods and tools, balancing the requirements of both data submitters and analytical end-users.
Topics: Ecosystem; Genome; Genomics; Metadata; Software
PubMed: 34249331
DOI: 10.12688/f1000research.28449.1 -
Bioinformatics (Oxford, England) Oct 2022Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments....
MOTIVATION
Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments. This can introduce differences in performance and reproducibility. Capturing metadata (e.g. package versions, GPU model) currently requires repetitious code and is difficult to store centrally for analysis. Even where virtual environments and containers are used, updates over time mean that versioning metadata should still be captured within analysis pipelines to guarantee reproducibility.
RESULTS
Microbench is a simple and extensible Python package to automate metadata capture to a file or Redis database. Captured metadata can include execution time, software package versions, environment variables, hardware information, Python version and more, with plugins. We present three case studies demonstrating Microbench usage to benchmark code execution and examine environment metadata for reproducibility purposes.
AVAILABILITY AND IMPLEMENTATION
Install from the Python Package Index using pip install microbench. Source code is available from https://github.com/alubbock/microbench.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Benchmarking; Metadata; Reproducibility of Results; Software; Systems Biology
PubMed: 36000837
DOI: 10.1093/bioinformatics/btac580 -
BMC Bioinformatics Jan 2019The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease...
BACKGROUND
The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease pathogens. Omics studies are enabling genotype-phenotype association studies which identify genetic determinants of pathogen virulence and drug resistance, as well as phylogenetic studies designed to track the origin and spread of disease outbreaks. These omics studies are complex and often employ multiple assay technologies including genomics, metagenomics, transcriptomics, proteomics, and metabolomics. To maximize the impact of omics studies, it is essential that data be accompanied by detailed contextual metadata (e.g., specimen, spatial-temporal, phenotypic characteristics) in clear, organized, and consistent formats. Over the years, many metadata standards developed by various metadata standards initiatives have arisen; the Genomic Standards Consortium's minimal information standards (MIxS), the GSCID/BRC Project and Sample Application Standard. Some tools exist for tracking metadata, but they do not provide event based capabilities to configure, collect, validate, and distribute metadata. To address this gap in the scientific community, an event based data-driven application, OMeta, was created that allows users to quickly configure, collect, validate, distribute, and integrate metadata.
RESULTS
A data-driven web application, OMeta, has been developed for use by researchers consisting of a browser-based interface, a command-line interface (CLI), and server-side components that provide an intuitive platform for configuring, capturing, viewing, and sharing metadata. Project and sample metadata can be set based on existing standards or based on projects goals. Recorded information includes details on the biological samples, procedures, protocols, and experimental technologies, etc. This information can be organized based on events, including sample collection, sample quantification, sequencing assay, and analysis results. OMeta enables configuration in various presentation types: checkbox, file, drop-box, ontology, and fields can be configured to use the National Center for Biomedical Ontology (NCBO), a biomedical ontology server. Furthermore, OMeta maintains a complete audit trail of all changes made by users and allows metadata export in comma separated value (CSV) format for convenient deposition of data into public databases.
CONCLUSIONS
We present, OMeta, a web-based software application that is built on data-driven principles for configuring and customizing data standards, capturing, curating, and sharing metadata.
Topics: Biological Ontologies; Databases, Factual; Metadata; Metagenomics; Phylogeny; Software; User-Computer Interface; Whole Genome Sequencing
PubMed: 30612540
DOI: 10.1186/s12859-018-2580-9 -
Studies in Health Technology and... May 2021The data produced during a research project are too often collected for the sole purpose of the study, therefore hindering profitable reuse in similar contexts. The...
The data produced during a research project are too often collected for the sole purpose of the study, therefore hindering profitable reuse in similar contexts. The growing need to counteract this trend has recently led to the formalization of the FAIR principles that aim to make (meta)data Findable, Accessible, Interoperable and Reusable, for humans and machines. Since their introduction, efforts are ongoing to encourage FAIR principles adoption and to implement solutions based on them. This paper reports on the FAIR-compliant registry we developed to collect and serve metadata describing clinical trials. The design of the registry is based on the FAIR Data Point (FDP) specifications, the state-of-the-art reference for FAIRified metadata sharing. To map the metadata relevant to our use case, we have extended the DCAT-based semantic model of the FDP adopting well-established ontologies in the biomedical and clinical domain, like the Semanticscience Integrated Ontology (SIO). Current implementation is based on the Molgenis software and provides both a user interface and a REST API for metadata discovering. At present the registry is being loaded with the metadata of the 18 clinical studies included in the 'I FAIR Program', a project finalised to the dissemination of FAIR best practices among the clinical researchers in Sardinia (Italy). After a testing phase, the registry will be publicly available, while the new model and the source code will be released open source.
Topics: Biomedical Research; Humans; Italy; Metadata; Registries; Software
PubMed: 34042684
DOI: 10.3233/SHTI210281 -
European Journal of Radiology Sep 2023The ever-increasing volume of medical imaging data and interest in Big Data research brings challenges to data organization, categorization, and retrieval. Although the...
PURPOSE
The ever-increasing volume of medical imaging data and interest in Big Data research brings challenges to data organization, categorization, and retrieval. Although the radiological value chain is almost entirely digital, data structuring has been widely performed pragmatically, but with insufficient naming and metadata standards for the stringent needs of image analysis. To enable automated data management independent of naming and metadata, this study focused on developing a convolutional neural network (CNN) that classifies medical images based solely on voxel data.
METHOD
A 3D CNN (3D-ResNet18) was trained using a dataset of 31,602 prostate MRI volumes with 10 different sequence types of 1243 patients. A five-fold cross-validation approach with patient-based splits was chosen for training and testing. Training was repeated with a gradual reduction in training data assessing classification accuracies to determine the minimum training data required for sufficient performance. The trained model and developed method were tested on three external datasets.
RESULTS
The model achieved an overall accuracy of 99.88 % ± 0.13 % in classifying typical prostate MRI sequence types. When being trained with approximately 10 % of the original cohort (112 patients), the CNN still achieved an accuracy of 97.43 % ± 2.10 %. In external testing the model achieved sensitivities of > 90 % for 10/15 tested sequence types.
CONCLUSIONS
The herein developed CNN enabled automatic and reliable sequence identification in prostate MRI. Ultimately, such CNN models for voxel-based sequence identification could substantially enhance the management of medical imaging data, improve workflow efficiency and data quality, and allow for robust clinical AI workflows.
Topics: Male; Humans; Prostate; Metadata; Magnetic Resonance Imaging; Neural Networks, Computer; Image Processing, Computer-Assisted
PubMed: 37453274
DOI: 10.1016/j.ejrad.2023.110964 -
IEEE/ACM Transactions on Computational... 2022The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data...
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone. In this paper, we describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process. We present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment. We propose a general, open and extensible pipeline that can easily incorporate any number of new data sources, and propose the resulting repository-already integrating several important sources-which is exposed by means of practical user interfaces to respond biological researchers' needs.
Topics: Computational Biology; Genomics; Information Storage and Retrieval; Metadata
PubMed: 32750853
DOI: 10.1109/TCBB.2020.2998954 -
Studies in Health Technology and... Sep 2019The utilisation of metadata repositories increasingly promotes secondary use of routinely collected data. However, this has not yet solved the problem of data exchange...
The utilisation of metadata repositories increasingly promotes secondary use of routinely collected data. However, this has not yet solved the problem of data exchange across organisational boundaries. The local description of a metadata set must also be exchangeable for flawless data exchange. In previous work, a metadata exchange language QL4MDR was developed. This work aimed to examine the applicability of this exchange language. For this purpose, existing MDR implementations were identified and systematically inspected and roughly divided into two categories to distinguish between data integration and query integration. It has been shown that all the implementations can be adapted to QL4MDR. The integration of metadata is an important first step; it enables the exchange of information, which is so urgently needed for the further processing of instance data, from the metadata mappings to the transformation rules.
Topics: Metadata
PubMed: 31483257
DOI: 10.3233/SHTI190808 -
Studies in Health Technology and... Aug 2019Secondary use of electronic health record (EHR) data requires a detailed description of metadata, especially when data collection and data re-use are organizationally...
Secondary use of electronic health record (EHR) data requires a detailed description of metadata, especially when data collection and data re-use are organizationally and technically far apart. This paper describes the concept of the SMITH consortium that includes conventions, processes, and tools for describing and managing metadata using common standards for semantic interoperability. It deals in particular with the chain of processing steps of data from existing information systems and provides an overview of the planned use of metadata, medical terminologies, and semantic services in the consortium.
Topics: Data Collection; Electronic Health Records; Germany; Information Systems; Metadata; Semantics
PubMed: 31438215
DOI: 10.3233/SHTI190518