-
Neuroinformatics Jan 2022Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve...
Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result's metadata. An ontology for Evidence Graphs, EVI ( https://w3id.org/EVI ), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software.
Topics: Metadata; Reproducibility of Results; Software; Workflow
PubMed: 34264488
DOI: 10.1007/s12021-021-09529-4 -
F1000Research 2022: Knowing the needs of the bioimaging community with respect to research data management (RDM) is essential for identifying measures that enable adoption of the FAIR...
: Knowing the needs of the bioimaging community with respect to research data management (RDM) is essential for identifying measures that enable adoption of the FAIR (findable, accessible, interoperable, reusable) principles for microscopy and bioimage analysis data across disciplines. As an initiative within Germany's National Research Data Infrastructure, we conducted this community survey in summer 2021 to assess the state of the art of bioimaging RDM and the community needs. : An online survey was conducted with a mixed question-type design. We created a questionnaire tailored to relevant topics of the bioimaging community, including specific questions on bioimaging methods and bioimage analysis, as well as more general questions on RDM principles and tools. 203 survey entries were included in the analysis covering the perspectives from various life and biomedical science disciplines and from participants at different career levels. : The results highlight the importance and value of bioimaging RDM and data sharing. However, the practical implementation of FAIR practices is impeded by technical hurdles, lack of knowledge, and insecurity about the legal aspects of data sharing. The survey participants request metadata guidelines and annotation tools and endorse the usage of image data management platforms. At present, OMERO (Open Microscopy Environment Remote Objects) is the best known and most widely used platform. Most respondents rely on image processing and analysis, which they regard as the most time-consuming step of the bioimage data workflow. While knowledge about and implementation of electronic lab notebooks and data management plans is limited, respondents acknowledge their potential value for data handling and publication. : The bioimaging community acknowledges and endorses the value of RDM and data sharing. Still, there is a need for information, guidance, and standardization to foster the adoption of FAIR data handling. This survey may help inspiring targeted measures to close this gap.
Topics: Humans; Data Management; Metadata; Information Dissemination; Surveys and Questionnaires; Workflow
PubMed: 36405555
DOI: 10.12688/f1000research.121714.2 -
Studies in Health Technology and... Jun 2023Learning Health System (LHS) and integrated care are challenged due to a fragmented health data landscape. An information model is agnostic to the underlying data... (Review)
Review
Learning Health System (LHS) and integrated care are challenged due to a fragmented health data landscape. An information model is agnostic to the underlying data structures and can potentially contribute to mitigating some of the gaps. In a research project, Valkyrie, we are exploring how metadata can be organized and used to promote service coordination and interoperability across levels of care. An information model is viewed as central in this context and as a future integrated LHS support. We examined the literature regarding property requirements for data, information and knowledge models in the context of semantic interoperability and an LHS. The requirements were elicited and synthesized into five guiding principles as a vocabulary to inform the information model design of Valkyrie. Further research on requirements and guiding principles for information model design and evaluation are welcomed.
Topics: Learning Health System; Knowledge; Metadata; Research Design
PubMed: 37387108
DOI: 10.3233/SHTI230574 -
Metabolomics : Official Journal of the... Nov 2022The structural identification of metabolites represents one of the current bottlenecks in non-targeted liquid chromatography-mass spectrometry (LC-MS) based...
INTRODUCTION
The structural identification of metabolites represents one of the current bottlenecks in non-targeted liquid chromatography-mass spectrometry (LC-MS) based metabolomics. The Metabolomics Standard Initiative has developed a multilevel system to report confidence in metabolite identification, which involves the use of MS, MS/MS and orthogonal data. Limitations due to similar or same fragmentation pattern (e.g. isomeric compounds) can be overcome by the additional orthogonal information of the retention time (RT), since it is a system property that is different for each chromatographic setup.
OBJECTIVES
In contrast to MS data, sharing of RT data is not as widespread. The quality of data and its (re-)useability depend very much on the quality of the metadata. We aimed to evaluate the coverage and quality of this metadata from public metabolomics repositories.
METHODS
We acquired an overview on the current reporting of chromatographic separation conditions. For this purpose, we defined the following information as important details that have to be provided: column name and dimension, flow rate, temperature, composition of eluents and gradient.
RESULTS
We found that 70% of descriptions of the chromatographic setups are incomplete (according to our definition) and an additional 10% of the descriptions contained ambiguous and/or incorrect information. Accordingly, only about 20% of the descriptions allow further (re-)use of the data, e.g. for RT prediction. Therefore, we have started to develop a unified and standardized notation for chromatographic metadata with detailed and specific description of eluents, columns and gradients.
CONCLUSION
Reporting of chromatographic metadata is currently not unified. Our recommended suggestions for metadata reporting will enable more standardization and automatization in future reporting.
Topics: Metadata; Metabolomics; Tandem Mass Spectrometry; Chromatography, Liquid; Temperature
PubMed: 36436113
DOI: 10.1007/s11306-022-01956-x -
F1000Research 2018Publishing peer review materials alongside research articles promises to make the peer review process more transparent as well as making it easier to recognise these...
Publishing peer review materials alongside research articles promises to make the peer review process more transparent as well as making it easier to recognise these contributions and give credit to peer reviewers. Traditionally, the peer review reports, editors letters and author responses are only shared between the small number of people in those roles prior to publication, but there is a growing interest in making some or all of these materials available. A small number of journals have been publishing peer review materials for some time, others have begun this practice more recently, and significantly more are now considering how they might begin. This article outlines the outcomes from a recent workshop among journals with experience in publishing peer review materials, in which the specific operation of these workflows, and the challenges, were discussed. Here, we provide a draft as to how to represent these materials in the JATS and Crossref data models to facilitate the coordination and discoverability of peer review materials, and seek feedback on these initial recommendations.
Topics: Authorship; Metadata; Peer Review, Research; Publishing
PubMed: 30416719
DOI: 10.12688/f1000research.16460.1 -
Bioinformatics (Oxford, England) Dec 2021As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows...
SUMMARY
As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets.
AVAILABILITY AND IMPLEMENTATION
We provide the UCSC Cell Browser as a free website where scientists can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Genomics; Software; Databases, Genetic; Metadata
PubMed: 34244710
DOI: 10.1093/bioinformatics/btab503 -
Medical Physics Nov 2020One of the most frequently cited radiomics investigations showed that features automatically extracted from routine clinical images could be used in prognostic modeling....
PURPOSE
One of the most frequently cited radiomics investigations showed that features automatically extracted from routine clinical images could be used in prognostic modeling. These images have been made publicly accessible via The Cancer Imaging Archive (TCIA). There have been numerous requests for additional explanatory metadata on the following datasets - RIDER, Interobserver, Lung1, and Head-Neck1. To support repeatability, reproducibility, generalizability, and transparency in radiomics research, we publish the subjects' clinical data, extracted radiomics features, and digital imaging and communications in medicine (DICOM) headers of these four datasets with descriptive metadata, in order to be more compliant with findable, accessible, interoperable, and reusable (FAIR) data management principles.
ACQUISITION AND VALIDATION METHODS
Overall survival time intervals were updated using a national citizens registry after internal ethics board approval. Spatial offsets of the primary gross tumor volume (GTV) regions of interest (ROIs) associated with the Lung1 CT series were improved on the TCIA. GTV radiomics features were extracted using the open-source Ontology-Guided Radiomics Analysis Workflow (O-RAW). We reshaped the output of O-RAW to map features and extraction settings to the latest version of Radiomics Ontology, so as to be consistent with the Image Biomarker Standardization Initiative (IBSI). Digital imaging and communications in medicine metadata was extracted using a research version of Semantic DICOM (SOHARD, GmbH, Fuerth; Germany). Subjects' clinical data were described with metadata using the Radiation Oncology Ontology. All of the above were published in Resource Descriptor Format (RDF), that is, triples. Example SPARQL queries are shared with the reader to use on the online triples archive, which are intended to illustrate how to exploit this data submission.
DATA FORMAT
The accumulated RDF data are publicly accessible through a SPARQL endpoint where the triples are archived. The endpoint is remotely queried through a graph database web application at http://sparql.cancerdata.org. SPARQL queries are intrinsically federated, such that we can efficiently cross-reference clinical, DICOM, and radiomics data within a single query, while being agnostic to the original data format and coding system. The federated queries work in the same way even if the RDF data were partitioned across multiple servers and dispersed physical locations.
POTENTIAL APPLICATIONS
The public availability of these data resources is intended to support radiomics features replication, repeatability, and reproducibility studies by the academic community. The example SPARQL queries may be freely used and modified by readers depending on their research question. Data interoperability and reusability are supported by referencing existing public ontologies. The RDF data are readily findable and accessible through the aforementioned link. Scripts used to create the RDF are made available at a code repository linked to this submission: https://gitlab.com/UM-CDS/FAIR-compliant_clinical_radiomics_and_DICOM_metadata.
Topics: Databases, Factual; Germany; Humans; Metadata; Reproducibility of Results; Workflow
PubMed: 32521049
DOI: 10.1002/mp.14322 -
Journal of Proteome Research Sep 2021The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science...
The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Topics: Mass Spectrometry; Metadata; Proteomics; Search Engine; Software
PubMed: 34342226
DOI: 10.1021/acs.jproteome.1c00454 -
GigaScience Oct 2019Increasingly sophisticated experiments, coupled with large-scale computational models, have the potential to systematically test biological hypotheses to drive our... (Review)
Review
Increasingly sophisticated experiments, coupled with large-scale computational models, have the potential to systematically test biological hypotheses to drive our understanding of multicellular systems. In this short review, we explore key challenges that must be overcome to achieve robust, repeatable data-driven multicellular systems biology. If these challenges can be solved, we can grow beyond the current state of isolated tools and datasets to a community-driven ecosystem of interoperable data, software utilities, and computational modeling platforms. Progress is within our grasp, but it will take community (and financial) commitment.
Topics: Big Data; Metadata; Systems Biology
PubMed: 31648301
DOI: 10.1093/gigascience/giz127 -
BMC Bioinformatics Jan 2019The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease...
BACKGROUND
The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease pathogens. Omics studies are enabling genotype-phenotype association studies which identify genetic determinants of pathogen virulence and drug resistance, as well as phylogenetic studies designed to track the origin and spread of disease outbreaks. These omics studies are complex and often employ multiple assay technologies including genomics, metagenomics, transcriptomics, proteomics, and metabolomics. To maximize the impact of omics studies, it is essential that data be accompanied by detailed contextual metadata (e.g., specimen, spatial-temporal, phenotypic characteristics) in clear, organized, and consistent formats. Over the years, many metadata standards developed by various metadata standards initiatives have arisen; the Genomic Standards Consortium's minimal information standards (MIxS), the GSCID/BRC Project and Sample Application Standard. Some tools exist for tracking metadata, but they do not provide event based capabilities to configure, collect, validate, and distribute metadata. To address this gap in the scientific community, an event based data-driven application, OMeta, was created that allows users to quickly configure, collect, validate, distribute, and integrate metadata.
RESULTS
A data-driven web application, OMeta, has been developed for use by researchers consisting of a browser-based interface, a command-line interface (CLI), and server-side components that provide an intuitive platform for configuring, capturing, viewing, and sharing metadata. Project and sample metadata can be set based on existing standards or based on projects goals. Recorded information includes details on the biological samples, procedures, protocols, and experimental technologies, etc. This information can be organized based on events, including sample collection, sample quantification, sequencing assay, and analysis results. OMeta enables configuration in various presentation types: checkbox, file, drop-box, ontology, and fields can be configured to use the National Center for Biomedical Ontology (NCBO), a biomedical ontology server. Furthermore, OMeta maintains a complete audit trail of all changes made by users and allows metadata export in comma separated value (CSV) format for convenient deposition of data into public databases.
CONCLUSIONS
We present, OMeta, a web-based software application that is built on data-driven principles for configuring and customizing data standards, capturing, curating, and sharing metadata.
Topics: Biological Ontologies; Databases, Factual; Metadata; Metagenomics; Phylogeny; Software; User-Computer Interface; Whole Genome Sequencing
PubMed: 30612540
DOI: 10.1186/s12859-018-2580-9