-
MSphere Jun 2022The availability of public genomics data has become essential for modern life sciences research, yet the quality, traceability, and curation of these data have...
The availability of public genomics data has become essential for modern life sciences research, yet the quality, traceability, and curation of these data have significant impacts on a broad range of microbial genomics research. While microbial genome databases such as NCBI's RefSeq database leverage the scalability of crowd sourcing for growth, genomics data provenance and authenticity of the source materials used to produce data are not strict requirements. Here, we describe the assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full genomics data provenance relating to bioinformatics methods, quality control, and passage history. Comparative genomics analysis of ATCC standard reference genomes (ASRGs) revealed significant issues with regard to NCBI's RefSeq bacterial genome assemblies related to completeness, mutations, structure, strain metadata, and gaps in traceability to the original biological source materials. Nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. Deep curation of these records is not within the scope of NCBI's core mission in supporting open science, which aims to collect sequence records that are submitted by the public. Nonetheless, we propose that gaps in metadata accuracy and data provenance represent an "elephant in the room" for microbial genomics research. Effectively addressing these issues will require raising the level of accountability for data depositors and acknowledging the need for higher expectations of quality among the researchers whose research depends on accurate and attributable reference genome data. The traceability of microbial genomics data to authenticated physical biological materials is not a requirement for depositing these data into public genome databases. This creates significant risks for the reliability and data provenance of these important genomics research resources, the impact of which is not well understood. We sought to investigate this by carrying out a comparative genomics study of 1,113 ATCC standard reference genomes (ASRGs) produced by ATCC from authenticated and traceable materials using the latest sequencing technologies. We found widespread discrepancies in genome assembly quality, genetic variability, and the quality and completeness of the associated metadata among hundreds of reference genomes for ATCC strains found in NCBI's RefSeq database. We present a comparative analysis of -assembled ASRGs, their respective metadata, and variant analysis using RefSeq genomes as a reference. Although assembly quality in RefSeq has generally improved over time, we found that significant quality issues remain, especially as related to genomic data and metadata provenance. Our work highlights the importance of data authentication and provenance for the microbial genomics community, and underscores the risks of ignoring this issue in the future.
Topics: Databases, Genetic; Genome, Bacterial; Genome, Microbial; Genomics; Reproducibility of Results
PubMed: 35491842
DOI: 10.1128/msphere.00077-22 -
Bioinformatics (Oxford, England) Jul 2014The Galaxy platform has developed into a fully featured collaborative workbench, with goals of inherently capturing provenance to enable reproducible data analysis, and...
UNLABELLED
The Galaxy platform has developed into a fully featured collaborative workbench, with goals of inherently capturing provenance to enable reproducible data analysis, and of making it straightforward to run one's own server. However, many Galaxy platform tools rely on the presence of reference data, such as alignment indexes, to function efficiently. Until now, the building of this cache of data for Galaxy has been an error-prone manual process lacking reproducibility and provenance. The Galaxy Data Manager framework is an enhancement that changes the management of Galaxy's built-in data cache from a manual procedure to an automated graphical user interface (GUI) driven process, which contains the same openness, reproducibility and provenance that is afforded to Galaxy's analysis tools. Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of reference data in real time.
AVAILABILITY AND IMPLEMENTATION
The Galaxy Data Manager framework is implemented in Python and has been integrated as part of the core Galaxy platform. Individual Data Manager tools can be defined locally or installed from a ToolShed, allowing the Galaxy community to define additional Data Manager tools as needed, with full versioning and dependency support.
Topics: Humans; Reproducibility of Results; Software
PubMed: 24585771
DOI: 10.1093/bioinformatics/btu119 -
Patterns (New York, N.Y.) May 2020Data provenance is a machine-readable summary of the collection and computational history of a dataset. Data provenance confers or adds value to a dataset, helps...
Data provenance is a machine-readable summary of the collection and computational history of a dataset. Data provenance confers or adds value to a dataset, helps reproduce computational analyses, or validates scientific conclusions. The people of the End-to-End Provenance Project are a community of professionals who have developed software tools to collect and use data provenance.
PubMed: 33205093
DOI: 10.1016/j.patter.2020.100016 -
JMIR Research Protocols Nov 2021Provenance supports the understanding of data genesis, and it is a key factor to ensure the trustworthiness of digital objects containing (sensitive) scientific data....
BACKGROUND
Provenance supports the understanding of data genesis, and it is a key factor to ensure the trustworthiness of digital objects containing (sensitive) scientific data. Provenance information contributes to a better understanding of scientific results and fosters collaboration on existing data as well as data sharing. This encompasses defining comprehensive concepts and standards for transparency and traceability, reproducibility, validity, and quality assurance during clinical and scientific data workflows and research.
OBJECTIVE
The aim of this scoping review is to investigate existing evidence regarding approaches and criteria for provenance tracking as well as disclosing current knowledge gaps in the biomedical domain. This review covers modeling aspects as well as metadata frameworks for meaningful and usable provenance information during creation, collection, and processing of (sensitive) scientific biomedical data. This review also covers the examination of quality aspects of provenance criteria.
METHODS
This scoping review will follow the methodological framework by Arksey and O'Malley. Relevant publications will be obtained by querying PubMed and Web of Science. All papers in English language will be included, published between January 1, 2006 and March 23, 2021. Data retrieval will be accompanied by manual search for grey literature. Potential publications will then be exported into a reference management software, and duplicates will be removed. Afterwards, the obtained set of papers will be transferred into a systematic review management tool. All publications will be screened, extracted, and analyzed: title and abstract screening will be carried out by 4 independent reviewers. Majority vote is required for consent to eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading will be performed independently by 2 reviewers and in the last step, key information will be extracted on a pretested template. If agreement cannot be reached, the conflict will be resolved by a domain expert. Charted data will be analyzed by categorizing and summarizing the individual data items based on the research questions. Tabular or graphical overviews will be given, if applicable.
RESULTS
The reporting follows the extension of the Preferred Reporting Items for Systematic reviews and Meta-Analyses statements for Scoping Reviews. Electronic database searches in PubMed and Web of Science resulted in 469 matches after deduplication. As of September 2021, the scoping review is in the full-text screening stage. The data extraction using the pretested charting template will follow the full-text screening stage. We expect the scoping review report to be completed by February 2022.
CONCLUSIONS
Information about the origin of healthcare data has a major impact on the quality and the reusability of scientific results as well as follow-up activities. This protocol outlines plans for a scoping review that will provide information about current approaches, challenges, or knowledge gaps with provenance tracking in biomedical sciences.
INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID)
DERR1-10.2196/31750.
PubMed: 34813494
DOI: 10.2196/31750 -
The FEBS Journal Jan 2022The FEBS Journal, a leading multidisciplinary journal in the life sciences, publishes high-impact papers on diverse topics relating to molecular mechanisms underpinning...
The FEBS Journal, a leading multidisciplinary journal in the life sciences, publishes high-impact papers on diverse topics relating to molecular mechanisms underpinning biological processes. Here, Editor-in-Chief Seamus Martin discusses the critical importance of data provenance and data integrity to the scientific method and discusses some of the highlights from 2021 at The FEBS Journal.
PubMed: 34982855
DOI: 10.1111/febs.16332 -
Patterns (New York, N.Y.) Aug 2020Deep learning, a set of approaches using artificial neural networks, has generated rapid recent advancements in machine learning. Deep learning does, however, have the...
Deep learning, a set of approaches using artificial neural networks, has generated rapid recent advancements in machine learning. Deep learning does, however, have the potential to reduce the reproducibility of scientific results. Model outputs are critically dependent on the data and processing approach used to initially generate the model, but this provenance information is usually lost during model training. To avoid a future reproducibility crisis, we need to improve our deep-learning model management. The FAIR principles for data stewardship and software/workflow implementation give excellent high-level guidance on ensuring effective reuse of data and software. We suggest some specific guidelines for the generation and use of deep-learning models in science and explain how these relate to the FAIR principles. We then present dtoolAI, a Python package that we have developed to implement these guidelines. The package implements automatic capture of provenance information during model training and simplifies model distribution.
PubMed: 33205122
DOI: 10.1016/j.patter.2020.100073 -
Frontiers in Genetics 2022Fair and equitable benefit sharing of genetic resources is an expectation of the Nagoya Protocol. Although the Nagoya Protocol does not yet formally apply to Digital...
Fair and equitable benefit sharing of genetic resources is an expectation of the Nagoya Protocol. Although the Nagoya Protocol does not yet formally apply to Digital Sequence Information ("DSI"), discussions are currently underway regarding to include such data through ongoing Convention on Biological Diversity ("CBD") negotiations. While Indigenous Peoples and Local Communities ("IPLC") expect the value generated from genomic data to be subject to benefit sharing arrangements, a range of views are currently being expressed by Nation States, IPLC and other stakeholders. The use of DSI gives rise to unique considerations, creating a gray area as to how it should be considered under the Nagoya Protocol's Access and Benefit Sharing ("ABS") principles. One way for benefit sharing to be enhanced is through the connection of data to proper provenance information. A significant development is the use of digital labeling systems to ensure that the origin of samples is appropriately disclosed. The Traditional Knowledge and Biocultural Labels initiative offers a practical option for data provided to genomic databases. In particular, the BioCultural Labels ("BC Labels") are a mechanism for Indigenous communities to identify and maintain provenance, origin and authority over biocultural material and data generated from Indigenous land and waters held in research, cultural institutions and data repositories. This form of cultural metadata adds value to the research endeavor and the creation of Indigenous fields within databases adds transparency and accountability to the research environment.
PubMed: 36212139
DOI: 10.3389/fgene.2022.1014044 -
Journal of Grid Computing 2022In scientific collaboration, data sharing, the exchange of ideas and results are essential to knowledge construction and the development of science. Hence, we must...
In scientific collaboration, data sharing, the exchange of ideas and results are essential to knowledge construction and the development of science. Hence, we must guarantee interoperability, privacy, traceability (reinforcing transparency), and trust. Provenance has been widely recognized for providing a history of the steps taken in scientific experiments. Consequently, we must support traceability, assisting in scientific results' reproducibility. One of the technologies that can enhance trust in collaborative scientific experimentation is blockchain. This work proposes an architecture, named BlockFlow, based on blockchain, provenance, and cloud infrastructure to bring trust and traceability in the execution of collaborative scientific experiments. The proposed architecture is implemented on Hyperledger, and a scenario about the genomic sequencing of the SARS-CoV-2 coronavirus is used to evaluate the architecture, discussing the benefits of providing traceability and trust in collaborative scientific experimentation. Furthermore, the architecture addresses the heterogeneity of shared data, facilitating interpretation by geographically distributed researchers and analysis of such data. Through a blockchain-based architecture that provides support on provenance and blockchain, we can enhance data sharing, traceability, and trust in collaborative scientific experiments.
PubMed: 36246518
DOI: 10.1007/s10723-022-09626-x -
Expert Review of Molecular Diagnostics Mar 2017The emergence and mass utilization of high-throughput (HT) technologies, including sequencing technologies (genomics) and mass spectrometry (proteomics, metabolomics,... (Review)
Review
The emergence and mass utilization of high-throughput (HT) technologies, including sequencing technologies (genomics) and mass spectrometry (proteomics, metabolomics, lipids), has allowed geneticists, biologists, and biostatisticians to bridge the gap between genotype and phenotype on a massive scale. These new technologies have brought rapid advances in our understanding of cell biology, evolutionary history, microbial environments, and are increasingly providing new insights and applications towards clinical care and personalized medicine. Areas covered: The very success of this industry also translates into daunting big data challenges for researchers and institutions that extend beyond the traditional academic focus of algorithms and tools. The main obstacles revolve around analysis provenance, data management of massive datasets, ease of use of software, interpretability and reproducibility of results. Expert commentary: The authors review the challenges associated with implementing bioinformatics best practices in a large-scale setting, and highlight the opportunity for establishing bioinformatics pipelines that incorporate data tracking and auditing, enabling greater consistency and reproducibility for basic research, translational or clinical settings.
Topics: Computational Biology; Genetic Research; Genomics
PubMed: 28092471
DOI: 10.1080/14737159.2017.1282822 -
The Journal of Consumer Affairs 2022This article advances the riveting discussion on how this special issue contributes to the consumer well-being literature. Specifically, this article endeavors to...
This article advances the riveting discussion on how this special issue contributes to the consumer well-being literature. Specifically, this article endeavors to present an eclectic account of how the pandemics has had a lasting impact on the consumer well-being, its provenance and future research priorities for academics and practice. First, it briefly discusses the origin and relevance of the evolving issue of consumer well-being during pandemics. Second, it presents several directions for future research and third, it offers key insights for policymakers. It includes multiple research priorities that present vastly contrasting manifestations of consumer well-being. This article argues that future research will need to examine the drivers of consumer well-being during pandemics, the mechanisms that underlie the influence of pandemics on consumer well-being and the boundary conditions that accentuate/mitigate the influence of pandemic-induced factors.
PubMed: 35603324
DOI: 10.1111/joca.12445