-
F1000Research 2018Publishing peer review materials alongside research articles promises to make the peer review process more transparent as well as making it easier to recognise these...
Publishing peer review materials alongside research articles promises to make the peer review process more transparent as well as making it easier to recognise these contributions and give credit to peer reviewers. Traditionally, the peer review reports, editors letters and author responses are only shared between the small number of people in those roles prior to publication, but there is a growing interest in making some or all of these materials available. A small number of journals have been publishing peer review materials for some time, others have begun this practice more recently, and significantly more are now considering how they might begin. This article outlines the outcomes from a recent workshop among journals with experience in publishing peer review materials, in which the specific operation of these workflows, and the challenges, were discussed. Here, we provide a draft as to how to represent these materials in the JATS and Crossref data models to facilitate the coordination and discoverability of peer review materials, and seek feedback on these initial recommendations.
Topics: Authorship; Metadata; Peer Review, Research; Publishing
PubMed: 30416719
DOI: 10.12688/f1000research.16460.1 -
Current Protocols in Bioinformatics Dec 2019The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics...
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.
Topics: Animals; Chromatin; DNA; DNA Methylation; Databases, Genetic; Epigenomics; Genome, Human; Humans; Internet; Metadata; Mice; Software
PubMed: 31751002
DOI: 10.1002/cpbi.89 -
Bioinformatics (Oxford, England) Mar 2023The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and...
MOTIVATION
The Gene Expression Omnibus has become an important source of biological data for secondary analysis. However, there is no simple, programmatic way to download data and metadata from Gene Expression Omnibus (GEO) in a standardized annotation format.
RESULTS
To address this, we present GEOfetch-a command-line tool that downloads and organizes data and metadata from GEO and SRA. GEOfetch formats the downloaded metadata as a Portable Encapsulated Project, providing universal format for the reanalysis of public data.
AVAILABILITY AND IMPLEMENTATION
GEOfetch is available on Bioconda and the Python Package Index (PyPI).
Topics: Metadata; Gene Expression; Computational Biology
PubMed: 36857584
DOI: 10.1093/bioinformatics/btad069 -
Journal of Proteome Research Sep 2021The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science...
The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Topics: Mass Spectrometry; Metadata; Proteomics; Search Engine; Software
PubMed: 34342226
DOI: 10.1021/acs.jproteome.1c00454 -
GigaScience Oct 2019Increasingly sophisticated experiments, coupled with large-scale computational models, have the potential to systematically test biological hypotheses to drive our... (Review)
Review
Increasingly sophisticated experiments, coupled with large-scale computational models, have the potential to systematically test biological hypotheses to drive our understanding of multicellular systems. In this short review, we explore key challenges that must be overcome to achieve robust, repeatable data-driven multicellular systems biology. If these challenges can be solved, we can grow beyond the current state of isolated tools and datasets to a community-driven ecosystem of interoperable data, software utilities, and computational modeling platforms. Progress is within our grasp, but it will take community (and financial) commitment.
Topics: Big Data; Metadata; Systems Biology
PubMed: 31648301
DOI: 10.1093/gigascience/giz127 -
GigaScience Feb 2022The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards,...
BACKGROUND
The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatics tools and resources, and advocate for greater openness, interoperability, accessibility, and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a need for a fit-for-purpose, open-source SARS-CoV-2 contextual data standard.
RESULTS
As such, we have developed a SARS-CoV-2 contextual data specification package based on harmonizable, publicly available community standards. The specification can be implemented via a collection template, as well as an array of protocols and tools to support both the harmonization and submission of sequence data and contextual information to public biorepositories.
CONCLUSIONS
Well-structured, rich contextual data add value, promote reuse, and enable aggregation and integration of disparate datasets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19. The package is now supported by the NCBI's BioSample database.
Topics: COVID-19; Genomics; Humans; Metadata; Public Health; Reproducibility of Results; SARS-CoV-2
PubMed: 35169842
DOI: 10.1093/gigascience/giac003 -
BMC Bioinformatics Jan 2019The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease...
BACKGROUND
The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease pathogens. Omics studies are enabling genotype-phenotype association studies which identify genetic determinants of pathogen virulence and drug resistance, as well as phylogenetic studies designed to track the origin and spread of disease outbreaks. These omics studies are complex and often employ multiple assay technologies including genomics, metagenomics, transcriptomics, proteomics, and metabolomics. To maximize the impact of omics studies, it is essential that data be accompanied by detailed contextual metadata (e.g., specimen, spatial-temporal, phenotypic characteristics) in clear, organized, and consistent formats. Over the years, many metadata standards developed by various metadata standards initiatives have arisen; the Genomic Standards Consortium's minimal information standards (MIxS), the GSCID/BRC Project and Sample Application Standard. Some tools exist for tracking metadata, but they do not provide event based capabilities to configure, collect, validate, and distribute metadata. To address this gap in the scientific community, an event based data-driven application, OMeta, was created that allows users to quickly configure, collect, validate, distribute, and integrate metadata.
RESULTS
A data-driven web application, OMeta, has been developed for use by researchers consisting of a browser-based interface, a command-line interface (CLI), and server-side components that provide an intuitive platform for configuring, capturing, viewing, and sharing metadata. Project and sample metadata can be set based on existing standards or based on projects goals. Recorded information includes details on the biological samples, procedures, protocols, and experimental technologies, etc. This information can be organized based on events, including sample collection, sample quantification, sequencing assay, and analysis results. OMeta enables configuration in various presentation types: checkbox, file, drop-box, ontology, and fields can be configured to use the National Center for Biomedical Ontology (NCBO), a biomedical ontology server. Furthermore, OMeta maintains a complete audit trail of all changes made by users and allows metadata export in comma separated value (CSV) format for convenient deposition of data into public databases.
CONCLUSIONS
We present, OMeta, a web-based software application that is built on data-driven principles for configuring and customizing data standards, capturing, curating, and sharing metadata.
Topics: Biological Ontologies; Databases, Factual; Metadata; Metagenomics; Phylogeny; Software; User-Computer Interface; Whole Genome Sequencing
PubMed: 30612540
DOI: 10.1186/s12859-018-2580-9 -
The Bone & Joint Journal Dec 2017'Big data' is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Billions of dollars have been spent on... (Review)
Review
'Big data' is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Billions of dollars have been spent on attempts to build predictive tools from large sets of poorly controlled healthcare metadata. Companies often sell reports at a physician or facility level based on various flawed data sources, and comparative websites of 'publicly reported data' purport to educate the public. Physicians should be aware of concerns and pitfalls seen in such data definitions, data clarity, data relevance, data sources and data cleaning when evaluating analytic reports from metadata in health care. Cite this article: 2017;99-B:1571-6.
Topics: Data Mining; Datasets as Topic; Delivery of Health Care; Humans; Metadata
PubMed: 29212678
DOI: 10.1302/0301-620X.99B12.BJJ-2017-0939 -
Nature Communications May 2022DNA-based data storage platforms traditionally encode information only in the nucleotide sequence of the molecule. Here we report on a two-dimensional molecular data...
DNA-based data storage platforms traditionally encode information only in the nucleotide sequence of the molecule. Here we report on a two-dimensional molecular data storage system that records information in both the sequence and the backbone structure of DNA and performs nontrivial joint data encoding, decoding and processing. Our 2DDNA method efficiently stores images in synthetic DNA and embeds pertinent metadata as nicks in the DNA backbone. To avoid costly worst-case redundancy for correcting sequencing/rewriting errors and to mitigate issues associated with mismatched decoding parameters, we develop machine learning techniques for automatic discoloration detection and image inpainting. The 2DDNA platform is experimentally tested by reconstructing a library of images with undetectable or small visual degradation after readout processing, and by erasing and rewriting copyright metadata encoded in nicks. Our results demonstrate that DNA can serve both as a write-once and rewritable memory for heterogenous data and that data can be erased in a permanent, privacy-preserving manner. Moreover, the storage system can be made robust to degrading channel qualities while avoiding global error-correction redundancy.
Topics: DNA; Gene Library; Information Storage and Retrieval; Machine Learning; Metadata
PubMed: 35624096
DOI: 10.1038/s41467-022-30140-x -
Database : the Journal of Biological... Jan 2016Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is...
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data-defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction.
Topics: Computational Biology; Data Mining; Databases, Genetic; Gene Expression Profiling; Humans; Metadata; Models, Statistical
PubMed: 28637268
DOI: 10.1093/database/baw080