-
Neuroinformatics Jul 2019Knowledge discovery via an informatics resource is constrained by the completeness of the resource, both in terms of the amount of data it contains and in terms of the...
Knowledge discovery via an informatics resource is constrained by the completeness of the resource, both in terms of the amount of data it contains and in terms of the metadata that exists to describe the data. Increasing completeness in one of these categories risks reducing completeness in the other because manually curating metadata is time consuming and is restricted by familiarity with both the data and the metadata annotation scheme. The diverse interests of a research community may drive a resource to have hundreds of metadata tags with few examples for each making it challenging for humans or machine learning algorithms to learn how to assign metadata tags properly. We demonstrate with ModelDB, a computational neuroscience model discovery resource, that using manually-curated regular-expression based rules can overcome this challenge by parsing existing texts from data providers during user data entry to suggest metadata annotations and prompt them to suggest other related metadata annotations rather than leaving the task to a curator. In the ModelDB implementation, analyzing the abstract identified 6.4 metadata tags per abstract at 79% precision. Using the full-text produced higher recall with low precision (41%), and the title alone produced few (1.3) metadata annotations per entry; we thus recommend data providers use their abstract during upload. Grouping the possible metadata annotations into categories (e.g. cell type, biological topic) revealed that precision and recall for the different text sources varies by category. Given this proof-of-concept, other bioinformatics resources can likewise improve the quality of their metadata by adopting our approach of prompting data uploaders with relevant metadata at the minimal cost of formalizing rules for each potential metadata annotation.
Topics: Animals; Computational Biology; Data Analysis; Humans; Machine Learning; Metadata
PubMed: 30382537
DOI: 10.1007/s12021-018-9403-z -
Nucleic Acids Research Jan 2022Single-cell bisulfite sequencing methods are widely used to assess epigenomic heterogeneity in cell states. Over the past few years, large amounts of data have been...
Single-cell bisulfite sequencing methods are widely used to assess epigenomic heterogeneity in cell states. Over the past few years, large amounts of data have been generated and facilitated deeper understanding of the epigenetic regulation of many key biological processes including early embryonic development, cell differentiation and tumor progression. It is an urgent need to build a functional resource platform with the massive amount of data. Here, we present scMethBank, the first open access and comprehensive database dedicated to the collection, integration, analysis and visualization of single-cell DNA methylation data and metadata. Current release of scMethBank includes processed single-cell bisulfite sequencing data and curated metadata of 8328 samples derived from 15 public single-cell datasets, involving two species (human and mouse), 29 cell types and two diseases. In summary, scMethBank aims to assist researchers who are interested in cell heterogeneity to explore and utilize whole genome methylation data at single-cell level by providing browse, search, visualization, download functions and user-friendly online tools. The database is accessible at: https://ngdc.cncb.ac.cn/methbank/scm/.
Topics: Animals; Chromosome Mapping; DNA Methylation; Databases, Genetic; Datasets as Topic; Epigenesis, Genetic; Genome; Humans; Internet; Metadata; Mice; Molecular Sequence Annotation; Single-Cell Analysis; Software; Whole Genome Sequencing
PubMed: 34570235
DOI: 10.1093/nar/gkab833 -
Journal of the American Medical... Jul 2020Ubiquitous technologies can be leveraged to construct ecologically relevant metrics that complement traditional psychological assessments. This study aims to determine...
OBJECTIVE
Ubiquitous technologies can be leveraged to construct ecologically relevant metrics that complement traditional psychological assessments. This study aims to determine the feasibility of smartphone-derived real-world keyboard metadata to serve as digital biomarkers of mood.
MATERIALS AND METHODS
BiAffect, a real-world observation study based on a freely available iPhone app, allowed the unobtrusive collection of typing metadata through a custom virtual keyboard that replaces the default keyboard. User demographics and self-reports for depression severity (Patient Health Questionnaire-8) were also collected. Using >14 million keypresses from 250 users who reported demographic information and a subset of 147 users who additionally completed at least 1 Patient Health Questionnaire, we employed hierarchical growth curve mixed-effects models to capture the effects of mood, demographics, and time of day on keyboard metadata.
RESULTS
We analyzed 86 541 typing sessions associated with a total of 543 Patient Health Questionnaires. Results showed that more severe depression relates to more variable typing speed (P < .001), shorter session duration (P < .001), and lower accuracy (P < .05). Additionally, typing speed and variability exhibit a diurnal pattern, being fastest and least variable at midday. Older users exhibit slower and more variable typing, as well as more pronounced slowing in the evening. The effects of aging and time of day did not impact the relationship of mood to typing variables and were recapitulated in the 250-user group.
CONCLUSIONS
Keystroke dynamics, unobtrusively collected in the real world, are significantly associated with mood despite diurnal patterns and effects of age, and thus could serve as a foundation for constructing digital biomarkers.
Topics: Adult; Affect; Aged; Aging; Biomarkers; Circadian Rhythm; Depressive Disorder; Female; Humans; Linear Models; Male; Metadata; Middle Aged; Smartphone; Telemedicine
PubMed: 32467973
DOI: 10.1093/jamia/ocaa057 -
Bioinformatics (Oxford, England) Oct 2022Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments....
MOTIVATION
Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments. This can introduce differences in performance and reproducibility. Capturing metadata (e.g. package versions, GPU model) currently requires repetitious code and is difficult to store centrally for analysis. Even where virtual environments and containers are used, updates over time mean that versioning metadata should still be captured within analysis pipelines to guarantee reproducibility.
RESULTS
Microbench is a simple and extensible Python package to automate metadata capture to a file or Redis database. Captured metadata can include execution time, software package versions, environment variables, hardware information, Python version and more, with plugins. We present three case studies demonstrating Microbench usage to benchmark code execution and examine environment metadata for reproducibility purposes.
AVAILABILITY AND IMPLEMENTATION
Install from the Python Package Index using pip install microbench. Source code is available from https://github.com/alubbock/microbench.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Benchmarking; Metadata; Reproducibility of Results; Software; Systems Biology
PubMed: 36000837
DOI: 10.1093/bioinformatics/btac580 -
Ecology and Evolution Aug 2022Data support knowledge development and theory advances in ecology and evolution. We are increasingly reusing data within our teams and projects and through the global,...
Data support knowledge development and theory advances in ecology and evolution. We are increasingly reusing data within our teams and projects and through the global, openly archived datasets of others. Metadata can be challenging to write and interpret, but it is always crucial for reuse. The value metadata cannot be overstated-even as a relatively independent research object because it describes the work that has been done in a structured format. We advance a new perspective and classify methods for metadata curation and development with tables. Tables with templates can be effectively used to capture all components of an experiment or project in a single, easy-to-read file familiar to most scientists. If coupled with the R programming language, metadata from tables can then be rapidly and reproducibly converted to publication formats including extensible markup language files suitable for data repositories. Tables can also be used to summarize existing metadata and store metadata across many datasets. A case study is provided and the added benefits of tables for metadata, a priori, are developed to ensure a more streamlined publishing process for many data repositories used in ecology, evolution, and the environmental sciences. In ecology and evolution, researchers are often highly tabular thinkers from experimental data collection in the lab and/or field, and representations of metadata as a table will provide novel research and reuse insights.
PubMed: 36035265
DOI: 10.1002/ece3.9245 -
BMC Research Notes May 2021The SARS-CoV-2 pandemic has prompted one of the most extensive and expeditious genomic sequencing efforts in history. Each viral genome is accompanied by a set of...
OBJECTIVE
The SARS-CoV-2 pandemic has prompted one of the most extensive and expeditious genomic sequencing efforts in history. Each viral genome is accompanied by a set of metadata which supplies important information such as the geographic origin of the sample, age of the host, and the lab at which the sample was sequenced, and is integral to epidemiological efforts and public health direction. Here, we interrogate some shortcomings of metadata within the GISAID database to raise awareness of common errors and inconsistencies that may affect data-driven analyses and provide possible avenues for resolutions.
RESULTS
Our analysis reveals a startling prevalence of spelling errors and inconsistent naming conventions, which together occur in an estimated ~ 9.8% and ~ 11.6% of "originating lab" and "submitting lab" GISAID metadata entries respectively. We also find numerous ambiguous entries which provide very little information about the actual source of a sample and could easily associate with multiple sources worldwide. Importantly, all of these issues can impair the ability and accuracy of association studies by deceptively causing a group of samples to identify with multiple sources when they truly all identify with one source, or vice versa.
Topics: COVID-19; Genome, Viral; Genomics; Humans; Metadata; Phylogeny; SARS-CoV-2
PubMed: 34001211
DOI: 10.1186/s13104-021-05605-9 -
Nucleic Acids Research Jan 2022The BioSamples database at EMBL-EBI is the central institutional repository for sample metadata storage and connection to EMBL-EBI archives and other resources. The...
The BioSamples database at EMBL-EBI is the central institutional repository for sample metadata storage and connection to EMBL-EBI archives and other resources. The technical improvements to our infrastructure described in our last update have enabled us to scale and accommodate an increasing number of communities, resulting in a higher number of submissions and more heterogeneous data. The BioSamples database now has a valuable set of features and processes to improve data quality in BioSamples, and in particular enriching metadata content and following FAIR principles. In this manuscript, we describe how BioSamples in 2021 handles requirements from our community of users through exemplar use cases: increased findability of samples and improved data management practices support the goals of the ReSOLUTE project, how the plant community benefits from being able to link genotypic to phenotypic information, and we highlight how cumulatively those improvements contribute to more complex multi-omics data integration supporting COVID-19 research. Finally, we present underlying technical features used as pillars throughout those use cases and how they are reused for expanded engagement with communities such as FAIRplus and the Global Alliance for Genomics and Health. Availability: The BioSamples database is freely available at http://www.ebi.ac.uk/biosamples. Content is distributed under the EMBL-EBI Terms of Use available at https://www.ebi.ac.uk/about/terms-of-use. The BioSamples code is available at https://github.com/EBIBioSamples/biosamples-v4 and distributed under the Apache 2.0 license.
Topics: COVID-19; Databases, Factual; Gene Expression Profiling; Genomics; Host-Pathogen Interactions; Humans; Metadata; Phenotype; Plant Physiological Phenomena; SARS-CoV-2
PubMed: 34747489
DOI: 10.1093/nar/gkab1046 -
BMC Bioinformatics Mar 2021Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects,...
BACKGROUND
Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio.
RESULTS
Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2-5%).
CONCLUSIONS
Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.
Topics: Animals; Bias; Databases, Factual; Female; Gene Expression; Male; Metadata; Mice; Neoplasms; Sex Factors
PubMed: 33784977
DOI: 10.1186/s12859-021-04070-2 -
BMC Bioinformatics Jan 2019The biomedical literature is expanding at ever-increasing rates, and it has become extremely challenging for researchers to keep abreast of new data and discoveries even...
BACKGROUND
The biomedical literature is expanding at ever-increasing rates, and it has become extremely challenging for researchers to keep abreast of new data and discoveries even in their own domains of expertise. We introduce PaperBot, a configurable, modular, open-source crawler to automatically find and efficiently index peer-reviewed publications based on periodic full-text searches across publisher web portals.
RESULTS
PaperBot may operate stand-alone or it can be easily integrated with other software platforms and knowledge bases. Without user interactions, PaperBot retrieves and stores the bibliographic information (full reference, corresponding email contact, and full-text keyword hits) based on pre-set search logic from a wide range of sources including Elsevier, Wiley, Springer, PubMed/PubMedCentral, Nature, and Google Scholar. Although different publishing sites require different search configurations, the common interface of PaperBot unifies the process from the user perspective. Once saved, all information becomes web accessible allowing efficient triage of articles based on their actual relevance and seamless annotation of suitable metadata content. The platform allows the agile reconfiguration of all key details, such as the selection of search portals, keywords, and metadata dimensions. The tool also provides a one-click option for adding articles manually via digital object identifier or PubMed ID. The microservice architecture of PaperBot implements these capabilities as a loosely coupled collection of distinct modules devised to work separately, as a whole, or to be integrated with or replaced by additional software. All metadata is stored in a schema-less NoSQL database designed to scale efficiently in clusters by minimizing the impedance mismatch between relational model and in-memory data structures.
CONCLUSIONS
As a testbed, we deployed PaperBot to help identify and manage peer-reviewed articles pertaining to digital reconstructions of neuronal morphology in support of the NeuroMorpho.Org data repository. PaperBot enabled the custom definition of both general and neuroscience-specific metadata dimensions, such as animal species, brain region, neuron type, and digital tracing system. Since deployment, PaperBot helped NeuroMorpho.Org more than quintuple the yearly volume of processed information while maintaining a stable personnel workforce.
Topics: Biomedical Research; Databases, Bibliographic; Information Storage and Retrieval; Internet; Metadata; Publications; Software; User-Computer Interface
PubMed: 30678631
DOI: 10.1186/s12859-019-2613-z -
Nucleic Acids Research Jan 2024Plasmids are mobile genetic elements found in many clades of Archaea and Bacteria. They drive horizontal gene transfer, impacting ecological and evolutionary processes...
Plasmids are mobile genetic elements found in many clades of Archaea and Bacteria. They drive horizontal gene transfer, impacting ecological and evolutionary processes within microbial communities, and hold substantial importance in human health and biotechnology. To support plasmid research and provide scientists with data of an unprecedented diversity of plasmid sequences, we introduce the IMG/PR database, a new resource encompassing 699 973 plasmid sequences derived from genomes, metagenomes and metatranscriptomes. IMG/PR is the first database to provide data of plasmid that were systematically identified from diverse microbiome samples. IMG/PR plasmids are associated with rich metadata that includes geographical and ecosystem information, host taxonomy, similarity to other plasmids, functional annotation, presence of genes involved in conjugation and antibiotic resistance. The database offers diverse methods for exploring its extensive plasmid collection, enabling users to navigate plasmids through metadata-centric queries, plasmid comparisons and BLAST searches. The web interface for IMG/PR is accessible at https://img.jgi.doe.gov/pr. Plasmid metadata and sequences can be downloaded from https://genome.jgi.doe.gov/portal/IMG_PR.
Topics: Humans; Metagenome; Metadata; Software; Databases, Genetic; Plasmids; Microbiota
PubMed: 37930866
DOI: 10.1093/nar/gkad964