-
The Bone & Joint Journal Dec 2017'Big data' is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Billions of dollars have been spent on... (Review)
Review
'Big data' is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Billions of dollars have been spent on attempts to build predictive tools from large sets of poorly controlled healthcare metadata. Companies often sell reports at a physician or facility level based on various flawed data sources, and comparative websites of 'publicly reported data' purport to educate the public. Physicians should be aware of concerns and pitfalls seen in such data definitions, data clarity, data relevance, data sources and data cleaning when evaluating analytic reports from metadata in health care. Cite this article: 2017;99-B:1571-6.
Topics: Data Mining; Datasets as Topic; Delivery of Health Care; Humans; Metadata
PubMed: 29212678
DOI: 10.1302/0301-620X.99B12.BJJ-2017-0939 -
Nature Communications May 2022DNA-based data storage platforms traditionally encode information only in the nucleotide sequence of the molecule. Here we report on a two-dimensional molecular data...
DNA-based data storage platforms traditionally encode information only in the nucleotide sequence of the molecule. Here we report on a two-dimensional molecular data storage system that records information in both the sequence and the backbone structure of DNA and performs nontrivial joint data encoding, decoding and processing. Our 2DDNA method efficiently stores images in synthetic DNA and embeds pertinent metadata as nicks in the DNA backbone. To avoid costly worst-case redundancy for correcting sequencing/rewriting errors and to mitigate issues associated with mismatched decoding parameters, we develop machine learning techniques for automatic discoloration detection and image inpainting. The 2DDNA platform is experimentally tested by reconstructing a library of images with undetectable or small visual degradation after readout processing, and by erasing and rewriting copyright metadata encoded in nicks. Our results demonstrate that DNA can serve both as a write-once and rewritable memory for heterogenous data and that data can be erased in a permanent, privacy-preserving manner. Moreover, the storage system can be made robust to degrading channel qualities while avoiding global error-correction redundancy.
Topics: DNA; Gene Library; Information Storage and Retrieval; Machine Learning; Metadata
PubMed: 35624096
DOI: 10.1038/s41467-022-30140-x -
Database : the Journal of Biological... Jan 2016Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is...
Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data-defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction.
Topics: Computational Biology; Data Mining; Databases, Genetic; Gene Expression Profiling; Humans; Metadata; Models, Statistical
PubMed: 28637268
DOI: 10.1093/database/baw080 -
Nucleic Acids Research Jan 2019The BioSamples database at EMBL-EBI provides a central hub for sample metadata storage and linkage to other EMBL-EBI resources. BioSamples has recently undergone major...
The BioSamples database at EMBL-EBI provides a central hub for sample metadata storage and linkage to other EMBL-EBI resources. BioSamples has recently undergone major changes, both in terms of data content and supporting infrastructure. The data content has more than doubled from around 2 million samples in 2014 to just over 5 million samples in 2018. Fast, reciprocal data exchange was fully established between sister Biosample databases and other INSDC partners, enabling a worldwide common representation and centralization of sample metadata. The BioSamples platform has been upgraded to accommodate anticipated increases in the number of submissions via GA4GH driver projects such as the Human Cell Atlas and the EGA, as well as from mirroring of NCBI dbGaP data. The BioSamples database is now the authoritative repository for all INSDC sample metadata, an ELIXIR Deposition Database for Biomolecular Data and the EMBL-EBI sample metadata hub. To support faster turnaround for sample submission, and to increase scalability and resilience, we have upgraded the BioSamples database backend storage, APIs and user interface. Finally, the website has been redesigned to allow search and retrieval of records based on specific filters, such as 'disease' or 'organism'. These changes are targeted at answering current use cases as well as providing functionalities for future emerging and anticipated developments. Availability: The BioSamples database is freely available at http://www.ebi.ac.uk/biosamples. Content is distributed under the EMBL-EBI Terms of Use available at https://www.ebi.ac.uk/about/terms-of-use.
Topics: Biological Specimen Banks; Computational Biology; Databases, Genetic; Databases, Nucleic Acid; Genomics; Humans; Information Storage and Retrieval; Internet; Metadata; User-Computer Interface
PubMed: 30407529
DOI: 10.1093/nar/gky1061 -
Studies in Health Technology and... Aug 2019Semantic standards and human language technologies are key enablers for semantic interoperability across heterogeneous document and data collections in clinical...
Semantic standards and human language technologies are key enablers for semantic interoperability across heterogeneous document and data collections in clinical information systems. Data provenance is awarded increasing attention, and it is especially critical where clinical data are automatically extracted from original documents, e.g. by text mining. This paper demonstrates how the output of a commercial clinical text-mining tool can be harmonised with FHIR, the leading clinical information model standard. Character ranges that indicate the origin of an annotation and machine generates confidence values were identified as crucial elements of data provenance in order to enrich text-mining results. We have specified and requested necessary extensions to the FHIR standard and demonstrated how, as a result, important metadata describing processes generating FHIR instances from clinical narratives can be embedded.
Topics: Data Mining; Delivery of Health Care; Electronic Health Records; Humans; Metadata; Semantics
PubMed: 31437890
DOI: 10.3233/SHTI190188 -
F1000Research 2021Many types of data from genomic analyses can be represented as genomic tracks, features linked to the genomic coordinates of a reference genome. Examples of such data...
Many types of data from genomic analyses can be represented as genomic tracks, features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, as well as RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information. We propose to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable) to produce searchable metadata for genomic tracks. Findability and Accessibility of metadata can then be ensured by a track search service that integrates globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories. Interoperability and Reusability need to be ensured by the specification and implementation of a basic set of recommendations for metadata. We have tested this concept by developing such a specification in a JSON Schema, called FAIRtracks, and have integrated it into a novel track search service, called TrackFind. We demonstrate practical usage by importing datasets through TrackFind into existing examples of relevant analytical tools for genomic tracks: EPICO and the GSuite HyperBrowser. We here provide a first iteration of a draft standard for genomic track metadata, as well as the accompanying software ecosystem. It can easily be adapted or extended to future needs of the research community regarding data, methods and tools, balancing the requirements of both data submitters and analytical end-users.
Topics: Ecosystem; Genome; Genomics; Metadata; Software
PubMed: 34249331
DOI: 10.12688/f1000research.28449.1 -
Neuroinformatics Jul 2019Knowledge discovery via an informatics resource is constrained by the completeness of the resource, both in terms of the amount of data it contains and in terms of the...
Knowledge discovery via an informatics resource is constrained by the completeness of the resource, both in terms of the amount of data it contains and in terms of the metadata that exists to describe the data. Increasing completeness in one of these categories risks reducing completeness in the other because manually curating metadata is time consuming and is restricted by familiarity with both the data and the metadata annotation scheme. The diverse interests of a research community may drive a resource to have hundreds of metadata tags with few examples for each making it challenging for humans or machine learning algorithms to learn how to assign metadata tags properly. We demonstrate with ModelDB, a computational neuroscience model discovery resource, that using manually-curated regular-expression based rules can overcome this challenge by parsing existing texts from data providers during user data entry to suggest metadata annotations and prompt them to suggest other related metadata annotations rather than leaving the task to a curator. In the ModelDB implementation, analyzing the abstract identified 6.4 metadata tags per abstract at 79% precision. Using the full-text produced higher recall with low precision (41%), and the title alone produced few (1.3) metadata annotations per entry; we thus recommend data providers use their abstract during upload. Grouping the possible metadata annotations into categories (e.g. cell type, biological topic) revealed that precision and recall for the different text sources varies by category. Given this proof-of-concept, other bioinformatics resources can likewise improve the quality of their metadata by adopting our approach of prompting data uploaders with relevant metadata at the minimal cost of formalizing rules for each potential metadata annotation.
Topics: Animals; Computational Biology; Data Analysis; Humans; Machine Learning; Metadata
PubMed: 30382537
DOI: 10.1007/s12021-018-9403-z -
Studies in Health Technology and... Aug 2019A database of full-body CT scans and associated lifestyle and health data from decedents who underwent an autopsy at the Office of the Medical Investigator (OMI) is...
A database of full-body CT scans and associated lifestyle and health data from decedents who underwent an autopsy at the Office of the Medical Investigator (OMI) is under construction. The dataset has 68 metadata fields containing data from the OMI's database and interviews with next of kin. Some metadata fields could be mapped to existing standards, but the majority of fields required some modifications to current standards or the creation of new standards.
Topics: Databases, Factual; Metadata
PubMed: 31438164
DOI: 10.3233/SHTI190467 -
Nucleic Acids Research Jan 2022Single-cell bisulfite sequencing methods are widely used to assess epigenomic heterogeneity in cell states. Over the past few years, large amounts of data have been...
Single-cell bisulfite sequencing methods are widely used to assess epigenomic heterogeneity in cell states. Over the past few years, large amounts of data have been generated and facilitated deeper understanding of the epigenetic regulation of many key biological processes including early embryonic development, cell differentiation and tumor progression. It is an urgent need to build a functional resource platform with the massive amount of data. Here, we present scMethBank, the first open access and comprehensive database dedicated to the collection, integration, analysis and visualization of single-cell DNA methylation data and metadata. Current release of scMethBank includes processed single-cell bisulfite sequencing data and curated metadata of 8328 samples derived from 15 public single-cell datasets, involving two species (human and mouse), 29 cell types and two diseases. In summary, scMethBank aims to assist researchers who are interested in cell heterogeneity to explore and utilize whole genome methylation data at single-cell level by providing browse, search, visualization, download functions and user-friendly online tools. The database is accessible at: https://ngdc.cncb.ac.cn/methbank/scm/.
Topics: Animals; Chromosome Mapping; DNA Methylation; Databases, Genetic; Datasets as Topic; Epigenesis, Genetic; Genome; Humans; Internet; Metadata; Mice; Molecular Sequence Annotation; Single-Cell Analysis; Software; Whole Genome Sequencing
PubMed: 34570235
DOI: 10.1093/nar/gkab833 -
Animal Genetics Dec 2018The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal...
The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main-or more permissive legacy-standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter-comparability of assay data. This supports the community to create a rich genome-to-phenotype resource and promotes continuing improvements in animal data standards as a whole.
Topics: Animals; Data Curation; Genomics; Livestock; Metadata; Pets; Software
PubMed: 30311252
DOI: 10.1111/age.12736