-
Ecology and Evolution Jul 2021Metadata plays an essential role in the long-term preservation, reuse, and interoperability of data. Nevertheless, creating useful metadata can be sufficiently difficult...
Metadata plays an essential role in the long-term preservation, reuse, and interoperability of data. Nevertheless, creating useful metadata can be sufficiently difficult and weakly enough incentivized that many datasets may be accompanied by little or no metadata. One key challenge is, therefore, how to make metadata creation easier and more valuable. We present a solution that involves creating domain-specific metadata schemes that are as complex as necessary and as simple as possible. These goals are achieved by co-development between a metadata expert and the researchers (i.e., the data creators). The final product is a bespoke metadata scheme into which researchers can enter information (and validate it) via the simplest of interfaces: a web browser application and a spreadsheet.We provide the R package dmdScheme (dmdScheme: An R package for working with domain specific MetaData schemes (Version v0.9.22), 2019) for creating a template domain-specific scheme. We describe how to create a domain-specific scheme from this template, including the iterative co-development process, and the simple methods for using the scheme, and simple methods for quality assessment, improvement, and validation.The process of developing a metadata scheme following the outlined approach was successful, resulting in a metadata scheme which is used for the data generated in our research group. The validation quickly identifies forgotten metadata, as well as inconsistent metadata, therefore improving the quality of the metadata. Multiple output formats are available, including XML.Making the provision of metadata easier while also ensuring high quality must be a priority for data curation initiatives. We show how both objectives are achieved by close collaboration between metadata experts and researchers to create domain-specific schemes. A near-future priority is to provide methods to interface domain-specific schemes with general metadata schemes, such as the Ecological Metadata Language, to increase interoperability.
PubMed: 34306613
DOI: 10.1002/ece3.7764 -
Nucleic Acids Research Jan 2019Sharing of research data in public repositories has become best practice in academia. With the accumulation of massive data, network bandwidth and storage requirements...
Sharing of research data in public repositories has become best practice in academia. With the accumulation of massive data, network bandwidth and storage requirements are rapidly increasing. The ProteomeXchange (PX) consortium implements a mode of centralized metadata and distributed raw data management, which promotes effective data sharing. To facilitate open access of proteome data worldwide, we have developed the integrated proteome resource iProX (http://www.iprox.org) as a public platform for collecting and sharing raw data, analysis results and metadata obtained from proteomics experiments. The iProX repository employs a web-based proteome data submission process and open sharing of mass spectrometry-based proteomics datasets. Also, it deploys extensive controlled vocabularies and ontologies to annotate proteomics datasets. Users can use a GUI to provide and access data through a fast Aspera-based transfer tool. iProX is a full member of the PX consortium; all released datasets are freely accessible to the public. iProX is based on a high availability architecture and has been deployed as part of the proteomics infrastructure of China, ensuring long-term and stable resource support. iProX will facilitate worldwide data analysis and sharing of proteomics experiments.
Topics: Animals; Computational Biology; Databases, Protein; Humans; Information Storage and Retrieval; Internet; Metadata; Proteome; Proteomics; User-Computer Interface
PubMed: 30252093
DOI: 10.1093/nar/gky869 -
Nucleic Acids Research Jan 2021The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide...
The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing nucleotide sequence data and metadata for >30 years. Three partner organizations, the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan; the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and GenBank at National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA have been collaboratively maintaining the INSDC for the benefit of not only science but all types of community worldwide.
Topics: Academies and Institutes; Base Sequence; Databases, Nucleic Acid; Europe; High-Throughput Nucleotide Sequencing; Humans; International Cooperation; Japan; Metadata; Nucleotides; Sequence Analysis, DNA; Sequence Analysis, RNA; United States
PubMed: 33166387
DOI: 10.1093/nar/gkaa967 -
Metabolites Aug 2023Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on...
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
PubMed: 37623884
DOI: 10.3390/metabo13080941 -
Nucleic Acids Research Jan 2022The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic,...
The European Genome-phenome Archive (EGA - https://ega-archive.org/) is a resource for long term secure archiving of all types of potentially identifiable genetic, phenotypic, and clinical data resulting from biomedical research projects. Its mission is to foster hosted data reuse, enable reproducibility, and accelerate biomedical and translational research in line with the FAIR principles. Launched in 2008, the EGA has grown quickly, currently archiving over 4,500 studies from nearly one thousand institutions. The EGA operates a distributed data access model in which requests are made to the data controller, not to the EGA, therefore, the submitter keeps control on who has access to the data and under which conditions. Given the size and value of data hosted, the EGA is constantly improving its value chain, that is, how the EGA can contribute to enhancing the value of human health data by facilitating its submission, discovery, access, and distribution, as well as leading the design and implementation of standards and methods necessary to deliver the value chain. The EGA has become a key GA4GH Driver Project, leading multiple development efforts and implementing new standards and tools, and has been appointed as an ELIXIR Core Data Resource.
Topics: Confidentiality; Datasets as Topic; Genome, Human; Genotype; History, 20th Century; History, 21st Century; Humans; Information Dissemination; Metadata; Phenomics; Phenotype; Translational Research, Biomedical
PubMed: 34791407
DOI: 10.1093/nar/gkab1059 -
Radiology Feb 2022Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT...
Background Lack of standardization in CT protocol choice contributes to radiation dose variation. Purpose To create a framework to assess radiation doses within broad CT categories defined according to body region and clinical imaging indication and to cluster indications according to the dose required for sufficient image quality. Materials and Methods This was a retrospective study using Digital Imaging and Communications in Medicine metadata. CT examinations in adults from January 1, 2016 to December 31, 2019 from the University of California San Francisco International CT Dose Registry were grouped into 19 categories according to body region and required radiation dose levels. Five body regions had a single dose range (ie, extremities, neck, thoracolumbar spine, combined chest and abdomen, and combined thoracolumbar spine). Five additional regions were subdivided according to dose. Head, chest, cardiac, and abdomen each had low, routine, and high dose categories; combined head and neck had routine and high dose categories. For each category, the median and 75th percentile (ie, diagnostic reference level [DRL]) were determined for dose-length product, and the variation in dose within categories versus across categories was calculated and compared using an analysis of variance. Relative median and DRL (95% CI) doses comparing high dose versus low dose categories were calculated. Results Among 4.5 million examinations, the median and DRL doses varied approximately 10 times between categories compared with between indications within categories. For head, chest, abdomen, and cardiac (3 266 546 examinations [72%]), the relative median doses were higher in examinations assigned to the high dose categories than in examinations assigned to the low dose categories, suggesting the assignment of indications to the broad categories is valid (head, 3.4-fold higher [95% CI: 3.4, 3.5]; chest, 9.6 [95% CI: 9.3, 10.0]; abdomen, 2.4 [95% CI: 2.4, 2.5]; and cardiac, 18.1 [95% CI: 17.7, 18.6]). Results were similar for DRL doses (all < .001). Conclusion Broad categories based on image quality requirements are a suitable framework for simplifying radiation dose assessment, according to expected variation between and within categories. © RSNA, 2021 See also the editorial by Mahesh in this issue.
Topics: Adult; Aged; Female; Humans; Male; Metadata; Middle Aged; Radiation Dosage; Retrospective Studies; Tomography, X-Ray Computed
PubMed: 34751618
DOI: 10.1148/radiol.2021210591 -
Nature Communications Oct 2021The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset... (Review)
Review
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.
Topics: Big Data; Data Analysis; Databases, Protein; Humans; Metadata; Proteomics; Reproducibility of Results; Software; Transcriptome
PubMed: 34615866
DOI: 10.1038/s41467-021-26111-3 -
Journal of Digital Imaging Aug 2018Imaging is increasingly being used in dermatology for documentation, diagnosis, and management of cutaneous disease. The lack of standards for dermatologic imaging is an... (Review)
Review
Imaging is increasingly being used in dermatology for documentation, diagnosis, and management of cutaneous disease. The lack of standards for dermatologic imaging is an impediment to clinical uptake. Standardization can occur in image acquisition, terminology, interoperability, and metadata. This paper presents the International Skin Imaging Collaboration position on standardization of metadata for dermatologic imaging. Metadata is essential to ensure that dermatologic images are properly managed and interpreted. There are two standards-based approaches to recording and storing metadata in dermatologic imaging. The first uses standard consumer image file formats, and the second is the file format and metadata model developed for the Digital Imaging and Communication in Medicine (DICOM) standard. DICOM would appear to provide an advantage over using consumer image file formats for metadata as it includes all the patient, study, and technical metadata necessary to use images clinically. Whereas, consumer image file formats only include technical metadata and need to be used in conjunction with another actor-for example, an electronic medical record-to supply the patient and study metadata. The use of DICOM may have some ancillary benefits in dermatologic imaging including leveraging DICOM network and workflow services, interoperability of images and metadata, leveraging existing enterprise imaging infrastructure, greater patient safety, and better compliance to legislative requirements for image retention.
Topics: Dermatology; Dermoscopy; Diagnostic Imaging; Humans; Internationality; Metadata; Radiology Information Systems; Reproducibility of Results; Skin Diseases; United States
PubMed: 29344752
DOI: 10.1007/s10278-017-0045-8 -
Patterns (New York, N.Y.) Apr 2020Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data... (Review)
Review
Entropy is the natural tendency for decline toward disorder over time. Information entropy is the decline in data, information, and understanding that occurs after data are used and results are published. As time passes, the information slowly fades into obscurity. Data discovery is not enough to slow this process. High-quality metadata that support understanding and reuse and cross domains are a critical antidote to information entropy, particularly as it supports reuse of the data-adding to community knowledge and wisdom. Ensuring the creation and preservation of these metadata is a responsibility shared across the entire data life cycle from creation through analysis and publication to archiving and reuse. Repositories can play an important role in this process by augmenting metadata through time with persistent identifiers and connections they facilitate. Data providers need to work with repositories to encourage metadata evolution as new capabilities and connections emerge.
PubMed: 33205081
DOI: 10.1016/j.patter.2020.100004 -
Journal of the American Medical... Aug 2021PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for...
OBJECTIVE
PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this end, we present a new disambiguation method, namely AggAND, by aggregating information from external databases.
MATERIALS AND METHODS
We address this issue by exploring Microsoft Academic Graph, Semantic Scholar, and PubMed Knowledge Graph to enhance the built-in name metadata, and extend the internal metadata with some external and more discriminative metadata.
RESULTS
Experimental results on enhanced name metadata demonstrate comparable performance to 3 author identifier systems, as well as show superiority over the original name metadata. More importantly, our method, AggAND, incorporating both enhanced name and extended metadata, yields F1 scores of 95.80% and 93.71% on 2 datasets and outperforms the state-of-the-art method by a large margin (3.61% and 6.55%, respectively).
CONCLUSIONS
The feasibility and good performance of our methods not only help better understand the importance of external databases for disambiguation, but also point to a promising direction for future AND studies in which information aggregated from multiple bibliographic databases can be effective in improving disambiguation performance. The methodology shown here can be generalized to broader bibliographic databases beyond PubMed. Our code and data are available online (https://github.com/carmanzhang/PubMed-AND-method).
Topics: Databases, Bibliographic; Databases, Factual; Metadata; PubMed; Semantics
PubMed: 34180522
DOI: 10.1093/jamia/ocab095