-
GigaScience Sep 2021The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling...
BACKGROUND
The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab-a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, the JSON serialization ISA-JSON was developed.
RESULTS
In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python object classes. We describe the ISA API feature set, early adopters, and its growing user community.
CONCLUSIONS
The ISA API provides users with rich programmatic metadata-handling functionality to support automation, a common interface, and an interoperable medium between the 2 ISA formats, as well as with other life science data formats required for depositing data in public databases.
Topics: Biological Science Disciplines; Databases, Factual; Metadata; Software
PubMed: 34528664
DOI: 10.1093/gigascience/giab060 -
Frontiers in Cellular and Infection... 2018Eukaryotic parasites and pathogens continue to cause some of the most detrimental and difficult to treat diseases (or disease states) in both humans and animals, while... (Review)
Review
Eukaryotic parasites and pathogens continue to cause some of the most detrimental and difficult to treat diseases (or disease states) in both humans and animals, while also continuously expanding into non-endemic countries. Combined with the ever growing number of reports on drug-resistance and the lack of effective treatment programs for many metazoan diseases, the impact that these organisms will have on quality of life remain a global challenge. Vaccination as an effective prophylactic treatment has been demonstrated for well over 200 years for bacterial and viral diseases. From the earliest variolation procedures to the cutting edge technologies employed today, many protective preparations have been successfully developed for use in both medical and veterinary applications. In spite of the successes of these applications in the discovery of subunit vaccines against prokaryotic pathogens, not many targets have been successfully developed into vaccines directed against metazoan parasites. With the current increase in -omics technologies and metadata for eukaryotic parasites, target discovery for vaccine development can be expedited. However, a good understanding of the host/vector/pathogen interface is needed to understand the underlying biological, biochemical and immunological components that will confer a protective response in the host animal. Therefore, systems biology is rapidly coming of age in the pursuit of effective parasite vaccines. Despite the difficulties, a number of approaches have been developed and applied to parasitic helminths and arthropods. This review will focus on key aspects of vaccine development that require attention in the battle against these metazoan parasites, as well as successes in the field of vaccine development for helminthiases and ectoparasites. Lastly, we propose future direction of applying successes in pursuit of next generation vaccines.
Topics: Animals; Antigens, Protozoan; Arthropods; Drug Discovery; Drug Resistance; Helminths; Host-Parasite Interactions; Metadata; Parasites; Parasitic Diseases, Animal; Protozoan Vaccines; Systems Biology; Vaccination
PubMed: 29594064
DOI: 10.3389/fcimb.2018.00067 -
Medical Physics Nov 2020One of the most frequently cited radiomics investigations showed that features automatically extracted from routine clinical images could be used in prognostic modeling....
PURPOSE
One of the most frequently cited radiomics investigations showed that features automatically extracted from routine clinical images could be used in prognostic modeling. These images have been made publicly accessible via The Cancer Imaging Archive (TCIA). There have been numerous requests for additional explanatory metadata on the following datasets - RIDER, Interobserver, Lung1, and Head-Neck1. To support repeatability, reproducibility, generalizability, and transparency in radiomics research, we publish the subjects' clinical data, extracted radiomics features, and digital imaging and communications in medicine (DICOM) headers of these four datasets with descriptive metadata, in order to be more compliant with findable, accessible, interoperable, and reusable (FAIR) data management principles.
ACQUISITION AND VALIDATION METHODS
Overall survival time intervals were updated using a national citizens registry after internal ethics board approval. Spatial offsets of the primary gross tumor volume (GTV) regions of interest (ROIs) associated with the Lung1 CT series were improved on the TCIA. GTV radiomics features were extracted using the open-source Ontology-Guided Radiomics Analysis Workflow (O-RAW). We reshaped the output of O-RAW to map features and extraction settings to the latest version of Radiomics Ontology, so as to be consistent with the Image Biomarker Standardization Initiative (IBSI). Digital imaging and communications in medicine metadata was extracted using a research version of Semantic DICOM (SOHARD, GmbH, Fuerth; Germany). Subjects' clinical data were described with metadata using the Radiation Oncology Ontology. All of the above were published in Resource Descriptor Format (RDF), that is, triples. Example SPARQL queries are shared with the reader to use on the online triples archive, which are intended to illustrate how to exploit this data submission.
DATA FORMAT
The accumulated RDF data are publicly accessible through a SPARQL endpoint where the triples are archived. The endpoint is remotely queried through a graph database web application at http://sparql.cancerdata.org. SPARQL queries are intrinsically federated, such that we can efficiently cross-reference clinical, DICOM, and radiomics data within a single query, while being agnostic to the original data format and coding system. The federated queries work in the same way even if the RDF data were partitioned across multiple servers and dispersed physical locations.
POTENTIAL APPLICATIONS
The public availability of these data resources is intended to support radiomics features replication, repeatability, and reproducibility studies by the academic community. The example SPARQL queries may be freely used and modified by readers depending on their research question. Data interoperability and reusability are supported by referencing existing public ontologies. The RDF data are readily findable and accessible through the aforementioned link. Scripts used to create the RDF are made available at a code repository linked to this submission: https://gitlab.com/UM-CDS/FAIR-compliant_clinical_radiomics_and_DICOM_metadata.
Topics: Databases, Factual; Germany; Humans; Metadata; Reproducibility of Results; Workflow
PubMed: 32521049
DOI: 10.1002/mp.14322 -
Nucleic Acids Research Jan 2024MetaboLights is a global database for metabolomics studies including the raw experimental data and the associated metadata. The database is cross-species and...
MetaboLights is a global database for metabolomics studies including the raw experimental data and the associated metadata. The database is cross-species and cross-technique and covers metabolite structures and their reference spectra as well as their biological roles and locations where available. MetaboLights is the recommended metabolomics repository for a number of leading journals and ELIXIR, the European infrastructure for life science information. In this article, we describe the continued growth and diversity of submissions and the significant developments in recent years. In particular, we highlight MetaboLights Labs, our new Galaxy Project instance with repository-scale standardized workflows, and how data public on MetaboLights are being reused by the community. Metabolomics resources and data are available under the EMBL-EBI's Terms of Use at https://www.ebi.ac.uk/metabolights and under Apache 2.0 at https://github.com/EBI-Metabolights.
Topics: Metabolomics; Metadata; Databases, Genetic; Internet
PubMed: 37971328
DOI: 10.1093/nar/gkad1045 -
Sensors (Basel, Switzerland) Apr 2022The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural... (Review)
Review
The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural networks (CNNs) per tested instrument. The paper starts with a review of tasks related to musical instrument identification. It focuses on tasks performed, input type, algorithms employed, and metrics used. The paper starts with the background presentation, i.e., metadata description and a review of related works. This is followed by showing the dataset prepared for the experiment and its division into subsets: training, validation, and evaluation. Then, the analyzed architecture of the neural network model is presented. Based on the described model, training is performed, and several quality metrics are determined for the training and validation sets. The results of the evaluation of the trained network on a separate set are shown. Detailed values for precision, recall, and the number of true and false positive and negative detections are presented. The model efficiency is high, with the metric values ranging from 0.86 for the guitar to 0.99 for drums. Finally, a discussion and a summary of the results obtained follows.
Topics: Algorithms; Benchmarking; Deep Learning; Metadata; Neural Networks, Computer
PubMed: 35459018
DOI: 10.3390/s22083033 -
Scientific Data Mar 2021Using brain atlases to localize regions of interest is a requirement for making neuroscientifically valid statistical inferences. These atlases, represented in...
Using brain atlases to localize regions of interest is a requirement for making neuroscientifically valid statistical inferences. These atlases, represented in volumetric or surface coordinate spaces, can describe brain topology from a variety of perspectives. Although many human brain atlases have circulated the field over the past fifty years, limited effort has been devoted to their standardization. Standardization can facilitate consistency and transparency with respect to orientation, resolution, labeling scheme, file storage format, and coordinate space designation. Our group has worked to consolidate an extensive selection of popular human brain atlases into a single, curated, open-source library, where they are stored following a standardized protocol with accompanying metadata, which can serve as the basis for future atlases. The repository containing the atlases, the specification, as well as relevant transformation functions is available in the neuroparc OSF registered repository or https://github.com/neurodata/neuroparc .
Topics: Brain; Brain Mapping; Humans; Image Processing, Computer-Assisted; Metadata
PubMed: 33686079
DOI: 10.1038/s41597-021-00849-3 -
Animal Genetics Dec 2018The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal...
The Functional Annotation of ANimal Genomes (FAANG) project aims, through a coordinated international effort, to provide high quality functional annotation of animal genomes with an initial focus on farmed and companion animals. A key goal of the initiative is to ensure high quality and rich supporting metadata to describe the project's animals, specimens, cell cultures and experimental assays. By defining rich sample and experimental metadata standards and promoting best practices in data descriptions, deposition and openness, FAANG champions higher quality and reusability of published datasets. FAANG has established a Data Coordination Centre, which sits at the heart of the Metadata and Data Sharing Committee. It continues to evolve the metadata standards, support submissions and, crucially, create powerful and accessible tools to support deposition and validation of metadata. FAANG conforms to the findable, accessible, interoperable, and reusable (FAIR) data principles, with high quality, open access and functionally interlinked data. In addition to data generated by FAANG members and specific FAANG projects, existing datasets that meet the main-or more permissive legacy-standards are incorporated into a central, focused, functional data resource portal for the entire farmed and companion animal community. Through clear and effective metadata standards, validation and conversion software, combined with promotion of best practices in metadata implementation, FAANG aims to maximise effectiveness and inter-comparability of assay data. This supports the community to create a rich genome-to-phenotype resource and promotes continuing improvements in animal data standards as a whole.
Topics: Animals; Data Curation; Genomics; Livestock; Metadata; Pets; Software
PubMed: 30311252
DOI: 10.1111/age.12736 -
Conservation Biology : the Journal of... Aug 2023Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to...
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data-sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever.
Topics: Humans; Conservation of Natural Resources; Metadata; Biodiversity; Probability; Genetic Variation
PubMed: 36704891
DOI: 10.1111/cobi.14061 -
PloS One 2022To adopt the FAIR principles (Findable, Accessible, Interoperable, Reusable) to enhance data sharing, the Cure Sickle Cell Initiative (CureSCi) MetaData Catalog (MDC)...
OBJECTIVES
To adopt the FAIR principles (Findable, Accessible, Interoperable, Reusable) to enhance data sharing, the Cure Sickle Cell Initiative (CureSCi) MetaData Catalog (MDC) was developed to make Sickle Cell Disease (SCD) study datasets more Findable by curating study metadata and making them available through an open-access web portal.
METHODS
Study metadata, including study protocol, data collection forms, and data dictionaries, describe information about study patient-level data. We curated key metadata of 16 SCD studies in a three-tiered conceptual framework of category, subcategory, and data element using ontologies and controlled vocabularies to organize the study variables. We developed the CureSCi MDC by indexing study metadata to enable effective browse and search capabilities at three levels: study, Patient-Reported Outcome (PRO) Measures, and data element levels.
RESULTS
The CureSCi MDC offers several browse and search tools to discover studies by study level, PRO Measures, and data elements. The "Browse Studies," "Browse Studies by PRO Measures," and "Browse Studies by Data Elements" tools allow users to identify studies through pre-defined conceptual categories. "Search by Keyword" and "Search Data Element by Concept Category" can be used separately or in combination to provide more granularity to refine the search results. This resource helps investigators find information about specific data elements across studies using public browsing/search tools, before going through data request procedures to access controlled datasets. The MDC makes SCD studies more Findable through browsing/searching study information, PRO Measures, and data elements, aiding in the reuse of existing SCD data.
Topics: Humans; Metadata; Information Dissemination; Anemia, Sickle Cell
PubMed: 36508412
DOI: 10.1371/journal.pone.0256248 -
Journal of Digital Imaging Oct 2019In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical... (Review)
Review
In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical diagnosis and treatment, many recent initiatives claim the use of medical imaging studies in clinical research scenarios but also to improve the business practices of medical institutions. However, the continuous production of medical imaging studies coupled with the tremendous amount of associated data, makes the real-time analysis of medical imaging repositories difficult using conventional tools and methodologies. Those archives contain not only the image data itself but also a wide range of valuable metadata describing all the stakeholders involved in the examination. The exploration of such technologies will increase the efficiency and quality of medical practice. In major centers, it represents a big data scenario where Business Intelligence (BI) and Data Analytics (DA) are rare and implemented through data warehousing approaches. This article proposes an Extract, Transform, Load (ETL) framework for medical imaging repositories able to feed, in real-time, a developed BI (Business Intelligence) application. The solution was designed to provide the necessary environment for leading research on top of live institutional repositories without requesting the creation of a data warehouse. It features an extensible dashboard with customizable charts and reports, with an intuitive web-based interface that empowers the usage of novel data mining techniques, namely, a variety of data cleansing tools, filters, and clustering functions. Therefore, the user is not required to master the programming skills commonly needed for data analysts and scientists, such as Python and R.
Topics: Data Mining; Data Warehousing; Humans; Metadata; Radiology Information Systems
PubMed: 31201587
DOI: 10.1007/s10278-019-00184-5