-
Nucleic Acids Research Jul 2022Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into...
Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into searchable signatures along with signatures extracted from Genotype-Tissue Expression (GTEx) and Gene Expression Omnibus (GEO), connections between drugs, genes, pathways and diseases can be illuminated. SigCom LINCS is a webserver that serves over a million gene expression signatures processed, analyzed, and visualized from LINCS, GTEx, and GEO. SigCom LINCS is built with Signature Commons, a cloud-agnostic skeleton Data Commons with a focus on serving searchable signatures. SigCom LINCS provides a rapid signature similarity search for mimickers and reversers given sets of up and down genes, a gene set, a single gene, or any search term. Additionally, users of SigCom LINCS can perform a metadata search to find and analyze subsets of signatures and find information about genes and drugs. SigCom LINCS is findable, accessible, interoperable, and reusable (FAIR) with metadata linked to standard ontologies and vocabularies. In addition, all the data and signatures within SigCom LINCS are available via a well-documented API. In summary, SigCom LINCS, available at https://maayanlab.cloud/sigcom-lincs, is a rich webserver resource for accelerating drug and target discovery in systems pharmacology.
Topics: Transcriptome; Metadata; Search Engine
PubMed: 35524556
DOI: 10.1093/nar/gkac328 -
GigaScience Dec 2022The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient...
BACKGROUND
The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.
FINDINGS
We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the "form generation module" creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the "validation module." Finally, the "resource module" can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive-compatible XML metadata file.
CONCLUSIONS
Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.
Topics: Metadata; Biological Science Disciplines; Databases, Factual; Nucleotides; Publishing
PubMed: 36879493
DOI: 10.1093/gigascience/giad014 -
Conservation Biology : the Journal of... Aug 2023Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to...
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data-sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever.
Topics: Humans; Conservation of Natural Resources; Metadata; Biodiversity; Probability; Genetic Variation
PubMed: 36704891
DOI: 10.1111/cobi.14061 -
PloS One 2022To adopt the FAIR principles (Findable, Accessible, Interoperable, Reusable) to enhance data sharing, the Cure Sickle Cell Initiative (CureSCi) MetaData Catalog (MDC)...
OBJECTIVES
To adopt the FAIR principles (Findable, Accessible, Interoperable, Reusable) to enhance data sharing, the Cure Sickle Cell Initiative (CureSCi) MetaData Catalog (MDC) was developed to make Sickle Cell Disease (SCD) study datasets more Findable by curating study metadata and making them available through an open-access web portal.
METHODS
Study metadata, including study protocol, data collection forms, and data dictionaries, describe information about study patient-level data. We curated key metadata of 16 SCD studies in a three-tiered conceptual framework of category, subcategory, and data element using ontologies and controlled vocabularies to organize the study variables. We developed the CureSCi MDC by indexing study metadata to enable effective browse and search capabilities at three levels: study, Patient-Reported Outcome (PRO) Measures, and data element levels.
RESULTS
The CureSCi MDC offers several browse and search tools to discover studies by study level, PRO Measures, and data elements. The "Browse Studies," "Browse Studies by PRO Measures," and "Browse Studies by Data Elements" tools allow users to identify studies through pre-defined conceptual categories. "Search by Keyword" and "Search Data Element by Concept Category" can be used separately or in combination to provide more granularity to refine the search results. This resource helps investigators find information about specific data elements across studies using public browsing/search tools, before going through data request procedures to access controlled datasets. The MDC makes SCD studies more Findable through browsing/searching study information, PRO Measures, and data elements, aiding in the reuse of existing SCD data.
Topics: Humans; Metadata; Information Dissemination; Anemia, Sickle Cell
PubMed: 36508412
DOI: 10.1371/journal.pone.0256248 -
Journal of Digital Imaging Oct 2019In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical... (Review)
Review
In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical diagnosis and treatment, many recent initiatives claim the use of medical imaging studies in clinical research scenarios but also to improve the business practices of medical institutions. However, the continuous production of medical imaging studies coupled with the tremendous amount of associated data, makes the real-time analysis of medical imaging repositories difficult using conventional tools and methodologies. Those archives contain not only the image data itself but also a wide range of valuable metadata describing all the stakeholders involved in the examination. The exploration of such technologies will increase the efficiency and quality of medical practice. In major centers, it represents a big data scenario where Business Intelligence (BI) and Data Analytics (DA) are rare and implemented through data warehousing approaches. This article proposes an Extract, Transform, Load (ETL) framework for medical imaging repositories able to feed, in real-time, a developed BI (Business Intelligence) application. The solution was designed to provide the necessary environment for leading research on top of live institutional repositories without requesting the creation of a data warehouse. It features an extensible dashboard with customizable charts and reports, with an intuitive web-based interface that empowers the usage of novel data mining techniques, namely, a variety of data cleansing tools, filters, and clustering functions. Therefore, the user is not required to master the programming skills commonly needed for data analysts and scientists, such as Python and R.
Topics: Data Mining; Data Warehousing; Humans; Metadata; Radiology Information Systems
PubMed: 31201587
DOI: 10.1007/s10278-019-00184-5 -
PloS One 2022Crowdfunding platforms allow entrepreneurs to publish projects and raise funds for realizing them. Hence, the question of what influences projects' fundraising success... (Comparative Study)
Comparative Study
Crowdfunding platforms allow entrepreneurs to publish projects and raise funds for realizing them. Hence, the question of what influences projects' fundraising success is very important. Previous studies examined various factors such as project goals and project duration that may influence the outcomes of fundraising campaigns. We present a novel model for predicting the success of crowdfunding projects in meeting their funding goals. Our model focuses on semantic features only, whose performance is comparable to that of previous models. In an additional model we developed, we examine both project metadata and project semantics, delivering a comprehensive study of factors influencing crowdfunding success. Further, we analyze a large dataset of crowdfunding project data, larger than reported in the art. Finally, we show that when combining semantics and metadata, we arrive at F1 score accuracy of 96.2%. We compare our model's accuracy to the accuracy of previous research models by applying their methods on our dataset, and demonstrate higher accuracy of our model. In addition to our scientific contribution, we provide practical recommendations that may increase project funding success chances.
Topics: Algorithms; Crowdsourcing; Fund Raising; Humans; Metadata; Models, Theoretical; Semantics
PubMed: 35148341
DOI: 10.1371/journal.pone.0263891 -
Bioinformatics (Oxford, England) Dec 2021As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows...
SUMMARY
As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets.
AVAILABILITY AND IMPLEMENTATION
We provide the UCSC Cell Browser as a free website where scientists can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Genomics; Software; Databases, Genetic; Metadata
PubMed: 34244710
DOI: 10.1093/bioinformatics/btab503 -
BMC Genomic Data Nov 2022While data sharing increases, most open data are difficult to re-use or to identify due to the lack of related metada. In this editorial, I discussed about the...
While data sharing increases, most open data are difficult to re-use or to identify due to the lack of related metada. In this editorial, I discussed about the importance of those metadata in the context of genomic, and why they are mandatory to ensure the success of data sharing.
Topics: Metadata; Information Dissemination; Genomics
PubMed: 36371151
DOI: 10.1186/s12863-022-01095-1 -
Social Studies of Science Oct 2019'Metadata' has received a fraction of the attention that 'data' has received in sociological studies of scientific research. A neglect of 'metadata' reduces the...
'Metadata' has received a fraction of the attention that 'data' has received in sociological studies of scientific research. A neglect of 'metadata' reduces the attention on a number of critical aspects of scientific work processes, including documentary work, accountability relations, and collaboration routines. Metadata processes and products are essential components of the work needed to practically accomplish day-to-day scientific research tasks, and are central to ensuring that research findings and products meet externally driven standards or requirements. This article is an attempt to open up the discussion on and conceptualization of metadata within the sociology of science and the sociology of data. It presents ethnographic research of metadata creation within everyday scientific practice, focusing on how researchers document, describe, annotate, organize and manage their data, both for their own use and the use of researchers outside of their project. In particular, this article argues that the role and significance of metadata within scientific research contexts are intimately tied to the nature of evidence and accountability within particular social situations. Studying metadata can (1) provide insight into the production of evidence, that is, how something we might call 'data' becomes able to serve an evidentiary role, and (2) provide a mechanism for revealing what people in research contexts are held , and what they achieve .
Topics: Metadata; Research Design
PubMed: 31354073
DOI: 10.1177/0306312719863494 -
Neuroinformatics Jan 2022Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve...
Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result's metadata. An ontology for Evidence Graphs, EVI ( https://w3id.org/EVI ), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software.
Topics: Metadata; Reproducibility of Results; Software; Workflow
PubMed: 34264488
DOI: 10.1007/s12021-021-09529-4