-
Journal of Digital Imaging Oct 2019In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical... (Review)
Review
In the last decades, the amount of medical imaging studies and associated metadata has been rapidly increasing. Despite being mostly used for supporting medical diagnosis and treatment, many recent initiatives claim the use of medical imaging studies in clinical research scenarios but also to improve the business practices of medical institutions. However, the continuous production of medical imaging studies coupled with the tremendous amount of associated data, makes the real-time analysis of medical imaging repositories difficult using conventional tools and methodologies. Those archives contain not only the image data itself but also a wide range of valuable metadata describing all the stakeholders involved in the examination. The exploration of such technologies will increase the efficiency and quality of medical practice. In major centers, it represents a big data scenario where Business Intelligence (BI) and Data Analytics (DA) are rare and implemented through data warehousing approaches. This article proposes an Extract, Transform, Load (ETL) framework for medical imaging repositories able to feed, in real-time, a developed BI (Business Intelligence) application. The solution was designed to provide the necessary environment for leading research on top of live institutional repositories without requesting the creation of a data warehouse. It features an extensible dashboard with customizable charts and reports, with an intuitive web-based interface that empowers the usage of novel data mining techniques, namely, a variety of data cleansing tools, filters, and clustering functions. Therefore, the user is not required to master the programming skills commonly needed for data analysts and scientists, such as Python and R.
Topics: Data Mining; Data Warehousing; Humans; Metadata; Radiology Information Systems
PubMed: 31201587
DOI: 10.1007/s10278-019-00184-5 -
PloS One 2022Crowdfunding platforms allow entrepreneurs to publish projects and raise funds for realizing them. Hence, the question of what influences projects' fundraising success... (Comparative Study)
Comparative Study
Crowdfunding platforms allow entrepreneurs to publish projects and raise funds for realizing them. Hence, the question of what influences projects' fundraising success is very important. Previous studies examined various factors such as project goals and project duration that may influence the outcomes of fundraising campaigns. We present a novel model for predicting the success of crowdfunding projects in meeting their funding goals. Our model focuses on semantic features only, whose performance is comparable to that of previous models. In an additional model we developed, we examine both project metadata and project semantics, delivering a comprehensive study of factors influencing crowdfunding success. Further, we analyze a large dataset of crowdfunding project data, larger than reported in the art. Finally, we show that when combining semantics and metadata, we arrive at F1 score accuracy of 96.2%. We compare our model's accuracy to the accuracy of previous research models by applying their methods on our dataset, and demonstrate higher accuracy of our model. In addition to our scientific contribution, we provide practical recommendations that may increase project funding success chances.
Topics: Algorithms; Crowdsourcing; Fund Raising; Humans; Metadata; Models, Theoretical; Semantics
PubMed: 35148341
DOI: 10.1371/journal.pone.0263891 -
Bioinformatics (Oxford, England) Dec 2021As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows...
SUMMARY
As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets.
AVAILABILITY AND IMPLEMENTATION
We provide the UCSC Cell Browser as a free website where scientists can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Genomics; Software; Databases, Genetic; Metadata
PubMed: 34244710
DOI: 10.1093/bioinformatics/btab503 -
Neuroinformatics Jan 2022Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve...
Results of computational analyses require transparent disclosure of their supporting resources, while the analyses themselves often can be very large scale and involve multiple processing steps separated in time. Evidence for the correctness of any analysis should include not only a textual description, but also a formal record of the computations which produced the result, including accessible data and software with runtime parameters, environment, and personnel involved. This article describes FAIRSCAPE, a reusable computational framework, enabling simplified access to modern scalable cloud-based components. FAIRSCAPE fully implements the FAIR data principles and extends them to provide fully FAIR Evidence, including machine-interpretable provenance of datasets, software and computations, as metadata for all computed results. The FAIRSCAPE microservices framework creates a complete Evidence Graph for every computational result, including persistent identifiers with metadata, resolvable to the software, computations, and datasets used in the computation; and stores a URI to the root of the graph in the result's metadata. An ontology for Evidence Graphs, EVI ( https://w3id.org/EVI ), supports inferential reasoning over the evidence. FAIRSCAPE can run nested or disjoint workflows and preserves provenance across them. It can run Apache Spark jobs, scripts, workflows, or user-supplied containers. All objects are assigned persistent IDs, including software. All results are annotated with FAIR metadata using the evidence graph model for access, validation, reproducibility, and re-use of archived data and software.
Topics: Metadata; Reproducibility of Results; Software; Workflow
PubMed: 34264488
DOI: 10.1007/s12021-021-09529-4 -
F1000Research 2022: Knowing the needs of the bioimaging community with respect to research data management (RDM) is essential for identifying measures that enable adoption of the FAIR...
: Knowing the needs of the bioimaging community with respect to research data management (RDM) is essential for identifying measures that enable adoption of the FAIR (findable, accessible, interoperable, reusable) principles for microscopy and bioimage analysis data across disciplines. As an initiative within Germany's National Research Data Infrastructure, we conducted this community survey in summer 2021 to assess the state of the art of bioimaging RDM and the community needs. : An online survey was conducted with a mixed question-type design. We created a questionnaire tailored to relevant topics of the bioimaging community, including specific questions on bioimaging methods and bioimage analysis, as well as more general questions on RDM principles and tools. 203 survey entries were included in the analysis covering the perspectives from various life and biomedical science disciplines and from participants at different career levels. : The results highlight the importance and value of bioimaging RDM and data sharing. However, the practical implementation of FAIR practices is impeded by technical hurdles, lack of knowledge, and insecurity about the legal aspects of data sharing. The survey participants request metadata guidelines and annotation tools and endorse the usage of image data management platforms. At present, OMERO (Open Microscopy Environment Remote Objects) is the best known and most widely used platform. Most respondents rely on image processing and analysis, which they regard as the most time-consuming step of the bioimage data workflow. While knowledge about and implementation of electronic lab notebooks and data management plans is limited, respondents acknowledge their potential value for data handling and publication. : The bioimaging community acknowledges and endorses the value of RDM and data sharing. Still, there is a need for information, guidance, and standardization to foster the adoption of FAIR data handling. This survey may help inspiring targeted measures to close this gap.
Topics: Humans; Data Management; Metadata; Information Dissemination; Surveys and Questionnaires; Workflow
PubMed: 36405555
DOI: 10.12688/f1000research.121714.2 -
Scientific Data Dec 2020Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream...
Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.
Topics: Clinical Trials as Topic; Databases, Factual; Datasets as Topic; Metadata; Research Design
PubMed: 33339830
DOI: 10.1038/s41597-020-00780-z -
PloS One 2017The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the...
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
Topics: Animals; DNA; Databases, Genetic; Genome; Genomics; Humans; Metadata; Mice; Software
PubMed: 28403240
DOI: 10.1371/journal.pone.0175310 -
Scientific Data Oct 2020Our understanding of the evolution of the COVID-19 pandemic is built upon data concerning confirmed cases and deaths. This data, however, can only be meaningfully...
Our understanding of the evolution of the COVID-19 pandemic is built upon data concerning confirmed cases and deaths. This data, however, can only be meaningfully interpreted alongside an accurate understanding of the extent of virus testing in different countries. This new database brings together official data on the extent of PCR testing over time for 94 countries. We provide a time series for the daily number of tests performed, or people tested, together with metadata describing data quality and comparability issues needed for the interpretation of the time series. The database is updated regularly through a combination of automated scraping and manual collection and verification, and is entirely replicable, with sources provided for each observation. In providing accessible cross-country data on testing output, it aims to facilitate the incorporation of this crucial information into epidemiological studies, as well as track a key component of countries' responses to COVID-19.
Topics: Betacoronavirus; COVID-19; COVID-19 Testing; Clinical Laboratory Techniques; Coronavirus Infections; Data Accuracy; Databases, Factual; Humans; Internationality; Metadata; Pandemics; Pneumonia, Viral; SARS-CoV-2
PubMed: 33033256
DOI: 10.1038/s41597-020-00688-8 -
BMC Bioinformatics Dec 2017NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms....
BACKGROUND
NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence.
RESULTS
Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method.
CONCLUSION
Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.
Topics: Age Factors; Algorithms; Animals; Automation; Databases, Genetic; Female; Gene Expression; Gene Ontology; Humans; Machine Learning; Male; Metadata; Middle Aged; Molecular Sequence Annotation; Rats; Reference Standards
PubMed: 29297276
DOI: 10.1186/s12859-017-1888-1 -
Metabolomics : Official Journal of the... Nov 2022The structural identification of metabolites represents one of the current bottlenecks in non-targeted liquid chromatography-mass spectrometry (LC-MS) based...
INTRODUCTION
The structural identification of metabolites represents one of the current bottlenecks in non-targeted liquid chromatography-mass spectrometry (LC-MS) based metabolomics. The Metabolomics Standard Initiative has developed a multilevel system to report confidence in metabolite identification, which involves the use of MS, MS/MS and orthogonal data. Limitations due to similar or same fragmentation pattern (e.g. isomeric compounds) can be overcome by the additional orthogonal information of the retention time (RT), since it is a system property that is different for each chromatographic setup.
OBJECTIVES
In contrast to MS data, sharing of RT data is not as widespread. The quality of data and its (re-)useability depend very much on the quality of the metadata. We aimed to evaluate the coverage and quality of this metadata from public metabolomics repositories.
METHODS
We acquired an overview on the current reporting of chromatographic separation conditions. For this purpose, we defined the following information as important details that have to be provided: column name and dimension, flow rate, temperature, composition of eluents and gradient.
RESULTS
We found that 70% of descriptions of the chromatographic setups are incomplete (according to our definition) and an additional 10% of the descriptions contained ambiguous and/or incorrect information. Accordingly, only about 20% of the descriptions allow further (re-)use of the data, e.g. for RT prediction. Therefore, we have started to develop a unified and standardized notation for chromatographic metadata with detailed and specific description of eluents, columns and gradients.
CONCLUSION
Reporting of chromatographic metadata is currently not unified. Our recommended suggestions for metadata reporting will enable more standardization and automatization in future reporting.
Topics: Metadata; Metabolomics; Tandem Mass Spectrometry; Chromatography, Liquid; Temperature
PubMed: 36436113
DOI: 10.1007/s11306-022-01956-x