-
International Journal of Epidemiology Oct 2023Mendelian randomization (MR) studies are susceptible to metadata errors (e.g. incorrect specification of the effect allele column) and other analytical issues that can...
BACKGROUND
Mendelian randomization (MR) studies are susceptible to metadata errors (e.g. incorrect specification of the effect allele column) and other analytical issues that can introduce substantial bias into analyses. We developed a quality control (QC) pipeline for the Fatty Acids in Cancer Mendelian Randomization Collaboration (FAMRC) that can be used to identify and correct for such errors.
METHODS
We collated summary association statistics from fatty acid and cancer genome-wide association studies (GWAS) and subjected the collated data to a comprehensive QC pipeline. We identified metadata errors through comparison of study-specific statistics to external reference data sets (the National Human Genome Research Institute-European Bioinformatics Institute GWAS catalogue and 1000 genome super populations) and other analytical issues through comparison of reported to expected genetic effect sizes. Comparisons were based on three sets of genetic variants: (i) GWAS hits for fatty acids, (ii) GWAS hits for cancer and (iii) a 1000 genomes reference set.
RESULTS
We collated summary data from 6 fatty acid and 54 cancer GWAS. Metadata errors and analytical issues with the potential to introduce substantial bias were identified in seven studies (11.6%). After resolving metadata errors and analytical issues, we created a data set of 219 842 genetic associations with 90 cancer types, generated in analyses of 566 665 cancer cases and 1 622 374 controls.
CONCLUSIONS
In this large MR collaboration, 11.6% of included studies were affected by a substantial metadata error or analytical issue. By increasing the integrity of collated summary data prior to their analysis, our protocol can be used to increase the reliability of downstream MR analyses. Our pipeline is available to other researchers via the CheckSumStats package (https://github.com/MRCIEU/CheckSumStats).
Topics: Humans; Genome-Wide Association Study; Mendelian Randomization Analysis; Reproducibility of Results; Fatty Acids; Quality Control; Neoplasms
PubMed: 38587501
DOI: 10.1093/ije/dyad018 -
Metabolites Aug 2023Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on...
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
PubMed: 37623884
DOI: 10.3390/metabo13080941 -
MedRxiv : the Preprint Server For... Sep 2023A scalable approach for the sharing and reuse of human-readable and computer-executable phenotype definitions can facilitate the reuse of electronic health records for...
BACKGROUND
A scalable approach for the sharing and reuse of human-readable and computer-executable phenotype definitions can facilitate the reuse of electronic health records for cohort identification and research studies.
DESCRIPTION
We developed a tool called Sharephe for the Informatics for Integrating Biology and the Bedside (i2b2) platform. Sharephe consists of a plugin for i2b2 and a cloud-based searchable repository of computable phenotypes, has the functionality to import to and export from the repository, and has the ability to link to supporting metadata.
DISCUSSION
The i2b2 platform enables researchers to create, evaluate, and implement phenotypes without knowing complex query languages. In an initial evaluation, two sites on the Evolve to Next-Gen ACT (ENACT) network used Sharephe to successfully create, share, and reuse phenotypes.
CONCLUSION
The combination of a cloud-based computable repository and an i2b2 plugin for accessing the repository enables investigators to store and retrieve phenotypes from anywhere and at any time and to collaborate across sites in a research network.
PubMed: 37790390
DOI: 10.1101/2023.09.17.23295681 -
Viruses Aug 2023Viruses are abundant and diverse entities that have important roles in public health, ecology, and agriculture. The identification and surveillance of viruses rely on an... (Review)
Review
Viruses are abundant and diverse entities that have important roles in public health, ecology, and agriculture. The identification and surveillance of viruses rely on an understanding of their genome organization, sequences, and replication strategy. Despite technological advancements in sequencing methods, our current understanding of virus diversity remains incomplete, highlighting the need to explore undiscovered viruses. Virus databases play a crucial role in providing access to sequences, annotations and other metadata, and analysis tools for studying viruses. However, there has not been a comprehensive review of virus databases in the last five years. This study aimed to fill this gap by identifying 24 active virus databases and included an extensive evaluation of their content, functionality and compliance with the FAIR principles. In this study, we thoroughly assessed the search capabilities of five database catalogs, which serve as comprehensive repositories housing a diverse array of databases and offering essential metadata. Moreover, we conducted a comprehensive review of different types of errors, encompassing taxonomy, names, missing information, sequences, sequence orientation, and chimeric sequences, with the intention of empowering users to effectively tackle these challenges. We expect this review to aid users in selecting suitable virus databases and other resources, and to help databases in error management and improve their adherence to the FAIR principles. The databases listed here represent the current knowledge of viruses and will help aid users find databases of interest based on content, functionality, and scope. The use of virus databases is integral to gaining new insights into the biology, evolution, and transmission of viruses, and developing new strategies to manage virus outbreaks and preserve global health.
PubMed: 37766241
DOI: 10.3390/v15091834 -
Trends in Biotechnology Feb 2024DNA is an intelligent data storage medium due to its stability and high density. It has been used by nature for over 3.5 billion years. Compared with traditional... (Review)
Review
DNA is an intelligent data storage medium due to its stability and high density. It has been used by nature for over 3.5 billion years. Compared with traditional methods, DNA offers better compression and physical density. DNA can retain information for thousands of years. However, challenges exist in scalability, standardization, metadata gathering, biocybersecurity, and specialized tools. Addressing these challenges is crucial for widespread implementation. Collaboration among experts, as well as keeping the future in mind, is needed to unlock the full potential of DNA data storage, which promises low energy costs, high-density storage, and long-term stability.
Topics: Information Storage and Retrieval; DNA
PubMed: 37673693
DOI: 10.1016/j.tibtech.2023.08.001 -
PloS One 2023Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications, often with the help of online scientific publication...
Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications, often with the help of online scientific publication databases and search engines. However, use of online sources suffers from a lack of repeatability and transparency, as well as from technical restrictions. Alexandria3k is a Python software package and an associated command-line tool that can populate embedded relational databases with slices from the complete set of several open publication metadata sets. These can then be employed for reproducible processing and analysis through versatile and performant queries. We demonstrate the software's utility by visualizing the evolution of publications in diverse scientific fields and relationships among them, by outlining scientometric facts associated with COVID-19 research, and by replicating commonly-used bibliometric measures and findings regarding scientific productivity, impact, and disruption.
Topics: Databases, Factual; Search Engine; Bibliometrics; Metadata; Research Design
PubMed: 38032908
DOI: 10.1371/journal.pone.0294946 -
Nature Methods Nov 2023The increasing generation of population-level single-cell atlases has the potential to link sample metadata with cellular data. Constructing such references requires...
The increasing generation of population-level single-cell atlases has the potential to link sample metadata with cellular data. Constructing such references requires integration of heterogeneous cohorts with varying metadata. Here we present single-cell population level integration (scPoli), an open-world learner that incorporates generative models to learn sample and cell representations for data integration, label transfer and reference mapping. We applied scPoli on population-level atlases of lung and peripheral blood mononuclear cells, the latter consisting of 7.8 million cells across 2,375 samples. We demonstrate that scPoli can explain sample-level biological and technical variations using sample embeddings revealing genes associated with batch effects and biological effects. scPoli is further applicable to single-cell sequencing assay for transposase-accessible chromatin and cross-species datasets, offering insights into chromatin accessibility and comparative genomics. We envision scPoli becoming an important tool for population-level single-cell data integration facilitating atlas use but also interpretation by means of multi-scale analyses.
Topics: Humans; Leukocytes, Mononuclear; Genomics; Chromatin; Single-Cell Analysis
PubMed: 37813989
DOI: 10.1038/s41592-023-02035-2 -
Frontiers in Immunology 2023The surge in the number of publications on psoriasis has posed significant challenges for researchers in effectively managing the vast amount of information. However,...
BACKGROUND
The surge in the number of publications on psoriasis has posed significant challenges for researchers in effectively managing the vast amount of information. However, due to the lack of tools to process metadata, no comprehensive bibliometric analysis has been conducted.
OBJECTIVES
This study is to evaluate the trends and current hotspots of psoriatic research from a macroscopic perspective through a bibliometric analysis assisted by machine learning based semantic analysis.
METHODS
Publications indexed under the Medical Subject Headings (MeSH) term "Psoriasis" from 2003 to 2022 were extracted from PubMed. The generative statistical algorithm latent Dirichlet allocation (LDA) was applied to identify specific topics and trends based on abstracts. The unsupervised Louvain algorithm was used to establish a network identifying relationships between topics.
RESULTS
A total of 28,178 publications were identified. The publications were derived from 176 countries, with United States, China, and Italy being the top three countries. For the term "psoriasis", 9,183 MeSH terms appeared 337,545 times. Among them, MeSH term "Severity of illness index", "Treatment outcome", "Dermatologic agents" occur most frequently. A total of 21,928 publications were included in LDA algorithm, which identified three main areas and 50 branched topics, with "Molecular pathogenesis", "Clinical trials", and "Skin inflammation" being the most increased topics. LDA networks identified "Skin inflammation" was tightly associated with "Molecular pathogenesis" and "Biological agents". "Nail psoriasis" and "Epidemiological study" have presented as new research hotspots, and attention on topics of comorbidities, including "Cardiovascular comorbidities", "Psoriatic arthritis", "Obesity" and "Psychological disorders" have increased gradually.
CONCLUSIONS
Research on psoriasis is flourishing, with molecular pathogenesis, skin inflammation, and clinical trials being the current hotspots. The strong association between skin inflammation and biologic agents indicated the effective translation between basic research and clinical application in psoriasis. Besides, nail psoriasis, epidemiological study and comorbidities of psoriasis also draw increased attention.
Topics: Humans; United States; Psoriasis; Arthritis, Psoriatic; Bibliometrics; Dermatitis; Machine Learning; Inflammation
PubMed: 37954610
DOI: 10.3389/fimmu.2023.1272080 -
Scientific Data Sep 2023The Two Weeks in the World research project has resulted in a dataset of 3087 clinically relevant bacterial genomes with pertaining metadata, collected from 59...
The Two Weeks in the World research project has resulted in a dataset of 3087 clinically relevant bacterial genomes with pertaining metadata, collected from 59 diagnostic units in 35 countries around the world during 2020. A relational database is available with metadata and summary data from selected bioinformatic analysis, such as species prediction and identification of acquired resistance genes.
Topics: Bacteria; Computational Biology; Databases, Factual; Genome, Bacterial; Metadata
PubMed: 37717051
DOI: 10.1038/s41597-023-02502-7 -
American Journal of Ophthalmology Aug 2023To develop a multimodal artificial intelligence (AI) system, EE-Explorer, to triage eye emergencies and assist in primary diagnosis using metadata and ocular images.
PURPOSE
To develop a multimodal artificial intelligence (AI) system, EE-Explorer, to triage eye emergencies and assist in primary diagnosis using metadata and ocular images.
DESIGN
A diagnostic, cross-sectional, validity and reliability study.
METHODS
EE-Explorer consists of 2 models. The triage model was developed from metadata (events, symptoms, and medical history) and ocular surface images via smartphones from 2038 patients presenting to Zhongshan Ophthalmic Center (ZOC) to output 3 classifications: urgent, semiurgent, and nonurgent. The primary diagnostic model was developed from the paired metadata and slitlamp images of 2405 patients from ZOC. Both models were externally tested on 103 participants from 4 other hospitals. A pilot test was conducted in Guangzhou to evaluate the hierarchical referral service pattern assisted by EE-Explorer for unspecialized health care facilities.
RESULTS
A high overall accuracy, as indicated by an area under the receiver operating characteristic curve (AUC) of 0.982 (95% CI, 0.966-0.998), was obtained using the triage model, which outperformed the triage nurses (P < .001). In the primary diagnostic model, the diagnostic classification accuracy (CA) and Hamming loss (HL) in the internal testing were 0.808 (95% CI 0.776-0.840) and 0.016 (95% CI 0.006-0.026), respectively. In the external testing, model performance was robust for both triage (average AUC, 0.988, 95% CI 0.967-1.000) and primary diagnosis (CA, 0.718, 95% CI 0.644-0.792; and HL, 0.023, 95% CI 0.000-0.048). In the pilot test in the hierarchical referral settings, EE-explorer demonstrated consistently robust performance and broad participant acceptance.
CONCLUSION
The EE-Explorer system showed robust performance in both triage and primary diagnosis for ophthalmic emergency patients. EE-Explorer can provide patients with acute ophthalmic symptoms access to remote self-triage and assist in primary diagnosis in unspecialized health care facilities to achieve rapid and effective treatment strategies.
Topics: Humans; Triage; Artificial Intelligence; Reproducibility of Results; Cross-Sectional Studies; Emergency Service, Hospital
PubMed: 37142171
DOI: 10.1016/j.ajo.2023.04.007