-
Bioinformatics (Oxford, England) Jun 2024In many biomedical applications, we are confronted with paired groups of samples, such as treated versus control. The aim is to detect discriminating features, i.e....
MOTIVATION
In many biomedical applications, we are confronted with paired groups of samples, such as treated versus control. The aim is to detect discriminating features, i.e. biomarkers, based on high-dimensional (omics-) data. This problem can be phrased more generally as a two-sample problem requiring statistical significance testing to establish differences, and interpretations to identify distinguishing features. The multivariate maximum mean discrepancy (MMD) test quantifies group-level differences, whereas statistically significantly associated features are usually found by univariate feature selection. Currently, few general-purpose methods simultaneously perform multivariate feature selection and two-sample testing.
RESULTS
We introduce a sparse, interpretable, and optimized MMD test (SpInOpt-MMD) that enables two-sample testing and feature selection in the same experiment. SpInOpt-MMD is a versatile method and we demonstrate its application to a variety of synthetic and real-world data types including images, gene expression measurements, and text data. SpInOpt-MMD is effective in identifying relevant features in small sample sizes and outperforms other feature selection methods such as SHapley Additive exPlanations and univariate association analysis in several experiments.
AVAILABILITY AND IMPLEMENTATION
The code and links to our public data are available at https://github.com/BorgwardtLab/spinoptmmd.
Topics: Biomarkers; Humans; Algorithms; Computational Biology
PubMed: 38940158
DOI: 10.1093/bioinformatics/btae251 -
Bioinformatics (Oxford, England) Jun 2024Profiling of gene expression and chromatin accessibility by single-cell multi-omics approaches can help to systematically decipher how transcription factors (TFs)...
MOTIVATION
Profiling of gene expression and chromatin accessibility by single-cell multi-omics approaches can help to systematically decipher how transcription factors (TFs) regulate target gene expression via cis-region interactions. However, integrating information from different modalities to discover regulatory associations is challenging, in part because motif scanning approaches miss many likely TF binding sites.
RESULTS
We develop REUNION, a framework for predicting genome-wide TF binding and cis-region-TF-gene "triplet" regulatory associations using single-cell multi-omics data. The first component of REUNION, Unify, utilizes information theory-inspired complementary score functions that incorporate TF expression, chromatin accessibility, and target gene expression to identify regulatory associations. The second component, Rediscover, takes Unify estimates as input for pseudo semi-supervised learning to predict TF binding in accessible genomic regions that may or may not include detected TF motifs. Rediscover leverages latent chromatin accessibility and sequence feature spaces of the genomic regions, without requiring chromatin immunoprecipitation data for model training. Applied to peripheral blood mononuclear cell data, REUNION outperforms alternative methods in TF binding prediction on average performance. In particular, it recovers missing region-TF associations from regions lacking detected motifs, which circumvents the reliance on motif scanning and facilitates discovery of novel associations involving potential co-binding transcriptional regulators. Newly identified region-TF associations, even in regions lacking a detected motif, improve the prediction of target gene expression in regulatory triplets, and are thus likely to genuinely participate in the regulation.
AVAILABILITY AND IMPLEMENTATION
All source code is available at https://github.com/yangymargaret/REUNION.
Topics: Transcription Factors; Humans; Single-Cell Analysis; Binding Sites; Chromatin; Genomics; Software; Computational Biology; Protein Binding; Algorithms; Leukocytes, Mononuclear; Multiomics
PubMed: 38940155
DOI: 10.1093/bioinformatics/btae234 -
Bioinformatics (Oxford, England) Jun 2024Spatial omics technologies are increasingly leveraged to characterize how disease disrupts tissue organization and cellular niches. While multiple methods to analyze...
SUMMARY
Spatial omics technologies are increasingly leveraged to characterize how disease disrupts tissue organization and cellular niches. While multiple methods to analyze spatial variation within a sample have been published, statistical and computational approaches to compare cell spatial organization across samples or conditions are mostly lacking. We present GraphCompass, a comprehensive set of omics-adapted graph analysis methods to quantitatively evaluate and compare the spatial arrangement of cells in samples representing diverse biological conditions. GraphCompass builds upon the Squidpy spatial omics toolbox and encompasses various statistical approaches to perform cross-condition analyses at the level of individual cell types, niches, and samples. Additionally, GraphCompass provides custom visualization functions that enable effective communication of results. We demonstrate how GraphCompass can be used to address key biological questions, such as how cellular organization and tissue architecture differ across various disease states and which spatial patterns correlate with a given pathological condition. GraphCompass can be applied to various popular omics techniques, including, but not limited to, spatial proteomics (e.g. MIBI-TOF), spot-based transcriptomics (e.g. 10× Genomics Visium), and single-cell resolved transcriptomics (e.g. Stereo-seq). In this work, we showcase the capabilities of GraphCompass through its application to three different studies that may also serve as benchmark datasets for further method development. With its easy-to-use implementation, extensive documentation, and comprehensive tutorials, GraphCompass is accessible to biologists with varying levels of computational expertise. By facilitating comparative analyses of cell spatial organization, GraphCompass promises to be a valuable asset in advancing our understanding of tissue function in health and disease.
UNLABELLED
.
Topics: Humans; Software; Proteomics; Computational Biology; Genomics; Animals; Transcriptome; Single-Cell Analysis
PubMed: 38940138
DOI: 10.1093/bioinformatics/btae242 -
Frontiers in Bioscience (Landmark... Jun 2024The incidence rate of oropharyngeal squamous cell carcinoma (OPSCC) worldwide is alarming. In the clinical community, there is a pressing necessity to comprehend the...
BACKGROUND
The incidence rate of oropharyngeal squamous cell carcinoma (OPSCC) worldwide is alarming. In the clinical community, there is a pressing necessity to comprehend the etiology of the OPSCC to facilitate the administration of effective treatments.
METHODS
This study confers an integrative genomics approach for identifying key oncogenic drivers involved in the OPSCC pathogenesis. The dataset contains RNA-Sequencing (RNA-Seq) samples of 46 Human papillomavirus-positive head and neck squamous cell carcinoma and 25 normal Uvulopalatopharyngoplasty cases. The differential marker selection is performed between the groups with a log2FoldChange (FC) score of 2, adjusted -value < 0.01, and screened 714 genes. The Particle Swarm Optimization (PSO) algorithm selects the candidate gene subset, reducing the size to 73. The state-of-the-art machine learning algorithms are trained with the differentially expressed genes and candidate subsets of PSO.
RESULTS
The analysis of predictive models using Shapley Additive exPlanations revealed that seven genes significantly contribute to the model's performance. These include , , and , which predominantly influence differentiating between sample groups. They were followed in importance by , , , and . The Random Forest and Bayes Net algorithms also achieved perfect validation scores when using PSO features. Furthermore, gene set enrichment analysis, protein-protein interactions, and disease ontology mining revealed a significant association between these genes and the target condition. As indicated by Shapley Additive exPlanations (SHAPs), the survival analysis of three key genes unveiled strong over-expression in the samples from "The Cancer Genome Atlas".
CONCLUSIONS
Our findings elucidate critical oncogenic drivers in OPSCC, offering vital insights for developing targeted therapies and enhancing understanding its pathogenesis.
Topics: Humans; Oropharyngeal Neoplasms; Biomarkers, Tumor; Papillomavirus Infections; Artificial Intelligence; Gene Expression Regulation, Neoplastic; Squamous Cell Carcinoma of Head and Neck; Algorithms; Sequence Analysis, RNA; Machine Learning; Papillomaviridae; Carcinoma, Squamous Cell
PubMed: 38940026
DOI: 10.31083/j.fbl2906220 -
Health Care Science Aug 2023The association between cancer and venous thromboembolism (VTE) is well-established with cancer patients accounting for approximately 20% of all VTE incidents. In this...
BACKGROUND
The association between cancer and venous thromboembolism (VTE) is well-established with cancer patients accounting for approximately 20% of all VTE incidents. In this paper, we have performed a comparison of machine learning (ML) methods to traditional clinical scoring models for predicting the occurrence of VTE in a cancer patient population, identified important features (clinical biomarkers) for ML model predictions, and examined how different approaches to reducing the number of features used in the model impact model performance.
METHODS
We have developed an ML pipeline including three separate feature selection processes and applied it to routine patient care data from the electronic health records of 1910 cancer patients at the University of California Davis Medical Center.
RESULTS
Our ML-based prediction model achieved an area under the receiver operating characteristic curve of 0.778 ± 0.006 (mean ± SD) when trained on a set of 15 features. This result is comparable with the model performance when trained on all features in our feature pool [0.779 ± 0.006 (mean ± SD) with 29 features]. Our result surpasses the most validated clinical scoring system for VTE risk assessment in cancer patients by 16.1%. We additionally found cancer stage information to be a useful predictor after all performed feature selection processes despite not being used in existing score-based approaches.
CONCLUSION
From these findings, we observe that ML can offer new insights and a significant improvement over the most validated clinical VTE risk scoring systems in cancer patients. The results of this study also allowed us to draw insight into our feature pool and identify the features that could have the most utility in the context of developing an efficient ML classifier. While a model trained on our entire feature pool of 29 features significantly outperformed the traditionally used clinical scoring system, we were able to achieve an equivalent performance using a subset of only 15 features through strategic feature selection methods. These results are encouraging for potential applications of ML to predicting cancer-associated VTE in clinical settings such as in bedside decision support systems where feature availability may be limited.
PubMed: 38939521
DOI: 10.1002/hcs2.55 -
JACC. Advances Aug 2023Detection of heart failure with preserved ejection fraction (HFpEF) involves integration of multiple imaging and clinical features which are often discordant or...
BACKGROUND
Detection of heart failure with preserved ejection fraction (HFpEF) involves integration of multiple imaging and clinical features which are often discordant or indeterminate.
OBJECTIVES
The authors applied artificial intelligence (AI) to analyze a single apical 4-chamber transthoracic echocardiogram video clip to detect HFpEF.
METHODS
A 3-dimensional convolutional neural network was developed and trained on apical 4-chamber video clips to classify patients with HFpEF (diagnosis of heart failure, ejection fraction ≥50%, and echocardiographic evidence of increased filling pressure; cases) vs without HFpEF (ejection fraction ≥50%, no diagnosis of heart failure, normal filling pressure; controls). Model outputs were classified as HFpEF, no HFpEF, or nondiagnostic (high uncertainty). Performance was assessed in an independent multisite data set and compared to previously validated clinical scores.
RESULTS
Training and validation included 2,971 cases and 3,785 controls (validation holdout, 16.8% patients), and demonstrated excellent discrimination (area under receiver-operating characteristic curve: 0.97 [95% CI: 0.96-0.97] and 0.95 [95% CI: 0.93-0.96] in training and validation, respectively). In independent testing (646 cases, 638 controls), 94 (7.3%) were nondiagnostic; sensitivity (87.8%; 95% CI: 84.5%-90.9%) and specificity (81.9%; 95% CI: 78.2%-85.6%) were maintained in clinically relevant subgroups, with high repeatability and reproducibility. Of 701 and 776 indeterminate outputs from the Heart Failure Association-Pretest Assessment, Echocardiographic and Natriuretic Peptide Score, Functional Testing (HFA-PEFF), and Final Etiology and Heavy, Hypertensive, Atrial Fibrillation, Pulmonary Hypertension, Elder, and Filling Pressure (H2FPEF) scores, the AI HFpEF model correctly reclassified 73.5% and 73.6%, respectively. During follow-up (median: 2.3 [IQR: 0.5-5.6] years), 444 (34.6%) patients died; mortality was higher in patients classified as HFpEF by AI (HR: 1.9 [95% CI: 1.5-2.4]).
CONCLUSIONS
An AI HFpEF model based on a single, routinely acquired echocardiographic video demonstrated excellent discrimination of patients with vs without HFpEF, more often than clinical scores, and identified patients with higher mortality.
PubMed: 38939447
DOI: 10.1016/j.jacadv.2023.100452 -
Journal of Child & Adolescent Trauma Jun 2024The literature suggests that there is a significant overlap in definition, measurement, and outcomes between trauma and bullying victimization, but the relative impact...
The literature suggests that there is a significant overlap in definition, measurement, and outcomes between trauma and bullying victimization, but the relative impact on current emotional distress of these events has not been explored. The goal of the current study was to explore whether traditional and cyber bullying victimization has a similar negative impact on current emotional disrtresss as other adverse childhood experiences which may also lead to a traumatic response. In addition, this study examined whether the association between bullying victimization and emotional distress is exacerbated when individuals also experience additional ACEs. Retrospective reports from a diverse sample of 576 adults were collected via an online survey. When ranked against other ACEs such as viewing family mental health problems or substance abuse, or verbal, physical, emotional, and sexual victimization not from peers, nearly 30% of participants ranked bullying victimization as having the most negative impact on their levels of emotional distress. Multi-group path analyses indicated that experiencing additional ACEs seems to exacerbate distress caused by bullying and cyber bullying victimization. The current study suggests that bullying victimization may be just as detrimental as other types of ACEs that occur in childhood.
PubMed: 38938969
DOI: 10.1007/s40653-023-00567-5 -
Biomedical Reports Aug 2024Type 2 diabetes mellitus (T2DM) is a major global health problem. Response to first-line therapy is variable. This is partially due to interindividual variability across...
Type 2 diabetes mellitus (T2DM) is a major global health problem. Response to first-line therapy is variable. This is partially due to interindividual variability across those genes codifying transport, metabolising, and drug activation proteins involved in first-line pharmacological treatment. Single nucleotide polymorphisms (SNPs) of genes and affect metformin therapeutic response in patients with T2DM patients. The present study investigated allelic and genotypic frequencies of organic cation (OCT)1, OCT2, and OCT3 polymorphisms among metformin-treated patients with type 2 diabetes mellitus (T2DM). It also reports the association between clinical and genetic variables with glycated haemoglobin (HbA1c) control in 59 patients with T2DM. Patients were genotyped through real-time PCR (TaqMan assays). Metformin plasmatic levels were determined by mass spectrometry. Neither the analysis of HbA1c control by SNPs in , and , nor the dominant genotypic model analysis yielded statistical significance between genotypes in polymorphisms rs72552763 (P=0.467), rs622342 (P=0.221), rs316019 (P=0.220) and rs2076828 (P=0.215). HbA1c levels were different in rs72552763 [GAT/GAT, 6.0 (5.7-6.6), GAT/del=6.5 (6.2-9.0), del/del=6.5 (6.4-6.8); P=0.022] and rs622342 [A/A=6.0 (5.8-6.5), A/C=6.4 (6.1-7.7), C/C=6.8 (6.4-9.3); P=0.009] genotypes. The dominant genotypic model found the lowest HbA1c levels in GAT/GAT (P=0.005) and A/A (P=0.010), in rs72552763 (GAT/GAT vs. GAT/del + del/del) and rs622342 (A/A vs. A/C + CC), respectively. There was a significant correlation between HbA1c levels and metformin dosage amongst del allele carriers in rs72552763 (β=0.14, P<0.001, r=0.387), as opposed to GAT/GAT in rs72552763. There were no differences between HbA1c values in the test set and those predicted by machine learning models employing a simple linear regression based on metformin dosage. Therefore, rs72552763 and rs622342 polymorphisms in may affect metformin response determined by HbA1c levels in patients with T2DM. The del allele of SNP rs72552763 may serve as a metformin response biomarker.
PubMed: 38938740
DOI: 10.3892/br.2024.1806 -
Journal of Extracellular Biology Jan 2024Extracellular vesicles (EVs) are membranous structures released by cells into the extracellular space and are thought to be involved in cell-to-cell communication. While...
Extracellular vesicles (EVs) are membranous structures released by cells into the extracellular space and are thought to be involved in cell-to-cell communication. While EVs and their cargo are promising biomarker candidates, sorting mechanisms of proteins to EVs remain unclear. In this study, we ask if it is possible to determine EV association based on the protein sequence. Additionally, we ask what the most important determinants are for EV association. We answer these questions with explainable AI models, using human proteome data from EV databases to train and validate the model. It is essential to correct the datasets for contaminants introduced by coarse EV isolation workflows and for experimental bias caused by mass spectrometry. In this study, we show that it is indeed possible to predict EV association from the protein sequence: a simple sequence-based model for predicting EV proteins achieved an area under the curve of 0.77 ± 0.01, which increased further to 0.84 ± 0.00 when incorporating curated post-translational modification (PTM) annotations. Feature analysis shows that EV-associated proteins are stable, polar, and structured with low isoelectric point compared to non-EV proteins. PTM annotations emerged as the most important features for correct classification; specifically, palmitoylation is one of the most prevalent EV sorting mechanisms for unique proteins. Palmitoylation and nitrosylation sites are especially prevalent in EV proteins that are determined by very strict isolation protocols, indicating they could potentially serve as quality control criteria for future studies. This computational study offers an effective sequence-based predictor of EV associated proteins with extensive characterisation of the human EV proteome that can explain for individual proteins which factors contribute to their EV association.
PubMed: 38938677
DOI: 10.1002/jex2.120 -
JACC. Advances Dec 2023
PubMed: 38938477
DOI: 10.1016/j.jacadv.2023.100682