-
Bioinformatics (Oxford, England) Jun 2024Electronic health records (EHRs) represent a comprehensive resource of a patient's medical history. EHRs are essential for utilizing advanced technologies such as deep...
MOTIVATION
Electronic health records (EHRs) represent a comprehensive resource of a patient's medical history. EHRs are essential for utilizing advanced technologies such as deep learning (DL), enabling healthcare providers to analyze extensive data, extract valuable insights, and make precise and data-driven clinical decisions. DL methods such as recurrent neural networks (RNN) have been utilized to analyze EHR to model disease progression and predict diagnosis. However, these methods do not address some inherent irregularities in EHR data such as irregular time intervals between clinical visits. Furthermore, most DL models are not interpretable. In this study, we propose two interpretable DL architectures based on RNN, namely time-aware RNN (TA-RNN) and TA-RNN-autoencoder (TA-RNN-AE) to predict patient's clinical outcome in EHR at the next visit and multiple visits ahead, respectively. To mitigate the impact of irregular time intervals, we propose incorporating time embedding of the elapsed times between visits. For interpretability, we propose employing a dual-level attention mechanism that operates between visits and features within each visit.
RESULTS
The results of the experiments conducted on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) datasets indicated the superior performance of proposed models for predicting Alzheimer's Disease (AD) compared to state-of-the-art and baseline approaches based on F2 and sensitivity. Additionally, TA-RNN showed superior performance on the Medical Information Mart for Intensive Care (MIMIC-III) dataset for mortality prediction. In our ablation study, we observed enhanced predictive performance by incorporating time embedding and attention mechanisms. Finally, investigating attention weights helped identify influential visits and features in predictions.
AVAILABILITY AND IMPLEMENTATION
https://github.com/bozdaglab/TA-RNN.
Topics: Electronic Health Records; Neural Networks, Computer; Humans; Deep Learning; Alzheimer Disease
PubMed: 38940180
DOI: 10.1093/bioinformatics/btae264 -
Bioinformatics (Oxford, England) Jun 2024In drug discovery, it is crucial to assess the drug-target binding affinity (DTA). Although molecular docking is widely used, computational efficiency limits its...
MOTIVATION
In drug discovery, it is crucial to assess the drug-target binding affinity (DTA). Although molecular docking is widely used, computational efficiency limits its application in large-scale virtual screening. Deep learning-based methods learn virtual scoring functions from labeled datasets and can quickly predict affinity. However, there are three limitations. First, existing methods only consider the atom-bond graph or one-dimensional sequence representations of compounds, ignoring the information about functional groups (pharmacophores) with specific biological activities. Second, relying on limited labeled datasets fails to learn comprehensive embedding representations of compounds and proteins, resulting in poor generalization performance in complex scenarios. Third, existing feature fusion methods cannot adequately capture contextual interaction information.
RESULTS
Therefore, we propose a novel DTA prediction method named HeteroDTA. Specifically, a multi-view compound feature extraction module is constructed to model the atom-bond graph and pharmacophore graph. The residue concat graph and protein sequence are also utilized to model protein structure and function. Moreover, to enhance the generalization capability and reduce the dependence on task-specific labeled data, pre-trained models are utilized to initialize the atomic features of the compounds and the embedding representations of the protein sequence. A context-aware nonlinear feature fusion method is also proposed to learn interaction patterns between compounds and proteins. Experimental results on public benchmark datasets show that HeteroDTA significantly outperforms existing methods. In addition, HeteroDTA shows excellent generalization performance in cold-start experiments and superiority in the representation learning ability of drug-target pairs. Finally, the effectiveness of HeteroDTA is demonstrated in a real-world drug discovery study.
AVAILABILITY AND IMPLEMENTATION
The source code and data are available at https://github.com/daydayupzzl/HeteroDTA.
Topics: Drug Discovery; Molecular Docking Simulation; Proteins; Deep Learning; Pharmacophore
PubMed: 38940179
DOI: 10.1093/bioinformatics/btae240 -
Bioinformatics (Oxford, England) Jun 2024World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome...
MOTIVATION
World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome increasing trend yearly. The disease is caused by Mycobacterium tuberculosis (MTB) through airborne transmission. Treatment of TB is estimated to be 85% successful, however, this drops to 57% if MTB exhibits multiple antimicrobial resistance (AMR), for which fewer treatment options are available.
RESULTS
We develop a robust machine-learning classifier using both linear and nonlinear models (i.e. LASSO logistic regression (LR) and random forests (RF)) to predict the phenotypic resistance of Mycobacterium tuberculosis (MTB) for a broad range of antibiotic drugs. We use data from the CRyPTIC consortium to train our classifier, which consists of whole genome sequencing and antibiotic susceptibility testing (AST) phenotypic data for 13 different antibiotics. To train our model, we assemble the sequence data into genomic contigs, identify all unique 31-mers in the set of contigs, and build a feature matrix M, where M[i, j] is equal to the number of times the ith 31-mer occurs in the jth genome. Due to the size of this feature matrix (over 350 million unique 31-mers), we build and use a sparse matrix representation. Our method, which we refer to as MTB++, leverages compact data structures and iterative methods to allow for the screening of all the 31-mers in the development of both LASSO LR and RF. MTB++ is able to achieve high discrimination (F-1 >80%) for the first-line antibiotics. Moreover, MTB++ had the highest F-1 score in all but three classes and was the most comprehensive since it had an F-1 score >75% in all but four (rare) antibiotic drugs. We use our feature selection to contextualize the 31-mers that are used for the prediction of phenotypic resistance, leading to some insights about sequence similarity to genes in MEGARes. Lastly, we give an estimate of the amount of data that is needed in order to provide accurate predictions.
AVAILABILITY
The models and source code are publicly available on Github at https://github.com/M-Serajian/MTB-Pipeline.
Topics: Mycobacterium tuberculosis; Machine Learning; Drug Resistance, Bacterial; Microbial Sensitivity Tests; Anti-Bacterial Agents; Whole Genome Sequencing; Genome, Bacterial; Humans
PubMed: 38940175
DOI: 10.1093/bioinformatics/btae243 -
Bioinformatics (Oxford, England) Jun 2024Genetic perturbations (e.g. knockouts, variants) have laid the foundation for our understanding of many diseases, implicating pathogenic mechanisms and indicating...
MOTIVATION
Genetic perturbations (e.g. knockouts, variants) have laid the foundation for our understanding of many diseases, implicating pathogenic mechanisms and indicating therapeutic targets. However, experimental assays are fundamentally limited by the number of measurable perturbations. Computational methods can fill this gap by predicting perturbation effects under novel conditions, but accurately predicting the transcriptional responses of cells to unseen perturbations remains a significant challenge.
RESULTS
We address this by developing a novel attention-based neural network, AttentionPert, which accurately predicts gene expression under multiplexed perturbations and generalizes to unseen conditions. AttentionPert integrates global and local effects in a multi-scale model, representing both the nonuniform system-wide impact of the genetic perturbation and the localized disturbance in a network of gene-gene similarities, enhancing its ability to predict nuanced transcriptional responses to both single and multi-gene perturbations. In comprehensive experiments, AttentionPert demonstrates superior performance across multiple datasets outperforming the state-of-the-art method in predicting differential gene expressions and revealing novel gene regulations. AttentionPert marks a significant improvement over current methods, particularly in handling the diversity of gene perturbations and in predicting out-of-distribution scenarios.
AVAILABILITY AND IMPLEMENTATION
Code is available at https://github.com/BaiDing1234/AttentionPert.
Topics: Computational Biology; Humans; Gene Regulatory Networks; Neural Networks, Computer; Gene Expression Profiling
PubMed: 38940174
DOI: 10.1093/bioinformatics/btae244 -
CODEX: COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations.Bioinformatics (Oxford, England) Jun 2024High-throughput screens (HTS) provide a powerful tool to decipher the causal effects of chemical and genetic perturbations on cancer cell lines. Their ability to...
MOTIVATION
High-throughput screens (HTS) provide a powerful tool to decipher the causal effects of chemical and genetic perturbations on cancer cell lines. Their ability to evaluate a wide spectrum of interventions, from single drugs to intricate drug combinations and CRISPR-interference, has established them as an invaluable resource for the development of novel therapeutic approaches. Nevertheless, the combinatorial complexity of potential interventions makes a comprehensive exploration intractable. Hence, prioritizing interventions for further experimental investigation becomes of utmost importance.
RESULTS
We propose CODEX (COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations) as a general framework for the causal modeling of HTS data, linking perturbations to their downstream consequences. CODEX relies on a stringent causal modeling strategy based on counterfactual reasoning. As such, CODEX predicts drug-specific cellular responses, comprising cell survival and molecular alterations, and facilitates the in silico exploration of drug combinations. This is achieved for both bulk and single-cell HTS. We further show that CODEX provides a rationale to explore complex genetic modifications from CRISPR-interference in silico in single cells.
AVAILABILITY AND IMPLEMENTATION
Our implementation of CODEX is publicly available at https://github.com/sschrod/CODEX. All data used in this article are publicly available.
Topics: Humans; Deep Learning; Cell Line, Tumor; Computer Simulation; High-Throughput Screening Assays; Neoplasms; Computational Biology; Software; Antineoplastic Agents
PubMed: 38940173
DOI: 10.1093/bioinformatics/btae261 -
Bioinformatics (Oxford, England) Jun 2024Cis-acting mRNA elements play a key role in the regulation of mRNA stability and translation efficiency. Revealing the interactions of these elements and their impact...
SUMMARY
Cis-acting mRNA elements play a key role in the regulation of mRNA stability and translation efficiency. Revealing the interactions of these elements and their impact plays a crucial role in understanding the regulation of the mRNA translation process, which supports the development of mRNA-based medicine or vaccines. Deep neural networks (DNN) can learn complex cis-regulatory codes from RNA sequences. However, extracting these cis-regulatory codes efficiently from DNN remains a significant challenge. Here, we propose a method based on our toolkit NeuronMotif and motif mutagenesis, which not only enables the discovery of diverse and high-quality motifs but also efficiently reveals motif interactions. By interpreting deep-learning models, we have discovered several crucial motifs that impact mRNA translation efficiency and stability, as well as some unknown motifs or motif syntax, offering novel insights for biologists. Furthermore, we note that it is challenging to enrich motif syntax in datasets composed of randomly generated sequences, and they may not contain sufficient biological signals.
AVAILABILITY AND IMPLEMENTATION
The source code and data used to produce the results and analyses presented in this manuscript are available from GitHub (https://github.com/WangLabTHU/combmotif).
Topics: RNA, Messenger; Deep Learning; Neural Networks, Computer; Nucleotide Motifs; Computational Biology; Humans
PubMed: 38940172
DOI: 10.1093/bioinformatics/btae262 -
Bioinformatics (Oxford, England) Jun 2024Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their...
MOTIVATION
Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance.
RESULTS
Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets.
AVAILABILITY AND IMPLEMENTATION
The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.
Topics: Phylogeny; Machine Learning; Software; Algorithms; Sequence Alignment; Computational Biology; Likelihood Functions
PubMed: 38940166
DOI: 10.1093/bioinformatics/btae255 -
Bioinformatics (Oxford, England) Jun 2024Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative...
MOTIVATION
Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative as it is challenging to make early. Consequently, machine learning approaches have been developed to predict AKI ahead of time. However, the prevalence of AKI is often underestimated in state-of-the-art approaches, as they rely on an AKI event annotation solely based on creatinine, ignoring urine output.
UNLABELLED
We construct and evaluate early warning systems for AKI in a multi-disciplinary ICU setting, using the complete KDIGO definition of AKI. We propose several variants of gradient-boosted decision tree (GBDT)-based models, including a novel time-stacking based approach. A state-of-the-art LSTM-based model previously proposed for AKI prediction is used as a comparison, which was not specifically evaluated in ICU settings yet.
RESULTS
We find that optimal performance is achieved by using GBDT with the time-based stacking technique (AUPRC = 65.7%, compared with the LSTM-based model's AUPRC = 62.6%), which is motivated by the high relevance of time since ICU admission for this task. Both models show mildly reduced performance in the limited training data setting, perform fairly across different subcohorts, and exhibit no issues in gender transfer.
UNLABELLED
Following the official KDIGO definition substantially increases the number of annotated AKI events. In our study GBDTs outperform LSTM models for AKI prediction. Generally, we find that both model types are robust in a variety of challenging settings arising for ICU data.
AVAILABILITY AND IMPLEMENTATION
The code to reproduce the findings of our manuscript can be found at: https://github.com/ratschlab/AKI-EWS.
Topics: Acute Kidney Injury; Intensive Care Units; Humans; Machine Learning; Male; Female; Decision Trees; Aged; Middle Aged
PubMed: 38940165
DOI: 10.1093/bioinformatics/btae212 -
Bioinformatics (Oxford, England) Jun 2024Human epidermal growth factor receptor 2 (HER2) status identification enables physicians to assess the prognosis risk and determine the treatment schedule for patients....
MOTIVATION
Human epidermal growth factor receptor 2 (HER2) status identification enables physicians to assess the prognosis risk and determine the treatment schedule for patients. In clinical practice, pathological slides serve as the gold standard, offering morphological information on cellular structure and tumoral regions. Computational analysis of pathological images has the potential to discover morphological patterns associated with HER2 molecular targets and achieve precise status prediction. However, pathological images are typically equipped with high-resolution attributes, and HER2 expression in breast cancer (BC) images often manifests the intratumoral heterogeneity.
RESULTS
We present a phenotype-informed weakly supervised multiple instance learning architecture (PhiHER2) for the prediction of the HER2 status from pathological images of BC. Specifically, a hierarchical prototype clustering module is designed to identify representative phenotypes across whole slide images. These phenotype embeddings are then integrated into a cross-attention module, enhancing feature interaction and aggregation on instances. This yields a phenotype-based feature space that leverages the intratumoral morphological heterogeneity for HER2 status prediction. Extensive results demonstrate that PhiHER2 captures a better WSI-level representation by the typical phenotype guidance and significantly outperforms existing methods on real-world datasets. Additionally, interpretability analyses of both phenotypes and WSIs provide explicit insights into the heterogeneity of morphological patterns associated with molecular HER2 status.
AVAILABILITY AND IMPLEMENTATION
Our model is available at https://github.com/lyotvincent/PhiHER2.
Topics: Humans; Receptor, ErbB-2; Breast Neoplasms; Phenotype; Female; Supervised Machine Learning; Computational Biology
PubMed: 38940163
DOI: 10.1093/bioinformatics/btae236 -
Bioinformatics (Oxford, England) Jun 2024In many biomedical applications, we are confronted with paired groups of samples, such as treated versus control. The aim is to detect discriminating features, i.e....
MOTIVATION
In many biomedical applications, we are confronted with paired groups of samples, such as treated versus control. The aim is to detect discriminating features, i.e. biomarkers, based on high-dimensional (omics-) data. This problem can be phrased more generally as a two-sample problem requiring statistical significance testing to establish differences, and interpretations to identify distinguishing features. The multivariate maximum mean discrepancy (MMD) test quantifies group-level differences, whereas statistically significantly associated features are usually found by univariate feature selection. Currently, few general-purpose methods simultaneously perform multivariate feature selection and two-sample testing.
RESULTS
We introduce a sparse, interpretable, and optimized MMD test (SpInOpt-MMD) that enables two-sample testing and feature selection in the same experiment. SpInOpt-MMD is a versatile method and we demonstrate its application to a variety of synthetic and real-world data types including images, gene expression measurements, and text data. SpInOpt-MMD is effective in identifying relevant features in small sample sizes and outperforms other feature selection methods such as SHapley Additive exPlanations and univariate association analysis in several experiments.
AVAILABILITY AND IMPLEMENTATION
The code and links to our public data are available at https://github.com/BorgwardtLab/spinoptmmd.
Topics: Biomarkers; Humans; Algorithms; Computational Biology
PubMed: 38940158
DOI: 10.1093/bioinformatics/btae251