-
Bioinformatics (Oxford, England) Jun 2024Electronic health records (EHRs) represent a comprehensive resource of a patient's medical history. EHRs are essential for utilizing advanced technologies such as deep...
MOTIVATION
Electronic health records (EHRs) represent a comprehensive resource of a patient's medical history. EHRs are essential for utilizing advanced technologies such as deep learning (DL), enabling healthcare providers to analyze extensive data, extract valuable insights, and make precise and data-driven clinical decisions. DL methods such as recurrent neural networks (RNN) have been utilized to analyze EHR to model disease progression and predict diagnosis. However, these methods do not address some inherent irregularities in EHR data such as irregular time intervals between clinical visits. Furthermore, most DL models are not interpretable. In this study, we propose two interpretable DL architectures based on RNN, namely time-aware RNN (TA-RNN) and TA-RNN-autoencoder (TA-RNN-AE) to predict patient's clinical outcome in EHR at the next visit and multiple visits ahead, respectively. To mitigate the impact of irregular time intervals, we propose incorporating time embedding of the elapsed times between visits. For interpretability, we propose employing a dual-level attention mechanism that operates between visits and features within each visit.
RESULTS
The results of the experiments conducted on Alzheimer's Disease Neuroimaging Initiative (ADNI) and National Alzheimer's Coordinating Center (NACC) datasets indicated the superior performance of proposed models for predicting Alzheimer's Disease (AD) compared to state-of-the-art and baseline approaches based on F2 and sensitivity. Additionally, TA-RNN showed superior performance on the Medical Information Mart for Intensive Care (MIMIC-III) dataset for mortality prediction. In our ablation study, we observed enhanced predictive performance by incorporating time embedding and attention mechanisms. Finally, investigating attention weights helped identify influential visits and features in predictions.
AVAILABILITY AND IMPLEMENTATION
https://github.com/bozdaglab/TA-RNN.
Topics: Electronic Health Records; Neural Networks, Computer; Humans; Deep Learning; Alzheimer Disease
PubMed: 38940180
DOI: 10.1093/bioinformatics/btae264 -
Bioinformatics (Oxford, England) Jun 2024In drug discovery, it is crucial to assess the drug-target binding affinity (DTA). Although molecular docking is widely used, computational efficiency limits its...
MOTIVATION
In drug discovery, it is crucial to assess the drug-target binding affinity (DTA). Although molecular docking is widely used, computational efficiency limits its application in large-scale virtual screening. Deep learning-based methods learn virtual scoring functions from labeled datasets and can quickly predict affinity. However, there are three limitations. First, existing methods only consider the atom-bond graph or one-dimensional sequence representations of compounds, ignoring the information about functional groups (pharmacophores) with specific biological activities. Second, relying on limited labeled datasets fails to learn comprehensive embedding representations of compounds and proteins, resulting in poor generalization performance in complex scenarios. Third, existing feature fusion methods cannot adequately capture contextual interaction information.
RESULTS
Therefore, we propose a novel DTA prediction method named HeteroDTA. Specifically, a multi-view compound feature extraction module is constructed to model the atom-bond graph and pharmacophore graph. The residue concat graph and protein sequence are also utilized to model protein structure and function. Moreover, to enhance the generalization capability and reduce the dependence on task-specific labeled data, pre-trained models are utilized to initialize the atomic features of the compounds and the embedding representations of the protein sequence. A context-aware nonlinear feature fusion method is also proposed to learn interaction patterns between compounds and proteins. Experimental results on public benchmark datasets show that HeteroDTA significantly outperforms existing methods. In addition, HeteroDTA shows excellent generalization performance in cold-start experiments and superiority in the representation learning ability of drug-target pairs. Finally, the effectiveness of HeteroDTA is demonstrated in a real-world drug discovery study.
AVAILABILITY AND IMPLEMENTATION
The source code and data are available at https://github.com/daydayupzzl/HeteroDTA.
Topics: Drug Discovery; Molecular Docking Simulation; Proteins; Deep Learning; Pharmacophore
PubMed: 38940179
DOI: 10.1093/bioinformatics/btae240 -
Bioinformatics (Oxford, England) Jun 2024Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem...
UNLABELLED
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets.
AVAILABILITY AND IMPLEMENTATION
Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
Topics: Proteins; Computational Biology; Gene Ontology; Databases, Protein; Algorithms
PubMed: 38940168
DOI: 10.1093/bioinformatics/btae237 -
Bioinformatics (Oxford, England) Jun 2024Quantitative dynamical models facilitate the understanding of biological processes and the prediction of their dynamics. The parameters of these models are commonly...
MOTIVATION
Quantitative dynamical models facilitate the understanding of biological processes and the prediction of their dynamics. The parameters of these models are commonly estimated from experimental data. Yet, experimental data generated from different techniques do not provide direct information about the state of the system but a nonlinear (monotonic) transformation of it. For such semi-quantitative data, when this transformation is unknown, it is not apparent how the model simulations and the experimental data can be compared.
RESULTS
We propose a versatile spline-based approach for the integration of a broad spectrum of semi-quantitative data into parameter estimation. We derive analytical formulas for the gradients of the hierarchical objective function and show that this substantially increases the estimation efficiency. Subsequently, we demonstrate that the method allows for the reliable discovery of unknown measurement transformations. Furthermore, we show that this approach can significantly improve the parameter inference based on semi-quantitative data in comparison to available methods.
AVAILABILITY AND IMPLEMENTATION
Modelers can easily apply our method by using our implementation in the open-source Python Parameter EStimation TOolbox (pyPESTO) available at https://github.com/ICB-DCM/pyPESTO.
Topics: Models, Biological; Software; Algorithms; Computer Simulation; Computational Biology
PubMed: 38940161
DOI: 10.1093/bioinformatics/btae210 -
Bioinformatics (Oxford, England) Jun 2024Wikipedia is a vital open educational resource in computational biology. The quality of computational biology coverage in English-language Wikipedia has improved...
MOTIVATION
Wikipedia is a vital open educational resource in computational biology. The quality of computational biology coverage in English-language Wikipedia has improved steadily in recent years. However, there is an increasingly large 'knowledge gap' between computational biology resources in English-language Wikipedia, and Wikipedias in non-English languages. Reducing this knowledge gap by providing educational resources in non-English languages would reduce language barriers which disadvantage non-native English speaking learners across multiple dimensions in computational biology.
RESULTS
Here, we provide a comprehensive assessment of computational biology coverage in Spanish-language Wikipedia, the second most accessed Wikipedia worldwide. Using Spanish-language Wikipedia as a case study, we generate quantitative and qualitative data before and after a targeted educational event, specifically, a Spanish-focused student editing competition. Our data demonstrates how such events and activities can narrow the knowledge gap between English and non-English educational resources, by improving existing articles and creating new articles. Finally, based on our analysis, we suggest ways to prioritize future initiatives to improve open educational resources in other languages.
AVAILABILITY AND IMPLEMENTATION
Scripts for data analysis are available at: https://github.com/ISCBWikiTeam/spanish.
Topics: Computational Biology; Humans; Language; Internet
PubMed: 38940154
DOI: 10.1093/bioinformatics/btae247 -
Bioinformatics (Oxford, England) Jun 2024Insertions and deletions (indels) influence the genetic code in fundamentally distinct ways from substitutions, significantly impacting gene product structure and...
MOTIVATION
Insertions and deletions (indels) influence the genetic code in fundamentally distinct ways from substitutions, significantly impacting gene product structure and function. Despite their influence, the evolutionary history of indels is often neglected in phylogenetic tree inference and ancestral sequence reconstruction, hindering efforts to comprehend biological diversity determinants and engineer variants for medical and industrial applications.
RESULTS
We frame determining the optimal history of indel events as a single Mixed-Integer Programming (MIP) problem, across all branch points in a phylogenetic tree adhering to topological constraints, and all sites implied by a given set of aligned, extant sequences. By disentangling the impact on ancestral sequences at each branch point, this approach identifies the minimal indel events that jointly explain the diversity in sequences mapped to the tips of that tree. MIP can recover alternate optimal indel histories, if available. We evaluated MIP for indel inference on a dataset comprising 15 real phylogenetic trees associated with protein families ranging from 165 to 2000 extant sequences, and on 60 synthetic trees at comparable scales of data and reflecting realistic rates of mutation. Across relevant metrics, MIP outperformed alternative parsimony-based approaches and reported the fewest indel events, on par or below their occurrence in synthetic datasets. MIP offers a rational justification for indel patterns in extant sequences; importantly, it uniquely identifies global optima on complex protein data sets without making unrealistic assumptions of independence or evolutionary underpinnings, promising a deeper understanding of molecular evolution and aiding novel protein design.
AVAILABILITY AND IMPLEMENTATION
The implementation is available via GitHub at https://github.com/santule/indelmip.
Topics: Phylogeny; INDEL Mutation; Evolution, Molecular; Algorithms; Computational Biology
PubMed: 38940131
DOI: 10.1093/bioinformatics/btae254 -
Annals of Agricultural and... Jun 2024Snow cover serves as a unique indicator of environmental pollution in both urban and rural areas. As a seasonal cover, it accumulates various pollutants emitted into the...
INTRODUCTION AND OBJECTIVE
Snow cover serves as a unique indicator of environmental pollution in both urban and rural areas. As a seasonal cover, it accumulates various pollutants emitted into the atmosphere, thus providing insight into air pollution types and the relative contributions of different pollution sources. The aim of the study is to analyze the distribution of trace elements in snow cover to assess the anthropogenic influence on pollution levels, and better understand ecological threats.
MATERIAL AND METHODS
The study was conducted in rural areas around the village of Wólka in the Lublin Province of eastern Poland, and in urban districts of the city of Lublin, capital of the Province. Samples were analyzed using Inductively Coupled Plasma-Mass Spectrometry, the Enrichment Factor (EF), and ecological risk indices (RI), were calculated to evaluate the contamination and potential ecological risks posed by the metals.
RESULTS
The findings indicate higher concentrations of metals like sodium and iron in urban areas, likely due to road salt use and industrial activity, respectively. Enrichment factors showed significant anthropogenic contributions, particularly for metals like sodium, zinc, and cadmium, which had EF values substantially above natural levels. The potential ecological risk assessment highlighted a considerable ecological threat in urban areas compared to rural settings, primarily due to higher concentrations of metals.
CONCLUSIONS
The variation in metal concentrations between urban and rural snow covers reflects the impact of human activities on local environments. Urban areas showed higher pollution levels, suggesting the need for targeted pollution control policies to mitigate the adverse ecological impacts. This study underscores the importance of continuous monitoring and comprehensive risk assessments to effectively manage environmental pollution.
Topics: Snow; Poland; Environmental Monitoring; Risk Assessment; Metals; Humans; Air Pollutants; Cities; Rural Population
PubMed: 38940104
DOI: 10.26444/aaem/190317 -
Current Research in Food Science 2024Discriminant analysis of similar food samples is an important aspect of achieving food quality control. The effective combination of Raman spectroscopy and machine...
Discriminant analysis of similar food samples is an important aspect of achieving food quality control. The effective combination of Raman spectroscopy and machine learning algorithms has become an extremely attractive approach to develop intelligent discrimination techniques. Feature spectral analysis can help researchers gain a deeper understanding of the data patterns in food quality discrimination. Herein, this work takes the discrimination of three brands of dairy products as an example to investigate the Raman spectral feature based on the support vector machines (SVM), extreme learning machines (ELM) and convolutional neural network (CNN) algorithms. The results show that there are certain differences in the optimal spectral feature interval corresponding to different machine learning algorithms. Selecting the appropriate spectral feature interval can maintain high recognition accuracy and improve the computational efficiency of the algorithm. For example, the SVM algorithm has a recognition accuracy of 100% in the 890-980 cm, 1410-1500 cm fusion spectral range, which takes about 200 s. The ELM algorithm also has a recognition accuracy of 100% in the 890-980 cm, 1410-1500 cm fusion spectral range, which takes less than 0.3 s. The CNN algorithm has a recognition accuracy of 100% in the 890-980 cm, 1050-1180 cm, 1410-1500 cm fusion spectral range, which takes about 80 s. In addition, by analyzing the distribution of spectral feature intervals based on Euclidean distance, the distribution of experimental samples based on feature spectra is visually displayed. Through the spectral feature analysis process of similar samples, a set of analysis strategies is provided to deeply reveal the data foundation of classification algorithms, which can provide reference for the analysis of relevant discriminative research patterns.
PubMed: 38939610
DOI: 10.1016/j.crfs.2024.100782 -
JACC. Advances Feb 2024With an increasing interest in using large claims databases in medical practice and research, it is a meaningful and essential step to efficiently identify patients with...
BACKGROUND
With an increasing interest in using large claims databases in medical practice and research, it is a meaningful and essential step to efficiently identify patients with the disease of interest.
OBJECTIVES
This study aims to establish a machine learning (ML) approach to identify patients with congenital heart disease (CHD) in large claims databases.
METHODS
We harnessed data from the Quebec claims and hospitalization databases from 1983 to 2000. The study included 19,187 patients. Of them, 3,784 were labeled as true CHD patients using a clinician developed algorithm with manual audits considered as the gold standards. To establish an accurate ML-empowered automated CHD classification system, we evaluated ML methods including Gradient Boosting Decision Tree, Support Vector Machine, Decision tree, and compared them to regularized logistic regression. The Area Under the Precision Recall Curve was used as the evaluation metric. External validation was conducted with an updated data set to 2010 with different subjects.
RESULTS
Among the ML methods we evaluated, Gradient Boosting Decision Tree led the performance in identifying true CHD patients with 99.3% Area Under the Precision Recall Curve, 98.0% for sensitivity, and 99.7% for specificity. External validation returned similar statistics on model performance.
CONCLUSIONS
This study shows that a tedious and time-consuming clinical inspection for CHD patient identification can be replaced by an extremely efficient ML algorithm in large claims database. Our findings demonstrate that ML methods can be used to automate complicated algorithms to identify patients with complex diseases.
PubMed: 38939385
DOI: 10.1016/j.jacadv.2023.100801 -
Frontiers in Oncology 2024Infections represent one of the most frequent causes of death of higher-risk MDS patients, as reported previously also by our group. Azacitidine Infection Risk Model...
INTRODUCTION
Infections represent one of the most frequent causes of death of higher-risk MDS patients, as reported previously also by our group. Azacitidine Infection Risk Model (AIR), based on red blood cell (RBC) transfusion dependency, neutropenia <0.8 × 10/L, platelet count <50 × 10/L, albumin <35g/L, and ECOG performance status ≥2 has been proposed based on the retrospective data to estimate the risk of infection in azacitidine treated patients.
METHODS
The prospective non-intervention study aimed to identify factors predisposing to infection, validate the AIR score, and assess the impact of antimicrobial prophylaxis on the outcome of azacitidine-treated MDS/AML and CMML patients.
RESULTS
We collected data on 307 patients, 57.6 % males, treated with azacitidine: AML (37.8%), MDS (55.0%), and CMML (7.1%). The median age at azacitidine treatment commencement was 71 (range, 18-95) years. 200 (65%) patients were assigned to higher risk AIR group. Antibacterial, antifungal, and antiviral prophylaxis was used in 66.0%, 29.3%, and 25.7% of patients, respectively. In total, 169 infectious episodes (IE) were recorded in 118 (38.4%) patients within the first three azacitidine cycles. In a multivariate analysis ECOG status, RBC transfusion dependency, IPSS-R score, and CRP concentration were statistically significant for infection development ( < 0.05). The occurrence of infection within the first three azacitidine cycles was significantly higher in the higher risk AIR group - 47.0% than in lower risk 22.4% (odds ratio (OR) 3.06; 95% CI 1.82-5.30, < 0.05). Administration of antimicrobial prophylaxis did not have a significant impact on all-infection occurrence in multivariate analysis: antibacterial prophylaxis (OR 0.93; 0.41-2.05, = 0.87), antifungal OR 1.24 (0.54-2.85) ( = 0.59), antiviral OR 1.24 (0.53-2.82) ( = 0.60).
DISCUSSION
The AIR Model effectively discriminates infection-risk patients during azacitidine treatment. Antimicrobial prophylaxis does not decrease the infection rate.
PubMed: 38939343
DOI: 10.3389/fonc.2024.1404322