-
Bioinformatics (Oxford, England) Jun 2024Artificial intelligence (AI) is increasingly used in genomics research and practice, and generative AI has garnered significant recent attention. In clinical...
Artificial intelligence (AI) is increasingly used in genomics research and practice, and generative AI has garnered significant recent attention. In clinical applications of generative AI, aspects of the underlying datasets can impact results, and confounders should be studied and mitigated. One example involves the facial expressions of people with genetic conditions. Stereotypically, Williams (WS) and Angelman (AS) syndromes are associated with a "happy" demeanor, including a smiling expression. Clinical geneticists may be more likely to identify these conditions in images of smiling individuals. To study the impact of facial expression, we analyzed publicly available facial images of approximately 3500 individuals with genetic conditions. Using a deep learning (DL) image classifier, we found that WS and AS images with non-smiling expressions had significantly lower prediction probabilities for the correct syndrome labels than those with smiling expressions. This was not seen for 22q11.2 deletion and Noonan syndromes, which are not associated with a smiling expression. To further explore the effect of facial expressions, we computationally altered the facial expressions for these images. We trained HyperStyle, a GAN-inversion technique compatible with StyleGAN2, to determine the vector representations of our images. Then, following the concept of InterfaceGAN, we edited these vectors to recreate the original images in a phenotypically accurate way but with a different facial expression. Through online surveys and an eye-tracking experiment, we examined how altered facial expressions affect the performance of human experts. We overall found that facial expression is associated with diagnostic accuracy variably in different genetic conditions.
Topics: Humans; Facial Expression; Deep Learning; Artificial Intelligence; Genetics, Medical; Williams Syndrome
PubMed: 38940144
DOI: 10.1093/bioinformatics/btae239 -
Bioinformatics (Oxford, England) Jun 2024Molecular core structures and R-groups are essential concepts in drug development. Integration of these concepts with conventional graph pre-training approaches can...
MOTIVATION
Molecular core structures and R-groups are essential concepts in drug development. Integration of these concepts with conventional graph pre-training approaches can promote deeper understanding in molecules. We propose MolPLA, a novel pre-training framework that employs masked graph contrastive learning in understanding the underlying decomposable parts in molecules that implicate their core structure and peripheral R-groups. Furthermore, we formulate an additional framework that grants MolPLA the ability to help chemists find replaceable R-groups in lead optimization scenarios.
RESULTS
Experimental results on molecular property prediction show that MolPLA exhibits predictability comparable to current state-of-the-art models. Qualitative analysis implicate that MolPLA is capable of distinguishing core and R-group sub-structures, identifying decomposable regions in molecules and contributing to lead optimization scenarios by rationally suggesting R-group replacements given various query core templates.
AVAILABILITY AND IMPLEMENTATION
The code implementation for MolPLA and its pre-trained model checkpoint is available at https://github.com/dmis-lab/MolPLA.
Topics: Software; Machine Learning; Molecular Structure; Algorithms; Drug Development
PubMed: 38940143
DOI: 10.1093/bioinformatics/btae256 -
Bioinformatics (Oxford, England) Jun 2024High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to...
MOTIVATION
High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts.
RESULTS
We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix.
AVAILABILITY AND IMPLEMENTATION
Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.
Topics: Chromatin; Machine Learning; Humans; Computational Biology; Algorithms; Software
PubMed: 38940142
DOI: 10.1093/bioinformatics/btae211 -
Bioinformatics (Oxford, England) Jun 2024Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used...
MOTIVATION
Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification.
RESULTS
We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses.
AVAILABILITY AND IMPLEMENTATION
The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: [email protected].
Topics: Proteomics; Peptides; Humans; Tandem Mass Spectrometry; Databases, Protein; Deep Learning; Software
PubMed: 38940141
DOI: 10.1093/bioinformatics/btae220 -
Bioinformatics (Oxford, England) Jun 2024Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role...
MOTIVATION
Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA-sequencing data and more recently from single-cell or spatially resolved transcriptomics (SRT) data. SRT has a particular advantage over single-cell approaches, since ligand-receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in SRT data are generally low, complicating the inference of CCIs from expression correlations.
RESULTS
We introduce Copulacci, a count-based model for inferring CCIs from SRT data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real SRT datasets, we show that Copulacci discovers biologically meaningful ligand-receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods.
AVAILABILITY AND IMPLEMENTATION
Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci.
Topics: Transcriptome; Cell Communication; Humans; Gene Expression Profiling; Single-Cell Analysis; Algorithms; Computational Biology; Ligands
PubMed: 38940134
DOI: 10.1093/bioinformatics/btae219 -
Bioinformatics (Oxford, England) Jun 2024Many tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and...
MOTIVATION
Many tasks in sequence analysis ask to identify biologically related sequences in a large set. The edit distance, being a sensible model for both evolution and sequencing error, is widely used in these tasks as a measure. The resulting computational problem-to recognize all pairs of sequences within a small edit distance-turns out to be exceedingly difficult, since the edit distance is known to be notoriously expensive to compute and that all-versus-all comparison is simply not acceptable with millions or billions of sequences. Among many attempts, we recently proposed the locality-sensitive bucketing (LSB) functions to meet this challenge. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, with a notable highlight being that optimal LSB functions for certain (d1,d2) exist. LSB functions hold the potential of solving above problems optimally, but the existence of LSB functions for more general (d1,d2) remains unclear, let alone constructing them for practical use.
RESULTS
In this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can potentially extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2), matching our theoretical results, and high accuracy for many others. Comparing to the state-of-the-art LSH method Order Min Hash, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions.
AVAILABILITY AND IMPLEMENTATION
The code for the training process and the structure of trained models are freely available at https://github.com/Shao-Group/lsb-learn.
Topics: Algorithms; Computational Biology; Machine Learning
PubMed: 38940133
DOI: 10.1093/bioinformatics/btae228 -
Bioinformatics (Oxford, England) Jun 2024One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide...
MOTIVATION
One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.
RESULTS
To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.
Topics: Databases, Protein; Peptides; Machine Learning; Mass Spectrometry; Algorithms; Sequence Analysis, Protein; Tandem Mass Spectrometry
PubMed: 38940129
DOI: 10.1093/bioinformatics/btae218 -
Journal of Integrative Neuroscience Jun 2024Neurofeedback is a non-invasive brain training technique used to enhance and treat hyperactivity disorder by altering the patterns of brain activity. Nonetheless, the...
BACKGROUND
Neurofeedback is a non-invasive brain training technique used to enhance and treat hyperactivity disorder by altering the patterns of brain activity. Nonetheless, the extent of enhancement by neurofeedback varies among individuals/patients and many of them are irresponsive to this treatment technique. Therefore, several studies have been conducted to predict the effectiveness of neurofeedback training including the theta/beta protocol with a specific emphasize on slow cortical potential (SCP) before initiating treatment, as well as examining SCP criteria according to age and sex criteria in diverse populations. While some of these studies failed to make accurate predictions, others have demonstrated low success rates. This study explores functional connections within various brain lobes across different frequency bands of electroencephalogram (EEG) signals and the value of phase locking is used to predict the potential effectiveness of neurofeedback treatment before its initiation.
METHODS
This study utilized EEG data from the Mendelian database. In this database, EEG signals were recorded during neurofeedback sessions involving 60 hyperactive students aged 7-14 years, irrespective of sex. These students were categorized into treatable and non-treatable. The proposed method includes a five-step algorithm. Initially, the data underwent preprocessing to reduce noise using a multi-stage filtering process. The second step involved extracting alpha and beta frequency bands from the preprocessed EEG signals, with a particular emphasis on the EEG recorded from sessions 10 to 20 of neurofeedback therapy. In the third step, the method assessed the disparity in brain signals between the two groups by evaluating functional relationships in different brain lobes using the phase lock value, a crucial data characteristic. The fourth step focused on reducing the feature space and identifying the most effective and optimal electrodes for neurofeedback treatment. Two methods, the probability index (-value) via a -test and the genetic algorithm, were employed. These methods showed that the optimal electrodes were in the frontal lobe and central cerebral cortex, notably channels C3, FZ, F4, CZ, C4, and F3, as they exhibited significant differences between the two groups. Finally, in the fifth step, machine learning classifiers were applied, and the results were combined to generate treatable and non-treatable labels for each dataset.
RESULTS
Among the classifiers, the support vector machine and the boosting method demonstrated the highest accuracy when combined. Consequently, the proposed algorithm successfully predicted the treatability of individuals with hyperactivity in a short time and with limited data, achieving an accuracy of 90.6% in the neurofeedback method. Additionally, it effectively identified key electrodes in neurofeedback treatment, reducing their number from 32 to 6.
CONCLUSIONS
This study introduces an algorithm with a 90.6% accuracy for predicting neurofeedback treatment outcomes in hyperactivity disorder, significantly enhancing treatment efficiency by identifying optimal electrodes and reducing their number from 32 to 6. The proposed method enables the prediction of patient responsiveness to neurofeedback therapy without the need for numerous sessions, thus conserving time and financial resources.
Topics: Humans; Neurofeedback; Attention Deficit Disorder with Hyperactivity; Adolescent; Male; Female; Child; Electroencephalography; Cerebral Cortex; Brain Waves; Treatment Outcome
PubMed: 38940096
DOI: 10.31083/j.jin2306121 -
Frontiers in Bioscience (Landmark... Jun 2024The incidence rate of oropharyngeal squamous cell carcinoma (OPSCC) worldwide is alarming. In the clinical community, there is a pressing necessity to comprehend the...
BACKGROUND
The incidence rate of oropharyngeal squamous cell carcinoma (OPSCC) worldwide is alarming. In the clinical community, there is a pressing necessity to comprehend the etiology of the OPSCC to facilitate the administration of effective treatments.
METHODS
This study confers an integrative genomics approach for identifying key oncogenic drivers involved in the OPSCC pathogenesis. The dataset contains RNA-Sequencing (RNA-Seq) samples of 46 Human papillomavirus-positive head and neck squamous cell carcinoma and 25 normal Uvulopalatopharyngoplasty cases. The differential marker selection is performed between the groups with a log2FoldChange (FC) score of 2, adjusted -value < 0.01, and screened 714 genes. The Particle Swarm Optimization (PSO) algorithm selects the candidate gene subset, reducing the size to 73. The state-of-the-art machine learning algorithms are trained with the differentially expressed genes and candidate subsets of PSO.
RESULTS
The analysis of predictive models using Shapley Additive exPlanations revealed that seven genes significantly contribute to the model's performance. These include , , and , which predominantly influence differentiating between sample groups. They were followed in importance by , , , and . The Random Forest and Bayes Net algorithms also achieved perfect validation scores when using PSO features. Furthermore, gene set enrichment analysis, protein-protein interactions, and disease ontology mining revealed a significant association between these genes and the target condition. As indicated by Shapley Additive exPlanations (SHAPs), the survival analysis of three key genes unveiled strong over-expression in the samples from "The Cancer Genome Atlas".
CONCLUSIONS
Our findings elucidate critical oncogenic drivers in OPSCC, offering vital insights for developing targeted therapies and enhancing understanding its pathogenesis.
Topics: Humans; Oropharyngeal Neoplasms; Biomarkers, Tumor; Papillomavirus Infections; Artificial Intelligence; Gene Expression Regulation, Neoplastic; Squamous Cell Carcinoma of Head and Neck; Algorithms; Sequence Analysis, RNA; Machine Learning; Papillomaviridae; Carcinoma, Squamous Cell
PubMed: 38940026
DOI: 10.31083/j.fbl2906220 -
Plant Phenomics (Washington, D.C.) 2024Grape cluster architecture and compactness are complex traits influencing disease susceptibility, fruit quality, and yield. Evaluation methods for these traits include...
Grape cluster architecture and compactness are complex traits influencing disease susceptibility, fruit quality, and yield. Evaluation methods for these traits include visual scoring, manual methodologies, and computer vision, with the latter being the most scalable approach. Most of the existing computer vision approaches for processing cluster images often rely on conventional segmentation or machine learning with extensive training and limited generalization. The Segment Anything Model (SAM), a novel foundation model trained on a massive image dataset, enables automated object segmentation without additional training. This study demonstrates out-of-the-box SAM's high accuracy in identifying individual berries in 2-dimensional (2D) cluster images. Using this model, we managed to segment approximately 3,500 cluster images, generating over 150,000 berry masks, each linked with spatial coordinates within their clusters. The correlation between human-identified berries and SAM predictions was very strong (Pearson's = 0.96). Although the visible berry count in images typically underestimates the actual cluster berry count due to visibility issues, we demonstrated that this discrepancy could be adjusted using a linear regression model (adjusted = 0.87). We emphasized the critical importance of the angle at which the cluster is imaged, noting its substantial effect on berry counts and architecture. We proposed different approaches in which berry location information facilitated the calculation of complex features related to cluster architecture and compactness. Finally, we discussed SAM's potential integration into currently available pipelines for image generation and processing in vineyard conditions.
PubMed: 38939746
DOI: 10.34133/plantphenomics.0202