-
Journal of Chemical Information and... Apr 2021Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in...
Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.
Topics: Amino Acid Sequence; Computational Biology; Humans; Ligands; Machine Learning; Phylogeny; Sequence Alignment
PubMed: 33757283
DOI: 10.1021/acs.jcim.0c01285 -
The Journal of Physical Chemistry... Aug 2022A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...
A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.
Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics
PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986 -
Current Opinion in Structural Biology Feb 2021The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)
Review
The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.
Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins
PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015 -
Bioinformatics (Oxford, England) Jun 2023Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the...
MOTIVATION
Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.
RESULTS
We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.
AVAILABILITY AND IMPLEMENTATION
The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
Topics: Amino Acid Sequence; Benchmarking; Language; Neural Networks, Computer; Software
PubMed: 37387145
DOI: 10.1093/bioinformatics/btad208 -
Bioinformatics (Oxford, England) Sep 2022Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength...
MOTIVATION
Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.
RESULTS
To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.
AVAILABILITY AND IMPLEMENTATION
Data and source codes are available at https://github.com/Shen-Lab/CPAC.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Drug Discovery; Neural Networks, Computer; Proteins; Software
PubMed: 36124802
DOI: 10.1093/bioinformatics/btac470 -
Cell Systems Jun 2021Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...
Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.
Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins
PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017 -
Scientific Reports May 2022Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks...
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .
Topics: Amino Acid Sequence; Amino Acids; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae
PubMed: 35589837
DOI: 10.1038/s41598-022-12201-9 -
Briefings in Bioinformatics Sep 2023Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research...
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
Topics: Humans; Amino Acid Sequence; Gene Ontology; Molecular Sequence Annotation; Neural Networks, Computer; Protein Interaction Maps
PubMed: 37861172
DOI: 10.1093/bib/bbab556 -
Proceedings of the National Academy of... Apr 2021Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and...
Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid-liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/.
Topics: Amino Acid Sequence; Animals; Humans; Hydrophobic and Hydrophilic Interactions; Machine Learning; Sequence Analysis, Protein
PubMed: 33827920
DOI: 10.1073/pnas.2019053118 -
Current Opinion in Structural Biology Oct 2023Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)
Review
Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.
Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype
PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669