protein sequence - OpenMD.com Journal Search

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization.

Journal of Chemical Information and... Apr 2021

Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in...

Summary PubMed Full Text PDF

Authors: Tian Cai, Hansaim Lim, Kyra Alyssa Abbu...

Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

Topics: Amino Acid Sequence; Computational Biology; Humans; Ligands; Machine Learning; Phylogeny; Sequence Alignment

PubMed: 33757283
DOI: 10.1021/acs.jcim.0c01285

Analytical Solution to the Flory-Huggins Model.

The Journal of Physical Chemistry... Aug 2022

A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...

Summary PubMed Full Text PDF

Authors: Daoyuan Qian, Thomas C T Michaels, Tuomas P J Knowles...

A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.

Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics

PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986

Toward complete rational control over protein structure and function through computational design.

Current Opinion in Structural Biology Feb 2021

The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)

Summary PubMed Full Text PDF

Review

Authors: Jared Adolf-Bryfogle, Frank D Teets, Christopher D Bahl...

The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.

Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins

PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function.

Bioinformatics (Oxford, England) Jun 2023

Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the...

Summary PubMed Full Text PDF

Authors: Frimpong Boadu, Hongyuan Cao, Jianlin Cheng...

MOTIVATION

Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

RESULTS

We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

AVAILABILITY AND IMPLEMENTATION

The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

Topics: Amino Acid Sequence; Benchmarking; Language; Neural Networks, Computer; Software

PubMed: 37387145
DOI: 10.1093/bioinformatics/btad208

Cross-modality and self-supervised protein embedding for compound-protein affinity and contact prediction.

Bioinformatics (Oxford, England) Sep 2022

Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength...

Summary PubMed Full Text PDF

Authors: Yuning You, Yang Shen

MOTIVATION

Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.

RESULTS

To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

AVAILABILITY AND IMPLEMENTATION

Data and source codes are available at https://github.com/Shen-Lab/CPAC.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Amino Acid Sequence; Drug Discovery; Neural Networks, Computer; Proteins; Software

PubMed: 36124802
DOI: 10.1093/bioinformatics/btac470

Learning the protein language: Evolution, structure, and function.

Cell Systems Jun 2021

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...

Summary PubMed Full Text PDF

Authors: Tristan Bepler, Bonnie Berger

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.

Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins

PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017

Prediction of protein-protein interaction using graph neural networks.

Scientific Reports May 2022

Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks...

Summary PubMed Full Text PDF

Authors: Kanchan Jha, Sriparna Saha, Hiteshi Singh...

Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .

Topics: Amino Acid Sequence; Amino Acids; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae

PubMed: 35589837
DOI: 10.1038/s41598-022-12201-9

HNetGO: protein function prediction via heterogeneous network transformer.

Briefings in Bioinformatics Sep 2023

Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research...

Summary PubMed Full Text PDF

Authors: Xiaoshuai Zhang, Huannan Guo, Fan Zhang...

Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.

Topics: Humans; Amino Acid Sequence; Gene Ontology; Molecular Sequence Annotation; Neural Networks, Computer; Protein Interaction Maps

PubMed: 37861172
DOI: 10.1093/bib/bbab556

Learning the molecular grammar of protein condensates from sequence determinants and embeddings.

Proceedings of the National Academy of... Apr 2021

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and...

Summary PubMed Full Text PDF

Authors: Kadi L Saar, Alexey S Morgunov, Runzhang Qi...

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid-liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/.

Topics: Amino Acid Sequence; Animals; Humans; Hydrophobic and Hydrophilic Interactions; Machine Learning; Sequence Analysis, Protein

PubMed: 33827920
DOI: 10.1073/pnas.2019053118

Impact of ancestral sequence reconstruction on mechanistic and structural enzymology.

Current Opinion in Structural Biology Oct 2023

Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)

Summary PubMed Full Text

Review

Authors: Callum R Nicoll, Marta Massari, Marco W Fraaije...

Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.

Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype

PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669