-
Current Opinion in Structural Biology Oct 2023Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)
Review
Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.
Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype
PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669 -
Briefings in Bioinformatics Mar 2022Spatial structures of proteins are closely related to protein functions. Integrating protein structures improves the performance of protein-protein interaction (PPI)...
Spatial structures of proteins are closely related to protein functions. Integrating protein structures improves the performance of protein-protein interaction (PPI) prediction. However, the limited quantity of known protein structures restricts the application of structure-based prediction methods. Utilizing the predicted protein structure information is a promising method to improve the performance of sequence-based prediction methods. We propose a novel end-to-end framework, TAGPPI, to predict PPIs using protein sequence alone. TAGPPI extracts multi-dimensional features by employing 1D convolution operation on protein sequences and graph learning method on contact maps constructed from AlphaFold. A contact map contains abundant spatial structure information, which is difficult to obtain from 1D sequence data directly. We further demonstrate that the spatial information learned from contact maps improves the ability of TAGPPI in PPI prediction tasks. We compare the performance of TAGPPI with those of nine state-of-the-art sequence-based methods, and TAGPPI outperforms such methods in all metrics. To the best of our knowledge, this is the first method to use the predicted protein topology structure graph for sequence-based PPI prediction. More importantly, our proposed architecture could be extended to other prediction tasks related to proteins.
Topics: Amino Acid Sequence; Machine Learning; Proteins
PubMed: 35018418
DOI: 10.1093/bib/bbab558 -
Cell Systems Nov 2023Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for...
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
Topics: Proteins; Artificial Intelligence; Amino Acid Sequence; Language; Databases, Factual
PubMed: 37909046
DOI: 10.1016/j.cels.2023.10.002 -
PLoS Computational Biology Aug 2022The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes...
The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes and protein sequence repositories is reshaping structural biology. However, because this database returns a single structure, it brought into question Alphafold's ability to capture the intrinsic conformational flexibility of proteins. Here we present a general approach to drive Alphafold2 to model alternate protein conformations through simple manipulation of the multiple sequence alignment via in silico mutagenesis. The approach is grounded in the hypothesis that the multiple sequence alignment must also encode for protein structural heterogeneity, thus its rational manipulation will enable Alphafold2 to sample alternate conformations. A systematic modeling pipeline is benchmarked against canonical examples of protein conformational flexibility and applied to interrogate the conformational landscape of membrane proteins. This work broadens the applicability of Alphafold2 by generating multiple protein conformations to be tested biologically, biochemically, biophysically, and for use in structure-based drug design.
Topics: Amino Acid Sequence; Drug Design; Protein Conformation; Proteins; Sequence Alignment
PubMed: 35994486
DOI: 10.1371/journal.pcbi.1010483 -
Trends in Genetics : TIG Jan 2015The factors that determine the tempo and mode of protein evolution continue to be a central question in molecular evolution. Traditionally, studies of protein evolution... (Review)
Review
The factors that determine the tempo and mode of protein evolution continue to be a central question in molecular evolution. Traditionally, studies of protein evolution focused on the rates of amino acid substitutions. More recently, with the availability of sequence data and advanced experimental techniques, the focus of attention has shifted toward the study of evolutionary trajectories and the overall layout of protein fitness landscapes. In this review we describe the effect of epistasis on the topology of evolutionary pathways that are likely to be found in fitness landscapes and develop a simple theory to connect the number of maladapted genotypes to the topology of fitness landscapes with epistatic interactions. Finally, we review recent studies that have probed the extent of epistatic interactions and have begun to chart the fitness landscapes in protein sequence space.
Topics: Amino Acid Sequence; Animals; Epistasis, Genetic; Genetic Fitness; Humans
PubMed: 25438718
DOI: 10.1016/j.tig.2014.09.009 -
Scientific Reports Apr 2024Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human...
Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human nucleocytosolic proteins. By comparing membrane and secreted proteins in which sequons are well known for N-glycosylation, we discovered that cyto-sequons can participate in nucleic acid binding, particularly in zinc finger proteins. Our global studies further discovered that sequon occurrence is largely proportional to protein length. The contribution of sequons to protein functions, including both N-glycosylation and nucleic acid binding, can be regulated through their density as well as the biased usage between NXS and NXT. In proteins where other PTMs or structural features are rich, such as phosphorylation, transmembrane ɑ-helices, and disulfide bridges, sequon occurrence is scarce. The information acquired here should help understand the relationship between protein sequence and function and assist future protein design and engineering.
Topics: Humans; Proteins; Glycosylation; Amino Acid Sequence; Phosphorylation; Nucleic Acids
PubMed: 38565583
DOI: 10.1038/s41598-024-57334-1 -
Bioinformatics (Oxford, England) Nov 2021Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function...
MOTIVATION
Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family.
RESULTS
We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments.
AVAILABILITY AND IMPLEMENTATION
Data and code available at https://github.com/CyrilMa/ssqa.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Proteins; Amino Acid Sequence; Sequence Alignment; Protein Structure, Secondary; Mutagenesis
PubMed: 34117879
DOI: 10.1093/bioinformatics/btab442 -
Computers in Biology and Medicine Dec 2023Protein sequence classification is a crucial research field in bioinformatics, playing a vital role in facilitating functional annotation, structure prediction, and...
Protein sequence classification is a crucial research field in bioinformatics, playing a vital role in facilitating functional annotation, structure prediction, and gaining a deeper understanding of protein function and interactions. With the rapid development of high-throughput sequencing technologies, a vast amount of unknown protein sequence data is being generated and accumulated, leading to an increasing demand for protein classification and annotation. Existing machine learning methods still have limitations in protein sequence classification, such as low accuracy and precision of classification models, rendering them less valuable in practical applications. Additionally, these models often lack strong generalization capabilities and cannot be widely applied to various types of proteins. Therefore, accurately classifying and predicting proteins remains a challenging task. In this study, we propose a protein sequence classifier called Multi-Laplacian Regularized Random Vector Functional Link (MLapRVFL). By incorporating Multi-Laplacian and L regularization terms into the basic Random Vector Functional Link (RVFL) method, we effectively improve the model's generalization performance, enhance the robustness and accuracy of the classification model. The experimental results on two commonly used datasets demonstrate that MLapRVFL outperforms popular machine learning methods and achieves superior predictive performance compared to previous studies. In conclusion, the proposed MLapRVFL method makes significant contributions to protein sequence prediction.
Topics: Machine Learning; Amino Acid Sequence; Proteins; Algorithms
PubMed: 37925912
DOI: 10.1016/j.compbiomed.2023.107618 -
Nature Biotechnology Feb 2017
Topics: Amino Acid Sequence; Humans; Proteins
PubMed: 28178247
DOI: 10.1038/nbt.3786 -
BMC Genomics Jun 2022Affinity prediction between molecule and protein is an important step of virtual screening, which is usually called drug-target affinity (DTA) prediction. Its accuracy...
BACKGROUND
Affinity prediction between molecule and protein is an important step of virtual screening, which is usually called drug-target affinity (DTA) prediction. Its accuracy directly influences the progress of drug development. Sequence-based drug-target affinity prediction can predict the affinity according to protein sequence, which is fast and can be applied to large datasets. However, due to the lack of protein structure information, the accuracy needs to be improved.
RESULTS
The proposed model which is called WGNN-DTA can be competent in drug-target affinity (DTA) and compound-protein interaction (CPI) prediction tasks. Various experiments are designed to verify the performance of the proposed method in different scenarios, which proves that WGNN-DTA has the advantages of simplicity and high accuracy. Moreover, because it does not need complex steps such as multiple sequence alignment (MSA), it has fast execution speed, and can be suitable for the screening of large databases.
CONCLUSION
We construct protein and molecular graphs through sequence and SMILES that can effectively reflect their structures. To utilize the detail contact information of protein, graph neural network is used to extract features and predict the binding affinity based on the graphs, which is called weighted graph neural networks drug-target affinity predictor (WGNN-DTA). The proposed method has the advantages of simplicity and high accuracy.
Topics: Amino Acid Sequence; Drug Development; Neural Networks, Computer; Proteins; Sequence Alignment
PubMed: 35715739
DOI: 10.1186/s12864-022-08648-9