-
Science (New York, N.Y.) Oct 2022Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using...
Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.
Topics: Amino Acid Sequence; Cryoelectron Microscopy; Crystallography, X-Ray; Deep Learning; Protein Conformation; Protein Engineering; Proteins
PubMed: 36108050
DOI: 10.1126/science.add2187 -
Bioinformatics (Oxford, England) Apr 2022Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However,...
SUMMARY
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
AVAILABILITY AND IMPLEMENTATION
Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Deep Learning; Amino Acid Sequence; Proteins; Language; Natural Language Processing
PubMed: 35020807
DOI: 10.1093/bioinformatics/btac020 -
Current Protocols in Bioinformatics Jun 2016Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more...
Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. © 2016 by John Wiley & Sons, Inc.
Topics: Amino Acid Sequence; Chemistry Techniques, Analytical; L-Lactate Dehydrogenase; Models, Molecular; Protein Conformation; Proteins; Sequence Alignment; Software; Trichomonas vaginalis
PubMed: 27322406
DOI: 10.1002/cpbi.3 -
Current Opinion in Structural Biology Feb 2022Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)
Review
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins
PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002 -
Protein Science : a Publication of the... May 2016While ab initio modeling of protein structures is not routine, certain types of proteins are more straightforward to model than others. Proteins with short repetitive... (Review)
Review
While ab initio modeling of protein structures is not routine, certain types of proteins are more straightforward to model than others. Proteins with short repetitive sequences typically exhibit repetitive structures. These repetitive sequences can be more amenable to modeling if some information is known about the predominant secondary structure or other key features of the protein sequence. We have successfully built models of a number of repetitive structures with novel folds using knowledge of the consensus sequence within the sequence repeat and an understanding of the likely secondary structures that these may adopt. Our methods for achieving this success are reviewed here.
Topics: Models, Molecular; Molecular Dynamics Simulation; Protein Folding; Protein Structure, Secondary; Proteins; Repetitive Sequences, Amino Acid
PubMed: 26914323
DOI: 10.1002/pro.2907 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155 -
BMC Research Notes Feb 2024The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...
OBJECTIVE
The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.
RESULTS
To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein
PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733 -
International Journal of Molecular... Feb 2023Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and...
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
Topics: Humans; Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Proteins; Saccharomyces cerevisiae; Proteomics
PubMed: 36835188
DOI: 10.3390/ijms24043775 -
Current Opinion in Chemical Biology Dec 2021Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)
Review
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering
PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004