protein sequence - OpenMD.com Journal Search

Robust deep learning-based protein sequence design using ProteinMPNN.

Science (New York, N.Y.) Oct 2022

Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using...

Summary PubMed Full Text PDF

Authors: J Dauparas, I Anishchenko, N Bennett...

Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.

Topics: Amino Acid Sequence; Cryoelectron Microscopy; Crystallography, X-Ray; Deep Learning; Protein Conformation; Protein Engineering; Proteins

PubMed: 36108050
DOI: 10.1126/science.add2187

ProteinBERT: a universal deep-learning model of protein sequence and function.

Bioinformatics (Oxford, England) Apr 2022

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However,...

Summary PubMed Full Text PDF

Authors: Nadav Brandes, Dan Ofer, Yam Peleg...

SUMMARY

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.

AVAILABILITY AND IMPLEMENTATION

Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Deep Learning; Amino Acid Sequence; Proteins; Language; Natural Language Processing

PubMed: 35020807
DOI: 10.1093/bioinformatics/btac020

Comparative Protein Structure Modeling Using MODELLER.

Current Protocols in Bioinformatics Jun 2016

Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more...

Summary PubMed Full Text PDF

Authors: Benjamin Webb, Andrej Sali

Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. © 2016 by John Wiley & Sons, Inc.

Topics: Amino Acid Sequence; Chemistry Techniques, Analytical; L-Lactate Dehydrogenase; Models, Molecular; Protein Conformation; Proteins; Sequence Alignment; Software; Trichomonas vaginalis

PubMed: 27322406
DOI: 10.1002/cpbi.3

Adaptive machine learning for protein engineering.

Current Opinion in Structural Biology Feb 2022

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)

Summary PubMed Full Text

Review

Authors: Brian L Hie, Kevin K Yang

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins

PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002

Modeling repetitive, non-globular proteins.

Protein Science : a Publication of the... May 2016

While ab initio modeling of protein structures is not routine, certain types of proteins are more straightforward to model than others. Proteins with short repetitive... (Review)

Summary PubMed Full Text PDF

Review

Authors: Koli Basu, Robert L Campbell, Shuaiqi Guo...

While ab initio modeling of protein structures is not routine, certain types of proteins are more straightforward to model than others. Proteins with short repetitive sequences typically exhibit repetitive structures. These repetitive sequences can be more amenable to modeling if some information is known about the predominant secondary structure or other key features of the protein sequence. We have successfully built models of a number of repetitive structures with novel folds using knowledge of the consensus sequence within the sequence repeat and an understanding of the likely secondary structures that these may adopt. Our methods for achieving this success are reviewed here.

Topics: Models, Molecular; Molecular Dynamics Simulation; Protein Folding; Protein Structure, Secondary; Proteins; Repetitive Sequences, Amino Acid

PubMed: 26914323
DOI: 10.1002/pro.2907

Predicting exon criticality from protein sequence.

Nucleic Acids Research Apr 2022

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...

Summary PubMed Full Text PDF

Authors: Jigar Desai, Christopher Francis, Kenneth Longo...

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.

Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics

PubMed: 35286381
DOI: 10.1093/nar/gkac155

KINtaro: protein kinase-like database.

BMC Research Notes Feb 2024

The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...

Summary PubMed Full Text PDF

Authors: Bartosz Baranowski, Marianna Krysińska, Marcin Gradowski...

OBJECTIVE

The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.

RESULTS

To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.

Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein

PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y

Deep learning of protein sequence design of protein-protein interactions.

Bioinformatics (Oxford, England) Jan 2023

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...

Summary PubMed Full Text PDF

Authors: Raulia Syrlybaeva, Eva-Maria Strauch

MOTIVATION

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.

RESULTS

We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.

AVAILABILITY AND IMPLEMENTATION

The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software

PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733

Survey of Protein Sequence Embedding Models.

International Journal of Molecular... Feb 2023

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and...

Summary PubMed Full Text PDF

Authors: Chau Tran, Siddharth Khadkikar, Aleksey Porollo...

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).

Topics: Humans; Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Proteins; Saccharomyces cerevisiae; Proteomics

PubMed: 36835188
DOI: 10.3390/ijms24043775

Protein sequence design with deep generative models.

Current Opinion in Chemical Biology Dec 2021

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)

Summary PubMed Full Text

Review

Authors: Zachary Wu, Kadina E Johnston, Frances H Arnold...

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering

PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004