-
Science (New York, N.Y.) Oct 2022Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using...
Although deep learning has revolutionized protein structure prediction, almost all experimentally characterized de novo protein designs have been generated using physically based approaches such as Rosetta. Here, we describe a deep learning-based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests. On native protein backbones, ProteinMPNN has a sequence recovery of 52.4% compared with 32.9% for Rosetta. The amino acid sequence at different positions can be coupled between single or multiple chains, enabling application to a wide range of current protein design challenges. We demonstrate the broad utility and high accuracy of ProteinMPNN using x-ray crystallography, cryo-electron microscopy, and functional studies by rescuing previously failed designs, which were made using Rosetta or AlphaFold, of protein monomers, cyclic homo-oligomers, tetrahedral nanoparticles, and target-binding proteins.
Topics: Amino Acid Sequence; Cryoelectron Microscopy; Crystallography, X-Ray; Deep Learning; Protein Conformation; Protein Engineering; Proteins
PubMed: 36108050
DOI: 10.1126/science.add2187 -
Bioinformatics (Oxford, England) Apr 2022Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However,...
SUMMARY
Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
AVAILABILITY AND IMPLEMENTATION
Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Deep Learning; Amino Acid Sequence; Proteins; Language; Natural Language Processing
PubMed: 35020807
DOI: 10.1093/bioinformatics/btac020 -
Nature Methods Dec 2019Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the...
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.
Topics: Amino Acid Sequence; Deep Learning; Mutation; Protein Engineering; Protein Stability
PubMed: 31636460
DOI: 10.1038/s41592-019-0598-1 -
Nature Biotechnology Jul 2022Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data,...
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
Topics: Algorithms; Amino Acid Sequence; Language; Protein Sorting Signals; Proteins
PubMed: 34980915
DOI: 10.1038/s41587-021-01156-3 -
Current Opinion in Structural Biology Feb 2022Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)
Review
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins
PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002 -
Nature Oct 2023We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the...
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the β-flower fold, added several protein families to Pfam database and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
Topics: Amino Acid Sequence; Databases, Protein; Deep Learning; Internet; Molecular Sequence Annotation; Protein Folding; Proteins; Structural Homology, Protein
PubMed: 37704037
DOI: 10.1038/s41586-023-06622-3 -
Genomics, Proteomics & Bioinformatics Feb 2021Phase separation is an important mechanism that mediates the compartmentalization of proteins in cells. Proteins that can undergo phase separation in cells share certain... (Review)
Review
Phase separation is an important mechanism that mediates the compartmentalization of proteins in cells. Proteins that can undergo phase separation in cells share certain typical sequence features, like intrinsically disordered regions (IDRs) and multiple modular domains. Sequence-based analysis tools are commonly used in the screening of these proteins. However, current phase separation predictors are mostly designed for IDR-containing proteins, thus inevitably overlook the phase-separating proteins with relatively low IDR content. Features other than amino acid sequence could provide crucial information for identifying possible phase-separating proteins: protein-protein interaction (PPI) networks show multivalent interactions that underlie phase separation process; post-translational modifications (PTMs) are crucial in the regulation of phase separation behavior; spherical structures revealed in immunofluorescence (IF)images indicate condensed droplets formed by phase-separating proteins, distinguishing these proteins from non-phase-separating proteins. Here, we summarize the sequence-based tools for predicting phase-separating proteins and highlight the importance of incorporating PPIs, PTMs, and IF images into phase separation prediction in future studies.
Topics: Amino Acid Sequence; Intrinsically Disordered Proteins; Protein Processing, Post-Translational
PubMed: 33610793
DOI: 10.1016/j.gpb.2020.11.003 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155 -
BMC Research Notes Feb 2024The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...
OBJECTIVE
The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.
RESULTS
To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein
PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733