-
Current Opinion in Chemical Biology Dec 2021Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)
Review
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering
PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004 -
Nucleic Acids Research Jul 2019Here, we describe a web server that integrates structural alignments with the MAFFT multiple sequence alignment (MSA) tool. For this purpose, we have prepared a...
Here, we describe a web server that integrates structural alignments with the MAFFT multiple sequence alignment (MSA) tool. For this purpose, we have prepared a web-based Database of Aligned Structural Homologs (DASH), which provides structural alignments at the domain and chain levels for all proteins in the Protein Data Bank (PDB), and can be queried interactively or by a simple REST-like API. MAFFT-DASH integration can be invoked with a single flag on either the web (https://mafft.cbrc.jp/alignment/server/) or command-line versions of MAFFT. In our benchmarks using 878 cases from the BAliBase, HomFam, OXFam, Mattbench and SISYPHUS datasets, MAFFT-DASH showed 10-20% improvement over standard MAFFT for MSA problems with weak similarity, in terms of Sum-of-Pairs (SP), a measure of how well a program succeeds at aligning input sequences in comparison to a reference alignment. When MAFFT alignments were supplemented with homologous sequences, further improvement was observed. Potential applications of DASH beyond MSA enrichment include functional annotation through detection of remote homology and assembly of template libraries for homology modeling.
Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Humans; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Analysis, RNA; Sequence Homology; Software
PubMed: 31062021
DOI: 10.1093/nar/gkz342 -
BMC Bioinformatics Feb 2024Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a...
PURPOSE
Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.
METHODS
We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.
RESULTS
PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.
CONCLUSION
Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
Topics: Proteins; Boronic Acids; Amino Acid Sequence; Sequence Alignment; Algorithms
PubMed: 38413857
DOI: 10.1186/s12859-024-05699-5 -
Nucleic Acids Research Jul 1993PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the...
PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. PIR-International is most noted for the Protein Sequence Database. This database originated in the early 1960's with the pioneering work of the late Margaret Dayhoff as a research tool for the study of protein evolution and intersequence relationships; it is maintained as a scientific resource, organized by biological concepts, using sequence homology as a guiding principle. PIR-International also maintains a number of other genomic, protein sequence, and sequence-related databases. The databases of PIR-International are made widely available. This paper briefly describes the architecture of the Protein Sequence Database, a number of other PIR-International databases, and mechanisms for providing access to and for distribution of these databases.
Topics: Amino Acid Sequence; CD-ROM; Databases, Factual; Escherichia coli; Information Storage and Retrieval; Protein Conformation; Proteins; Sequence Homology, Amino Acid
PubMed: 8332528
DOI: 10.1093/nar/21.13.3089 -
Nature Communications Feb 2022The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding
PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9 -
Bioinformatics (Oxford, England) Mar 2024Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences...
MOTIVATION
Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences continues to grow rapidly, there is an increasing need for efficient and scalable multiple sequence query algorithms for super-large databases without expensive time and computational costs.
RESULTS
We introduce Chorus, a novel protein sequence query system that leverages parallel model and heterogeneous computation architecture to enable users to query thousands of protein sequences concurrently against large protein databases on a desktop workstation. Chorus achieves over 100× speedup over BLASTP without sacrificing sensitivity. We demonstrate the utility of Chorus through a case study of analyzing a ∼1.5-TB large-scale metagenomic datasets for novel CRISPR-Cas protein discovery within 30 min.
AVAILABILITY AND IMPLEMENTATION
Chorus is open-source and its code repository is available at https://github.com/Bio-Acc/Chorus.
Topics: Software; Algorithms; Amino Acid Sequence; Proteins; Databases, Protein
PubMed: 38547405
DOI: 10.1093/bioinformatics/btae151 -
Current Opinion in Biotechnology Jun 2022Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive... (Review)
Review
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Topics: Amino Acid Sequence; Biotechnology; Machine Learning; Protein Engineering; Proteins
PubMed: 35413604
DOI: 10.1016/j.copbio.2022.102713 -
Methods in Molecular Biology (Clifton,... 2023The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The...
The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The progress is propelled by the improved accuracy of deep learning-based inter-residue contact map predictors coupled with the rising growth of protein sequence databases. Contact map encodes interatomic interaction information that can be exploited for highly accurate prediction of protein structures via contact map threading even for the query proteins that are not amenable to direct homology modeling. As such, contact-assisted threading has garnered considerable research effort. In this chapter, we provide an overview of existing contact-assisted threading methods while highlighting the recent advances and discussing some of the current limitations and future prospects in the application of contact-assisted threading for improving the accuracy of low-homology protein modeling.
Topics: Algorithms; Sequence Analysis, Protein; Proteins; Software; Amino Acid Sequence; Databases, Protein; Protein Conformation; Protein Folding
PubMed: 36959441
DOI: 10.1007/978-1-0716-2974-1_3 -
Journal of Proteome Research Feb 2023Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the...
Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the Baker lab have independently published protein structure prediction tools that can help us obtain predicted protein structures for the whole human proteome. This enabled us to visualize the entire human proteome using predicted 3D structures for the first time. To help other researchers best utilize these protein structure predictions in proteomics experiments, we present the Sequence Coverage Visualizer (SCV), http://scv.lab.gy, a web application for protein sequence coverage 3D visualization. Here we showed a few possible usages of the SCV, including the labeling of post-translational modifications and isotope labeling experiments. These results highlight the usefulness of such 3D visualization for proteomics experiments and how SCV can turn a regular proteomics experiment (identified peptide list) into structural insights. Furthermore, when used together with limited proteolysis, we demonstrated that SCV can help to compare different protein structures from different sources, including predicted ones and existing PDB entries. We hope our tool can provide help in the process of improving protein structure prediction accuracy. Overall, SCV is a convenient and powerful tool for visualizing proteomics results in 3D.
Topics: Humans; Proteome; Imaging, Three-Dimensional; Amino Acid Sequence; Peptides; Proteomics; Software
PubMed: 36511722
DOI: 10.1021/acs.jproteome.2c00358 -
PloS One 2023With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall
PubMed: 37486915
DOI: 10.1371/journal.pone.0289030