-
Current Opinion in Chemical Biology Dec 2021Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)
Review
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering
PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004 -
Current Protein & Peptide Science 2023Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge... (Review)
Review
Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge discovery activity, bioinformatics-assisted sequence analysis, which relies primarily on biological data manipulation, is becoming an indispensable option for the modern discovery of new knowledge, especially when large amounts of protein-encoding sequences can be easily identified from the annotation of highthroughput genomic data. Here, we review the advances in bioinformatics-assisted protein sequence analysis to highlight how bioinformatics analysis will aid in understanding protein structure and function. We first discuss the analyses with individual protein sequences as input, from which some basic parameters of proteins (e.g., amino acid composition, MW and PTM) can be predicted. In addition to these basic parameters that can be directly predicted by analyzing a protein sequence alone, many predictions are based on principles drawn from knowledge of many well-studied proteins, with multiple sequence comparisons as input. Identification of conserved sites by comparing multiple homologous sequences, prediction of the folding, structure or function of uncharacterized proteins, construction of phylogenies of related sequences, analysis of the contribution of conserved related sites to protein function by SCA or DCA, elucidation of the significance of codon usage, and extraction of functional units from protein sequences and coding spaces belong to this category. We then discuss the revolutionary invention of the "QTY code" that can be applied to convert membrane proteins into water- soluble proteins but at the cost of marginal introduced structural and functional changes. As machine learning has been done in other scientific fields, machine learning has profoundly impacted protein sequence analysis. In summary, we have highlighted the relevance of the bioinformatics-assisted analysis for protein research as a valuable guide for laboratory experiments.
Topics: Proteins; Amino Acid Sequence; Sequence Analysis, Protein; Amino Acids; Computational Biology
PubMed: 37287293
DOI: 10.2174/1389203724666230509124300 -
Current Opinion in Structural Biology Jun 1996Protein sequence motifs are signatures of protein families and can often be used as tools for the prediction of protein function. The generalization and modification of... (Review)
Review
Protein sequence motifs are signatures of protein families and can often be used as tools for the prediction of protein function. The generalization and modification of already known motifs are becoming major trends in the literature, even though new motifs are still being discovered at an approximately linear rate. The emphasis of motif analysis appears to be shifting from metabolic enzymes, in which motifs are associated with catalytic functions and thus often readily recognizable, to structural and regulatory proteins, which contain more divergent motifs. The consideration of structural information increasingly contributes to the identification of motifs and their sensitivity. Genome sequencing provides the basis for a systematic analysis of all motifs that are present in a particular organism. A systematically derived motif database is therefore feasible, allowing the classification of the majority of the newly appearing protein sequences into known families.
Topics: Amino Acid Sequence; Base Sequence; Databases, Factual; Molecular Sequence Data; Protein Conformation; Proteins; Sequence Alignment
PubMed: 8804823
DOI: 10.1016/s0959-440x(96)80057-1 -
Computational Biology and Chemistry Aug 2022Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to...
Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to generate a profile based on the substitution scoring matrix. The profile applications are very dependent on the alignment algorithm and scoring system for amino acid substitution. However, sometimes there are no similar sequences in the database with the query sequence based on the scoring schema. In these cases, it is not possible to make a profile. This paper proposes a method named PA_SPP, based on pre-trained ProtAlbert transformer to predict the profile for a single protein sequence without alignment. The performance of transformers on natural languages is impressive. Protein sequences can be viewed as a language; we can benefit from these models. We analyze the attention heads in different layers of ProtAlbert to show that the transformer can capture five essential protein characteristics of a single sequence. This assessment shows that ProtAlbert considers some protein properties when suggesting amino acids for each position in the sequence. In other words, transformers can be considered an appropriate alternative for alignment and scoring schema to predict a profile. We evaluate PA_SPP on the Casp13 dataset, including 55 proteins. Meanwhile, one thermophilic and two mesophilic proteins are used as case studies. The results display high similarity between the predicted profiles and HSSP profiles.
Topics: Algorithms; Amino Acid Sequence; Databases, Factual; Proteins; Sequence Alignment
PubMed: 35802991
DOI: 10.1016/j.compbiolchem.2022.107717 -
Nature Communications Feb 2022The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding
PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9 -
Cell Systems Jan 2021Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is...
Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.
Topics: Amino Acid Sequence; Machine Learning; Proteins
PubMed: 33212013
DOI: 10.1016/j.cels.2020.10.007 -
Advances in Protein Chemistry 2000
Review
Topics: Amino Acid Sequence; Computational Biology; Databases, Factual; Molecular Sequence Data; Proteins; Sequence Analysis, Protein
PubMed: 10829224
DOI: 10.1016/s0065-3233(00)54002-9 -
Methods in Molecular Biology (Clifton,... 2008Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein... (Review)
Review
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Molecular Sequence Data; Proteins; Sequence Alignment; Software
PubMed: 18592193
DOI: 10.1007/978-1-59745-398-1_25 -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155