-
Nucleic Acids Research 2004We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer... (Comparative Study)
Comparative Study
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
Topics: Algorithms; Amino Acid Motifs; Amino Acid Sequence; Internet; Molecular Sequence Data; Reproducibility of Results; Sequence Alignment; Sequence Analysis, Protein; Software; Time Factors
PubMed: 15034147
DOI: 10.1093/nar/gkh340 -
Current Opinion in Structural Biology Feb 2022Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)
Review
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins
PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155 -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733 -
Current Opinion in Chemical Biology Dec 2021Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)
Review
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering
PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004 -
Pacific Symposium on Biocomputing.... 2021Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells....
Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.
Topics: Amino Acid Sequence; COVID-19; Computational Biology; Humans; Proteins; SARS-CoV-2
PubMed: 33691013
DOI: No ID Found -
Scientific Reports Jul 2022Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence...
Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Proteins; Sequence Alignment
PubMed: 35778592
DOI: 10.1038/s41598-022-15266-8 -
Nature Communications Feb 2022The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding
PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9 -
Current Opinion in Biotechnology Jun 2022Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive... (Review)
Review
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Topics: Amino Acid Sequence; Biotechnology; Machine Learning; Protein Engineering; Proteins
PubMed: 35413604
DOI: 10.1016/j.copbio.2022.102713 -
Biomolecules Jan 2022Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell...
Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell processes and functions. High-throughput methods to detect PpIs and PPIs usually require time and costs that are not always affordable. Therefore, reliable in silico predictions represent a valid and effective alternative. In this work, a new algorithm is described, implemented in a freely available tool, i.e., "PepThreader", to carry out PPIs and PpIs prediction and analysis. PepThreader threads multiple fragments derived from a full-length protein sequence (or from a peptide library) onto a second template peptide, in complex with a protein target, "spotting" the potential binding peptides and ranking them according to a sequence-based and structure-based threading score. The threading algorithm first makes use of a scoring function that is based on peptides sequence similarity. Then, a rerank of the initial hits is performed, according to structure-based scoring functions. PepThreader has been benchmarked on a dataset of 292 protein-peptide complexes that were collected from existing databases of experimentally determined protein-peptide interactions. An accuracy of 80%, when considering the top predicted 25 hits, was achieved, which performs in a comparable way with the other state-of-art tools in PPIs and PpIs modeling. Nonetheless, PepThreader is unique in that it is able at the same time to spot a binding peptide within a full-length sequence involved in PPI and model its structure within the receptor. Therefore, PepThreader adds to the already-available tools supporting the experimental PPIs and PpIs identification and characterization.
Topics: Amino Acid Sequence; Peptide Library; Peptides; Protein Interaction Mapping; Software
PubMed: 35204702
DOI: 10.3390/biom12020201