protein sequence - OpenMD.com Journal Search

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Nucleic Acids Research 2004

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer... (Comparative Study)

Summary PubMed Full Text PDF

Comparative Study

Authors: Robert C Edgar

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

Topics: Algorithms; Amino Acid Motifs; Amino Acid Sequence; Internet; Molecular Sequence Data; Reproducibility of Results; Sequence Alignment; Sequence Analysis, Protein; Software; Time Factors

PubMed: 15034147
DOI: 10.1093/nar/gkh340

Adaptive machine learning for protein engineering.

Current Opinion in Structural Biology Feb 2022

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)

Summary PubMed Full Text

Review

Authors: Brian L Hie, Kevin K Yang

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins

PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002

Predicting exon criticality from protein sequence.

Nucleic Acids Research Apr 2022

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...

Summary PubMed Full Text PDF

Authors: Jigar Desai, Christopher Francis, Kenneth Longo...

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.

Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics

PubMed: 35286381
DOI: 10.1093/nar/gkac155

Deep learning of protein sequence design of protein-protein interactions.

Bioinformatics (Oxford, England) Jan 2023

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...

Summary PubMed Full Text PDF

Authors: Raulia Syrlybaeva, Eva-Maria Strauch

MOTIVATION

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.

RESULTS

We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.

AVAILABILITY AND IMPLEMENTATION

The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software

PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733

Protein sequence design with deep generative models.

Current Opinion in Chemical Biology Dec 2021

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)

Summary PubMed Full Text

Review

Authors: Zachary Wu, Kadina E Johnston, Frances H Arnold...

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering

PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004

Protein sequence models for prediction and comparative analysis of the SARS-CoV-2 -human interactome.

Pacific Symposium on Biocomputing.... 2021

Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells....

Summary PubMed Full Text

Authors: Meghana Kshirsagar, Nure Tasnina, Michael D Ward...

Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.

Topics: Amino Acid Sequence; COVID-19; Computational Biology; Humans; Proteins; SARS-CoV-2

PubMed: 33691013
DOI: No ID Found

An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.

Scientific Reports Jul 2022

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence...

Summary PubMed Full Text PDF

Authors: Saeedeh Akbari Rokn Abadi, Azam Sadat Abdosalehi, Faezeh Pouyamehr...

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.

Topics: Algorithms; Amino Acid Sequence; Amino Acids; Proteins; Sequence Alignment

PubMed: 35778592
DOI: 10.1038/s41598-022-15266-8

Protein sequence design with a learned potential.

Nature Communications Feb 2022

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...

Summary PubMed Full Text PDF

Authors: Namrata Anand, Raphael Eguchi, Irimpan I Mathews...

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.

Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding

PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9

Machine learning to navigate fitness landscapes for protein engineering.

Current Opinion in Biotechnology Jun 2022

Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive... (Review)

Summary PubMed Full Text PDF

Review

Authors: Chase R Freschlin, Sarah A Fahlberg, Philip A Romero...

Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.

Topics: Amino Acid Sequence; Biotechnology; Machine Learning; Protein Engineering; Proteins

PubMed: 35413604
DOI: 10.1016/j.copbio.2022.102713

Prediction and Modeling of Protein-Protein Interactions Using "Spotted" Peptides with a Template-Based Approach.

Biomolecules Jan 2022

Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell...

Summary PubMed Full Text PDF

Authors: Chiara Gasbarri, Serena Rosignoli, Giacomo Janson...

Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell processes and functions. High-throughput methods to detect PpIs and PPIs usually require time and costs that are not always affordable. Therefore, reliable in silico predictions represent a valid and effective alternative. In this work, a new algorithm is described, implemented in a freely available tool, i.e., "PepThreader", to carry out PPIs and PpIs prediction and analysis. PepThreader threads multiple fragments derived from a full-length protein sequence (or from a peptide library) onto a second template peptide, in complex with a protein target, "spotting" the potential binding peptides and ranking them according to a sequence-based and structure-based threading score. The threading algorithm first makes use of a scoring function that is based on peptides sequence similarity. Then, a rerank of the initial hits is performed, according to structure-based scoring functions. PepThreader has been benchmarked on a dataset of 292 protein-peptide complexes that were collected from existing databases of experimentally determined protein-peptide interactions. An accuracy of 80%, when considering the top predicted 25 hits, was achieved, which performs in a comparable way with the other state-of-art tools in PPIs and PpIs modeling. Nonetheless, PepThreader is unique in that it is able at the same time to spot a binding peptide within a full-length sequence involved in PPI and model its structure within the receptor. Therefore, PepThreader adds to the already-available tools supporting the experimental PPIs and PpIs identification and characterization.

Topics: Amino Acid Sequence; Peptide Library; Peptides; Protein Interaction Mapping; Software

PubMed: 35204702
DOI: 10.3390/biom12020201