-
Nucleic Acids Research 2004We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer... (Comparative Study)
Comparative Study
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
Topics: Algorithms; Amino Acid Motifs; Amino Acid Sequence; Internet; Molecular Sequence Data; Reproducibility of Results; Sequence Alignment; Sequence Analysis, Protein; Software; Time Factors
PubMed: 15034147
DOI: 10.1093/nar/gkh340 -
Current Opinion in Structural Biology Feb 2022Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)
Review
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins
PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002 -
Bioinformatics (Oxford, England) Mar 2023Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
MOTIVATION
Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
RESULTS
Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.
AVAILABILITY AND IMPLEMENTATION
The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Topics: Amino Acid Sequence; Proteins; Software
PubMed: 36916746
DOI: 10.1093/bioinformatics/btad122 -
Pacific Symposium on Biocomputing.... 2021Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells....
Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.
Topics: Amino Acid Sequence; COVID-19; Computational Biology; Humans; Proteins; SARS-CoV-2
PubMed: 33691013
DOI: No ID Found -
Scientific Reports Jul 2022Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence...
Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Proteins; Sequence Alignment
PubMed: 35778592
DOI: 10.1038/s41598-022-15266-8 -
Bioinformatics (Oxford, England) Nov 2022The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly...
MOTIVATION
The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.
RESULTS
E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.
AVAILABILITY AND IMPLEMENTATION
The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Humans; Polymorphism, Single Nucleotide; Artificial Intelligence; Amino Acid Sequence; Proteins; Amino Acids; Computational Biology; Molecular Sequence Annotation
PubMed: 36227117
DOI: 10.1093/bioinformatics/btac678 -
Briefings in Bioinformatics Jan 2023Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However,...
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.
Topics: Amino Acid Sequence; Proteins; Computational Biology; Conserved Sequence
PubMed: 36631405
DOI: 10.1093/bib/bbac599 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155 -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733 -
International Journal of Molecular... Feb 2023Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and...
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
Topics: Humans; Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Proteins; Saccharomyces cerevisiae; Proteomics
PubMed: 36835188
DOI: 10.3390/ijms24043775