protein sequence - OpenMD.com Journal Search

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Nucleic Acids Research 2004

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer... (Comparative Study)

Summary PubMed Full Text PDF

Comparative Study

Authors: Robert C Edgar

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

Topics: Algorithms; Amino Acid Motifs; Amino Acid Sequence; Internet; Molecular Sequence Data; Reproducibility of Results; Sequence Alignment; Sequence Analysis, Protein; Software; Time Factors

PubMed: 15034147
DOI: 10.1093/nar/gkh340

Adaptive machine learning for protein engineering.

Current Opinion in Structural Biology Feb 2022

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)

Summary PubMed Full Text

Review

Authors: Brian L Hie, Kevin K Yang

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins

PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002

Accurate and efficient protein sequence design through learning concise local environment of residues.

Bioinformatics (Oxford, England) Mar 2023

Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.

Summary PubMed Full Text PDF

Authors: Bin Huang, Tingwen Fan, Kaiyue Wang...

MOTIVATION

Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.

RESULTS

Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.

AVAILABILITY AND IMPLEMENTATION

The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.

Topics: Amino Acid Sequence; Proteins; Software

PubMed: 36916746
DOI: 10.1093/bioinformatics/btad122

Protein sequence models for prediction and comparative analysis of the SARS-CoV-2 -human interactome.

Pacific Symposium on Biocomputing.... 2021

Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells....

Summary PubMed Full Text

Authors: Meghana Kshirsagar, Nure Tasnina, Michael D Ward...

Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.

Topics: Amino Acid Sequence; COVID-19; Computational Biology; Humans; Proteins; SARS-CoV-2

PubMed: 33691013
DOI: No ID Found

An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.

Scientific Reports Jul 2022

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence...

Summary PubMed Full Text PDF

Authors: Saeedeh Akbari Rokn Abadi, Azam Sadat Abdosalehi, Faezeh Pouyamehr...

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.

Topics: Algorithms; Amino Acid Sequence; Amino Acids; Proteins; Sequence Alignment

PubMed: 35778592
DOI: 10.1038/s41598-022-15266-8

E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants.

Bioinformatics (Oxford, England) Nov 2022

The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly...

Summary PubMed Full Text PDF

Authors: Matteo Manfredi, Castrense Savojardo, Pier Luigi Martelli...

MOTIVATION

The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.

RESULTS

E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.

AVAILABILITY AND IMPLEMENTATION

The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Humans; Polymorphism, Single Nucleotide; Artificial Intelligence; Amino Acid Sequence; Proteins; Amino Acids; Computational Biology; Molecular Sequence Annotation

PubMed: 36227117
DOI: 10.1093/bioinformatics/btac678

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

Briefings in Bioinformatics Jan 2023

Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However,...

Summary PubMed Full Text PDF

Authors: Wayland Yeung, Zhongliang Zhou, Sheng Li...

Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.

Topics: Amino Acid Sequence; Proteins; Computational Biology; Conserved Sequence

PubMed: 36631405
DOI: 10.1093/bib/bbac599

Predicting exon criticality from protein sequence.

Nucleic Acids Research Apr 2022

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...

Summary PubMed Full Text PDF

Authors: Jigar Desai, Christopher Francis, Kenneth Longo...

Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.

Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics

PubMed: 35286381
DOI: 10.1093/nar/gkac155

Deep learning of protein sequence design of protein-protein interactions.

Bioinformatics (Oxford, England) Jan 2023

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...

Summary PubMed Full Text PDF

Authors: Raulia Syrlybaeva, Eva-Maria Strauch

MOTIVATION

As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.

RESULTS

We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.

AVAILABILITY AND IMPLEMENTATION

The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software

PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733

Survey of Protein Sequence Embedding Models.

International Journal of Molecular... Feb 2023

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and...

Summary PubMed Full Text PDF

Authors: Chau Tran, Siddharth Khadkikar, Aleksey Porollo...

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).

Topics: Humans; Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Proteins; Saccharomyces cerevisiae; Proteomics

PubMed: 36835188
DOI: 10.3390/ijms24043775