-
Journal of Proteome Research Feb 2022
Topics: Alzheimer Disease; Amino Acid Sequence; Aspartic Acid; Humans; Isomerism
PubMed: 35114789
DOI: 10.1021/acs.jproteome.2c00016 -
Neural networks to learn protein sequence-function relationships from deep mutational scanning data.Proceedings of the National Academy of... Nov 2021The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties....
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.
Topics: Algorithms; Amino Acid Sequence; Biochemical Phenomena; Deep Learning; Machine Learning; Mutation; Neural Networks, Computer; Proteins; Sequence Analysis, Protein; Structure-Activity Relationship
PubMed: 34815338
DOI: 10.1073/pnas.2104878118 -
Biochemistry Jun 2017Every amino acid exhibits a different propensity for distinct structural conformations. Hence, decoding how the primary amino acid sequence undergoes the transition to a... (Review)
Review
Every amino acid exhibits a different propensity for distinct structural conformations. Hence, decoding how the primary amino acid sequence undergoes the transition to a defined secondary structure and its final three-dimensional fold is presently considered predictable with reasonable certainty. However, protein sequences that defy the first principles of secondary structure prediction (they attain two different folds) have recently been discovered. Such proteins, aptly named metamorphic proteins, decrease the conformational constraint by increasing flexibility in the secondary structure and thereby result in efficient functionality. In this review, we discuss the major factors driving the conformational switch related both to protein sequence and to structure using illustrative examples. We discuss the concept of an evolutionary transition in sequence and structure, the functional impact of the tertiary fold, and the pressure of intrinsic and external factors that give rise to metamorphic proteins. We mainly focus on the major components of protein architecture, namely, the α-helix and β-sheet segments, which are involved in conformational switching within the same or highly similar sequences. These chameleonic sequences are widespread in both cytosolic and membrane proteins, and these folds are equally important for protein structure and function. We discuss the implications of metamorphic proteins and chameleonic peptide sequences in de novo peptide design.
Topics: Amino Acid Sequence; Humans; Membrane Proteins; Protein Conformation; Protein Folding
PubMed: 28570055
DOI: 10.1021/acs.biochem.7b00375 -
Computational and Mathematical Methods... 2022Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods...
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Topics: Humans; Databases, Protein; Proteins; Amino Acid Sequence
PubMed: 36267316
DOI: 10.1155/2022/1650693 -
Bioinformatics (Oxford, England) Sep 2019Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their...
MOTIVATION
Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their allergenic potential before introduction into the marketplace. The FAO/WHO guidelines for computational assessment of allergenic potential of proteins based on short peptide hits and linear sequence window identity thresholds misclassify many proteins as allergens.
RESULTS
We developed AllerCatPro which predicts the allergenic potential of proteins based on similarity of their 3D protein structure as well as their amino acid sequence compared with a data set of known protein allergens comprising of 4180 unique allergenic protein sequences derived from the union of the major databases Food Allergy Research and Resource Program, Comprehensive Protein Allergen Resource, WHO/International Union of Immunological Societies, UniProtKB and Allergome. We extended the hexamer hit rule by removing peptides with high probability of random occurrence measured by sequence entropy as well as requiring 3 or more hexamer hits consistent with natural linear epitope patterns in known allergens. This is complemented with a Gluten-like repeat pattern detection. We also switched from a linear sequence window similarity to a B-cell epitope-like 3D surface similarity window which became possible through extensive 3D structure modeling covering the majority (74%) of allergens. In case no structure similarity is found, the decision workflow reverts to the old linear sequence window rule. The overall accuracy of AllerCatPro is 84% compared with other current methods which range from 51 to 73%. Both the FAO/WHO rules and AllerCatPro achieve highest sensitivity but AllerCatPro provides a 37-fold increase in specificity.
AVAILABILITY AND IMPLEMENTATION
https://allercatpro.bii.a-star.edu.sg/.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Allergens; Amino Acid Sequence; Databases, Protein; Food Hypersensitivity; Humans; Proteins; Sequence Alignment
PubMed: 30657872
DOI: 10.1093/bioinformatics/btz029 -
Combinatorial Chemistry & High... 2018The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences....
AIM AND OBJECTIVE
The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information.
METHODS
Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically.
RESULTS
By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M.
CONCLUSION
These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Computer Graphics; DNA-Binding Proteins; Datasets as Topic; Phylogeny; Sequence Homology, Amino Acid; Support Vector Machine
PubMed: 29380690
DOI: 10.2174/1386207321666180130100838 -
IEEE/ACM Transactions on Computational... 2020As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those... (Review)
Review
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Protein Folding; Protein Structure, Secondary; Proteins; Sequence Analysis, Protein
PubMed: 30998480
DOI: 10.1109/TCBB.2019.2911677 -
Mass Spectrometry Reviews Mar 2022The present review covers available results on the application of FT-MS for the de novo sequencing of natural peptides of various animals: cones, bees, snakes,... (Review)
Review
The present review covers available results on the application of FT-MS for the de novo sequencing of natural peptides of various animals: cones, bees, snakes, amphibians, scorpions, and so forth. As these peptides are usually bioactive, the animals efficiently use them as a weapon against microorganisms or higher animals including predators. These peptides represent definite interest as drugs of future generations since the mechanism of their activity is completely different in comparison with that of the modern antibiotics. Utilization of those peptides as antibiotics can eliminate the problem of the bacterial resistance development. Sequence elucidation of these bioactive peptides becomes even more challenging when the species genome is not available and little is known about the protein origin and other properties of those peptides in the study. De novo sequencing may be the only option to obtain sequence information. The benefits of FT-MS for the top-down peptide sequencing, the general approaches of the de novxxo sequencing, the difficult cases involving sequence coverage, isobaric and isomeric amino acids, cyclization of short peptides, the presence of posttranslational modifications will be discussed in the review.
Topics: Amino Acid Sequence; Animals; Mass Spectrometry; Peptides; Proteins; Sequence Analysis, Protein
PubMed: 33347655
DOI: 10.1002/mas.21678 -
Combinatorial Chemistry & High... 2022The similarities comparison of biological sequences is an important task in bioinformatics. The methods of the similarities comparison for biological sequences are...
AIM AND OBJECTIVE
The similarities comparison of biological sequences is an important task in bioinformatics. The methods of the similarities comparison for biological sequences are divided into two classes: sequence alignment method and alignment-free method. The graphical representation of biological sequences is a kind of alignment-free method, which constitutes a tool for analyzing and visualizing the biological sequences. In this article, a generalized iterative map of protein sequences was suggested to analyze the similarities of biological sequences.
MATERIALS AND METHODS
Based on the normalized physicochemical indexes of 20 amino acids, each amino acid can be mapped into a point in 5D space. A generalized iterative function system was introduced to outline a generalized iterative map of protein sequences, which can not only reflect various physicochemical properties of amino acids but also incorporate with different compression ratios of the component of a generalized iterative map. Several properties were proved to illustrate the advantage of the generalized iterative map. The mathematical description of the generalized iterative map was suggested to compare the similarities and dissimilarities of protein sequences. Based on this method, similarities/dissimilarities were compared among ND5 protein sequences, as well as ND6 protein sequences of ten different species.
RESULTS
By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to show the utility of our approach. The comparison results show that our approach has better correlations with ClustalW for all species than other approaches and illustrate the effectiveness of our approach.
CONCLUSION
Two examples show that our method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; Proteins; Sequence Alignment; Sequence Analysis, Protein
PubMed: 33045963
DOI: 10.2174/1386207323666201012142318 -
Briefings in Bioinformatics Jan 2023Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However,...
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.
Topics: Amino Acid Sequence; Proteins; Computational Biology; Conserved Sequence
PubMed: 36631405
DOI: 10.1093/bib/bbac599