-
PLoS Computational Biology Apr 2022Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to...
Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to go beyond the existing database annotation, it is fundamental to understand the mechanisms underlying functional inheritance and divergence. If the homology relationship between proteins is known, can we determine whether the function diverged? In this work, we analyze different possibilities of protein sequence evolution after gene duplication and identify "inter-paralog inversions", i.e., sites where the relationship between the ancestry and the functional signal is decoupled. The amino acids in these sites are masked from being recognized by other prediction tools. Still, they play a role in functional divergence and could indicate a shift in protein function. We develop a method to specifically recognize inter-paralog amino acid inversions in a phylogeny and test it on real and simulated datasets. In a dataset built from the Epidermal Growth Factor Receptor (EGFR) sequences found in 88 fish species, we identify 19 amino acid sites that went through inversion after gene duplication, mostly located at the ligand-binding extracellular domain. Our work uncovers an outcome of protein duplications with direct implications in protein functional annotation and sequence evolution. The developed method is optimized to work with large protein datasets and can be readily included in a targeted protein analysis pipeline.
Topics: Amino Acid Sequence; Amino Acids; Animals; Chromosome Inversion; Evolution, Molecular; Gene Duplication; Phylogeny; Proteins
PubMed: 35377869
DOI: 10.1371/journal.pcbi.1010016 -
Cell Reports Methods May 2022
Topics: Protein Structure, Secondary; Amino Acid Sequence
PubMed: 35637914
DOI: 10.1016/j.crmeth.2022.100223 -
Journal of Proteome Research Feb 2022
Topics: Alzheimer Disease; Amino Acid Sequence; Aspartic Acid; Humans; Isomerism
PubMed: 35114789
DOI: 10.1021/acs.jproteome.2c00016 -
Neural networks to learn protein sequence-function relationships from deep mutational scanning data.Proceedings of the National Academy of... Nov 2021The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties....
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.
Topics: Algorithms; Amino Acid Sequence; Biochemical Phenomena; Deep Learning; Machine Learning; Mutation; Neural Networks, Computer; Proteins; Sequence Analysis, Protein; Structure-Activity Relationship
PubMed: 34815338
DOI: 10.1073/pnas.2104878118 -
Computational and Mathematical Methods... 2022Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods...
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Topics: Humans; Databases, Protein; Proteins; Amino Acid Sequence
PubMed: 36267316
DOI: 10.1155/2022/1650693 -
Bioinformatics (Oxford, England) Sep 2019Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their...
MOTIVATION
Due to the risk of inducing an immediate Type I (IgE-mediated) allergic response, proteins intended for use in consumer products must be investigated for their allergenic potential before introduction into the marketplace. The FAO/WHO guidelines for computational assessment of allergenic potential of proteins based on short peptide hits and linear sequence window identity thresholds misclassify many proteins as allergens.
RESULTS
We developed AllerCatPro which predicts the allergenic potential of proteins based on similarity of their 3D protein structure as well as their amino acid sequence compared with a data set of known protein allergens comprising of 4180 unique allergenic protein sequences derived from the union of the major databases Food Allergy Research and Resource Program, Comprehensive Protein Allergen Resource, WHO/International Union of Immunological Societies, UniProtKB and Allergome. We extended the hexamer hit rule by removing peptides with high probability of random occurrence measured by sequence entropy as well as requiring 3 or more hexamer hits consistent with natural linear epitope patterns in known allergens. This is complemented with a Gluten-like repeat pattern detection. We also switched from a linear sequence window similarity to a B-cell epitope-like 3D surface similarity window which became possible through extensive 3D structure modeling covering the majority (74%) of allergens. In case no structure similarity is found, the decision workflow reverts to the old linear sequence window rule. The overall accuracy of AllerCatPro is 84% compared with other current methods which range from 51 to 73%. Both the FAO/WHO rules and AllerCatPro achieve highest sensitivity but AllerCatPro provides a 37-fold increase in specificity.
AVAILABILITY AND IMPLEMENTATION
https://allercatpro.bii.a-star.edu.sg/.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Allergens; Amino Acid Sequence; Databases, Protein; Food Hypersensitivity; Humans; Proteins; Sequence Alignment
PubMed: 30657872
DOI: 10.1093/bioinformatics/btz029 -
Bioinformatics (Oxford, England) Aug 2018Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding...
MOTIVATION
Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.
RESULTS
The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.
AVAILABILITY AND IMPLEMENTATION
The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Bacteria; Computational Biology; Eukaryota; Humans; Machine Learning; Models, Biological; Proteins; Sequence Analysis, Protein; Software
PubMed: 29584811
DOI: 10.1093/bioinformatics/bty178 -
Advanced Science (Weinheim,... Aug 2023Proteins are the building blocks of life, carrying out fundamental functions in biology. In computational biology, an effective protein representation facilitates many...
Proteins are the building blocks of life, carrying out fundamental functions in biology. In computational biology, an effective protein representation facilitates many important biological quantifications. Most existing protein representation methods are derived from self-supervised language models designed for text analysis. Proteins, however, are more than linear sequences of amino acids. Here, a multimodal deep learning framework for incorporating ≈1 million protein sequence, structure, and functional annotation (MASSA) is proposed. A multitask learning process with five specific pretraining objectives is presented to extract a fine-grained protein-domain feature. Through pretraining, multimodal protein representation achieves state-of-the-art performance in specific downstream tasks such as protein properties (stability and fluorescence), protein-protein interactions (shs27k/shs148k/string/skempi), and protein-ligand interactions (kinase, DUD-E), while achieving competitive results in secondary structure and remote homology tasks. Moreover, a novel optimal-transport-based metric with rich geometry awareness is introduced to quantify the dynamic transferability from the pretrained representation to the related downstream tasks, which provides a panoramic view of the step-by-step learning process. The pairwise distances between these downstream tasks are also calculated, and a strong correlation between the inter-task feature space distributions and adaptability is observed.
Topics: Algorithms; Proteins; Amino Acid Sequence; Amino Acids
PubMed: 37249398
DOI: 10.1002/advs.202301223 -
Bioinformatics (Oxford, England) Mar 2024Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using...
MOTIVATION
Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable the development of more versatile thermostability predictors for multiple ranges of temperatures.
RESULTS
We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data.
AVAILABILITY AND IMPLEMENTATION
TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.
Topics: Proteins; Machine Learning; Software; Amino Acid Sequence; Language
PubMed: 38507682
DOI: 10.1093/bioinformatics/btae157 -
PloS One 2018The sequence space of five protein superfamilies was investigated by constructing sequence networks. The nodes represent individual sequences, and two nodes are...
The sequence space of five protein superfamilies was investigated by constructing sequence networks. The nodes represent individual sequences, and two nodes are connected by an edge if the global sequence identity of two sequences exceeds a threshold. The networks were characterized by their degree distribution (number of nodes with a given number of neighbors) and by their fractal network dimension. Although the five protein families differed in sequence length, fold, and domain arrangement, their network properties were similar. The fractal network dimension Df was distance-dependent: a high dimension for single and double mutants (Df = 4.0), which dropped to Df = 0.7-1.0 at 90% sequence identity, and increased to Df = 3.5-4.5 below 70% sequence identity. The distance dependency of the network dimension is consistent with evolutionary constraints for functional proteins. While random single and double mutations often result in a functional protein, the accumulation of more than ten mutations is dominated by epistasis. The networks of the five protein families were highly inhomogeneous with few highly connected communities ("hub sequences") and a large number of smaller and less connected communities. The degree distributions followed a power-law distribution with similar scaling exponents close to 1. Because the hub sequences have a large number of functional neighbors, they are expected to be robust toward possible deleterious effects of mutations. Because of their robustness, hub sequences have the potential of high innovability, with additional mutations readily inducing new functions. Therefore, they form hotspots of evolution and are promising candidates as starting points for directed evolution experiments in biotechnology.
Topics: Amino Acid Sequence; Fractals; Models, Molecular; Mutation; Protein Domains; Protein Folding; Proteins; Sequence Alignment
PubMed: 30067815
DOI: 10.1371/journal.pone.0200815