-
Briefings in Bioinformatics Jan 2022Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences has experimentally...
Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences has experimentally determined functional annotations. Computational methods may predict protein function very quickly, but their accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted structure information and protein sequence embedding. Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, our GAT-GO yields Fmax (maximum F-score) 0.508, 0.416, 0.501, and area under the precision-recall curve (AUPRC) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than the homology-based method BLAST (Fmax 0.117, 0.121, 0.207 and AUPRC 0.120, 0.120, 0.163) that does not use any structure information. On the PDB-cdhit testset where the training and test proteins are more similar, although using predicted structure information, our GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published method DeepFRI that uses experimental structures, which has Fmax 0.542, 0.425, 0.424 and AUPRC only 0.313, 0.159, 0.193.
Topics: Amino Acid Sequence; Area Under Curve; Computational Biology; Databases, Protein; Molecular Sequence Annotation; Proteins
PubMed: 34882195
DOI: 10.1093/bib/bbab502 -
Proceedings of the National Academy of... Mar 2020Frameshifts in protein coding sequences are widely perceived as resulting in either nonfunctional or even deleterious protein products. Indeed, frameshifts typically...
Frameshifts in protein coding sequences are widely perceived as resulting in either nonfunctional or even deleterious protein products. Indeed, frameshifts typically lead to markedly altered protein sequences and premature stop codons. By analyzing complete proteomes from all three domains of life, we demonstrate that, in contrast, several key physicochemical properties of protein sequences exhibit significant robustness against +1 and -1 frameshifts. In particular, we show that hydrophobicity profiles of many protein sequences remain largely invariant upon frameshifting. For example, over 2,900 human proteins exhibit a Pearson's correlation coefficient R between the hydrophobicity profiles of the original and the +1-frameshifted variants greater than 0.7, despite an average sequence identity between the two of only 6.5% in this group. We observe a similar effect for protein sequence profiles of affinity for certain nucleobases as well as protein sequence profiles of intrinsic disorder. Finally, analysis of significance and optimality demonstrates that frameshift stability is embedded in the structure of the universal genetic code and may have contributed to shaping it. Our results suggest that frameshifting may be a powerful evolutionary mechanism for creating new proteins with vastly different sequences, yet similar physicochemical properties to the proteins from which they originate.
Topics: Amino Acid Sequence; Chemical Phenomena; Evolution, Molecular; Frameshift Mutation; Genetic Code; Humans; Hydrophobic and Hydrophilic Interactions; Open Reading Frames; Proteins
PubMed: 32127487
DOI: 10.1073/pnas.1911203117 -
MSystems Dec 2022A protein's function depends on functional residues that determine its binding specificity or its catalytic activity, but these residues are typically not considered...
A protein's function depends on functional residues that determine its binding specificity or its catalytic activity, but these residues are typically not considered when annotating a protein's function. To help biologists investigate the functional residues of proteins, we developed two interactive web-based tools, SitesBLAST and Sites on a Tree. Given a protein sequence, SitesBLAST finds homologs that have known functional residues and shows whether the functional residues are conserved. Sites on a Tree shows how functional residues vary across a protein family by showing them on a phylogenetic tree. These tools are available at http://papers.genomics.lbl.gov/sites. For most microbes of interest, a genome sequence is available, but the function of its proteins is not known. Instead, proteins' functions are predicted from their similarity to other protein sequences. Within a protein's sequence, a few key residues are most important for function, such as catalyzing a chemical reaction or determining what it binds. But most function prediction tools do not take these key residues into account. We developed interactive tools for identifying functional residues in a protein sequence by comparing it to proteins with known functional residues. Our tools also make it easy to compare key residues across many similar proteins. This should help biologists check if a protein's function is predicted correctly, or to predict if groups of similar proteins have conserved functions.
Topics: Phylogeny; Computational Biology; Proteins; Amino Acid Sequence; Data Interpretation, Statistical
PubMed: 36374048
DOI: 10.1128/msystems.00705-22 -
Biomolecular Concepts Feb 2022Accurate prediction of protein structure is one of the most challenging goals of biology. The most recent achievement is AlphaFold, a machine learning method that has...
Accurate prediction of protein structure is one of the most challenging goals of biology. The most recent achievement is AlphaFold, a machine learning method that has claimed to have solved the structure of almost all human proteins. This technological breakthrough has been compared to the sequencing of the human genome. However, this triumphal statement should be treated with caution, as we identified serious flaws in some AlphaFold models. Disordered regions are often represented by large loops that clash with the overall protein geometry, leading to unrealistic structures, especially for membrane proteins. In fact, AlphaFold comes up against the notion that protein folding is not solely determined by genomic information. We suggest that all parameters controlling the structure of a protein without being strictly encoded in its amino acid sequence should be coined "epigenetic dimension of protein structure." Such parameters include for instance protein solvation by membrane lipids, or the structuration of disordered proteins upon ligand binding, but exclude sequence-encoded sites of post-translational modifications such as glycosylation. In our view, this paradigm is necessary to reconcile two opposite properties of living systems: beyond rigorous biological coding, evolution has given way to a certain level of uncertainty and anarchy.
Topics: Amino Acid Sequence; Epigenesis, Genetic; Humans; Membrane Proteins; Protein Conformation; Protein Folding
PubMed: 35189052
DOI: 10.1515/bmc-2022-0006 -
PLoS Computational Biology Apr 2022Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to...
Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to go beyond the existing database annotation, it is fundamental to understand the mechanisms underlying functional inheritance and divergence. If the homology relationship between proteins is known, can we determine whether the function diverged? In this work, we analyze different possibilities of protein sequence evolution after gene duplication and identify "inter-paralog inversions", i.e., sites where the relationship between the ancestry and the functional signal is decoupled. The amino acids in these sites are masked from being recognized by other prediction tools. Still, they play a role in functional divergence and could indicate a shift in protein function. We develop a method to specifically recognize inter-paralog amino acid inversions in a phylogeny and test it on real and simulated datasets. In a dataset built from the Epidermal Growth Factor Receptor (EGFR) sequences found in 88 fish species, we identify 19 amino acid sites that went through inversion after gene duplication, mostly located at the ligand-binding extracellular domain. Our work uncovers an outcome of protein duplications with direct implications in protein functional annotation and sequence evolution. The developed method is optimized to work with large protein datasets and can be readily included in a targeted protein analysis pipeline.
Topics: Amino Acid Sequence; Amino Acids; Animals; Chromosome Inversion; Evolution, Molecular; Gene Duplication; Phylogeny; Proteins
PubMed: 35377869
DOI: 10.1371/journal.pcbi.1010016 -
Cell Reports Methods May 2022
Topics: Protein Structure, Secondary; Amino Acid Sequence
PubMed: 35637914
DOI: 10.1016/j.crmeth.2022.100223 -
BMC Bioinformatics Oct 2021Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple...
BACKGROUND
Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.
RESULTS
Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp's capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor.
CONCLUSIONS
Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.
Topics: Algorithms; Amino Acid Sequence; Machine Learning
PubMed: 34670493
DOI: 10.1186/s12859-021-04437-5 -
PLoS Computational Biology Mar 2023Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the...
Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.
Topics: Genetic Fitness; Proteins; Mutation; Tetrahydrofolate Dehydrogenase; Amino Acid Sequence; Evolution, Molecular; Models, Genetic; Epistasis, Genetic
PubMed: 36857380
DOI: 10.1371/journal.pcbi.1010956 -
Journal of Proteome Research Feb 2022
Topics: Alzheimer Disease; Amino Acid Sequence; Aspartic Acid; Humans; Isomerism
PubMed: 35114789
DOI: 10.1021/acs.jproteome.2c00016 -
Neural networks to learn protein sequence-function relationships from deep mutational scanning data.Proceedings of the National Academy of... Nov 2021The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties....
The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.
Topics: Algorithms; Amino Acid Sequence; Biochemical Phenomena; Deep Learning; Machine Learning; Mutation; Neural Networks, Computer; Proteins; Sequence Analysis, Protein; Structure-Activity Relationship
PubMed: 34815338
DOI: 10.1073/pnas.2104878118