-
Biomacromolecules Feb 2023Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike...
Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.
Topics: Proteins; Polymers; Amino Acid Sequence; Polymerization
PubMed: 36638823
DOI: 10.1021/acs.biomac.2c01036 -
Journal of Chemical Information and... Apr 2021Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in...
Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.
Topics: Amino Acid Sequence; Computational Biology; Humans; Ligands; Machine Learning; Phylogeny; Sequence Alignment
PubMed: 33757283
DOI: 10.1021/acs.jcim.0c01285 -
The Journal of Physical Chemistry... Aug 2022A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...
A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.
Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics
PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986 -
Current Opinion in Structural Biology Feb 2021The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)
Review
The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.
Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins
PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015 -
Methods in Enzymology 2023N-myristoyltransferases (NMTs) are members of the large family of GCN5-related N-acetyltransferases (GNATs). NMTs mainly catalyze eukaryotic protein myristoylation, an...
N-myristoyltransferases (NMTs) are members of the large family of GCN5-related N-acetyltransferases (GNATs). NMTs mainly catalyze eukaryotic protein myristoylation, an essential modification tagging protein N-termini and allowing successive subcellular membrane targeting. NMTs use myristoyl-CoA (C14:0) as major acyl donor. NMTs were recently found to react with unexpected substrates including lysine side-chains and acetyl-CoA. This chapter details the kinetic approaches that have allowed the characterization of the unique catalytic features of NMTs in vitro.
Topics: Amino Acid Sequence; Acyltransferases
PubMed: 37230588
DOI: 10.1016/bs.mie.2023.02.018 -
Bioinformatics (Oxford, England) Jun 2023Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the...
MOTIVATION
Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.
RESULTS
We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.
AVAILABILITY AND IMPLEMENTATION
The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
Topics: Amino Acid Sequence; Benchmarking; Language; Neural Networks, Computer; Software
PubMed: 37387145
DOI: 10.1093/bioinformatics/btad208 -
Bioinformatics (Oxford, England) Sep 2022Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength...
MOTIVATION
Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.
RESULTS
To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.
AVAILABILITY AND IMPLEMENTATION
Data and source codes are available at https://github.com/Shen-Lab/CPAC.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Drug Discovery; Neural Networks, Computer; Proteins; Software
PubMed: 36124802
DOI: 10.1093/bioinformatics/btac470 -
Current Opinion in Chemical Biology Aug 2023The phenomenon of protein phase separation, which underlies the formation of biomolecular condensates, has been associated with numerous cellular functions. Recent... (Review)
Review
The phenomenon of protein phase separation, which underlies the formation of biomolecular condensates, has been associated with numerous cellular functions. Recent studies indicate that the amino acid sequences of most proteins may harbour not only the code for folding into the native state but also for condensing into the liquid-like droplet state and the solid-like amyloid state. Here we review the current understanding of the principles for sequence-based methods for predicting the propensity of proteins for phase separation. A guiding concept is that entropic contributions are generally more important to stabilise the droplet state than they are for the native and amyloid states. Although estimating these entropic contributions has proven difficult, we describe some progress that has been recently made in this direction. To conclude, we discuss the challenges ahead to extend sequence-based prediction methods of protein phase separation to include quantitative in vivo characterisations of this process.
Topics: Amyloid; Amino Acid Sequence; Cell Physiological Phenomena
PubMed: 37207400
DOI: 10.1016/j.cbpa.2023.102317 -
Current Opinion in Genetics &... Oct 2019Many functions of eukaryotic cells are compartmentalized within membrane-bound organelles. One or more cis-encoded signals within a polypeptide sequence typically govern... (Review)
Review
Many functions of eukaryotic cells are compartmentalized within membrane-bound organelles. One or more cis-encoded signals within a polypeptide sequence typically govern protein targeting to and within destination organelles. Perhaps unexpectedly, organelle targeting does not occur with high specificity, but instead is characterized by considerable degeneracy and inefficiency. Indeed, the same peptide signals can target proteins to more than one location, randomized sequences can easily direct proteins to organelles, and many enzymes appear to traverse different subcellular settings across eukaryotic phylogeny. We discuss the potential benefits provided by flexibility in organelle targeting, with a special emphasis on horizontally transferred and de novo proteins. Moreover, we consider how these new organelle residents can be protected and maintained before they contribute to the needs of the cell and promote fitness.
Topics: Amino Acid Sequence; Amoeba; Endoplasmic Reticulum; Eukaryota; Evolution, Molecular; Gene Transfer, Horizontal; Mitochondria; Molecular Chaperones; Phylogeny; Protein Sorting Signals; Protein Transport
PubMed: 31476715
DOI: 10.1016/j.gde.2019.07.012 -
Cell Systems Jun 2021Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...
Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.
Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins
PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017