-
Biophysical Journal Jun 2024Diffusion determines the turnover of biomolecules in liquid-liquid phase-separated condensates. We considered the mean square displacement and thus the diffusion...
Diffusion determines the turnover of biomolecules in liquid-liquid phase-separated condensates. We considered the mean square displacement and thus the diffusion constant for simple model systems of peptides GGGGG, GGQGG, and GGVGG in aqueous solutions after phase separation by simulating atomic-level models. These solutions readily separate into aqueous and peptide-rich droplet phases. We noted the effect of the peptides being in a solvated, surface, or droplet state on the peptide's diffusion coefficients. Both sequence and peptide conformational distribution were found to influence diffusion and condensate turnover in these systems, with sequence dominating the magnitude of the differences. We found that the most compact structures for each sequence diffused the fastest in the peptide-rich condensate phase. This model result may have implications for turnover dynamics in signaling systems.
Topics: Diffusion; Peptides; Biomolecular Condensates; Amino Acid Sequence; Water; Models, Molecular; Protein Conformation
PubMed: 38751116
DOI: 10.1016/j.bpj.2024.05.009 -
Biochemistry Nov 2023Sequence determines the structure, and the structure in turn determines the function, are the fundamental principles of protein chemistry. In the genomics era, the...
Sequence determines the structure, and the structure in turn determines the function, are the fundamental principles of protein chemistry. In the genomics era, the paradigm of mining protein functionality and evolutionary insights through sequence analysis has led to remarkable achievements. However, protein sequences often mutate faster than their structural counterparts during evolution. For protein sets characterized by highly divergent sequences, sequence-based analysis is often inadequate, whereas direct extraction of implicit information from the structures appears to be a more effective strategy.
Topics: Evolution, Molecular; Proteins; Amino Acid Sequence; Genomics
PubMed: 37950690
DOI: 10.1021/acs.biochem.3c00547 -
Briefings in Bioinformatics Jan 2024Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function,...
Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.
Topics: Amino Acid Sequence; Lysine; Natural Language Processing; Protein Domains; Protein Processing, Post-Translational
PubMed: 38385873
DOI: 10.1093/bib/bbad539 -
Briefings in Bioinformatics Sep 2023Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on...
Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.
Topics: Crystallization; Machine Learning; Amino Acid Sequence; Algorithms; Computational Biology
PubMed: 37649385
DOI: 10.1093/bib/bbad319 -
Scientific Reports Aug 2023Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes...
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Topics: Amino Acid Sequence; Amino Acids; Antifibrinolytic Agents; Electric Power Supplies; Language
PubMed: 37587128
DOI: 10.1038/s41598-023-40247-w -
Chembiochem : a European Journal of... Dec 2023Stapled peptides have rapidly established themselves as a powerful technique to mimic α-helical interactions with a short peptide sequence. There are many examples of... (Review)
Review
Stapled peptides have rapidly established themselves as a powerful technique to mimic α-helical interactions with a short peptide sequence. There are many examples of stapled peptides that successfully disrupt α-helix-mediated protein-protein interactions, with an example currently in clinical trials. DNA-protein interactions are also often mediated by α-helices and are involved in all transcriptional regulation processes. Unlike DNA-binding small molecules, which typically lack DNA sequence selectivity, DNA-binding proteins bind with high affinity and high selectivity. These are ideal candidates for the design DNA-binding stapled peptides. Despite the parallel to protein-protein interaction disrupting stapled peptides and the need for sequence specific DNA binders, there are very few DNA-binding stapled peptides. In this review we examine all the known DNA-binding stapled peptides. Their design concepts are compared to stapled peptides that disrupt protein-protein interactions and based on the few examples in the literature, DNA-binding stapled peptide trends are discussed.
Topics: Peptides; Amino Acid Sequence; Gene Expression Regulation; DNA
PubMed: 37750576
DOI: 10.1002/cbic.202300594 -
Computers in Biology and Medicine Mar 2024The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized...
The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.
Topics: Computational Biology; Proteins; Amino Acid Sequence; Amino Acids; Algorithms; Support Vector Machine; Sequence Analysis, Protein; Databases, Protein
PubMed: 38217977
DOI: 10.1016/j.compbiomed.2024.107956 -
Briefings in Functional Genomics Nov 2023Cyclin proteins are a group of proteins that activate the cell cycle by forming complexes with cyclin-dependent kinases. Identifying cyclins correctly can provide key...
Cyclin proteins are a group of proteins that activate the cell cycle by forming complexes with cyclin-dependent kinases. Identifying cyclins correctly can provide key clues to understanding the function of cyclins. However, due to the low similarity between cyclin protein sequences, the advancement of a machine learning-based approach to identify cycles is urgently needed. In this study, cyclin protein sequence features were extracted using the profile-based auto-cross covariance method. Then the features were ranked and selected with maximum relevance-maximum distance (MRMD) 1.0 and MRMD2.0. Finally, the prediction model was assessed through 10-fold cross-validation. The computational experiments showed that the best protein sequence features generated by MRMD1.0 could correctly predict 98.2% of cyclins using the random forest (RF) classifier, whereas seven-dimensional key protein sequence features identified with MRMD2.0 could correctly predict 96.1% of cyclins, which was superior to previous studies on the same dataset both in terms of dimensionality and performance comparisons. Therefore, our work provided a valuable tool for identifying cyclins. The model data can be downloaded from https://github.com/YUshunL/cyclin.
Topics: Cyclins; Amino Acid Sequence; Proteins; Cyclin-Dependent Kinases; Cell Cycle
PubMed: 37118891
DOI: 10.1093/bfgp/elad014 -
Proteomics Nov 2023Prediction of protein-protein interactions (PPIs) commonly involves a significant computational component. Rapid recent advances in the power of computational methods... (Review)
Review
Prediction of protein-protein interactions (PPIs) commonly involves a significant computational component. Rapid recent advances in the power of computational methods for protein interaction prediction motivate a review of the state-of-the-art. We review the major approaches, organized according to the primary source of data utilized: protein sequence, protein structure, and protein co-abundance. The advent of deep learning (DL) has brought with it significant advances in interaction prediction, and we show how DL is used for each source data type. We review the literature taxonomically, present example case studies in each category, and conclude with observations about the strengths and weaknesses of machine learning methods in the context of the principal sources of data for protein interaction prediction.
Topics: Protein Interaction Mapping; Proteins; Machine Learning; Amino Acid Sequence; Computational Biology
PubMed: 37401192
DOI: 10.1002/pmic.202200292 -
Current Opinion in Chemical Biology Aug 2023The phenomenon of protein phase separation, which underlies the formation of biomolecular condensates, has been associated with numerous cellular functions. Recent... (Review)
Review
The phenomenon of protein phase separation, which underlies the formation of biomolecular condensates, has been associated with numerous cellular functions. Recent studies indicate that the amino acid sequences of most proteins may harbour not only the code for folding into the native state but also for condensing into the liquid-like droplet state and the solid-like amyloid state. Here we review the current understanding of the principles for sequence-based methods for predicting the propensity of proteins for phase separation. A guiding concept is that entropic contributions are generally more important to stabilise the droplet state than they are for the native and amyloid states. Although estimating these entropic contributions has proven difficult, we describe some progress that has been recently made in this direction. To conclude, we discuss the challenges ahead to extend sequence-based prediction methods of protein phase separation to include quantitative in vivo characterisations of this process.
Topics: Amyloid; Amino Acid Sequence; Cell Physiological Phenomena
PubMed: 37207400
DOI: 10.1016/j.cbpa.2023.102317