-
Scientific Reports May 2022Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks...
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .
Topics: Amino Acid Sequence; Amino Acids; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae
PubMed: 35589837
DOI: 10.1038/s41598-022-12201-9 -
Briefings in Bioinformatics Sep 2023Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research...
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
Topics: Humans; Amino Acid Sequence; Gene Ontology; Molecular Sequence Annotation; Neural Networks, Computer; Protein Interaction Maps
PubMed: 37861172
DOI: 10.1093/bib/bbab556 -
Proceedings of the National Academy of... Apr 2021Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and...
Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid-liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/.
Topics: Amino Acid Sequence; Animals; Humans; Hydrophobic and Hydrophilic Interactions; Machine Learning; Sequence Analysis, Protein
PubMed: 33827920
DOI: 10.1073/pnas.2019053118 -
Current Opinion in Structural Biology Oct 2023Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)
Review
Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.
Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype
PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669 -
Journal of Structural Biology Sep 2023Biomaterials for tissue regeneration must mimic the biophysical properties of the native physiological environment. A protein engineering approach allows the generation...
Biomaterials for tissue regeneration must mimic the biophysical properties of the native physiological environment. A protein engineering approach allows the generation of protein hydrogels with specific and customised biophysical properties designed to suit a particular physiological environment. Herein, repetitive engineered proteins were successfully designed to form covalent molecular networks with defined physical characteristics able to sustain cell phenotype. Our hydrogel design was made possible by the incorporation of the SpyTag (ST) peptide and multiple repetitive units of the SpyCatcher (SC) protein that spontaneously formed covalent crosslinks upon mixing. Changing the ratios of the protein building blocks (ST:SC), allowed the viscoelastic properties and gelation speeds of the hydrogels to be altered and controlled. The physical properties of the hydrogels could readily be altered further to suit different environments by tuning the key features in the repetitive protein sequence. The resulting hydrogels were designed with a view to allow cell attachment and encapsulation of liver derived cells. Biocompatibility of the hydrogels was assayed using a HepG2 cell line constitutively expressing GFP. The cells remained viable and continued to express GFP whilst attached or encapsulated within the hydrogel. Our results demonstrate how this genetically encoded approach using repetitive proteins could be applied to bridge engineering biology with nanotechnology creating a level of biomaterial customisation previously inaccessible.
Topics: Protein Array Analysis; Hydrogels; Proteins; Biocompatible Materials; Amino Acid Sequence
PubMed: 37245604
DOI: 10.1016/j.jsb.2023.107981 -
Scientific Reports Apr 2023As synthetic biology becomes increasingly capable and accessible, it is likewise increasingly critical to be able to make accurate biosecurity determinations regarding...
As synthetic biology becomes increasingly capable and accessible, it is likewise increasingly critical to be able to make accurate biosecurity determinations regarding the pathogenicity or toxicity of particular nucleic acid or amino acid sequences. At present, this is typically done using the BLAST algorithm to determine the best match with sequences in the NCBI nucleic acid and protein databases. Neither BLAST nor any of the NCBI databases, however, are actually designed for biosafety determination. Critically, taxonomic errors or ambiguities in the NCBI nucleic acid and protein databases can also cause errors in BLAST-based taxonomic categorization. With heavily studied taxa and frequently used biotechnology tools, even low frequency taxonomic categorization issues can lead to high rates of errors in biosecurity decision-making. Here we focus on the implications for false positives, finding that BLAST against NCBI's protein database will now incorrectly categorize a number of commonly used biotechnology tool sequences as the pathogens or toxins with which they have been used. Paradoxically, this implies that problems are expected to be most acute for the pathogens and toxins of highest interest and for the most widely used biotechnology tools. We thus conclude that biosecurity tools should shift away from BLAST against general purpose databases and towards new methods that are specifically tailored for biosafety purposes.
Topics: Sequence Alignment; Databases, Protein; Amino Acid Sequence; Biotechnology; Software
PubMed: 37012314
DOI: 10.1038/s41598-023-32481-z -
BMC Genomics Aug 2022Protein-protein interaction (PPI) is very important for many biochemical processes. Therefore, accurate prediction of PPI can help us better understand the role of...
BACKGROUND
Protein-protein interaction (PPI) is very important for many biochemical processes. Therefore, accurate prediction of PPI can help us better understand the role of proteins in biochemical processes. Although there are many methods to predict PPI in biology, they are time-consuming and lack accuracy, so it is necessary to build an efficiently and accurately computational model in the field of PPI prediction.
RESULTS
We present a novel sequence-based computational approach called DCSE (Double-Channel-Siamese-Ensemble) to predict potential PPI. In the encoding layer, we treat each amino acid as a word, and map it into an N-dimensional vector. In the feature extraction layer, we extract features from local and global perspectives by Multilayer Convolutional Neural Network (MCN) and Multilayer Bidirectional Gated Recurrent Unit with Convolutional Neural Networks (MBC). Finally, the output of the feature extraction layer is then fed into the prediction layer to output whether the input protein pair will interact each other. The MCN and MBC are siamese and ensemble based network, which can effectively improve the performance of the model. In order to demonstrate our model's performance, we compare it with four machine learning based and three deep learning based models. The results show that our method outperforms other models in all evaluation criteria. The Accuracy, Precision, [Formula: see text], Recall and MCC of our model are 0.9303, 0.9091, 0.9268, 0.9452, 0.8609. For the other seven models, the highest Accuracy, Precision, [Formula: see text], Recall and MCC are 0.9288, 0.9243, 0.9246, 0.9250, 0.8572. We also test our model in the imbalanced dataset and transfer our model to another species. The results show our model is excellent.
CONCLUSION
Our model achieves the best performance by comparing it with seven other models. NLP-based coding method has a good effect on PPI prediction task. MCN and MBC extract protein sequence features from local and global perspectives and these two feature extraction layers are based on siamese and ensemble network structures. Siamese-based network structure can keep the features consistent and ensemble based network structure can effectively improve the accuracy of the model.
Topics: Amino Acid Sequence; Machine Learning; Neural Networks, Computer; Proteins
PubMed: 35922751
DOI: 10.1186/s12864-022-08772-6 -
Bioinformatics (Oxford, England) Apr 2022The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a...
MOTIVATION
The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly and challenging task, while protein sequence data are ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different Deep Learning (DL) architectures and learning strategies for protein-protein, protein-nucleotide and protein-small molecule interface prediction has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six DL architectures and various learning strategies with sequence-derived input features.
RESULTS
We constructed a large dataset dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang and Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction dataset. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule.
AVAILABILITY AND IMPLEMENTATION
Source code and datasets are available at https://github.com/ibivu/pipenn/.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Proteins; Machine Learning; Software; Amino Acid Sequence; Nucleotides; Computational Biology
PubMed: 35150231
DOI: 10.1093/bioinformatics/btac071 -
Biomolecules Aug 2023With the development of accurate protein structure prediction algorithms, artificial intelligence (AI) has emerged as a powerful tool in the field of structural biology....
With the development of accurate protein structure prediction algorithms, artificial intelligence (AI) has emerged as a powerful tool in the field of structural biology. AI-based algorithms have been used to analyze large amounts of protein sequence data including the human proteome, complementing experimental structure data found in resources such as the Protein Data Bank. The EBI AlphaFold Protein Structure Database (for example) contains over 230 million structures. In this study, these data have been analyzed to find all human proteins containing (or predicted to contain) the cytosolic glutathione transferase (cGST) fold. A total of 39 proteins were found, including the alpha-, mu-, pi-, sigma-, zeta- and omega-class GSTs, intracellular chloride channels, metaxins, multisynthetase complex components, elongation factor 1 complex components and others. Three broad themes emerge: cGST domains as enzymes, as chloride ion channels and as protein-protein interaction mediators. As the majority of cGSTs are dimers, the AI-based structure prediction algorithm AlphaFold-multimer was used to predict structures of all pairwise combinations of these cGST domains. Potential homo- and heterodimers are described. Experimental biochemical and structure data is used to highlight the strengths and limitations of AI-predicted structures.
Topics: Humans; Glutathione Transferase; Genome, Human; Artificial Intelligence; Algorithms; Amino Acid Sequence
PubMed: 37627305
DOI: 10.3390/biom13081240 -
Bioinformatics (Oxford, England) Aug 2023Protein thermostability is of great interest, both in theory and in practice.
MOTIVATION
Protein thermostability is of great interest, both in theory and in practice.
RESULTS
This study compared orthologous proteins with different cellular thermostability. A large number of physicochemical properties of protein were calculated and used to develop a series of machine learning models for predicting cellular thermostability differences between orthologous proteins. Most of the important features in these models are also highly correlated to relative cellular thermostability. A comparison between the present study with previous comparison of orthologous proteins from thermophilic and mesophilic organisms found that most highly correlated features are consistent in these studies, suggesting they may be important to protein thermostability.
AVAILABILITY AND IMPLEMENTATION
Data freely available for download at https://github.com/fangj3/cellular-protein-thermostability-dataset.
Topics: Amino Acid Sequence; Proteins
PubMed: 37572303
DOI: 10.1093/bioinformatics/btad504