-
BMC Genomics Jun 2022Affinity prediction between molecule and protein is an important step of virtual screening, which is usually called drug-target affinity (DTA) prediction. Its accuracy...
BACKGROUND
Affinity prediction between molecule and protein is an important step of virtual screening, which is usually called drug-target affinity (DTA) prediction. Its accuracy directly influences the progress of drug development. Sequence-based drug-target affinity prediction can predict the affinity according to protein sequence, which is fast and can be applied to large datasets. However, due to the lack of protein structure information, the accuracy needs to be improved.
RESULTS
The proposed model which is called WGNN-DTA can be competent in drug-target affinity (DTA) and compound-protein interaction (CPI) prediction tasks. Various experiments are designed to verify the performance of the proposed method in different scenarios, which proves that WGNN-DTA has the advantages of simplicity and high accuracy. Moreover, because it does not need complex steps such as multiple sequence alignment (MSA), it has fast execution speed, and can be suitable for the screening of large databases.
CONCLUSION
We construct protein and molecular graphs through sequence and SMILES that can effectively reflect their structures. To utilize the detail contact information of protein, graph neural network is used to extract features and predict the binding affinity based on the graphs, which is called weighted graph neural networks drug-target affinity predictor (WGNN-DTA). The proposed method has the advantages of simplicity and high accuracy.
Topics: Amino Acid Sequence; Drug Development; Neural Networks, Computer; Proteins; Sequence Alignment
PubMed: 35715739
DOI: 10.1186/s12864-022-08648-9 -
Bioinformatics (Oxford, England) Sep 2021Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput...
MOTIVATION
Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.
RESULTS
To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence-Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.
AVAILABILITY AND IMPLEMENTATION
The data, source codes and models are available at https://github.com/Shen-Lab/TALE.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Proteins; Software; Amino Acid Sequence; Molecular Sequence Annotation; Gene Ontology
PubMed: 33755048
DOI: 10.1093/bioinformatics/btab198 -
Chembiochem : a European Journal of... Feb 2020A common interpretation of Anfinsen's hypothesis states that one amino acid sequence should fold into a single, native, ordered state, or a highly similar set thereof,... (Review)
Review
A common interpretation of Anfinsen's hypothesis states that one amino acid sequence should fold into a single, native, ordered state, or a highly similar set thereof, coinciding with the global minimum in the folding-energy landscape, which, in turn, is responsible for the function of the protein. However, this classical view is challenged by many proteins and peptide sequences, which can adopt exchangeable, significantly dissimilar conformations that even fulfill different biological roles. The similarities and differences of concepts related to these proteins, mainly chameleon sequences, metamorphic proteins, and switch peptides, which are all denoted herein "turncoat" polypeptides, are reviewed. As well as adding a twist to the conventional view of protein folding, the lack of structural definition adds clear versatility to the activity of proteins and can be used as a tool for protein design and further application in biotechnology and biomedicine.
Topics: Amino Acid Sequence; Models, Molecular; Peptides; Protein Conformation; Protein Folding; Proteins; Thermodynamics
PubMed: 31456307
DOI: 10.1002/cbic.201900446 -
MBio Mar 2024Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many...
Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many ESCRT components are identifiable in various groups of archaea, especially in Asgardarchaeota, the archaeal phylum that is currently considered to include the closest relatives of eukaryotes, but not in bacteria. We performed a comprehensive search for ESCRT protein homologs in archaea and reconstructed ESCRT evolution using the phylogenetic tree of Vps4 ATPase (ESCRT IV) as a scaffold and using sensitive protein sequence analysis and comparison of structural models to identify previously unknown ESCRT proteins. Several distinct groups of ESCRT systems in archaea outside of Asgard were identified, including proteins structurally similar to ESCRT-I and ESCRT-II, and several other domains involved in protein sorting in eukaryotes, suggesting an early origin of these components. Additionally, distant homologs of CdvA proteins were identified in Thermoproteales which are likely components of the uncharacterized cell division system in these archaea. We propose an evolutionary scenario for the origin of eukaryotic and Asgard ESCRT complexes from ancestral building blocks, namely, the Vps4 ATPase, ESCRT-III components, wH (winged helix-turn-helix fold) and possibly also coiled-coil, and Vps28-like domains. The last archaeal common ancestor likely encompassed a complex ESCRT system that was involved in protein sorting. Subsequent evolution involved either simplification, as in the TACK superphylum, where ESCRT was co-opted for cell division, or complexification as in Asgardarchaeota. In Asgardarchaeota, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was already established.IMPORTANCEAll eukaryotic cells possess complex intracellular membrane organization. Endosomal sorting complexes required for transport (ESCRT) play a central role in membrane remodeling which is essential for cellular functionality in eukaryotes. Recently, it has been shown that Asgard archaea, the archaeal phylum that includes the closest known relatives of eukaryotes, encode homologs of many components of the ESCRT systems. We employed protein sequence and structure comparisons to reconstruct the evolution of ESCRT systems in archaea and identified several previously unknown homologs of ESCRT subunits, some of which can be predicted to participate in cell division. The results of this reconstruction indicate that the last archaeal common ancestor already encoded a complex ESCRT system that was involved in protein sorting. In Asgard archaea, ESCRT systems evolved toward greater complexity, and in particular, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was established.
Topics: Endosomal Sorting Complexes Required for Transport; Phylogeny; Amino Acid Sequence; Archaea; Adenosine Triphosphatases; Ubiquitins
PubMed: 38380930
DOI: 10.1128/mbio.00335-24 -
Proceedings of the National Academy of... May 2023With the recent success in calculating protein structures from amino acid sequences using artificial intelligence-based algorithms, an important next step is to decipher...
With the recent success in calculating protein structures from amino acid sequences using artificial intelligence-based algorithms, an important next step is to decipher how dynamics is encoded by the primary protein sequence so as to better predict function. Such dynamics information is critical for protein design, where strategies could then focus not only on sequences that fold into particular structures that perform a given task, but would also include low-lying excited protein states that could influence the function of the designed protein. Herein, we illustrate the importance of dynamics in modulating the function of C34, a designed α/β protein that captures β-strands of target ligands and is a member of a family of proteins designed to sequester β-strands and β hairpins of aggregation-prone molecules that lead to a variety of pathologies. Using a strategy to "see" regions of C34 that are invisible to NMR spectroscopy as a result of pervasive conformational exchange, as well as a mutagenesis approach whereby C34 molecules are stabilized into a single conformer, we determine the structures of the predominant conformations that are sampled by C34 and show that these attenuate the affinity for cognate peptide. Subsequently, the observed motion is exploited to develop an allosterically regulated peptide binder whose binding affinity can be controlled through the addition of a second molecule. Our study emphasizes the unique role that NMR can play in directing the design process and in the construction of new molecules with more complex functionality.
Topics: Protein Conformation; Artificial Intelligence; Proteins; Amino Acid Sequence; Peptides; Ligands
PubMed: 37094170
DOI: 10.1073/pnas.2303149120 -
Molecular Biology and Evolution Apr 2023Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying...
Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.
Topics: Animals; Proteome; Drosophila melanogaster; DNA; Amino Acid Sequence; Saccharomyces cerevisiae
PubMed: 37036379
DOI: 10.1093/molbev/msad084 -
BMC Bioinformatics Jul 2022Protein-protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and...
BACKGROUND
Protein-protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.
RESULTS
In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database with M proteins can be transformed into a much more simpler problem: to find a number inside a sorted array of length M. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship.
CONCLUSIONS
The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from [Formula: see text] to [Formula: see text] for performing an all-against-all PPI prediction for a database with M proteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.
Topics: Amino Acid Sequence; Databases, Protein; Drug Discovery; Proteins
PubMed: 35804303
DOI: 10.1186/s12859-022-04811-x -
Genome Research Jul 2023Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and...
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Topics: Sequence Alignment; Proteins; Amino Acid Sequence; Algorithms; Amino Acids; Language
PubMed: 37414576
DOI: 10.1101/gr.277675.123 -
Briefings in Bioinformatics Jan 2023Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These...
Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets.
Topics: Amino Acid Sequence; Cluster Analysis; Proteins; Sequence Alignment
PubMed: 36642409
DOI: 10.1093/bib/bbac619 -
Nucleic Acids Research Jan 2022Proteome-pI 2.0 is an update of an online database containing predicted isoelectric points and pKa dissociation constants of proteins and peptides. The isoelectric...
Proteome-pI 2.0 is an update of an online database containing predicted isoelectric points and pKa dissociation constants of proteins and peptides. The isoelectric point-the pH at which a particular molecule carries no net electrical charge-is an important parameter for many analytical biochemistry and proteomics techniques. Additionally, it can be obtained directly from the pKa values of individual charged residues of the protein. The Proteome-pI 2.0 database includes data for over 61 million protein sequences from 20 115 proteomes (three to four times more than the previous release). The isoelectric point for proteins is predicted by 21 methods, whereas pKa values are inferred by one method. To facilitate bottom-up proteomics analysis, individual proteomes were digested in silico with the five most commonly used proteases (trypsin, chymotrypsin, trypsin + LysC, LysN, ArgC), and the peptides' isoelectric point and molecular weights were calculated. The database enables the retrieval of virtual 2D-PAGE plots and customized fractions of a proteome based on the isoelectric point and molecular weight. In addition, isoelectric points for proteins in NCBI non-redundant (nr), UniProt, SwissProt, and Protein Data Bank are available in both CSV and FASTA formats. The database can be accessed at http://isoelectricpointdb2.org.
Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Electrophoresis, Gel, Two-Dimensional; Isoelectric Point; Molecular Weight; Peptides; Proteome; Proteomics
PubMed: 34718696
DOI: 10.1093/nar/gkab944