-
Journal of Proteome Research Feb 2023Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the...
Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the Baker lab have independently published protein structure prediction tools that can help us obtain predicted protein structures for the whole human proteome. This enabled us to visualize the entire human proteome using predicted 3D structures for the first time. To help other researchers best utilize these protein structure predictions in proteomics experiments, we present the Sequence Coverage Visualizer (SCV), http://scv.lab.gy, a web application for protein sequence coverage 3D visualization. Here we showed a few possible usages of the SCV, including the labeling of post-translational modifications and isotope labeling experiments. These results highlight the usefulness of such 3D visualization for proteomics experiments and how SCV can turn a regular proteomics experiment (identified peptide list) into structural insights. Furthermore, when used together with limited proteolysis, we demonstrated that SCV can help to compare different protein structures from different sources, including predicted ones and existing PDB entries. We hope our tool can provide help in the process of improving protein structure prediction accuracy. Overall, SCV is a convenient and powerful tool for visualizing proteomics results in 3D.
Topics: Humans; Proteome; Imaging, Three-Dimensional; Amino Acid Sequence; Peptides; Proteomics; Software
PubMed: 36511722
DOI: 10.1021/acs.jproteome.2c00358 -
PloS One 2023With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall
PubMed: 37486915
DOI: 10.1371/journal.pone.0289030 -
BMC Bioinformatics Oct 2021Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and...
BACKGROUND
Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, high-throughput experimental methods are expensive and time-consuming for the prediction of PPIs. Thus, computational methods have attracted researcher's attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders.
RESULTS
Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (OR-RCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by element-wise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple sub-classifiers in the second part. The results of multiple sub-classifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold [Formula: see text], and say the interaction exists between the protein pair if its confidence score is bigger than [Formula: see text].
CONCLUSIONS
We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms state-of-the-art PPI prediction models.
Topics: Amino Acid Sequence; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae
PubMed: 34625020
DOI: 10.1186/s12859-021-04369-0 -
International Journal of Molecular... Jan 2023This review explains the origin of the LIV-1 family of zinc transporters, paying attention to how this family of nine human proteins was originally discovered.... (Review)
Review
This review explains the origin of the LIV-1 family of zinc transporters, paying attention to how this family of nine human proteins was originally discovered. Structural and functional differences between these nine human LIV-1 family members and the five other ZIP transporters are examined. These differences are both related to aspects of the protein sequence, the conservation of important motifs and to the effect this may have on their overall function. The LIV-1 family are dependent on various post-translational modifications, such as phosphorylation and cleavage, which play an important role in their ability to transport zinc. These modifications and their implications are discussed in detail. Some of these proteins have been implicated in cancer which is examined. Furthermore, some additional areas of potential fruitful discovery are discussed and suggested as worthy of examination in the future.
Topics: Humans; Carrier Proteins; Membrane Transport Proteins; Zinc; Amino Acid Sequence
PubMed: 36674777
DOI: 10.3390/ijms24021255 -
BMC Bioinformatics Feb 2024To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the...
BACKGROUND
To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.
RESULTS
CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.
CONCLUSIONS
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
Topics: Genome; Sequence Alignment; Proteins; Amino Acid Sequence; Nucleotides
PubMed: 38424511
DOI: 10.1186/s12859-024-05700-1 -
Bioinformatics (Oxford, England) Mar 2023Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
MOTIVATION
Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
RESULTS
Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.
AVAILABILITY AND IMPLEMENTATION
The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Topics: Amino Acid Sequence; Proteins; Software
PubMed: 36916746
DOI: 10.1093/bioinformatics/btad122 -
Biomolecules Jan 2022Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell...
Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell processes and functions. High-throughput methods to detect PpIs and PPIs usually require time and costs that are not always affordable. Therefore, reliable in silico predictions represent a valid and effective alternative. In this work, a new algorithm is described, implemented in a freely available tool, i.e., "PepThreader", to carry out PPIs and PpIs prediction and analysis. PepThreader threads multiple fragments derived from a full-length protein sequence (or from a peptide library) onto a second template peptide, in complex with a protein target, "spotting" the potential binding peptides and ranking them according to a sequence-based and structure-based threading score. The threading algorithm first makes use of a scoring function that is based on peptides sequence similarity. Then, a rerank of the initial hits is performed, according to structure-based scoring functions. PepThreader has been benchmarked on a dataset of 292 protein-peptide complexes that were collected from existing databases of experimentally determined protein-peptide interactions. An accuracy of 80%, when considering the top predicted 25 hits, was achieved, which performs in a comparable way with the other state-of-art tools in PPIs and PpIs modeling. Nonetheless, PepThreader is unique in that it is able at the same time to spot a binding peptide within a full-length sequence involved in PPI and model its structure within the receptor. Therefore, PepThreader adds to the already-available tools supporting the experimental PPIs and PpIs identification and characterization.
Topics: Amino Acid Sequence; Peptide Library; Peptides; Protein Interaction Mapping; Software
PubMed: 35204702
DOI: 10.3390/biom12020201 -
Scientific Reports Jul 2022Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence...
Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Proteins; Sequence Alignment
PubMed: 35778592
DOI: 10.1038/s41598-022-15266-8 -
PLoS Computational Biology Mar 2016Nuclear magnetic resonance (NMR) spectroscopy provides a unique toolbox of experimental probes for studying dynamic processes on a wide range of timescales, ranging from... (Review)
Review
Nuclear magnetic resonance (NMR) spectroscopy provides a unique toolbox of experimental probes for studying dynamic processes on a wide range of timescales, ranging from picoseconds to milliseconds and beyond. Along with NMR hardware developments, recent methodological advancements have enabled the characterization of allosteric proteins at unprecedented detail, revealing intriguing aspects of allosteric mechanisms and increasing the proportion of the conformational ensemble that can be observed by experiment. Here, we present an overview of NMR spectroscopic methods for characterizing equilibrium fluctuations in free and bound states of allosteric proteins that have been most influential in the field. By combining NMR experimental approaches with molecular simulations, atomistic-level descriptions of the mechanisms by which allosteric phenomena take place are now within reach.
Topics: Allosteric Regulation; Allosteric Site; Amino Acid Sequence; Enzyme Activation; Enzymes; Magnetic Resonance Spectroscopy; Models, Chemical; Molecular Dynamics Simulation; Molecular Sequence Data; Protein Binding; Sequence Analysis, Protein
PubMed: 26964042
DOI: 10.1371/journal.pcbi.1004620 -
Briefings in Bioinformatics Jan 2023Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However,...
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements-conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.
Topics: Amino Acid Sequence; Proteins; Computational Biology; Conserved Sequence
PubMed: 36631405
DOI: 10.1093/bib/bbac599