-
Molecular & Cellular Proteomics : MCP Aug 2023The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and... (Review)
Review
The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.
Topics: Humans; Proteomics; Proteome; Databases, Protein; Amino Acid Sequence; Peptides
PubMed: 37301379
DOI: 10.1016/j.mcpro.2023.100591 -
Journal of Proteome Research Dec 2023Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires... (Review)
Review
Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires that both the base protein sequence be defined and any mass shifts identified, ideally localizing their positions within the protein sequence. Being able to fully elucidate proteoform profiles lends insight into characterizing proteoform-unique roles, and is a crucial aspect of defining protein structure-function relationships and the specific roles of different (combinations of) protein modifications. However, defining and pinpointing protein post-translational modifications (PTMs) on intact proteins remains a challenge. Characterization of (heavily) modified proteins (>∼30 kDa) remains problematic, especially when they exist in a population of similarly modified, or kindred, proteoforms. This issue is compounded as the number of modifications increases, and thus the number of theoretical combinations. Here, we present our perspective on the challenges of analyzing kindred proteoform populations, focusing on annotation of protein modifications on an "average" protein. Furthermore, we discuss the technical requirements to obtain high quality fragmentation spectral data to robustly define site-specific PTMs, and the fact that this is tempered by the time requirements necessary to separate proteoforms in advance of mass spectrometry analysis.
Topics: Tandem Mass Spectrometry; Proteomics; Proteins; Protein Processing, Post-Translational; Amino Acid Sequence; Proteome
PubMed: 37937372
DOI: 10.1021/acs.jproteome.3c00416 -
Bioinformatics (Oxford, England) Nov 2023High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to...
MOTIVATION
High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di).
RESULTS
We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide.
AVAILABILITY AND IMPLEMENTATION
TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.
Topics: Amino Acid Sequence; Software; Proteins
PubMed: 37897686
DOI: 10.1093/bioinformatics/btad663 -
PloS One 2023With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall
PubMed: 37486915
DOI: 10.1371/journal.pone.0289030 -
PloS One 2023The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution...
The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution of sequences and structures is related to changes in functions. Recently, studies of SARS-CoV-2 Spike (S) protein structures have been performed to predict binding receptors and infection activity in COVID-19, hence the scientific interest in the effects of virus mutations due to sequence, structure and vaccination arises. However, there is the need for models and tools to study the links between the evolution of S protein sequence, structure and functions, and virus transmissibility and the effects of vaccination. As studies on S protein have been generated a large amount of relevant information, we propose in this work to use Protein Contact Networks (PCNs) to relate protein structures with biological properties by means of network topology properties. Topological properties are used to compare the structural changes with sequence changes. We find that both node centrality and community extraction analysis can be used to relate protein stability and functionality with sequence mutations. Starting from this we compare structural evolution to sequence changes and study mutations from a temporal perspective focusing on virus variants. Finally by applying our model to the Omicron variant we report a timeline correlation between Omicron and the vaccination campaign.
Topics: Humans; SARS-CoV-2; COVID-19; Amino Acid Sequence; Mutation; Spike Glycoprotein, Coronavirus
PubMed: 37471335
DOI: 10.1371/journal.pone.0283400 -
BMC Bioinformatics Feb 2024To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the...
BACKGROUND
To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.
RESULTS
CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.
CONCLUSIONS
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
Topics: Genome; Sequence Alignment; Proteins; Amino Acid Sequence; Nucleotides
PubMed: 38424511
DOI: 10.1186/s12859-024-05700-1 -
Briefings in Bioinformatics Jan 2024Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function,...
Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.
Topics: Amino Acid Sequence; Lysine; Natural Language Processing; Protein Domains; Protein Processing, Post-Translational
PubMed: 38385873
DOI: 10.1093/bib/bbad539 -
Scientific Reports Aug 2023Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes...
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Topics: Amino Acid Sequence; Amino Acids; Antifibrinolytic Agents; Electric Power Supplies; Language
PubMed: 37587128
DOI: 10.1038/s41598-023-40247-w -
Nature Biotechnology Dec 2023An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by...
An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by peptide sequences representing a small fraction of their total amino acids. Hence, an average shotgun experiment fails to distinguish different protein variants and isoforms. Deeper proteome sequencing is therefore required for the global discovery of protein isoforms. Using six different human cell lines, six proteases, deep fractionation and three tandem mass spectrometry fragmentation methods, we identify a million unique peptides from 17,717 protein groups, with a median sequence coverage of approximately 80%. Direct comparison with RNA expression data provides evidence for the translation of most nonsynonymous variants. We have also hypothesized that undetected variants likely arise from mutation-induced protein instability. We further observe comparable detection rates for exon-exon junction peptides representing constitutive and alternative splicing events. Our dataset represents a resource for proteoform discovery and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.
Topics: Humans; Proteome; Protein Isoforms; Alternative Splicing; Peptides; Amino Acid Sequence
PubMed: 36959352
DOI: 10.1038/s41587-023-01714-x -
Bioinformatics (Oxford, England) Jun 2023Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the...
MOTIVATION
Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.
RESULTS
We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.
AVAILABILITY AND IMPLEMENTATION
The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
Topics: Amino Acid Sequence; Benchmarking; Language; Neural Networks, Computer; Software
PubMed: 37387145
DOI: 10.1093/bioinformatics/btad208