protein sequence - OpenMD.com Journal Search

UniProt and Mass Spectrometry-Based Proteomics-A 2-Way Working Relationship.

Molecular & Cellular Proteomics : MCP Aug 2023

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and... (Review)

Summary PubMed Full Text PDF

Review

Authors: E H Bowler-Barnett, J Fan, J Luo...

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.

Topics: Humans; Proteomics; Proteome; Databases, Protein; Amino Acid Sequence; Peptides

PubMed: 37301379
DOI: 10.1016/j.mcpro.2023.100591

Top-Down Proteomics and the Challenges of True Proteoform Characterization.

Journal of Proteome Research Dec 2023

Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires... (Review)

Summary PubMed Full Text PDF

Review

Authors: Allen Po, Claire E Eyers

Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires that both the base protein sequence be defined and any mass shifts identified, ideally localizing their positions within the protein sequence. Being able to fully elucidate proteoform profiles lends insight into characterizing proteoform-unique roles, and is a crucial aspect of defining protein structure-function relationships and the specific roles of different (combinations of) protein modifications. However, defining and pinpointing protein post-translational modifications (PTMs) on intact proteins remains a challenge. Characterization of (heavily) modified proteins (>∼30 kDa) remains problematic, especially when they exist in a population of similarly modified, or kindred, proteoforms. This issue is compounded as the number of modifications increases, and thus the number of theoretical combinations. Here, we present our perspective on the challenges of analyzing kindred proteoform populations, focusing on annotation of protein modifications on an "average" protein. Furthermore, we discuss the technical requirements to obtain high quality fragmentation spectral data to robustly define site-specific PTMs, and the fact that this is tempered by the time requirements necessary to separate proteoforms in advance of mass spectrometry analysis.

Topics: Tandem Mass Spectrometry; Proteomics; Proteins; Protein Processing, Post-Translational; Amino Acid Sequence; Proteome

PubMed: 37937372
DOI: 10.1021/acs.jproteome.3c00416

TT3D: Leveraging precomputed protein 3D sequence models to predict protein-protein interactions.

Bioinformatics (Oxford, England) Nov 2023

High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to...

Summary PubMed Full Text PDF

Authors: Samuel Sledzieski, Kapil Devkota, Rohit Singh...

MOTIVATION

High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di).

RESULTS

We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide.

AVAILABILITY AND IMPLEMENTATION

TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.

Topics: Amino Acid Sequence; Software; Proteins

PubMed: 37897686
DOI: 10.1093/bioinformatics/btad663

Protein embeddings improve phage-host interaction prediction.

PloS One 2023

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...

Summary PubMed Full Text PDF

Authors: Mark Edward M Gonzales, Jennifer C Ureta, Anish M S Shrestha...

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.

Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall

PubMed: 37486915
DOI: 10.1371/journal.pone.0289030

SARS-CoV-2 protein structure and sequence mutations: Evolutionary analysis and effects on virus variants.

PloS One 2023

The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution...

Summary PubMed Full Text PDF

Authors: Ugo Lomoio, Barbara Puccio, Giuseppe Tradigo...

The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution of sequences and structures is related to changes in functions. Recently, studies of SARS-CoV-2 Spike (S) protein structures have been performed to predict binding receptors and infection activity in COVID-19, hence the scientific interest in the effects of virus mutations due to sequence, structure and vaccination arises. However, there is the need for models and tools to study the links between the evolution of S protein sequence, structure and functions, and virus transmissibility and the effects of vaccination. As studies on S protein have been generated a large amount of relevant information, we propose in this work to use Protein Contact Networks (PCNs) to relate protein structures with biological properties by means of network topology properties. Topological properties are used to compare the structural changes with sequence changes. We find that both node centrality and community extraction analysis can be used to relate protein stability and functionality with sequence mutations. Starting from this we compare structural evolution to sequence changes and study mutations from a temporal perspective focusing on virus variants. Finally by applying our model to the Omicron variant we report a timeline correlation between Omicron and the vaccination campaign.

Topics: Humans; SARS-CoV-2; COVID-19; Amino Acid Sequence; Mutation; Spike Glycoprotein, Coronavirus

PubMed: 37471335
DOI: 10.1371/journal.pone.0283400

CNCA aligns small annotated genomes.

BMC Bioinformatics Feb 2024

To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the...

Summary PubMed Full Text PDF

Authors: Jean-Noël Lorenzi, François Graner, Virginie Courtier-Orgogozo...

BACKGROUND

To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.

RESULTS

CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.

CONCLUSIONS

CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.

Topics: Genome; Sequence Alignment; Proteins; Amino Acid Sequence; Nucleotides

PubMed: 38424511
DOI: 10.1186/s12859-024-05700-1

Lactylation prediction models based on protein sequence and structural feature fusion.

Briefings in Bioinformatics Jan 2024

Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function,...

Summary PubMed Full Text PDF

Authors: Ye-Hong Yang, Jun-Tao Yang, Jiang-Feng Liu...

Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.

Topics: Amino Acid Sequence; Lysine; Natural Language Processing; Protein Domains; Protein Processing, Post-Translational

PubMed: 38385873
DOI: 10.1093/bib/bbad539

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry.

Scientific Reports Aug 2023

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes...

Summary PubMed Full Text PDF

Authors: Anastasiya V Kulikova, Daniel J Diaz, Tianlong Chen...

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

Topics: Amino Acid Sequence; Amino Acids; Antifibrinolytic Agents; Electric Power Supplies; Language

PubMed: 37587128
DOI: 10.1038/s41598-023-40247-w

Global detection of human variants and isoforms by deep proteome sequencing.

Nature Biotechnology Dec 2023

An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by...

Summary PubMed Full Text PDF

Authors: Pavel Sinitcyn, Alicia L Richards, Robert J Weatheritt...

An average shotgun proteomics experiment detects approximately 10,000 human proteins from a single sample. However, individual proteins are typically identified by peptide sequences representing a small fraction of their total amino acids. Hence, an average shotgun experiment fails to distinguish different protein variants and isoforms. Deeper proteome sequencing is therefore required for the global discovery of protein isoforms. Using six different human cell lines, six proteases, deep fractionation and three tandem mass spectrometry fragmentation methods, we identify a million unique peptides from 17,717 protein groups, with a median sequence coverage of approximately 80%. Direct comparison with RNA expression data provides evidence for the translation of most nonsynonymous variants. We have also hypothesized that undetected variants likely arise from mutation-induced protein instability. We further observe comparable detection rates for exon-exon junction peptides representing constitutive and alternative splicing events. Our dataset represents a resource for proteoform discovery and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.

Topics: Humans; Proteome; Protein Isoforms; Alternative Splicing; Peptides; Amino Acid Sequence

PubMed: 36959352
DOI: 10.1038/s41587-023-01714-x

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function.

Bioinformatics (Oxford, England) Jun 2023

Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the...

Summary PubMed Full Text PDF

Authors: Frimpong Boadu, Hongyuan Cao, Jianlin Cheng...

MOTIVATION

Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

RESULTS

We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

AVAILABILITY AND IMPLEMENTATION

The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

Topics: Amino Acid Sequence; Benchmarking; Language; Neural Networks, Computer; Software

PubMed: 37387145
DOI: 10.1093/bioinformatics/btad208