-
Mathematical Biosciences and... Oct 2021Protein S-nitrosylation is one of the most important post-translational modifications, a well-grounded understanding of S-nitrosylation is very significant since it...
Protein S-nitrosylation is one of the most important post-translational modifications, a well-grounded understanding of S-nitrosylation is very significant since it plays a key role in a variety of biological processes. For an uncharacterized protein sequence, it is a very meaningful problem for both basic research and drug development when we can firstly identify whether it is a S-nitrosylation protein or not, and then predict the specific S-nitrosylation site(s). This work has proposed two models for identifying S-nitrosylation protein and its PTM sites. Firstly, three kinds of features are extracted from protein sequence: KNN scoring of functional domain annotation, PseAAC and bag-of-words based on the physical and chemical properties of amino acids. Secondly, the synthetic minority oversampling technique is used to balance the data sets, and some state-of-the-art classifiers and feature fusion strategies are performed on the balanced data sets. In the five-fold cross-validation for predicting S-nitrosylation proteins, the results of Accuracy (ACC), Matthew's correlation coefficient (MCC) and area under ROC curve (AUC) are 81.84%, 0.5178, 0.8635, respectively. Finally, a model for predicting S-nitrosylation sites has been constructed on the basis of tripeptide composition (TPC) and the composition of k-spaced amino acid pairs (CKSAAP). To eliminate redundant information and improve work efficiency, elastic nets are employed for feature selection. The five-fold cross-validation tests have indicated the promising success rates of the proposed model. For the convenience of related researchers, the web-server named "RF-SNOPS" has been established at http://www.jci-bioinfo.cn/RF-SNOPS.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Area Under Curve; Computational Biology; Protein Processing, Post-Translational; Proteins
PubMed: 34814339
DOI: 10.3934/mbe.2021450 -
ELife Feb 2023Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints...
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
Topics: Sequence Alignment; Proteins; Amino Acid Sequence
PubMed: 36734516
DOI: 10.7554/eLife.79854 -
MAbs 2022A key step in therapeutic and endogenous humoral antibody characterization is identifying the amino acid sequence. So far, this task has been mainly tackled through... (Review)
Review
A key step in therapeutic and endogenous humoral antibody characterization is identifying the amino acid sequence. So far, this task has been mainly tackled through sequencing of B-cell receptor (BCR) repertoires at the nucleotide level. Mass spectrometry (MS) has emerged as an alternative tool for obtaining sequence information directly at the - most relevant - protein level. Although several MS methods are now well established, analysis of recombinant and endogenous antibodies comes with a specific set of challenges, requiring approaches beyond the conventional proteomics workflows. Here, we review the challenges in MS-based sequencing of both recombinant as well as endogenous humoral antibodies and outline state-of-the-art methods attempting to overcome these obstacles. We highlight recent examples and discuss remaining challenges. We foresee a great future for these approaches making de novo antibody sequencing and discovery by MS-based techniques feasible, even for complex clinical samples from endogenous sources such as serum and other liquid biopsies.
Topics: Amino Acid Sequence; Antibodies; Peptides; Proteomics; Sequence Analysis, Protein; Tandem Mass Spectrometry
PubMed: 35699511
DOI: 10.1080/19420862.2022.2079449 -
Journal of Visualized Experiments : JoVE Jul 2017We demonstrate the usage of Bio3D-web for the interactive analysis of biomolecular structure data. The Bio3D-web application provides online functionality for: (1) The...
We demonstrate the usage of Bio3D-web for the interactive analysis of biomolecular structure data. The Bio3D-web application provides online functionality for: (1) The identification of related protein structure sets to user specified thresholds of similarity; (2) Their multiple alignment and structure superposition; (3) Sequence and structure conservation analysis; (4) Inter-conformer relationship mapping with principal component analysis, and (5) comparison of predicted internal dynamics via ensemble normal mode analysis. This integrated functionality provides a complete online workflow for investigating sequence-structure-dynamic relationships within protein families and superfamilies.
Topics: Amino Acid Sequence; Data Interpretation, Statistical; Programming Languages; Proteins; Sequence Alignment
PubMed: 28745621
DOI: 10.3791/55640 -
Protein Science : a Publication of the... Dec 2022Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning-based methods have...
Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning-based methods have achieved impressive success not only in protein structure prediction, but also in protein sequence design. However, highly efficient and accurate protein side-chain prediction methods that can give detailed atomic interactions are still lacking. In the present study, we developed a deep learning based method, GeoPacker, that uses geometric deep learning coupled ResNet for protein side-chain modeling. GeoPacker explicitly represents atomic interactions with rotational and translational invariance for information extraction of relative locations. GeoPacker outperformed the state-of-the-art energy function-based methods in side-chain structure prediction accuracy and runs about 10 and 700 times faster than the deep learning-based method DLPacker and OPUS-rota4 with comparable prediction accuracy, respectively. The performance of GeoPacker does not depend on the secondary structures that the residues belong to. GeoPacker gives highly accurate predictions for buried residues in the protein core as well as protein-protein interface, making it a useful tool for protein structure modeling, protein, and interaction design.
Topics: Deep Learning; Algorithms; Proteins; Protein Structure, Secondary; Amino Acid Sequence; Protein Conformation
PubMed: 36309961
DOI: 10.1002/pro.4484 -
Bioinformatics (Oxford, England) Oct 2023In recent years, there has been a breakthrough in protein structure prediction, and the AlphaFold2 model of the DeepMind team has improved the accuracy of protein...
MOTIVATION
In recent years, there has been a breakthrough in protein structure prediction, and the AlphaFold2 model of the DeepMind team has improved the accuracy of protein structure prediction to the atomic level. Currently, deep learning-based protein function prediction models usually extract features from protein sequences and combine them with protein-protein interaction networks to achieve good results. However, for newly sequenced proteins that are not in the protein-protein interaction network, such models cannot make effective predictions. To address this, this article proposes the Struct2GO model, which combines protein structure and sequence data to enhance the precision of protein function prediction and the generality of the model.
RESULTS
We obtain amino acid residue embeddings in protein structure through graph representation learning, utilize the graph pooling algorithm based on a self-attention mechanism to obtain the whole graph structure features, and fuse them with sequence features obtained from the protein language model. The results demonstrate that compared with the traditional protein sequence-based function prediction model, the Struct2GO model achieves better results.
AVAILABILITY AND IMPLEMENTATION
The data underlying this article are available at https://github.com/lyjps/Struct2GO.
Topics: Neural Networks, Computer; Proteins; Algorithms; Amino Acid Sequence; Amino Acids
PubMed: 37847755
DOI: 10.1093/bioinformatics/btad637 -
Nature Communications Nov 2021Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict...
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
Topics: Algorithms; Amino Acid Sequence; Computer Simulation; Databases, Protein; Humans; Models, Statistical; Mutation; Protein Structural Elements; Proteins; Sequence Alignment; Structure-Activity Relationship
PubMed: 34728624
DOI: 10.1038/s41467-021-26529-9 -
Bioinformatics (Oxford, England) Feb 2015DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are...
MOTIVATION
DNA and protein patterns are usefully represented by sequence logos. However, the methods for logo generation in common use lack a proper statistical basis, and are non-optimal for recognizing functionally relevant alignment columns.
RESULTS
We redefine the information at a logo position as a per-observation multiple alignment log-odds score. Such scores are positive or negative, depending on whether a column's observations are better explained as arising from relatedness or chance. Within this framework, we propose distinct normalized maximum likelihood and Bayesian measures of column information. We illustrate these measures on High Mobility Group B (HMGB) box proteins and a dataset of enzyme alignments. Particularly in the context of protein alignments, our measures improve the discrimination of biologically relevant positions.
AVAILABILITY AND IMPLEMENTATION
Our new measures are implemented in an open-source Web-based logo generation program, which is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/index.html. A stand-alone version of the program is also available from this site.
CONTACT
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Bayes Theorem; Humans; Molecular Sequence Annotation; Molecular Sequence Data; Position-Specific Scoring Matrices; Sequence Alignment; Sequence Analysis, DNA; Sequence Analysis, Protein; Sequence Homology, Amino Acid; Software
PubMed: 25294922
DOI: 10.1093/bioinformatics/btu634 -
Protein Science : a Publication of the... Nov 2023Predicting the effects of mutations on protein function and stability is an outstanding challenge. Here, we assess the performance of a variant of RoseTTAFold jointly...
Predicting the effects of mutations on protein function and stability is an outstanding challenge. Here, we assess the performance of a variant of RoseTTAFold jointly trained for sequence and structure recovery, RF , for mutation effect prediction. Without any further training, we achieve comparable accuracy in predicting mutation effects for a diverse set of protein families using RF to both another zero-shot model (MSA Transformer) and a model that requires specific training on a particular protein family for mutation effect prediction (DeepSequence). Thus, although the architecture of RF was developed to address the protein design problem of scaffolding functional motifs, RF acquired an understanding of the mutational landscapes of proteins during model training that is equivalent to that of recently developed large protein language models. The ability to simultaneously reason over protein structure and sequence could enable even more precise mutation effect predictions following supervised training on the task. These results suggest that RF has a quite broad understanding of protein sequence-structure landscapes, and can be viewed as a joint model for protein sequence and structure which could be broadly useful for protein modeling.
Topics: Proteins; Mutation; Amino Acid Sequence; Protein Stability
PubMed: 37695922
DOI: 10.1002/pro.4780 -
Briefings in Bioinformatics Jan 2022In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen... (Review)
Review
In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen significant leaps forward by deep learning within the past five years, which immediately unlocked new developments of drugs and immunotherapies. We show that deep learning models offer unique advantages, such as representation learning and multi-layer architecture, which make them an ideal choice to leverage a huge amount of protein sequence and structure data to address those two problems. We also discuss the impact and future possibilities enabled by those two applications, especially how the data-driven approach by deep learning shall accelerate the progress towards personalized biomedicine.
Topics: Amino Acid Sequence; Deep Learning; Immunotherapy; Proteins
PubMed: 34891158
DOI: 10.1093/bib/bbab493