-
Scientific Reports Oct 2023High-throughput proteomic analysis of archaeological skeletal remains provides information about past fauna community compositions and species dispersals in time and...
High-throughput proteomic analysis of archaeological skeletal remains provides information about past fauna community compositions and species dispersals in time and space. Archaeological skeletal remains are a finite resource, however, and therefore it becomes relevant to optimize methods of skeletal proteome extraction. Ancient proteins in bone specimens can be highly degraded and consequently, extraction methods for well-preserved or modern bone might be unsuitable for the processing of highly degraded skeletal proteomes. In this study, we compared six proteomic extraction methods on Late Pleistocene remains with variable levels of proteome preservation. We tested the accuracy of species identification, protein sequence coverage, deamidation, and the number of post-translational modifications per method. We find striking differences in obtained proteome complexity and sequence coverage, highlighting that simple acid-insoluble proteome extraction methods perform better in highly degraded contexts. For well-preserved specimens, the approach using EDTA demineralization and protease-mix proteolysis yielded a higher number of identified peptides. The protocols presented here allowed protein extraction from ancient bone with a minimum number of working steps and equipment and yielded protein extracts within three working days. We expect further development along this route to benefit large-scale screening applications of relevance to archaeological and human evolution research.
Topics: Humans; Proteome; Proteomics; Body Remains; Peptides; Amino Acid Sequence
PubMed: 37884544
DOI: 10.1038/s41598-023-44885-y -
BMC Bioinformatics Jul 2023Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary,...
BACKGROUND
Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties.
RESULTS
To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence.
CONCLUSION
Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.
Topics: Humans; Mutant Proteins; Amino Acid Sequence; Amino Acids; Biological Evolution; Learning
PubMed: 37480001
DOI: 10.1186/s12859-023-05415-9 -
Molecular Plant-microbe Interactions :... Dec 2023Cytoplasmic effectors with an Arg-any amino acid-Arg-Leu (RxLR) motif are encoded by hundreds of genes within the genomes of oomycete spp. and downy mildew pathogens....
Cytoplasmic effectors with an Arg-any amino acid-Arg-Leu (RxLR) motif are encoded by hundreds of genes within the genomes of oomycete spp. and downy mildew pathogens. There has been a dramatic increase in our understanding of the evolution, function, and recognition of these effectors. Host proteins with a wide range of subcellular localizations and functions are targeted by RxLR effectors. Many processes are manipulated, including transcription, post-translational modifications, such as phosphorylation and ubiquitination, secretion, and intracellular trafficking. This involves an array of RxLR effector modes-of-action, including stabilization or destabilization of protein targets, altering or disrupting protein complexes, inhibition or utility of target enzyme activities, and changing the location of protein targets. Interestingly, approximately 50% of identified host proteins targeted by RxLR effectors are negative regulators of immunity. Avirulence RxLR effectors may be directly or indirectly detected by nucleotide-binding leucine-rich repeat resistance (NLR) proteins. Direct recognition by a single NLR of RxLR effector orthologues conserved across multiple pathogens may provide wide protection of diverse crops. Failure of RxLR effectors to interact with or appropriately manipulate target proteins in nonhost plants has been shown to restrict host range. This knowledge can potentially be exploited to alter host targets to prevent effector interaction, providing a barrier to host infection. Finally, recent evidence suggests that RxLR effectors, like cytoplasmic effectors from fungal pathogen , may enter host cells via clathrin-mediated endocytosis. [Formula: see text] Copyright © 2023 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license.
Topics: Phytophthora infestans; Amino Acid Sequence; Amino Acid Motifs; Proteins; Crops, Agricultural; Plant Diseases
PubMed: 37750829
DOI: 10.1094/MPMI-05-23-0054-CR -
Molecules (Basel, Switzerland) Oct 2023Protein structure prediction represents a significant challenge in the field of bioinformatics, with the prediction of protein structures using backbone dihedral angles...
Protein structure prediction represents a significant challenge in the field of bioinformatics, with the prediction of protein structures using backbone dihedral angles recently achieving significant progress due to the rise of deep neural network research. However, there is a trend in protein structure prediction research to employ increasingly complex neural networks and contributions from multiple models. This study, on the other hand, explores how a single model transparently behaves using sequence data only and what can be expected from the predicted angles. To this end, the current paper presents data acquisition, deep learning model definition, and training toward the final protein backbone angle prediction. The method applies a simple fully connected neural network (FCNN) model that takes only the primary structure of the protein with a sliding window of size 21 as input to predict protein backbone ϕ and ψ dihedral angles. Despite its simplicity, the model shows surprising accuracy for the ϕ angle prediction and somewhat lower accuracy for the ψ angle prediction. Moreover, this study demonstrates that protein secondary structure prediction is also possible with simple neural networks that take in only the protein amino-acid residue sequence, but more complex models are required for higher accuracies.
Topics: Deep Learning; Proteins; Amino Acid Sequence; Neural Networks, Computer; Protein Structure, Secondary
PubMed: 37894526
DOI: 10.3390/molecules28207046 -
Briefings in Bioinformatics May 2024Protein-protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions....
Protein-protein interactions (PPIs) are the basis of many important biological processes, with protein complexes being the key forms implementing these interactions. Understanding protein complexes and their functions is critical for elucidating mechanisms of life processes, disease diagnosis and treatment and drug development. However, experimental methods for identifying protein complexes have many limitations. Therefore, it is necessary to use computational methods to predict protein complexes. Protein sequences can indicate the structure and biological functions of proteins, while also determining their binding abilities with other proteins, influencing the formation of protein complexes. Integrating these characteristics to predict protein complexes is very promising, but currently there is no effective framework that can utilize both protein sequence and PPI network topology for complex prediction. To address this challenge, we have developed HyperGraphComplex, a method based on hypergraph variational autoencoder that can capture expressive features from protein sequences without feature engineering, while also considering topological properties in PPI networks, to predict protein complexes. Experiment results demonstrated that HyperGraphComplex achieves satisfactory predictive performance when compared with state-of-art methods. Further bioinformatics analysis shows that the predicted protein complexes have similar attributes to known ones. Moreover, case studies corroborated the remarkable predictive capability of our model in identifying protein complexes, including 3 that were not only experimentally validated by recent studies but also exhibited high-confidence structural predictions from AlphaFold-Multimer. We believe that the HyperGraphComplex algorithm and our provided proteome-wide high-confidence protein complex prediction dataset will help elucidate how proteins regulate cellular processes in the form of complexes, and facilitate disease diagnosis and treatment and drug development. Source codes are available at https://github.com/LiDlab/HyperGraphComplex.
Topics: Computational Biology; Protein Interaction Mapping; Proteins; Algorithms; Protein Interaction Maps; Databases, Protein; Humans; Sequence Analysis, Protein; Amino Acid Sequence
PubMed: 38851299
DOI: 10.1093/bib/bbae274 -
Scientific Reports Oct 2023Three-dimensional protein structures are invaluable sources of information for the functional annotation of protein molecules. Describing the function of a protein...
Three-dimensional protein structures are invaluable sources of information for the functional annotation of protein molecules. Describing the function of a protein sequence is one of the most common problems in biology. Generally, this problem can be facilitated by studying the tertiary structure of proteins. In the lack of protein structures, comparative modeling often provides a useful three-dimensional model of the protein associated with at least one known protein structure. Comparative modeling predicts the tertiary structure of a certain protein sequence (target) mainly based on its homological sequence to the sequence of one or more proteins with known structures (templates). MODELLER is one of the most widely used tools for homology or comparative modeling of three-dimensional protein structures. However, most users find it challenging to start with MODELLER as it is a command line based and requires knowledge of basic Python scripting to use it efficiently. In this study, a web-based interface has been designed to predict the tertiary structure of proteins based on Modeller, which does the comparative modeling automatically, and uses PHP and Python programming languages. This tool is called "EasyModel" and is available at http://bioinf.modares.ac.ir/software/easymodel/ . EasyModel provides a straightforward graphical interface for Modeller that can be used in only one browser.
Topics: Software; Proteins; Programming Languages; Amino Acid Sequence; Internet; User-Computer Interface
PubMed: 37821634
DOI: 10.1038/s41598-023-44505-9 -
PloS One 2024Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance...
Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical stability challenges in DNNs, which also relates to their known sensitivity to noise injection. These challenges can jeopardise their performance and reliability. This paper investigates DeepGOPlus, a CNN that predicts protein function. DeepGOPlus has achieved state-of-the-art performance and can successfully take advantage and annotate the abounding protein sequences emerging in proteomics. We determine the numerical stability of the model's inference stage by quantifying the numerical uncertainty resulting from perturbations of the underlying floating-point data. In addition, we explore the opportunity to use reduced-precision floating point formats for DeepGOPlus inference, to reduce memory consumption and latency. This is achieved by instrumenting DeepGOPlus' execution using Monte Carlo Arithmetic, a technique that experimentally quantifies floating point operation errors and VPREC, a tool that emulates results with customizable floating point precision formats. Focus is placed on the inference stage as it is the primary deliverable of the DeepGOPlus model, widely applicable across different environments. All in all, our results show that although the DeepGOPlus CNN is very stable numerically, it can only be selectively implemented with lower-precision floating-point formats. We conclude that predictions obtained from the pre-trained DeepGOPlus model are very reliable numerically, and use existing floating-point formats efficiently.
Topics: Reproducibility of Results; Neural Networks, Computer; Amino Acid Sequence; Proteins; Monte Carlo Method
PubMed: 38285635
DOI: 10.1371/journal.pone.0296725 -
Biomolecules Jul 2023Tandem repeats in proteins are patterns of residues repeated directly adjacent to each other. The evolution of these repeats can be assessed by using groups of...
Tandem repeats in proteins are patterns of residues repeated directly adjacent to each other. The evolution of these repeats can be assessed by using groups of homologous sequences, which can help pointing to events of unit duplication or deletion. High pressure in a protein family for variation of a given type of repeat might point to their function. Here, we propose the analysis of protein families to calculate protein short tandem repeats (pSTRs) in each protein sequence and assess their variability within the family in terms of number of units. To facilitate this analysis, we developed the pSTR tool, a method to analyze the evolution of protein short tandem repeats in a given protein family by pairwise comparisons between evolutionarily related protein sequences. We evaluated pSTR unit number variation in protein families of 12 complete metazoan proteomes. We hypothesize that families with more dynamic ensembles of repeats could reflect particular roles of these repeats in processes that require more adaptability.
Topics: Animals; Amino Acid Sequence; Proteome; Microsatellite Repeats; Evolution, Molecular
PubMed: 37509152
DOI: 10.3390/biom13071116 -
Nucleic Acids Research Jan 2024The OpenProt proteogenomic resource (https://www.openprot.org/) provides users with a complete and freely accessible set of non-canonical or alternative open reading...
The OpenProt proteogenomic resource (https://www.openprot.org/) provides users with a complete and freely accessible set of non-canonical or alternative open reading frames (AltORFs) within the transcriptome of various species, as well as functional annotations of the corresponding protein sequences not found in standard databases. Enhancements in this update are largely the result of user feedback and include the prediction of structure, subcellular localization, and intrinsic disorder, using cutting-edge algorithms based on machine learning techniques. The mass spectrometry pipeline now integrates a machine learning-based peptide rescoring method to improve peptide identification. We continue to help users explore this cryptic proteome by providing OpenCustomDB, a tool that enables users to build their own customized protein databases, and OpenVar, a genomic annotator including genetic variants within AltORFs and protein sequences. A new interface improves the visualization of all functional annotations, including a spectral viewer and the prediction of multicoding genes. All data on OpenProt are freely available and downloadable. Overall, OpenProt continues to establish itself as an important resource for the exploration and study of new proteins.
Topics: Amino Acid Sequence; Databases, Protein; Genomics; Internet; Peptides; Proteome; Proteomics; Humans
PubMed: 37956315
DOI: 10.1093/nar/gkad1050 -
PLoS Computational Biology Nov 2023Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures...
Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.
Topics: Humans; Intrinsically Disordered Proteins; Amino Acid Sequence; Language; Drug Design; Protein Conformation
PubMed: 37992088
DOI: 10.1371/journal.pcbi.1011657