-
BMC Genomics Jan 2020Impaired proteostatic regulation of proteins with prion-like domains (PrLDs) is associated with a variety of human diseases including neurodegenerative disorders,...
BACKGROUND
Impaired proteostatic regulation of proteins with prion-like domains (PrLDs) is associated with a variety of human diseases including neurodegenerative disorders, myopathies, and certain forms of cancer. For many of these disorders, current models suggest a prion-like molecular mechanism of disease, whereby proteins aggregate and spread to neighboring cells in an infectious manner. The development of prion prediction algorithms has facilitated the large-scale identification of PrLDs among "reference" proteomes for various organisms. However, the degree to which intraspecies protein sequence diversity influences predicted prion propensity has not been systematically examined.
RESULTS
Here, we explore protein sequence variation introduced at genetic, post-transcriptional, and post-translational levels, and its influence on predicted aggregation propensity for human PrLDs. We find that sequence variation is relatively common among PrLDs and in some cases can result in relatively large differences in predicted prion propensity. Sequence variation introduced at the post-transcriptional level (via alternative splicing) also commonly affects predicted aggregation propensity, often by direct inclusion or exclusion of a PrLD. Finally, analysis of a database of sequence variants associated with human disease reveals a number of mutations within PrLDs that are predicted to increase prion propensity.
CONCLUSIONS
Our analyses expand the list of candidate human PrLDs, quantitatively estimate the effects of sequence variation on the aggregation propensity of PrLDs, and suggest the involvement of prion-like mechanisms in additional human diseases.
Topics: Algorithms; Alternative Splicing; Amino Acid Sequence; Humans; Mutation; Neurodegenerative Diseases; Prion Proteins; Prions; Protein Aggregates; Protein Domains; Proteome
PubMed: 31914925
DOI: 10.1186/s12864-019-6425-3 -
PLoS Computational Biology Aug 2022The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes...
The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes and protein sequence repositories is reshaping structural biology. However, because this database returns a single structure, it brought into question Alphafold's ability to capture the intrinsic conformational flexibility of proteins. Here we present a general approach to drive Alphafold2 to model alternate protein conformations through simple manipulation of the multiple sequence alignment via in silico mutagenesis. The approach is grounded in the hypothesis that the multiple sequence alignment must also encode for protein structural heterogeneity, thus its rational manipulation will enable Alphafold2 to sample alternate conformations. A systematic modeling pipeline is benchmarked against canonical examples of protein conformational flexibility and applied to interrogate the conformational landscape of membrane proteins. This work broadens the applicability of Alphafold2 by generating multiple protein conformations to be tested biologically, biochemically, biophysically, and for use in structure-based drug design.
Topics: Amino Acid Sequence; Drug Design; Protein Conformation; Proteins; Sequence Alignment
PubMed: 35994486
DOI: 10.1371/journal.pcbi.1010483 -
Proceedings of the National Academy of... Aug 2023Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce...
Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce a measure of conservation of individual metabolite levels among related species. By analyzing multispecies tissue metabolome datasets in phylogenetically diverse mammals and fruit flies, we show that conservation varies extensively across metabolites. Three major functional properties, metabolite abundance, essentiality, and association with human diseases predict conservation, highlighting a striking parallel between the evolutionary forces driving metabolome and protein sequence conservation. Metabolic network simulations recapitulated these general patterns and revealed that abundant metabolites are highly conserved due to their strong coupling to key metabolic fluxes in the network. Finally, we show that biomarkers of metabolic diseases can be distinguished from other metabolites simply based on evolutionary conservation, without requiring any prior clinical knowledge. Overall, this study uncovers simple rules that govern metabolic evolution in animals and implies that most tissue metabolome differences between species are permitted, rather than favored by natural selection. More broadly, our work paves the way toward using evolutionary information to identify biomarkers, as well as to detect pathogenic metabolome alterations in individual patients.
Topics: Animals; Humans; Metabolome; Amino Acid Sequence; Drosophila; Knowledge; Mammals
PubMed: 37603743
DOI: 10.1073/pnas.2302147120 -
Molecules (Basel, Switzerland) May 2022Protein folding is a complicated phenomenon including various time scales (μs to several s), and various structural indices are required to analyze it. The... (Review)
Review
Protein folding is a complicated phenomenon including various time scales (μs to several s), and various structural indices are required to analyze it. The methodologies used to study this phenomenon also have a wide variety and employ various experimental and computational techniques. Thus, a simple speculation does not serve to understand the folding mechanism of a protein. In the present review, we discuss the recent studies conducted by the author and their colleagues to decode amino acid sequences to obtain information on protein folding. We investigate globin-like proteins, ferredoxin-like fold proteins, IgG-like beta-sandwich fold proteins, lysozyme-like fold proteins and β-trefoil-like fold proteins. Our techniques are based on statistics relating to the inter-residue average distance, and our studies performed so far indicate that the information obtained from these analyses includes data on the protein folding mechanism. The relationships between our results and the actual protein folding phenomena are also discussed.
Topics: Amino Acid Sequence; Models, Molecular; Protein Folding; Proteins; Staphylococcal Protein A
PubMed: 35566370
DOI: 10.3390/molecules27093020 -
Bioinformatics (Oxford, England) Mar 2024Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences...
MOTIVATION
Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences continues to grow rapidly, there is an increasing need for efficient and scalable multiple sequence query algorithms for super-large databases without expensive time and computational costs.
RESULTS
We introduce Chorus, a novel protein sequence query system that leverages parallel model and heterogeneous computation architecture to enable users to query thousands of protein sequences concurrently against large protein databases on a desktop workstation. Chorus achieves over 100× speedup over BLASTP without sacrificing sensitivity. We demonstrate the utility of Chorus through a case study of analyzing a ∼1.5-TB large-scale metagenomic datasets for novel CRISPR-Cas protein discovery within 30 min.
AVAILABILITY AND IMPLEMENTATION
Chorus is open-source and its code repository is available at https://github.com/Bio-Acc/Chorus.
Topics: Software; Algorithms; Amino Acid Sequence; Proteins; Databases, Protein
PubMed: 38547405
DOI: 10.1093/bioinformatics/btae151 -
Scientific Reports Apr 2024Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human...
Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human nucleocytosolic proteins. By comparing membrane and secreted proteins in which sequons are well known for N-glycosylation, we discovered that cyto-sequons can participate in nucleic acid binding, particularly in zinc finger proteins. Our global studies further discovered that sequon occurrence is largely proportional to protein length. The contribution of sequons to protein functions, including both N-glycosylation and nucleic acid binding, can be regulated through their density as well as the biased usage between NXS and NXT. In proteins where other PTMs or structural features are rich, such as phosphorylation, transmembrane ɑ-helices, and disulfide bridges, sequon occurrence is scarce. The information acquired here should help understand the relationship between protein sequence and function and assist future protein design and engineering.
Topics: Humans; Proteins; Glycosylation; Amino Acid Sequence; Phosphorylation; Nucleic Acids
PubMed: 38565583
DOI: 10.1038/s41598-024-57334-1 -
Bioinformatics (Oxford, England) Nov 2021Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function...
MOTIVATION
Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family.
RESULTS
We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments.
AVAILABILITY AND IMPLEMENTATION
Data and code available at https://github.com/CyrilMa/ssqa.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Proteins; Amino Acid Sequence; Sequence Alignment; Protein Structure, Secondary; Mutagenesis
PubMed: 34117879
DOI: 10.1093/bioinformatics/btab442 -
Nature Communications Jan 2023To better understand how amino acid sequence encodes protein structure, we engineered mutational pathways that connect three common folds (3α, β-grasp, and...
To better understand how amino acid sequence encodes protein structure, we engineered mutational pathways that connect three common folds (3α, β-grasp, and α/β-plait). The structures of proteins at high sequence-identity intersections in the pathways (nodes) were determined using NMR spectroscopy and analyzed for stability and function. To generate nodes, the amino acid sequence encoding a smaller fold is embedded in the structure of an ~50% larger fold and a new sequence compatible with two sets of native interactions is designed. This generates protein pairs with a 3α or β-grasp fold in the smaller form but an α/β-plait fold in the larger form. Further, embedding smaller antagonistic folds creates critical states in the larger folds such that single amino acid substitutions can switch both their fold and function. The results help explain the underlying ambiguity in the protein folding code and show that new protein structures can evolve via abrupt fold switching.
Topics: Proteins; Amino Acid Sequence; Protein Folding; Staphylococcal Protein A; Mutation
PubMed: 36702827
DOI: 10.1038/s41467-023-36065-3 -
BMC Bioinformatics Oct 2021Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine...
BACKGROUND
Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations.
RESULTS
Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848.
CONCLUSION
The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.
Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Deep Learning; Position-Specific Scoring Matrices; Proteins
PubMed: 34686152
DOI: 10.1186/s12859-021-04404-0 -
Biomolecules Sep 2022Galectins constitute a protein family of soluble and non-glycosylated animal lectins that show a β-galactoside-binding activity via a conserved sequence of...
Galectins constitute a protein family of soluble and non-glycosylated animal lectins that show a β-galactoside-binding activity via a conserved sequence of approximately 130-140 amino acids located in the carbohydrate recognition domain (CRD) [...].
Topics: Amino Acid Sequence; Amino Acids; Animals; Carbohydrates; Galectins; Neoplasms
PubMed: 36139094
DOI: 10.3390/biom12091255