-
Proceedings of the National Academy of... Aug 2023Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce...
Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce a measure of conservation of individual metabolite levels among related species. By analyzing multispecies tissue metabolome datasets in phylogenetically diverse mammals and fruit flies, we show that conservation varies extensively across metabolites. Three major functional properties, metabolite abundance, essentiality, and association with human diseases predict conservation, highlighting a striking parallel between the evolutionary forces driving metabolome and protein sequence conservation. Metabolic network simulations recapitulated these general patterns and revealed that abundant metabolites are highly conserved due to their strong coupling to key metabolic fluxes in the network. Finally, we show that biomarkers of metabolic diseases can be distinguished from other metabolites simply based on evolutionary conservation, without requiring any prior clinical knowledge. Overall, this study uncovers simple rules that govern metabolic evolution in animals and implies that most tissue metabolome differences between species are permitted, rather than favored by natural selection. More broadly, our work paves the way toward using evolutionary information to identify biomarkers, as well as to detect pathogenic metabolome alterations in individual patients.
Topics: Animals; Humans; Metabolome; Amino Acid Sequence; Drosophila; Knowledge; Mammals
PubMed: 37603743
DOI: 10.1073/pnas.2302147120 -
Scientific Reports Apr 2024Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human...
Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human nucleocytosolic proteins. By comparing membrane and secreted proteins in which sequons are well known for N-glycosylation, we discovered that cyto-sequons can participate in nucleic acid binding, particularly in zinc finger proteins. Our global studies further discovered that sequon occurrence is largely proportional to protein length. The contribution of sequons to protein functions, including both N-glycosylation and nucleic acid binding, can be regulated through their density as well as the biased usage between NXS and NXT. In proteins where other PTMs or structural features are rich, such as phosphorylation, transmembrane ɑ-helices, and disulfide bridges, sequon occurrence is scarce. The information acquired here should help understand the relationship between protein sequence and function and assist future protein design and engineering.
Topics: Humans; Proteins; Glycosylation; Amino Acid Sequence; Phosphorylation; Nucleic Acids
PubMed: 38565583
DOI: 10.1038/s41598-024-57334-1 -
International Journal of Molecular... Mar 2024Tandem repeats (TRs) in protein sequences are consecutive, highly similar sequence motifs. Some types of TRs fold into structural units that pack together in ensembles,...
Tandem repeats (TRs) in protein sequences are consecutive, highly similar sequence motifs. Some types of TRs fold into structural units that pack together in ensembles, forming either an (open) elongated domain or a (closed) propeller, where the last unit of the ensemble packs against the first one. Here, we examine TR proteins (TRPs) to see how their sequence, structure, and evolutionary properties favor them for a function as mediators of protein interactions. Our observations suggest that TRPs bind other proteins using large, structured surfaces like globular domains; in particular, open-structured TR ensembles are favored by flexible termini and the possibility to tightly coil against their targets. While, intuitively, open ensembles of TRs seem prone to evolve due to their potential to accommodate insertions and deletions of units, these evolutionary events are unexpectedly rare, suggesting that they are advantageous for the emergence of the ancestral sequence but are early fixed. We hypothesize that their flexibility makes it easier for further proteins to adapt to interact with them, which would explain their large number of protein interactions. We provide insight into the properties of open TR ensembles, which make them scaffolds for alternative protein complexes to organize genes, RNA and proteins.
Topics: Proteins; Tandem Repeat Sequences; Amino Acid Sequence
PubMed: 38474241
DOI: 10.3390/ijms25052994 -
Computers in Biology and Medicine Dec 2023Protein sequence classification is a crucial research field in bioinformatics, playing a vital role in facilitating functional annotation, structure prediction, and...
Protein sequence classification is a crucial research field in bioinformatics, playing a vital role in facilitating functional annotation, structure prediction, and gaining a deeper understanding of protein function and interactions. With the rapid development of high-throughput sequencing technologies, a vast amount of unknown protein sequence data is being generated and accumulated, leading to an increasing demand for protein classification and annotation. Existing machine learning methods still have limitations in protein sequence classification, such as low accuracy and precision of classification models, rendering them less valuable in practical applications. Additionally, these models often lack strong generalization capabilities and cannot be widely applied to various types of proteins. Therefore, accurately classifying and predicting proteins remains a challenging task. In this study, we propose a protein sequence classifier called Multi-Laplacian Regularized Random Vector Functional Link (MLapRVFL). By incorporating Multi-Laplacian and L regularization terms into the basic Random Vector Functional Link (RVFL) method, we effectively improve the model's generalization performance, enhance the robustness and accuracy of the classification model. The experimental results on two commonly used datasets demonstrate that MLapRVFL outperforms popular machine learning methods and achieves superior predictive performance compared to previous studies. In conclusion, the proposed MLapRVFL method makes significant contributions to protein sequence prediction.
Topics: Machine Learning; Amino Acid Sequence; Proteins; Algorithms
PubMed: 37925912
DOI: 10.1016/j.compbiomed.2023.107618 -
Sensors (Basel, Switzerland) Nov 2023Protein is one of the primary biochemical macromolecular regulators in the compartmental cellular structure, and the subcellular locations of proteins can therefore...
Protein is one of the primary biochemical macromolecular regulators in the compartmental cellular structure, and the subcellular locations of proteins can therefore provide information on the function of subcellular structures and physiological environments. Recently, data-driven systems have been developed to predict the subcellular location of proteins based on protein sequence, immunohistochemistry (IHC) images, or immunofluorescence (IF) images. However, the research on the fusion of multiple protein signals has received little attention. In this study, we developed a dual-signal computational protocol by incorporating IHC images into protein sequences to learn protein subcellular localization. Three major steps can be summarized as follows in this protocol: first, a benchmark database that includes 281 proteins sorted out from 4722 proteins of the Human Protein Atlas (HPA) and Swiss-Prot database, which is involved in the endoplasmic reticulum (ER), Golgi apparatus, cytosol, and nucleoplasm; second, discriminative feature operators were first employed to quantitate protein image-sequence samples that include IHC images and protein sequence; finally, the feature subspace of different protein signals is absorbed to construct multiple sub-classifiers via dimensionality reduction and binary relevance (BR), and multiple confidence derived from multiple sub-classifiers is adopted to decide subcellular location by the centralized voting mechanism at the decision layer. The experimental results indicated that the dual-signal model embedded IHC images and protein sequences outperformed the single-signal models with accuracy, precision, and recall of 75.41%, 80.38%, and 74.38%, respectively. It is enlightening for further research on protein subcellular location prediction under multi-signal fusion of protein.
Topics: Humans; Immunohistochemistry; Proteins; Amino Acid Sequence; Cell Nucleus; Databases, Protein; Subcellular Fractions
PubMed: 38005402
DOI: 10.3390/s23229014 -
Genes Dec 2023Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology,...
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Topics: Amino Acid Sequence; Algorithms; Benchmarking; Databases, Protein; Language
PubMed: 38254915
DOI: 10.3390/genes15010025 -
Briefings in Bioinformatics Sep 2023Most life activities in organisms are regulated through protein complexes, which are mainly controlled via Protein-Protein Interactions (PPIs). Discovering new...
Most life activities in organisms are regulated through protein complexes, which are mainly controlled via Protein-Protein Interactions (PPIs). Discovering new interactions between proteins and revealing their biological functions are of great significance for understanding the molecular mechanisms of biological processes and identifying the potential targets in drug discovery. Current experimental methods only capture stable protein interactions, which lead to limited coverage. In addition, expensive cost and time consuming are also the obvious shortcomings. In recent years, various computational methods have been successfully developed for predicting PPIs based only on protein homology, primary sequences of protein or gene ontology information. Computational efficiency and data complexity are still the main bottlenecks for the algorithm generalization. In this study, we proposed a novel computational framework, HNSPPI, to predict PPIs. As a hybrid supervised learning model, HNSPPI comprehensively characterizes the intrinsic relationship between two proteins by integrating amino acid sequence information and connection properties of PPI network. The experimental results show that HNSPPI works very well on six benchmark datasets. Moreover, the comparison analysis proved that our model significantly outperforms other five existing algorithms. Finally, we used the HNSPPI model to explore the SARS-CoV-2-Human interaction system and found several potential regulations. In summary, HNSPPI is a promising model for predicting new protein interactions from known PPI data.
Topics: Humans; COVID-19; SARS-CoV-2; Algorithms; Amino Acid Sequence; Benchmarking
PubMed: 37480553
DOI: 10.1093/bib/bbad261 -
MBio Mar 2024Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many...
Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many ESCRT components are identifiable in various groups of archaea, especially in Asgardarchaeota, the archaeal phylum that is currently considered to include the closest relatives of eukaryotes, but not in bacteria. We performed a comprehensive search for ESCRT protein homologs in archaea and reconstructed ESCRT evolution using the phylogenetic tree of Vps4 ATPase (ESCRT IV) as a scaffold and using sensitive protein sequence analysis and comparison of structural models to identify previously unknown ESCRT proteins. Several distinct groups of ESCRT systems in archaea outside of Asgard were identified, including proteins structurally similar to ESCRT-I and ESCRT-II, and several other domains involved in protein sorting in eukaryotes, suggesting an early origin of these components. Additionally, distant homologs of CdvA proteins were identified in Thermoproteales which are likely components of the uncharacterized cell division system in these archaea. We propose an evolutionary scenario for the origin of eukaryotic and Asgard ESCRT complexes from ancestral building blocks, namely, the Vps4 ATPase, ESCRT-III components, wH (winged helix-turn-helix fold) and possibly also coiled-coil, and Vps28-like domains. The last archaeal common ancestor likely encompassed a complex ESCRT system that was involved in protein sorting. Subsequent evolution involved either simplification, as in the TACK superphylum, where ESCRT was co-opted for cell division, or complexification as in Asgardarchaeota. In Asgardarchaeota, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was already established.IMPORTANCEAll eukaryotic cells possess complex intracellular membrane organization. Endosomal sorting complexes required for transport (ESCRT) play a central role in membrane remodeling which is essential for cellular functionality in eukaryotes. Recently, it has been shown that Asgard archaea, the archaeal phylum that includes the closest known relatives of eukaryotes, encode homologs of many components of the ESCRT systems. We employed protein sequence and structure comparisons to reconstruct the evolution of ESCRT systems in archaea and identified several previously unknown homologs of ESCRT subunits, some of which can be predicted to participate in cell division. The results of this reconstruction indicate that the last archaeal common ancestor already encoded a complex ESCRT system that was involved in protein sorting. In Asgard archaea, ESCRT systems evolved toward greater complexity, and in particular, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was established.
Topics: Endosomal Sorting Complexes Required for Transport; Phylogeny; Amino Acid Sequence; Archaea; Adenosine Triphosphatases; Ubiquitins
PubMed: 38380930
DOI: 10.1128/mbio.00335-24 -
The Journal of Physical Chemistry. B Jul 2023Using tools developed to study the dynamic bioinformatics of proteins, we are able to study the dynamic characteristics of very large numbers of protein sequences...
Using tools developed to study the dynamic bioinformatics of proteins, we are able to study the dynamic characteristics of very large numbers of protein sequences simultaneously. We study herein the distribution of protein sequences in a space determined by sequence mobility. It is shown that there are statistically significant differences in mobility distribution between folded sequences of different structural classes and between those and sequences of intrinsically disordered proteins. It is also shown that the several regions of mobility space differ significantly with respect to structural makeup. Helical proteins are shown to have distinctive dynamic characteristics at both extremes of the mobility spectrum.
Topics: Intrinsically Disordered Proteins; Amino Acid Sequence; Protein Conformation; Protein Folding
PubMed: 37368985
DOI: 10.1021/acs.jpcb.3c02609 -
ACS Biomaterials Science & Engineering Jul 2023Elastin is a structural protein with outstanding mechanical properties (e.g., elasticity and resilience) and biologically relevant functions (e.g., triggering responses... (Review)
Review
Elastin is a structural protein with outstanding mechanical properties (e.g., elasticity and resilience) and biologically relevant functions (e.g., triggering responses like cell adhesion or chemotaxis). It is formed from its precursor tropoelastin, a 60-72 kDa water-soluble and temperature-responsive protein that coacervates at physiological temperature, undergoing a phenomenon termed lower critical solution temperature (LCST). Inspired by this behavior, many scientists and engineers are developing recombinantly produced elastin-inspired biopolymers, usually termed elastin-like polypeptides (ELPs). These ELPs are generally comprised of repetitive motifs with the sequence VPGXG, which corresponds to repeats of a small part of the tropoelastin sequence, X being any amino acid except proline. ELPs display LCST and mechanical properties similar to tropoelastin, which renders them promising candidates for the development of elastic and stimuli-responsive protein-based materials. Unveiling the structure-property relationships of ELPs can aid in the development of these materials by establishing the connections between the ELP amino acid sequence and the macroscopic properties of the materials. Here we present a review of the structure-property relationships of ELPs and ELP-based materials, with a focus on LCST and mechanical properties and how experimental and computational studies have aided in their understanding.
Topics: Tropoelastin; Peptides; Amino Acid Sequence; Temperature
PubMed: 34251181
DOI: 10.1021/acsbiomaterials.1c00145