-
Scientific Reports Apr 2024Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human...
Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human nucleocytosolic proteins. By comparing membrane and secreted proteins in which sequons are well known for N-glycosylation, we discovered that cyto-sequons can participate in nucleic acid binding, particularly in zinc finger proteins. Our global studies further discovered that sequon occurrence is largely proportional to protein length. The contribution of sequons to protein functions, including both N-glycosylation and nucleic acid binding, can be regulated through their density as well as the biased usage between NXS and NXT. In proteins where other PTMs or structural features are rich, such as phosphorylation, transmembrane ɑ-helices, and disulfide bridges, sequon occurrence is scarce. The information acquired here should help understand the relationship between protein sequence and function and assist future protein design and engineering.
Topics: Humans; Proteins; Glycosylation; Amino Acid Sequence; Phosphorylation; Nucleic Acids
PubMed: 38565583
DOI: 10.1038/s41598-024-57334-1 -
International Journal of Molecular... Mar 2024Tandem repeats (TRs) in protein sequences are consecutive, highly similar sequence motifs. Some types of TRs fold into structural units that pack together in ensembles,...
Tandem repeats (TRs) in protein sequences are consecutive, highly similar sequence motifs. Some types of TRs fold into structural units that pack together in ensembles, forming either an (open) elongated domain or a (closed) propeller, where the last unit of the ensemble packs against the first one. Here, we examine TR proteins (TRPs) to see how their sequence, structure, and evolutionary properties favor them for a function as mediators of protein interactions. Our observations suggest that TRPs bind other proteins using large, structured surfaces like globular domains; in particular, open-structured TR ensembles are favored by flexible termini and the possibility to tightly coil against their targets. While, intuitively, open ensembles of TRs seem prone to evolve due to their potential to accommodate insertions and deletions of units, these evolutionary events are unexpectedly rare, suggesting that they are advantageous for the emergence of the ancestral sequence but are early fixed. We hypothesize that their flexibility makes it easier for further proteins to adapt to interact with them, which would explain their large number of protein interactions. We provide insight into the properties of open TR ensembles, which make them scaffolds for alternative protein complexes to organize genes, RNA and proteins.
Topics: Proteins; Tandem Repeat Sequences; Amino Acid Sequence
PubMed: 38474241
DOI: 10.3390/ijms25052994 -
Genes Dec 2023Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology,...
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Topics: Amino Acid Sequence; Algorithms; Benchmarking; Databases, Protein; Language
PubMed: 38254915
DOI: 10.3390/genes15010025 -
MBio Mar 2024Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many...
Endosomal sorting complexes required for transport (ESCRT) play key roles in protein sorting between membrane-bounded compartments of eukaryotic cells. Homologs of many ESCRT components are identifiable in various groups of archaea, especially in Asgardarchaeota, the archaeal phylum that is currently considered to include the closest relatives of eukaryotes, but not in bacteria. We performed a comprehensive search for ESCRT protein homologs in archaea and reconstructed ESCRT evolution using the phylogenetic tree of Vps4 ATPase (ESCRT IV) as a scaffold and using sensitive protein sequence analysis and comparison of structural models to identify previously unknown ESCRT proteins. Several distinct groups of ESCRT systems in archaea outside of Asgard were identified, including proteins structurally similar to ESCRT-I and ESCRT-II, and several other domains involved in protein sorting in eukaryotes, suggesting an early origin of these components. Additionally, distant homologs of CdvA proteins were identified in Thermoproteales which are likely components of the uncharacterized cell division system in these archaea. We propose an evolutionary scenario for the origin of eukaryotic and Asgard ESCRT complexes from ancestral building blocks, namely, the Vps4 ATPase, ESCRT-III components, wH (winged helix-turn-helix fold) and possibly also coiled-coil, and Vps28-like domains. The last archaeal common ancestor likely encompassed a complex ESCRT system that was involved in protein sorting. Subsequent evolution involved either simplification, as in the TACK superphylum, where ESCRT was co-opted for cell division, or complexification as in Asgardarchaeota. In Asgardarchaeota, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was already established.IMPORTANCEAll eukaryotic cells possess complex intracellular membrane organization. Endosomal sorting complexes required for transport (ESCRT) play a central role in membrane remodeling which is essential for cellular functionality in eukaryotes. Recently, it has been shown that Asgard archaea, the archaeal phylum that includes the closest known relatives of eukaryotes, encode homologs of many components of the ESCRT systems. We employed protein sequence and structure comparisons to reconstruct the evolution of ESCRT systems in archaea and identified several previously unknown homologs of ESCRT subunits, some of which can be predicted to participate in cell division. The results of this reconstruction indicate that the last archaeal common ancestor already encoded a complex ESCRT system that was involved in protein sorting. In Asgard archaea, ESCRT systems evolved toward greater complexity, and in particular, the connection between ESCRT and the ubiquitin system that was previously considered a eukaryotic signature was established.
Topics: Endosomal Sorting Complexes Required for Transport; Phylogeny; Amino Acid Sequence; Archaea; Adenosine Triphosphatases; Ubiquitins
PubMed: 38380930
DOI: 10.1128/mbio.00335-24 -
Genome Research Jul 2023Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and...
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Topics: Sequence Alignment; Proteins; Amino Acid Sequence; Algorithms; Amino Acids; Language
PubMed: 37414576
DOI: 10.1101/gr.277675.123 -
ACS Biomaterials Science & Engineering Jul 2023Elastin is a structural protein with outstanding mechanical properties (e.g., elasticity and resilience) and biologically relevant functions (e.g., triggering responses... (Review)
Review
Elastin is a structural protein with outstanding mechanical properties (e.g., elasticity and resilience) and biologically relevant functions (e.g., triggering responses like cell adhesion or chemotaxis). It is formed from its precursor tropoelastin, a 60-72 kDa water-soluble and temperature-responsive protein that coacervates at physiological temperature, undergoing a phenomenon termed lower critical solution temperature (LCST). Inspired by this behavior, many scientists and engineers are developing recombinantly produced elastin-inspired biopolymers, usually termed elastin-like polypeptides (ELPs). These ELPs are generally comprised of repetitive motifs with the sequence VPGXG, which corresponds to repeats of a small part of the tropoelastin sequence, X being any amino acid except proline. ELPs display LCST and mechanical properties similar to tropoelastin, which renders them promising candidates for the development of elastic and stimuli-responsive protein-based materials. Unveiling the structure-property relationships of ELPs can aid in the development of these materials by establishing the connections between the ELP amino acid sequence and the macroscopic properties of the materials. Here we present a review of the structure-property relationships of ELPs and ELP-based materials, with a focus on LCST and mechanical properties and how experimental and computational studies have aided in their understanding.
Topics: Tropoelastin; Peptides; Amino Acid Sequence; Temperature
PubMed: 34251181
DOI: 10.1021/acsbiomaterials.1c00145 -
BMC Bioinformatics Nov 2023Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers...
BACKGROUND
Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction.
RESULTS
We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings.
CONCLUSIONS
QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.
RESEARCH
google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb .
Topics: Proteins; Amino Acid Sequence; Language; Protein Structure, Secondary; Protein Transport
PubMed: 37964216
DOI: 10.1186/s12859-023-05549-w -
Advanced Science (Weinheim,... Aug 2023Proteins are the building blocks of life, carrying out fundamental functions in biology. In computational biology, an effective protein representation facilitates many...
Proteins are the building blocks of life, carrying out fundamental functions in biology. In computational biology, an effective protein representation facilitates many important biological quantifications. Most existing protein representation methods are derived from self-supervised language models designed for text analysis. Proteins, however, are more than linear sequences of amino acids. Here, a multimodal deep learning framework for incorporating ≈1 million protein sequence, structure, and functional annotation (MASSA) is proposed. A multitask learning process with five specific pretraining objectives is presented to extract a fine-grained protein-domain feature. Through pretraining, multimodal protein representation achieves state-of-the-art performance in specific downstream tasks such as protein properties (stability and fluorescence), protein-protein interactions (shs27k/shs148k/string/skempi), and protein-ligand interactions (kinase, DUD-E), while achieving competitive results in secondary structure and remote homology tasks. Moreover, a novel optimal-transport-based metric with rich geometry awareness is introduced to quantify the dynamic transferability from the pretrained representation to the related downstream tasks, which provides a panoramic view of the step-by-step learning process. The pairwise distances between these downstream tasks are also calculated, and a strong correlation between the inter-task feature space distributions and adaptability is observed.
Topics: Algorithms; Proteins; Amino Acid Sequence; Amino Acids
PubMed: 37249398
DOI: 10.1002/advs.202301223 -
Sheng Wu Gong Cheng Xue Bao = Chinese... Nov 2023Yeast surface display (YSD) is a technology that fuses the exogenous target protein gene sequence with a specific vector gene sequence, followed by introduction into... (Review)
Review
Yeast surface display (YSD) is a technology that fuses the exogenous target protein gene sequence with a specific vector gene sequence, followed by introduction into yeast cells. Subsequently, the target protein is expressed and localized on the yeast cell surface by using the intracellular protein transport mechanism of yeast cells, whereas the most widely used YSD system is the α-agglutinin expression system. Yeast cells possess the eukaryotic post-translational modification mechanism, which helps the target protein fold correctly. This mechanism could be used to display various eukaryotic proteins, including antibodies, receptors, enzymes, and antigenic peptides. YSD has become a powerful protein engineering tool in biotechnology and biomedicine, and has been used to improve a broad range of protein properties including affinity, specificity, enzymatic function, and stability. This review summarized recent advances in the application of YSD technology from the aspects of library construction and screening, antibody engineering, protein engineering, enzyme engineering and vaccine development.
Topics: Saccharomyces cerevisiae; Protein Engineering; Biotechnology; Antibodies; Amino Acid Sequence
PubMed: 38013172
DOI: 10.13345/j.cjb.230085 -
Bioinformatics (Oxford, England) Mar 2024Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using...
MOTIVATION
Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable the development of more versatile thermostability predictors for multiple ranges of temperatures.
RESULTS
We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data.
AVAILABILITY AND IMPLEMENTATION
TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.
Topics: Proteins; Machine Learning; Software; Amino Acid Sequence; Language
PubMed: 38507682
DOI: 10.1093/bioinformatics/btae157