-
Briefings in Bioinformatics Sep 2023Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research...
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
Topics: Humans; Amino Acid Sequence; Gene Ontology; Molecular Sequence Annotation; Neural Networks, Computer; Protein Interaction Maps
PubMed: 37861172
DOI: 10.1093/bib/bbab556 -
Current Opinion in Structural Biology Oct 2023Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)
Review
Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.
Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype
PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669 -
Journal of Structural Biology Sep 2023Biomaterials for tissue regeneration must mimic the biophysical properties of the native physiological environment. A protein engineering approach allows the generation...
Biomaterials for tissue regeneration must mimic the biophysical properties of the native physiological environment. A protein engineering approach allows the generation of protein hydrogels with specific and customised biophysical properties designed to suit a particular physiological environment. Herein, repetitive engineered proteins were successfully designed to form covalent molecular networks with defined physical characteristics able to sustain cell phenotype. Our hydrogel design was made possible by the incorporation of the SpyTag (ST) peptide and multiple repetitive units of the SpyCatcher (SC) protein that spontaneously formed covalent crosslinks upon mixing. Changing the ratios of the protein building blocks (ST:SC), allowed the viscoelastic properties and gelation speeds of the hydrogels to be altered and controlled. The physical properties of the hydrogels could readily be altered further to suit different environments by tuning the key features in the repetitive protein sequence. The resulting hydrogels were designed with a view to allow cell attachment and encapsulation of liver derived cells. Biocompatibility of the hydrogels was assayed using a HepG2 cell line constitutively expressing GFP. The cells remained viable and continued to express GFP whilst attached or encapsulated within the hydrogel. Our results demonstrate how this genetically encoded approach using repetitive proteins could be applied to bridge engineering biology with nanotechnology creating a level of biomaterial customisation previously inaccessible.
Topics: Protein Array Analysis; Hydrogels; Proteins; Biocompatible Materials; Amino Acid Sequence
PubMed: 37245604
DOI: 10.1016/j.jsb.2023.107981 -
Biomolecules Aug 2023With the development of accurate protein structure prediction algorithms, artificial intelligence (AI) has emerged as a powerful tool in the field of structural biology....
With the development of accurate protein structure prediction algorithms, artificial intelligence (AI) has emerged as a powerful tool in the field of structural biology. AI-based algorithms have been used to analyze large amounts of protein sequence data including the human proteome, complementing experimental structure data found in resources such as the Protein Data Bank. The EBI AlphaFold Protein Structure Database (for example) contains over 230 million structures. In this study, these data have been analyzed to find all human proteins containing (or predicted to contain) the cytosolic glutathione transferase (cGST) fold. A total of 39 proteins were found, including the alpha-, mu-, pi-, sigma-, zeta- and omega-class GSTs, intracellular chloride channels, metaxins, multisynthetase complex components, elongation factor 1 complex components and others. Three broad themes emerge: cGST domains as enzymes, as chloride ion channels and as protein-protein interaction mediators. As the majority of cGSTs are dimers, the AI-based structure prediction algorithm AlphaFold-multimer was used to predict structures of all pairwise combinations of these cGST domains. Potential homo- and heterodimers are described. Experimental biochemical and structure data is used to highlight the strengths and limitations of AI-predicted structures.
Topics: Humans; Glutathione Transferase; Genome, Human; Artificial Intelligence; Algorithms; Amino Acid Sequence
PubMed: 37627305
DOI: 10.3390/biom13081240 -
Bioinformatics (Oxford, England) Mar 2024Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences...
MOTIVATION
Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences continues to grow rapidly, there is an increasing need for efficient and scalable multiple sequence query algorithms for super-large databases without expensive time and computational costs.
RESULTS
We introduce Chorus, a novel protein sequence query system that leverages parallel model and heterogeneous computation architecture to enable users to query thousands of protein sequences concurrently against large protein databases on a desktop workstation. Chorus achieves over 100× speedup over BLASTP without sacrificing sensitivity. We demonstrate the utility of Chorus through a case study of analyzing a ∼1.5-TB large-scale metagenomic datasets for novel CRISPR-Cas protein discovery within 30 min.
AVAILABILITY AND IMPLEMENTATION
Chorus is open-source and its code repository is available at https://github.com/Bio-Acc/Chorus.
Topics: Software; Algorithms; Amino Acid Sequence; Proteins; Databases, Protein
PubMed: 38547405
DOI: 10.1093/bioinformatics/btae151 -
Bioinformatics (Oxford, England) Aug 2023Protein thermostability is of great interest, both in theory and in practice.
MOTIVATION
Protein thermostability is of great interest, both in theory and in practice.
RESULTS
This study compared orthologous proteins with different cellular thermostability. A large number of physicochemical properties of protein were calculated and used to develop a series of machine learning models for predicting cellular thermostability differences between orthologous proteins. Most of the important features in these models are also highly correlated to relative cellular thermostability. A comparison between the present study with previous comparison of orthologous proteins from thermophilic and mesophilic organisms found that most highly correlated features are consistent in these studies, suggesting they may be important to protein thermostability.
AVAILABILITY AND IMPLEMENTATION
Data freely available for download at https://github.com/fangj3/cellular-protein-thermostability-dataset.
Topics: Amino Acid Sequence; Proteins
PubMed: 37572303
DOI: 10.1093/bioinformatics/btad504 -
Microbial Cell Factories Sep 2023In the post-genomic era, the demand for faster and more efficient protein production has increased, both in public laboratories and industry. In addition, with the... (Review)
Review
In the post-genomic era, the demand for faster and more efficient protein production has increased, both in public laboratories and industry. In addition, with the expansion of protein sequences in databases, the range of possible enzymes of interest for a given application is also increasing. Faced with peer competition, budgetary, and time constraints, companies and laboratories must find ways to develop a robust manufacturing process for recombinant protein production. In this review, we explore high-throughput technologies for recombinant protein expression and present a holistic high-throughput process development strategy that spans from genes to proteins. We discuss the challenges that come with this task, the limitations of previous studies, and future research directions.
Topics: Cloning, Molecular; Amino Acid Sequence; Genomics; Laboratories; Recombinant Proteins
PubMed: 37715258
DOI: 10.1186/s12934-023-02184-1 -
Current Opinion in Structural Biology Feb 2024Relating the native fold of a protein to its amino acid sequence remains a fundamental problem in biology. While computer algorithms have demonstrated recently their... (Review)
Review
Relating the native fold of a protein to its amino acid sequence remains a fundamental problem in biology. While computer algorithms have demonstrated recently their prowess in predicting what structure a particular amino acid sequence will fold to, an understanding of how and why a specific protein fold is achieved remains elusive. A major challenge is to define the role of conformational heterogeneity during protein folding. Recent experimental studies, utilizing time-resolved FRET, hydrogen-exchange coupled to mass spectrometry, and single-molecule force spectroscopy, often in conjunction with simulation, have begun to reveal how conformational heterogeneity evolves during folding, and whether an intermediate ensemble of defined free energy consists of different sub-populations of molecules that may differ significantly in conformation, energy and entropy.
Topics: Protein Folding; Proteins; Amino Acid Sequence; Entropy; Computer Simulation; Protein Conformation
PubMed: 38041993
DOI: 10.1016/j.sbi.2023.102738 -
Cell Systems Nov 2023Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for...
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
Topics: Proteins; Artificial Intelligence; Amino Acid Sequence; Language; Databases, Factual
PubMed: 37909046
DOI: 10.1016/j.cels.2023.10.002 -
Proceedings of the National Academy of... Aug 2023Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce...
Metabolite levels shape cellular physiology and disease susceptibility, yet the general principles governing metabolome evolution are largely unknown. Here, we introduce a measure of conservation of individual metabolite levels among related species. By analyzing multispecies tissue metabolome datasets in phylogenetically diverse mammals and fruit flies, we show that conservation varies extensively across metabolites. Three major functional properties, metabolite abundance, essentiality, and association with human diseases predict conservation, highlighting a striking parallel between the evolutionary forces driving metabolome and protein sequence conservation. Metabolic network simulations recapitulated these general patterns and revealed that abundant metabolites are highly conserved due to their strong coupling to key metabolic fluxes in the network. Finally, we show that biomarkers of metabolic diseases can be distinguished from other metabolites simply based on evolutionary conservation, without requiring any prior clinical knowledge. Overall, this study uncovers simple rules that govern metabolic evolution in animals and implies that most tissue metabolome differences between species are permitted, rather than favored by natural selection. More broadly, our work paves the way toward using evolutionary information to identify biomarkers, as well as to detect pathogenic metabolome alterations in individual patients.
Topics: Animals; Humans; Metabolome; Amino Acid Sequence; Drosophila; Knowledge; Mammals
PubMed: 37603743
DOI: 10.1073/pnas.2302147120