-
PLoS Computational Biology Nov 2023Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative...
Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.
Topics: Proteins; Amino Acid Sequence; Amino Acids
PubMed: 38011273
DOI: 10.1371/journal.pcbi.1011655 -
BMC Bioinformatics Oct 2023The relationship between the sequence of a protein, its structure, and the resulting connection between its structure and function, is a foundational principle in...
BACKGROUND
The relationship between the sequence of a protein, its structure, and the resulting connection between its structure and function, is a foundational principle in biological science. Only recently has the computational prediction of protein structure based only on protein sequence been addressed effectively by AlphaFold, a neural network approach that can predict the majority of protein structures with X-ray crystallographic accuracy. A question that is now of acute relevance is the "inverse protein folding problem": predicting the sequence of a protein that folds into a specified structure. This will be of immense value in protein engineering and biotechnology, and will allow the design and expression of recombinant proteins that can, for instance, fold into specified structures as a scaffold for the attachment of recombinant antigens, or enzymes with modified or novel catalytic activities. Here we describe the development of SeqPredNN, a feed-forward neural network trained with X-ray crystallographic structures from the RCSB Protein Data Bank to predict the identity of amino acids in a protein structure using only the relative positions, orientations, and backbone dihedral angles of nearby residues.
RESULTS
We predict the sequence of a protein expected to fold into a specified structure and assess the accuracy of the prediction using both AlphaFold and RoseTTAFold to computationally generate the fold of the derived sequence. We show that the sequences predicted by SeqPredNN fold into a structure with a median TM-score of 0.638 when compared to the crystal structure according to AlphaFold predictions, yet these sequences are unique and only 28.4% identical to the sequence of the crystallized protein.
CONCLUSIONS
We propose that SeqPredNN will be a valuable tool to generate proteins of defined structure for the design of novel biomaterials, pharmaceuticals, catalysts, and reporter systems. The low sequence identity of its predictions compared to the native sequence could prove useful for developing proteins with modified physical properties, such as water solubility and thermal stability. The speed and ease of use of SeqPredNN offers a significant advantage over physics-based protein design methods.
Topics: Amino Acid Sequence; Neural Networks, Computer; Proteins; Amino Acids; Protein Folding
PubMed: 37789284
DOI: 10.1186/s12859-023-05498-4 -
Journal of Nanobiotechnology Nov 2023Elastin-like polypeptides (ELPs) are thermally responsive biopolymers derived from natural elastin. These peptides have a low critical solution temperature phase... (Review)
Review
Elastin-like polypeptides (ELPs) are thermally responsive biopolymers derived from natural elastin. These peptides have a low critical solution temperature phase behavior and can be used to prepare stimuli-responsive biomaterials. Through genetic engineering, biomaterials prepared from ELPs can have unique and customizable properties. By adjusting the amino acid sequence and length of ELPs, nanostructures, such as micelles and nanofibers, can be formed. Correspondingly, ELPs have been used for improving the stability and prolonging drug-release time. Furthermore, ELPs have widespread use in tissue repair due to their biocompatibility and biodegradability. Here, this review summarizes the basic property composition of ELPs and the methods for modulating their phase transition properties, discusses the application of drug delivery system and tissue repair and clarifies the current challenges and future directions of ELPs in applications.
Topics: Elastin; Peptides; Drug Delivery Systems; Amino Acid Sequence; Biocompatible Materials
PubMed: 37951928
DOI: 10.1186/s12951-023-02184-8 -
Current Opinion in Structural Biology Aug 2023Recently, prediction of structural/functional motifs in protein sequences takes advantage of powerful machine learning based approaches. Protein encoding adopts protein... (Review)
Review
Recently, prediction of structural/functional motifs in protein sequences takes advantage of powerful machine learning based approaches. Protein encoding adopts protein language models overpassing standard procedures. Different combinations of machine learning and encoding schemas are available for predicting different structural/functional motifs. Particularly interesting is the adoption of protein language models to encode proteins in addition to evolution information and physicochemical parameters. A thorough analysis of recent predictors developed for annotating transmembrane regions, sorting signals, lipidation and phosphorylation sites allows to investigate the state-of-the-art focusing on the relevance of protein language models for the different tasks. This highlights that more experimental data are necessary to exploit available powerful machine learning methods.
Topics: Deep Learning; Amino Acid Sequence; Proteins; Machine Learning
PubMed: 37385080
DOI: 10.1016/j.sbi.2023.102641 -
Virology Journal Dec 2023Family Genomoviridae was recently established, and only a few mycoviruses have been described and characterized, and almost all of them (Sclerotinia sclerotiorum...
BACKGROUND
Family Genomoviridae was recently established, and only a few mycoviruses have been described and characterized, and almost all of them (Sclerotinia sclerotiorum hypovirulence-associated DNA virus 1, Fusarium graminearum gemyptripvirus 1 and Botrytis cinerea gemydayirivirus 1) induced hypovirulence in their host. Botrytis cinerea ssDNA virus 1 (BcssDV1), a tetrasegmented single-stranded DNA virus infecting the fungus Botrytis cinerea, has been molecularly characterized in this work.
METHODS
BcssDV1 was detected in Spanish and Italian B. cinerea field isolates obtained from grapevine. BcssDV1 variants genomes were molecularly characterized via NGS and Sanger sequencing. Nucleotide and amino acid sequences were used for diversity and phylogenetic analysis. Prediction of protein tertiary structures and putative associated functions were performed by AlphaFold2 and DALI.
RESULTS
BcssDV1 is a tetrasegmented single-stranded DNA virus. The mycovirus was composed by four genomic segments of approximately 1.7 Kb each, which are DNA-A, DNA-B, and DNA-C and DNA-D, that coded, respectively, for the rolling-circle replication initiation protein (Rep), capsid protein (CP) and two hypothetical proteins. BcssDV1 was present in several Italian and Spanish regions with high incidence and low variability among the different viral variants. DNA-A and DNA-D were found to be the more conserved genomic segments among variants, while DNA-B and DNA-C segments were shown to be the most variable ones. Tertiary structures of the proteins encoded by each segment suggested specific functions associated with each of them.
CONCLUSIONS
This study presented the first complete sequencing and characterization of a tetrasegmented ssDNA mycovirus, its incidence in Spain and Italy, its presence in other countries and its high conservation among regions.
Topics: RNA Viruses; DNA, Single-Stranded; Phylogeny; Amino Acid Sequence; Botrytis; Genome, Viral; Fungal Viruses
PubMed: 38114992
DOI: 10.1186/s12985-023-02256-z -
ACS Synthetic Biology Oct 2023Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such...
Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.
Topics: Humans; Epitopes; Amino Acid Sequence; Epitope Mapping; Viruses; Proteome; Amino Acids
PubMed: 37703075
DOI: 10.1021/acssynbio.3c00201 -
Bioinformatics (Oxford, England) Dec 2023Protein sequences can be broadly categorized into two classes: those which adopt stable secondary structure and fold into a domain (i.e. globular proteins), and those...
MOTIVATION
Protein sequences can be broadly categorized into two classes: those which adopt stable secondary structure and fold into a domain (i.e. globular proteins), and those that do not. The sequences belonging to this latter class are conformationally heterogeneous and are described as being intrinsically disordered. Decades of investigation into the structure and function of globular proteins has resulted in a suite of computational tools that enable their sub-classification by domain type, an approach that has revolutionized how we understand and predict protein functionality. Conversely, it is unknown if sequences of disordered protein regions are subject to broadly generalizable organizational principles that would enable their sub-classification.
RESULTS
Here, we report the development of a statistical approach that quantifies linear variance in amino acid composition across a sequence. With multiple examples, we provide evidence that intrinsically disordered regions are organized into statistically non-random modules of unique compositional bias. Modularity is observed for both low and high-complexity sequences and, in some cases, we find that modules are organized in repetitive patterns. These data demonstrate that disordered sequences are non-randomly organized into modular architectures and motivate future experiments to comprehensively classify module types and to determine the degree to which modules constitute functionally separable units analogous to the domains of globular proteins.
AVAILABILITY AND IMPLEMENTATION
The source code, documentation, and data to reproduce all figures are freely available at https://github.com/MWPlabUTSW/Chi-Score-Analysis.git. The analysis is also available as a Google Colab Notebook (https://colab.research.google.com/github/MWPlabUTSW/Chi-Score-Analysis/blob/main/ChiScore_Analysis.ipynb).
Topics: Intrinsically Disordered Proteins; Protein Domains; Amino Acid Sequence; Amino Acids; Software
PubMed: 38039154
DOI: 10.1093/bioinformatics/btad732 -
Science Advances Jun 2024Unlike aquaporins or potassium channels, ammonium transporters (Amts) uniquely discriminate ammonium from potassium and water. This feature has certainly contributed to...
Unlike aquaporins or potassium channels, ammonium transporters (Amts) uniquely discriminate ammonium from potassium and water. This feature has certainly contributed to their repurposing as ammonium receptors during evolution. Here, we describe the ammonium receptor Sd-Amt1, where an Amt module connects to a cytoplasmic diguanylate cyclase transducer module via an HAMP domain. Structures of the protein with and without bound ammonium were determined to 1.7- and 1.9-Ångstrom resolution, depicting the ON and OFF states of the receptor and confirming the presence of a binding site for two ammonium cations that is pivotal for signal perception and receptor activation. The transducer domain was disordered in the crystals, and an AlphaFold2 prediction suggests that the helices linking both domains are flexible. While the sensor domain retains the trimeric fold formed by all Amt family members, the HAMP domains interact as pairs and serve to dimerize the transducer domain upon activation.
Topics: Ammonium Compounds; Cation Transport Proteins; Signal Transduction; Models, Molecular; Binding Sites; Crystallography, X-Ray; Protein Domains; Protein Binding; Amino Acid Sequence
PubMed: 38838143
DOI: 10.1126/sciadv.adm9441 -
BioRxiv : the Preprint Server For... Dec 2023Nature has likely sampled only a fraction of all protein sequences and structures allowed by the laws of biophysics. However, the combinatorial scale of amino-acid...
Nature has likely sampled only a fraction of all protein sequences and structures allowed by the laws of biophysics. However, the combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. In particular, it remains unknown how much of the vast uncharted landscape of far-from-natural sequences consists of alternate ways to encode the familiar ensemble of natural folds; proteins in this category also represent an opportunity to diversify candidates for downstream applications. Here, we characterize sequence-structure mapping in far-from-natural regions of sequence-space guided by the capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation. We demonstrate that pretrained generative pLMs sample a limited structural snapshot of the natural protein universe, including >350 common (sub)domain elements. Incorporating pLM, structure prediction, and structure-based search techniques, we surpass this limitation by developing a novel "foldtuning" strategy that pushes a pretrained pLM into a generative regime that maintains structural similarity to a target protein fold (e.g. TIM barrel, thioredoxin, etc) while maximizing dissimilarity to natural amino-acid sequences. We apply "foldtuning" to build a library of pLMs for >700 naturally-abundant folds in the SCOP database, accessing swaths of proteins that take familiar structures yet lie far from known sequences, spanning targets that include enzymes, immune ligands, and signaling proteins. By revealing protein sequence-structure information at scale outside of the context of evolution, we anticipate that this work will enable future systematic searches for wholly novel folds and facilitate more immediate protein design goals in catalysis and medicine.
PubMed: 38187750
DOI: 10.1101/2023.12.22.573145 -
Nucleic Acids Research Jan 2024Tumorigenic functions due to the formation of fusion genes have been targeted for cancer therapeutics (i.e. kinase inhibitors). However, many fusion proteins involved in...
Tumorigenic functions due to the formation of fusion genes have been targeted for cancer therapeutics (i.e. kinase inhibitors). However, many fusion proteins involved in various cellular processes have not been studied for targeted therapeutics. This is because the lack of complete fusion protein sequences and their whole 3D structures has made it challenging to develop new therapeutic strategies. To fill these critical gaps, we developed a computational pipeline and a resource of human fusion proteins named FusionPDB, available at https://compbio.uth.edu/FusionPDB. FusionPDB is organized into four levels: 43K fusion protein sequences (14.7K in-frame fusion genes, Level 1), over 2300 + 1267 fusion protein 3D structures (from 2300 recurrent and 266 manually curated in-frame fusion genes, Level 2), pLDDT score analysis for the 1267 fusion proteins from 266 manually curated fusion genes (Level 3), and virtual screening outcomes for 68 selected fusion proteins from 266 manually curated fusion genes (Level 4). FusionPDB is the only resource providing whole 3D structures of fusion proteins and comprehensive knowledge of human fusion proteins. It will be regularly updated until it covers all human fusion proteins in the future.
Topics: Humans; Amino Acid Sequence; Knowledge Bases; Neoplasms; Databases, Protein; Protein Conformation
PubMed: 37870473
DOI: 10.1093/nar/gkad920