-
ACS Synthetic Biology Sep 2023Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and...
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 10 of 10 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.
Topics: Amino Acids; Amino Acid Sequence; Cysteine; Neural Networks, Computer; Protein Engineering
PubMed: 37642646
DOI: 10.1021/acssynbio.3c00196 -
Briefings in Bioinformatics Sep 2023The advanced language models have enabled us to recognize protein-protein interactions (PPIs) and interaction sites using protein sequences or structures. Here, we...
The advanced language models have enabled us to recognize protein-protein interactions (PPIs) and interaction sites using protein sequences or structures. Here, we trained the MindSpore ProteinBERT (MP-BERT) model, a Bidirectional Encoder Representation from Transformers, using protein pairs as inputs, making it suitable for identifying PPIs and their respective interaction sites. The pretrained model (MP-BERT) was fine-tuned as MPB-PPI (MP-BERT on PPI) and demonstrated its superiority over the state-of-the-art models on diverse benchmark datasets for predicting PPIs. Moreover, the model's capability to recognize PPIs among various organisms was evaluated on multiple organisms. An amalgamated organism model was designed, exhibiting a high level of generalization across the majority of organisms and attaining an accuracy of 92.65%. The model was also customized to predict interaction site propensity by fine-tuning it with PPI site data as MPB-PPISP. Our method facilitates the prediction of both PPIs and their interaction sites, thereby illustrating the potency of transfer learning in dealing with the protein pair task.
Topics: Proteins; Amino Acid Sequence; Machine Learning
PubMed: 37870286
DOI: 10.1093/bib/bbad376 -
Bioinformatics (Oxford, England) Nov 2023High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to...
MOTIVATION
High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di).
RESULTS
We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide.
AVAILABILITY AND IMPLEMENTATION
TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.
Topics: Amino Acid Sequence; Software; Proteins
PubMed: 37897686
DOI: 10.1093/bioinformatics/btad663 -
BioEssays : News and Reviews in... Sep 2023Fold-switching proteins, which remodel their secondary and tertiary structures in response to cellular stimuli, suggest a new view of protein fold space. For decades,...
Fold-switching proteins, which remodel their secondary and tertiary structures in response to cellular stimuli, suggest a new view of protein fold space. For decades, experimental evidence has indicated that protein fold space is discrete: dissimilar folds are encoded by dissimilar amino acid sequences. Challenging this assumption, fold-switching proteins interconnect discrete groups of dissimilar protein folds, making protein fold space fluid. Three recent observations support the concept of fluid fold space: (1) some amino acid sequences interconvert between folds with distinct secondary structures, (2) some naturally occurring sequences have switched folds by stepwise mutation, and (3) fold switching is evolutionarily selected and likely confers advantage. These observations indicate that minor amino acid sequence modifications can transform protein structure and function. Consequently, proteomic structural and functional diversity may be expanded by alternative splicing, small nucleotide polymorphisms, post-translational modifications, and modified translation rates.
Topics: Protein Folding; Proteomics; Models, Molecular; Proteins; Amino Acid Sequence
PubMed: 37431685
DOI: 10.1002/bies.202300057 -
PloS One 2023The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution...
The structure and sequence of proteins strongly influence their biological functions. New models and algorithms can help researchers in understanding how the evolution of sequences and structures is related to changes in functions. Recently, studies of SARS-CoV-2 Spike (S) protein structures have been performed to predict binding receptors and infection activity in COVID-19, hence the scientific interest in the effects of virus mutations due to sequence, structure and vaccination arises. However, there is the need for models and tools to study the links between the evolution of S protein sequence, structure and functions, and virus transmissibility and the effects of vaccination. As studies on S protein have been generated a large amount of relevant information, we propose in this work to use Protein Contact Networks (PCNs) to relate protein structures with biological properties by means of network topology properties. Topological properties are used to compare the structural changes with sequence changes. We find that both node centrality and community extraction analysis can be used to relate protein stability and functionality with sequence mutations. Starting from this we compare structural evolution to sequence changes and study mutations from a temporal perspective focusing on virus variants. Finally by applying our model to the Omicron variant we report a timeline correlation between Omicron and the vaccination campaign.
Topics: Humans; SARS-CoV-2; COVID-19; Amino Acid Sequence; Mutation; Spike Glycoprotein, Coronavirus
PubMed: 37471335
DOI: 10.1371/journal.pone.0283400 -
Journal of Chemical Theory and... Jan 2024With the ongoing development of peptide self-assembling materials, there is growing interest in exploring novel functional peptide sequences. From short peptides to long... (Review)
Review
With the ongoing development of peptide self-assembling materials, there is growing interest in exploring novel functional peptide sequences. From short peptides to long polypeptides, as the functionality increases, the sequence space is also expanding exponentially. Consequently, attempting to explore all functional sequences comprehensively through experience and experiments alone has become impractical. By utilizing computational methods, especially artificial intelligence enhanced molecular dynamics (MD) simulation and peptide design, there has been a significant expansion in the exploration of sequence space. Through these methods, a variety of supramolecular functional materials, including fibers, two-dimensional arrays, nanocages, , have been designed by meticulously controlling the inter- and intramolecular interactions. In this review, we first provide a brief overview of the current main computational methods and then focus on the computational design methods for various self-assembled peptide materials. Additionally, we introduce some representative protein self-assemblies to offer guidance for the design of self-assembling peptides.
Topics: Artificial Intelligence; Peptides; Amino Acid Sequence; Proteins; Molecular Dynamics Simulation
PubMed: 38206800
DOI: 10.1021/acs.jctc.3c01054 -
Cell Systems Nov 2023The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein... (Review)
Review
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Topics: Deep Learning; Neural Networks, Computer; Proteins; Amino Acid Sequence
PubMed: 37972559
DOI: 10.1016/j.cels.2023.10.006 -
PloS One 2023With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall
PubMed: 37486915
DOI: 10.1371/journal.pone.0289030 -
Journal of Medical Virology Dec 2023Alphavirus is a type of arbovirus that can infect both humans and animals. The amino acid sequence of the 6K protein, being one of the structural proteins of the...
Alphavirus is a type of arbovirus that can infect both humans and animals. The amino acid sequence of the 6K protein, being one of the structural proteins of the alphavirus, is not conserved. Deletion of this protein will result in varying effects on different alphaviruses. Our study focuses on the function of the Getah virus (GETV) 6K protein in infected cells and mice. We successfully constructed infectious clone plasmids and created resulting viruses (rGETV and rGETV-Δ6K). Our comprehensive microscopic analysis revealed that the 6 K protein mainly stays in the endoplasmic reticulum. In addition, rGETV-Δ6K has lower thermal stability and sensitivity to temperature than GETV. Although the deletion of the 6K protein does not reduce virion production in ST cells, it affects the release of virions from host cells by inhibiting the process of E2 protein transportation to the plasma membrane. Subsequent in vivo testing demonstrated that neonatal mice infected with rGETV-Δ6K had a lower virus content, less significant pathological changes in tissue slices, and milder disease than those infected with the wild-type virus. Our results indicate that the 6K protein effectively reduces the viral titer by influencing the release of viral particles. Furthermore, the 6K protein play a role in the clinical manifestation of GETV disease.
Topics: Humans; Animals; Mice; Alphavirus; Virulence; Viral Proteins; Virus Replication; Amino Acid Sequence
PubMed: 38084773
DOI: 10.1002/jmv.29302 -
BMC Bioinformatics Feb 2024To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the...
BACKGROUND
To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.
RESULTS
CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.
CONCLUSIONS
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
Topics: Genome; Sequence Alignment; Proteins; Amino Acid Sequence; Nucleotides
PubMed: 38424511
DOI: 10.1186/s12859-024-05700-1