-
Biomacromolecules Feb 2023Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike...
Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.
Topics: Proteins; Polymers; Amino Acid Sequence; Polymerization
PubMed: 36638823
DOI: 10.1021/acs.biomac.2c01036 -
The Journal of Physical Chemistry... Aug 2022A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...
A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.
Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics
PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986 -
Current Opinion in Structural Biology Feb 2021The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)
Review
The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.
Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins
PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015 -
BMC Evolutionary Biology Dec 2011Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current... (Comparative Study)
Comparative Study
BACKGROUND
Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field.
RESULTS
Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites.
CONCLUSIONS
Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model.
Topics: Amino Acid Sequence; Biophysics; Computer Simulation; Evolution, Molecular; Models, Molecular; Molecular Sequence Data; Protein Conformation; Protein Folding; Proteins; Thermodynamics
PubMed: 22171550
DOI: 10.1186/1471-2148-11-361 -
Journal of Molecular Graphics &... Nov 2019The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination...
The protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. Comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. This paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable RNA-dependent RNA polymerase (RdRP) target-template pairs within human-infective RNA virus genera. Measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (CNNs) potentially useful for production of homology models most representative of their genera. Homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. Reconstructed ancestral RdRP sequences for individual genera were also used as templates for the production of ancestral RdRP homology models. High quality ancestral RdRP models were consistently produced, as were good quality models for target-template pairs in the same genus. Homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. We present a protocol for the production of optimal RdRP homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words).
Topics: Algorithms; Amino Acid Sequence; Humans; Models, Molecular; Molecular Dynamics Simulation; Proteins
PubMed: 31377535
DOI: 10.1016/j.jmgm.2019.07.014 -
Cell Systems Jun 2021Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...
Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.
Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins
PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017 -
Advances in Protein Chemistry and... 2011Graphical representation and numerical characterization (GRANCH) of nucleotide and protein sequences is a new field that is showing a lot of promise in analysis of such... (Review)
Review
Graphical representation and numerical characterization (GRANCH) of nucleotide and protein sequences is a new field that is showing a lot of promise in analysis of such sequences. While formulation and applications of GRANCH techniques for DNA/RNA sequences started just over a decade ago, analyses of protein sequences by these techniques are of more recent origin. The emphasis is still on developing the underlying technique, but significant results have been achieved in using these methods for protein phylogeny, mass spectral data of proteins and protein serum profiles in parasites, toxicoproteomics, determination of different indices for use in QSAR studies, among others. We briefly mention these in this chapter, with some details on protein phylogeny and viral diseases. In particular, we cover a systematic method developed in GRANCH to determine conserved surface exposed peptide segments in selected viral proteins that can be used for drug and vaccine targeting. The new GRANCH techniques and applications for DNAs and proteins are covered briefly to provide an overview to this nascent field.
Topics: Amino Acid Sequence; Quantitative Structure-Activity Relationship; Sequence Analysis, Protein; Viral Proteins
PubMed: 21570664
DOI: 10.1016/B978-0-12-381262-9.00001-X -
Current Opinion in Structural Biology Oct 2023Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific... (Review)
Review
Ancestral sequence reconstruction (ASR) provides insight into the changes within a protein sequence across evolution. More specifically, it can illustrate how specific amino acid changes give rise to different phenotypes within a protein family. Over the last few decades it has established itself as a powerful technique for revealing molecular common denominators that govern enzyme function. Here, we describe the strength of ASR in unveiling catalytic mechanisms and emerging phenotypes for a range of different proteins, also highlighting biotechnological applications the methodology can provide.
Topics: Phylogeny; Evolution, Molecular; Proteins; Amino Acid Sequence; Phenotype
PubMed: 37544113
DOI: 10.1016/j.sbi.2023.102669 -
Genes Sep 2022The classification of protein sequences provides valuable insights into bioinformatics. Most existing methods are based on sequence alignment algorithms, which become...
The classification of protein sequences provides valuable insights into bioinformatics. Most existing methods are based on sequence alignment algorithms, which become time-consuming as the size of the database increases. Therefore, there is a need to develop an improved method for effectively classifying protein sequences. In this paper, we propose a novel accumulated natural vector method to cluster protein sequences at a lower time cost without reducing accuracy. Our method projects each protein sequence as a point in a 250-dimensional space according to its amino acid distribution. Thus, the biological distance between any two proteins can be easily measured by the Euclidean distance between the corresponding points in the 250-dimensional space. The convex hull analysis and classification perform robustly on virus and bacteria datasets, effectively verifying our method.
Topics: Phylogeny; Amino Acid Sequence; Sequence Alignment; Amino Acids; Bacteria
PubMed: 36292629
DOI: 10.3390/genes13101744 -
Proteins Sep 2018Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The...
Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences.
Topics: Amino Acid Sequence; Computational Biology; Models, Molecular; Protein Conformation; Protein Folding; Proteins; Thermodynamics
PubMed: 29790601
DOI: 10.1002/prot.25527