-
Biomolecules Mar 2023The inverse protein folding problem, also known as protein sequence design, seeks to predict an amino acid sequence that folds into a specific structure and performs a...
The inverse protein folding problem, also known as protein sequence design, seeks to predict an amino acid sequence that folds into a specific structure and performs a specific function. Recent advancements in machine learning techniques have been successful in generating functional sequences, outperforming previous energy function-based methods. However, these machine learning methods are limited in their interoperability and robustness, especially when designing proteins that must function under non-ambient conditions, such as high temperature, extreme pH, or in various ionic solvents. To address this issue, we propose a new Physics-Informed Neural Networks (PINNs)-based protein sequence design approach. Our approach combines all-atom molecular dynamics simulations, a PINNs MD surrogate model, and a relaxation of binary programming to solve the protein design task while optimizing both energy and the structural stability of proteins. We demonstrate the effectiveness of our design framework in designing proteins that can function under non-ambient conditions.
Topics: Proteins; Neural Networks, Computer; Amino Acid Sequence; Molecular Dynamics Simulation; Physics
PubMed: 36979392
DOI: 10.3390/biom13030457 -
Current Opinion in Structural Biology Apr 2016Design of proteins has far-reaching potentials in diverse areas that span repurposing of the protein scaffold for reactions and substrates that they were not naturally... (Review)
Review
Design of proteins has far-reaching potentials in diverse areas that span repurposing of the protein scaffold for reactions and substrates that they were not naturally meant for, to catching a glimpse of the ephemeral proteins that nature might have sampled during evolution. These non-natural proteins, either in synthesized or virtual form have opened the scope for the design of entities that not only rival their natural counterparts but also offer a chance to visualize the protein space continuum that might help to relate proteins and understand their associations. Here, we review the recent advances in protein engineering and design, in multiple areas, with a view to drawing attention to their future potential.
Topics: Amino Acid Sequence; Nanotechnology; Protein Folding; Proteins
PubMed: 26773478
DOI: 10.1016/j.sbi.2015.12.004 -
Current Protocols Feb 2021Protein evolution and protein engineering techniques are of great interest in basic science and industrial applications such as pharmacology, medicine, or biotechnology....
Protein evolution and protein engineering techniques are of great interest in basic science and industrial applications such as pharmacology, medicine, or biotechnology. Ancestral sequence reconstruction (ASR) is a powerful technique for probing evolutionary relationships and engineering robust proteins with good thermostability and broad substrate specificity. The following protocol describes the setting up and execution of an automated FireProt workflow using a dedicated web site. The service allows for inference of ancestral proteins automatically, from a single protein sequence. Once a protein sequence is submitted, the server will build a dataset of homology sequences, perform a multiple sequence alignment (MSA), build a phylogenetic tree, and reconstruct ancestral nodes. The protocol is also highly flexible and allows for multiple forms of input, advanced settings, and the ability to start jobs from: (i) a single sequence, (ii) a set of homologous sequences, (iii) an MSA, and (iv) a phylogenetic tree. This approach automates all necessary steps and offers a way for novices with limited exposure to ASR techniques to improve the properties of a protein of interest. The technique can even be used to introduce catalytic promiscuity into an enzyme. A web server for accessing the fully automated workflow is freely accessible at https://loschmidt.chemi.muni.cz/fireprotasr/. © 2021 Wiley Periodicals LLC. Basic Protocol: ASR using the Web Server FireProt.
Topics: Amino Acid Sequence; Evolution, Molecular; Phylogeny; Proteins; Sequence Alignment
PubMed: 33524240
DOI: 10.1002/cpz1.30 -
Cell Systems Jan 2021Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is...
Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.
Topics: Amino Acid Sequence; Machine Learning; Proteins
PubMed: 33212013
DOI: 10.1016/j.cels.2020.10.007 -
Current Opinion in Structural Biology Jun 2016Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring... (Review)
Review
Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring metrics appear similar between successful and failed cases. Nevertheless, the use of sequence statistics, modularity and symmetry from natural proteins, combined with computational design both at the coarse-grained and atomistic levels is propelling a new wave of design efforts to success. Here we highlight recent examples of design, showing how the wealth of natural protein sequence and topology data may be leveraged to reduce the search space and increase the likelihood of achieving desired outcomes.
Topics: Amino Acid Sequence; Computational Biology; Protein Engineering; Proteins
PubMed: 27270240
DOI: 10.1016/j.sbi.2016.05.007 -
Journal of Structural Biology Nov 2019Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary...
Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) repeats are widespread and many define regions with a function in protein interactions. For these reasons, we have developed an algorithm to quickly analyze local repeatability along protein sequences, that is, how close a protein fragment is from a perfect repeat. Using this algorithm we identified that the proteins of the yeast Saccharomyces cerevisiae are depleted in short repeats (approximate or not) of odd-length, while the human proteins are not, that the fish Danio rerio has many proteins with repeats of length two and that the plant Arabidopsis thaliana has an unusually large amount of repeats of length seven. Our method (REpeatability Scanner, RES, accessible at http://cbdm-01.zdv.uni-mainz.de/~munoz/res/) allows to find regions with approximate short repeats in protein sequences, and helps to characterize the variable use of LCRs and compositional bias in different organisms.
Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Evolution, Molecular; Humans; Proteins; Repetitive Sequences, Amino Acid; Sequence Alignment; Sequence Analysis, Protein
PubMed: 31408700
DOI: 10.1016/j.jsb.2019.08.003 -
Computers in Biology and Medicine Mar 2024The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized...
The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.
Topics: Computational Biology; Proteins; Amino Acid Sequence; Amino Acids; Algorithms; Support Vector Machine; Sequence Analysis, Protein; Databases, Protein
PubMed: 38217977
DOI: 10.1016/j.compbiomed.2024.107956 -
Biomacromolecules Feb 2023Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike...
Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.
Topics: Proteins; Polymers; Amino Acid Sequence; Polymerization
PubMed: 36638823
DOI: 10.1021/acs.biomac.2c01036 -
The Journal of Physical Chemistry... Aug 2022A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...
A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.
Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics
PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986 -
Current Opinion in Structural Biology Feb 2021The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)
Review
The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.
Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins
PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015