-
Nature Reviews. Molecular Cell Biology Mar 2024Intrinsically disordered protein regions exist in a collection of dynamic interconverting conformations that lack a stable 3D structure. These regions are structurally... (Review)
Review
Intrinsically disordered protein regions exist in a collection of dynamic interconverting conformations that lack a stable 3D structure. These regions are structurally heterogeneous, ubiquitous and found across all kingdoms of life. Despite the absence of a defined 3D structure, disordered regions are essential for cellular processes ranging from transcriptional control and cell signalling to subcellular organization. Through their conformational malleability and adaptability, disordered regions extend the repertoire of macromolecular interactions and are readily tunable by their structural and chemical context, making them ideal responders to regulatory cues. Recent work has led to major advances in understanding the link between protein sequence and conformational behaviour in disordered regions, yet the link between sequence and molecular function is less well defined. Here we consider the biochemical and biophysical foundations that underlie how and why disordered regions can engage in productive cellular functions, provide examples of emerging concepts and discuss how protein disorder contributes to intracellular information processing and regulation of cellular function.
Topics: Intrinsically Disordered Proteins; Protein Conformation; Amino Acid Sequence; Macromolecular Substances
PubMed: 37957331
DOI: 10.1038/s41580-023-00673-0 -
Methods in Enzymology 2020Directed evolution and rational design are powerful strategies in protein engineering to tailor enzyme properties to meet the demands in academia and industry.... (Review)
Review
Directed evolution and rational design are powerful strategies in protein engineering to tailor enzyme properties to meet the demands in academia and industry. Traditional approaches for enzyme engineering and directed evolution are often experimentally driven, in particular when the protein structure-function relationship is not available. Though they have been successfully applied to engineer many enzymes, these methods are still facing significant challenges due to the tremendous size of the protein sequence space and the combinatorial problem. It can be ascertained that current experimental techniques and computational techniques might never be able to sample through the entire protein sequence space and benefit from nature's full potential for the generation of better enzymes. With advancements in next generation sequencing, high throughput screening methods, the growth of protein databases and artificial intelligence, especially machine learning (ML), data-driven enzyme engineering is emerging as a promising solution to these challenges. To date, ML-assisted approaches have efficiently and accurately determined the quantitative structure-property/activity relationship for the prediction of diverse enzyme properties. In addition, enzyme engineering can be accelerated much faster than ever through the combination of experimental library generation and ML-based prediction. In this chapter, we review the recent progresses in ML-assisted enzyme engineering and highlight several successful examples (e.g., to enhance activity, enantioselectivity, or thermostability). Herein we explain enzyme engineering strategies that combine random or (semi-)rational approaches with ML methods and allow an effective reengineering of enzymes to improve targeted properties. We further discuss the main challenges to solve in order to realize the full potential of ML methods in enzyme engineering. Finally, we describe the current limitations of ML-assisted enzyme engineering, and our perspective on future opportunities in this growing field.
Topics: Amino Acid Sequence; Artificial Intelligence; Directed Molecular Evolution; High-Throughput Screening Assays; Machine Learning; Protein Engineering
PubMed: 32896285
DOI: 10.1016/bs.mie.2020.05.005 -
Current Protein & Peptide Science 2023Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge... (Review)
Review
Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge discovery activity, bioinformatics-assisted sequence analysis, which relies primarily on biological data manipulation, is becoming an indispensable option for the modern discovery of new knowledge, especially when large amounts of protein-encoding sequences can be easily identified from the annotation of highthroughput genomic data. Here, we review the advances in bioinformatics-assisted protein sequence analysis to highlight how bioinformatics analysis will aid in understanding protein structure and function. We first discuss the analyses with individual protein sequences as input, from which some basic parameters of proteins (e.g., amino acid composition, MW and PTM) can be predicted. In addition to these basic parameters that can be directly predicted by analyzing a protein sequence alone, many predictions are based on principles drawn from knowledge of many well-studied proteins, with multiple sequence comparisons as input. Identification of conserved sites by comparing multiple homologous sequences, prediction of the folding, structure or function of uncharacterized proteins, construction of phylogenies of related sequences, analysis of the contribution of conserved related sites to protein function by SCA or DCA, elucidation of the significance of codon usage, and extraction of functional units from protein sequences and coding spaces belong to this category. We then discuss the revolutionary invention of the "QTY code" that can be applied to convert membrane proteins into water- soluble proteins but at the cost of marginal introduced structural and functional changes. As machine learning has been done in other scientific fields, machine learning has profoundly impacted protein sequence analysis. In summary, we have highlighted the relevance of the bioinformatics-assisted analysis for protein research as a valuable guide for laboratory experiments.
Topics: Proteins; Amino Acid Sequence; Sequence Analysis, Protein; Amino Acids; Computational Biology
PubMed: 37287293
DOI: 10.2174/1389203724666230509124300 -
Cell Systems Aug 2023Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both... (Review)
Review
Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both require efficient screens and effective libraries, although they differ in their challenges because of the absence or presence, respectively, of an initial protein variant with the desired function. A host of high-throughput technologies-experimental and computational-enable efficient screens to identify performant protein variants. In partnership, an informed search of sequence space is needed to overcome the immensity, sparsity, and complexity of the sequence-performance landscape. Early in the historical trajectory of protein engineering, these elements aligned with distinct approaches to identify the most performant sequence: selection from large, randomized combinatorial libraries versus rational computational design. Substantial advances have now emerged from the synergy of these perspectives. Rational design of combinatorial libraries aids the experimental search of sequence space, and high-throughput, high-integrity experimental data inform computational design. At the core of the collaborative interface, efficient protein characterization (rather than mere selection of optimal variants) maps sequence-performance landscapes. Such quantitative maps elucidate the complex relationships between protein sequence and performance-e.g., binding, catalytic efficiency, biological activity, and developability-thereby advancing fundamental protein science and facilitating protein discovery and evolution.
Topics: Directed Molecular Evolution; Protein Engineering; Biotechnology; Proteins; Amino Acid Sequence
PubMed: 37494931
DOI: 10.1016/j.cels.2023.06.009 -
Current Opinion in Structural Biology Feb 2022Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using... (Review)
Review
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering; Proteins
PubMed: 34896756
DOI: 10.1016/j.sbi.2021.11.002 -
Nucleic Acids Research Apr 2022Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive...
Alternative splicing is frequently involved in the diversification of protein function and can also be modulated for therapeutic purposes. Here we develop a predictive model, called Exon ByPASS (predicting Exon skipping Based on Protein amino acid SequenceS), to assess the criticality of exon inclusion based solely on information contained in the amino acid sequence upstream and downstream of the exon junctions. By focusing on protein sequence, Exon ByPASS predicts exon skipping independent of tissue and species in the absence of any intronic information. We validate model predictions using transcriptomic and proteomic data and show that the model can capture exon skipping in different tissues and species. Additionally, we reveal potential therapeutic opportunities by predicting synthetically skippable exons and neo-junctions arising in cancer cells.
Topics: Alternative Splicing; Amino Acid Sequence; Exons; Introns; Proteomics
PubMed: 35286381
DOI: 10.1093/nar/gkac155 -
BMC Research Notes Feb 2024The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...
OBJECTIVE
The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.
RESULTS
To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein
PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y -
Bioinformatics (Oxford, England) Jan 2023As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more...
MOTIVATION
As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence-structure relationships become more feasible. Within this space, the amino acid sequence design of protein-protein interactions is still a rather challenging subproblem with very low success rates-yet, it is central to most biological processes.
RESULTS
We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein-protein interactions including antibody-antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein-protein interactions.
AVAILABILITY AND IMPLEMENTATION
The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Deep Learning; Proteins; Algorithms; Peptides; Software
PubMed: 36377772
DOI: 10.1093/bioinformatics/btac733 -
Current Opinion in Structural Biology Aug 2021Machine learning (ML) can expedite directed evolution by allowing researchers to move expensive experimental screens in silico. Gathering sequence-function data for... (Review)
Review
Machine learning (ML) can expedite directed evolution by allowing researchers to move expensive experimental screens in silico. Gathering sequence-function data for training ML models, however, can still be costly. In contrast, raw protein sequence data is widely available. Recent advances in ML approaches use protein sequences to augment limited sequence-function data for directed evolution. We highlight contributions in a growing effort to use sequences to reduce or eliminate the amount of sequence-function data needed for effective in silico screening. We also highlight approaches that use ML models trained on sequences to generate new functional sequence diversity, focusing on strategies that use these generative models to efficiently explore vast regions of protein space.
Topics: Amino Acid Sequence; Computer Simulation; Machine Learning; Proteins
PubMed: 33647531
DOI: 10.1016/j.sbi.2021.01.008 -
International Journal of Molecular... Feb 2023Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and...
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
Topics: Humans; Algorithms; Amino Acid Sequence; Amino Acids; Computational Biology; Proteins; Saccharomyces cerevisiae; Proteomics
PubMed: 36835188
DOI: 10.3390/ijms24043775