-
Nature Biotechnology Aug 2023Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language...
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
Topics: Estrogens, Conjugated (USP); Amino Acid Sequence; Proteins; Chorismate Mutase; Language
PubMed: 36702895
DOI: 10.1038/s41587-022-01618-2 -
Nature Reviews. Molecular Cell Biology Mar 2024Intrinsically disordered protein regions exist in a collection of dynamic interconverting conformations that lack a stable 3D structure. These regions are structurally... (Review)
Review
Intrinsically disordered protein regions exist in a collection of dynamic interconverting conformations that lack a stable 3D structure. These regions are structurally heterogeneous, ubiquitous and found across all kingdoms of life. Despite the absence of a defined 3D structure, disordered regions are essential for cellular processes ranging from transcriptional control and cell signalling to subcellular organization. Through their conformational malleability and adaptability, disordered regions extend the repertoire of macromolecular interactions and are readily tunable by their structural and chemical context, making them ideal responders to regulatory cues. Recent work has led to major advances in understanding the link between protein sequence and conformational behaviour in disordered regions, yet the link between sequence and molecular function is less well defined. Here we consider the biochemical and biophysical foundations that underlie how and why disordered regions can engage in productive cellular functions, provide examples of emerging concepts and discuss how protein disorder contributes to intracellular information processing and regulation of cellular function.
Topics: Intrinsically Disordered Proteins; Protein Conformation; Amino Acid Sequence; Macromolecular Substances
PubMed: 37957331
DOI: 10.1038/s41580-023-00673-0 -
Nature Feb 2024Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are...
Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence-structure-function relationships. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale. By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link-at the proteome scale-conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.
Topics: Humans; Amino Acid Sequence; Intrinsically Disordered Proteins; Models, Molecular; Protein Conformation; Proteome; Structure-Activity Relationship; Evolution, Molecular; Disease
PubMed: 38297118
DOI: 10.1038/s41586-023-07004-5 -
Cell Systems Aug 2023Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both... (Review)
Review
Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both require efficient screens and effective libraries, although they differ in their challenges because of the absence or presence, respectively, of an initial protein variant with the desired function. A host of high-throughput technologies-experimental and computational-enable efficient screens to identify performant protein variants. In partnership, an informed search of sequence space is needed to overcome the immensity, sparsity, and complexity of the sequence-performance landscape. Early in the historical trajectory of protein engineering, these elements aligned with distinct approaches to identify the most performant sequence: selection from large, randomized combinatorial libraries versus rational computational design. Substantial advances have now emerged from the synergy of these perspectives. Rational design of combinatorial libraries aids the experimental search of sequence space, and high-throughput, high-integrity experimental data inform computational design. At the core of the collaborative interface, efficient protein characterization (rather than mere selection of optimal variants) maps sequence-performance landscapes. Such quantitative maps elucidate the complex relationships between protein sequence and performance-e.g., binding, catalytic efficiency, biological activity, and developability-thereby advancing fundamental protein science and facilitating protein discovery and evolution.
Topics: Directed Molecular Evolution; Protein Engineering; Biotechnology; Proteins; Amino Acid Sequence
PubMed: 37494931
DOI: 10.1016/j.cels.2023.06.009 -
Cell Systems Aug 2023Machine learning is transforming antibody engineering by enabling the generation of drug-like monoclonal antibodies with unprecedented efficiency. Unsupervised... (Review)
Review
Machine learning is transforming antibody engineering by enabling the generation of drug-like monoclonal antibodies with unprecedented efficiency. Unsupervised algorithms trained on massive and diverse protein sequence datasets facilitate the prediction of panels of antibody variants with native-like intrinsic properties (e.g., high stability), greatly reducing the amount of subsequent experimentation needed to identify specific candidates that also possess desired extrinsic properties (e.g., high affinity). Additionally, supervised algorithms, which are trained on deep sequencing datasets obtained after enrichment of in vitro antibody libraries for one or more specific extrinsic properties, enable the prediction of antibody variants with desired combinations of extrinsic properties without the need for additional screening. Here we review recent advances using both machine learning approaches and how they are impacting the field of antibody engineering as well as key outstanding challenges and opportunities for these paradigm-changing methods.
Topics: Antibodies, Monoclonal; Algorithms; Amino Acid Sequence; Engineering; Machine Learning
PubMed: 37591204
DOI: 10.1016/j.cels.2023.04.009 -
Nature Oct 2023We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the...
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the β-flower fold, added several protein families to Pfam database and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
Topics: Amino Acid Sequence; Databases, Protein; Deep Learning; Internet; Molecular Sequence Annotation; Protein Folding; Proteins; Structural Homology, Protein
PubMed: 37704037
DOI: 10.1038/s41586-023-06622-3 -
Briefings in Bioinformatics Sep 2023The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways... (Review)
Review
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Topics: Amino Acid Sequence; Exercise; Neural Networks, Computer; Proteins; Unsupervised Machine Learning
PubMed: 37864295
DOI: 10.1093/bib/bbad358 -
Nature Biotechnology Feb 2024Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New... (Review)
Review
Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.
Topics: Proteins; Machine Learning; Biotechnology; Amino Acid Sequence; Antibodies
PubMed: 38361074
DOI: 10.1038/s41587-024-02127-0 -
BMC Research Notes Feb 2024The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...
OBJECTIVE
The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.
RESULTS
To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein
PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y -
Molecular Cell Jul 2023A fundamental challenge in biology is understanding the molecular details of protein function. How mutations alter protein activity, regulation, and response to drugs is... (Review)
Review
A fundamental challenge in biology is understanding the molecular details of protein function. How mutations alter protein activity, regulation, and response to drugs is of critical importance to human health. Recent years have seen the emergence of pooled base editor screens for in situ mutational scanning: the interrogation of protein sequence-function relationships by directly perturbing endogenous proteins in live cells. These studies have revealed the effects of disease-associated mutations, discovered novel drug resistance mechanisms, and generated biochemical insights into protein function. Here, we discuss how this "base editor scanning" approach has been applied to diverse biological questions, compare it with alternative techniques, and describe the emerging challenges that must be addressed to maximize its utility. Given its broad applicability toward profiling mutations across the proteome, base editor scanning promises to revolutionize the investigation of proteins in their native contexts.
Topics: Humans; Gene Editing; CRISPR-Cas Systems; Mutation; Proteome; Amino Acid Sequence
PubMed: 37390819
DOI: 10.1016/j.molcel.2023.06.009