-
Nature Biotechnology Aug 2023Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language...
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
Topics: Estrogens, Conjugated (USP); Amino Acid Sequence; Proteins; Chorismate Mutase; Language
PubMed: 36702895
DOI: 10.1038/s41587-022-01618-2 -
Nature Aug 2023An outstanding mystery in biology is why some species, such as the axolotl, can regenerate tissues whereas mammals cannot. Here, we demonstrate that rapid activation of...
An outstanding mystery in biology is why some species, such as the axolotl, can regenerate tissues whereas mammals cannot. Here, we demonstrate that rapid activation of protein synthesis is a unique feature of the injury response critical for limb regeneration in the axolotl (Ambystoma mexicanum). By applying polysome sequencing, we identify hundreds of transcripts, including antioxidants and ribosome components that are selectively activated at the level of translation from pre-existing messenger RNAs in response to injury. By contrast, protein synthesis is not activated in response to non-regenerative digit amputation in the mouse. We identify the mTORC1 pathway as a key upstream signal that mediates tissue regeneration and translational control in the axolotl. We discover unique expansions in mTOR protein sequence among urodele amphibians. By engineering an axolotl mTOR (axmTOR) in human cells, we show that these changes create a hypersensitive kinase that allows axolotls to maintain this pathway in a highly labile state primed for rapid activation. This change renders axolotl mTOR more sensitive to nutrient sensing, and inhibition of amino acid transport is sufficient to inhibit tissue regeneration. Together, these findings highlight the unanticipated impact of the translatome on orchestrating the early steps of wound healing in a highly regenerative species and provide a missing link in our understanding of vertebrate regenerative potential.
Topics: Animals; Humans; Mice; Ambystoma mexicanum; Amino Acid Sequence; Extremities; Regeneration; RNA, Messenger; TOR Serine-Threonine Kinases; Wound Healing; Mechanistic Target of Rapamycin Complex 1; Biological Evolution; Species Specificity; Protein Biosynthesis; Antioxidants; Nutrients; Polyribosomes
PubMed: 37495694
DOI: 10.1038/s41586-023-06365-1 -
Nature Communications Oct 2023Most eukaryotic proteins are N-terminally acetylated, but the functional impact on a global scale has remained obscure. Using genome-wide CRISPR knockout screens in...
Most eukaryotic proteins are N-terminally acetylated, but the functional impact on a global scale has remained obscure. Using genome-wide CRISPR knockout screens in human cells, we reveal a strong genetic dependency between a major N-terminal acetyltransferase and specific ubiquitin ligases. Biochemical analyses uncover that both the ubiquitin ligase complex UBR4-KCMF1 and the acetyltransferase NatC recognize proteins bearing an unacetylated N-terminal methionine followed by a hydrophobic residue. NatC KO-induced protein degradation and phenotypes are reversed by UBR knockdown, demonstrating the central cellular role of this interplay. We reveal that loss of Drosophila NatC is associated with male sterility, reduced longevity, and age-dependent loss of motility due to developmental muscle defects. Remarkably, muscle-specific overexpression of UbcE2M, one of the proteins targeted for NatC KO-mediated degradation, suppresses defects of NatC deletion. In conclusion, NatC-mediated N-terminal acetylation acts as a protective mechanism against protein degradation, which is relevant for increased longevity and motility.
Topics: Male; Humans; Amino Acid Sequence; Acetylation; Longevity; Protein Processing, Post-Translational; Ubiquitins; Ubiquitin-Protein Ligases
PubMed: 37891180
DOI: 10.1038/s41467-023-42342-y -
Nature Oct 2023We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the...
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the β-flower fold, added several protein families to Pfam database and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
Topics: Amino Acid Sequence; Databases, Protein; Deep Learning; Internet; Molecular Sequence Annotation; Protein Folding; Proteins; Structural Homology, Protein
PubMed: 37704037
DOI: 10.1038/s41586-023-06622-3 -
BMC Research Notes Feb 2024The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities...
OBJECTIVE
The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families.
RESULTS
To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Topics: Humans; Protein Kinases; Proteins; Phosphorylation; Amino Acid Sequence; Sequence Alignment; Databases, Protein
PubMed: 38365785
DOI: 10.1186/s13104-024-06713-y -
Trends in Cell Biology Aug 2023Liquid-liquid phase separation (LLPS) is emerging as a mechanism of spatiotemporal regulation that could answer long-standing questions about how order is achieved in... (Review)
Review
Liquid-liquid phase separation (LLPS) is emerging as a mechanism of spatiotemporal regulation that could answer long-standing questions about how order is achieved in biochemical signaling. In this review we discuss how LLPS orchestrates kinase signaling, either by creating condensate structures that are sensed by kinases or by direct LLPS of kinases, cofactors, and substrates - thereby acting as a mechanism to compartmentalize kinase-substrate relationships, and in some cases also sequestering the kinase away from inhibitory factors. We also examine the possibility that selective pressure promotes genomic rearrangements that fuse pro-growth kinases to LLPS-prone protein sequences, which in turn drives aberrant kinase activation through LLPS.
Topics: Humans; Intrinsically Disordered Proteins; Amino Acid Sequence
PubMed: 36528418
DOI: 10.1016/j.tcb.2022.11.009 -
BMC Bioinformatics Feb 2024Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a...
PURPOSE
Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.
METHODS
We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.
RESULTS
PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.
CONCLUSION
Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
Topics: Proteins; Boronic Acids; Amino Acid Sequence; Sequence Alignment; Algorithms
PubMed: 38413857
DOI: 10.1186/s12859-024-05699-5 -
ACS Nano Sep 2023Biotechnological innovations have vastly improved the capacity to perform large-scale protein studies, while the methods we have for identifying and quantifying... (Review)
Review
Biotechnological innovations have vastly improved the capacity to perform large-scale protein studies, while the methods we have for identifying and quantifying individual proteins are still inadequate to perform protein sequencing at the single-molecule level. Nanopore-inspired systems devoted to understanding how single molecules behave have been extensively developed for applications in genome sequencing. These nanopore systems are emerging as prominent tools for protein identification, detection, and analysis, suggesting realistic prospects for novel protein sequencing. This review summarizes recent advances in biological nanopore sensors toward protein sequencing, from the identification of individual amino acids to the controlled translocation of peptides and proteins, with attention focused on device and algorithm development and the delineation of molecular mechanisms with the aid of simulations. Specifically, the review aims to offer recommendations for the advancement of nanopore-based protein sequencing from an engineering perspective, highlighting the need for collaborative efforts across multiple disciplines. These efforts should include chemical conjugation, protein engineering, molecular simulation, machine-learning-assisted identification, and electronic device fabrication to enable practical implementation in real-world scenarios.
Topics: Amino Acid Sequence; Peptides; Proteins; Base Sequence; Amino Acids; Nanopores
PubMed: 37490313
DOI: 10.1021/acsnano.3c05628 -
Briefings in Bioinformatics Sep 2023The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways... (Review)
Review
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Topics: Amino Acid Sequence; Exercise; Neural Networks, Computer; Proteins; Unsupervised Machine Learning
PubMed: 37864295
DOI: 10.1093/bib/bbad358 -
Channels (Austin, Tex.) Dec 2023Voltage-gated sodium channels initiate action potentials in nerve and muscle, and voltage-gated calcium channels couple depolarization of the plasma membrane to... (Review)
Review
Voltage-gated sodium channels initiate action potentials in nerve and muscle, and voltage-gated calcium channels couple depolarization of the plasma membrane to intracellular events such as secretion, contraction, synaptic transmission, and gene expression. In this Review and Perspective article, I summarize early work that led to identification, purification, functional reconstitution, and determination of the amino acid sequence of the protein subunits of sodium and calcium channels and showed that their pore-forming subunits are closely related. Decades of study by antibody mapping, site-directed mutagenesis, and electrophysiological recording led to detailed two-dimensional structure-function maps of the amino acid residues involved in voltage-dependent activation and inactivation, ion permeation and selectivity, and pharmacological modulation. Most recently, high-resolution three-dimensional structure determination by X-ray crystallography and cryogenic electron microscopy has revealed the structural basis for sodium and calcium channel function and pharmacological modulation at the atomic level. These studies now define the chemical basis for electrical signaling and provide templates for future development of new therapeutic agents for a range of neurological and cardiovascular diseases.
Topics: Calcium Channels; Sodium; Voltage-Gated Sodium Channels; Amino Acid Sequence; Action Potentials; Calcium
PubMed: 37983307
DOI: 10.1080/19336950.2023.2281714