-
Briefings in Bioinformatics Sep 2023The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways... (Review)
Review
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Topics: Amino Acid Sequence; Exercise; Neural Networks, Computer; Proteins; Unsupervised Machine Learning
PubMed: 37864295
DOI: 10.1093/bib/bbad358 -
BMC Bioinformatics Feb 2024Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a...
PURPOSE
Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.
METHODS
We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.
RESULTS
PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.
CONCLUSION
Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
Topics: Proteins; Boronic Acids; Amino Acid Sequence; Sequence Alignment; Algorithms
PubMed: 38413857
DOI: 10.1186/s12859-024-05699-5 -
ACS Nano Sep 2023Biotechnological innovations have vastly improved the capacity to perform large-scale protein studies, while the methods we have for identifying and quantifying... (Review)
Review
Biotechnological innovations have vastly improved the capacity to perform large-scale protein studies, while the methods we have for identifying and quantifying individual proteins are still inadequate to perform protein sequencing at the single-molecule level. Nanopore-inspired systems devoted to understanding how single molecules behave have been extensively developed for applications in genome sequencing. These nanopore systems are emerging as prominent tools for protein identification, detection, and analysis, suggesting realistic prospects for novel protein sequencing. This review summarizes recent advances in biological nanopore sensors toward protein sequencing, from the identification of individual amino acids to the controlled translocation of peptides and proteins, with attention focused on device and algorithm development and the delineation of molecular mechanisms with the aid of simulations. Specifically, the review aims to offer recommendations for the advancement of nanopore-based protein sequencing from an engineering perspective, highlighting the need for collaborative efforts across multiple disciplines. These efforts should include chemical conjugation, protein engineering, molecular simulation, machine-learning-assisted identification, and electronic device fabrication to enable practical implementation in real-world scenarios.
Topics: Amino Acid Sequence; Peptides; Proteins; Base Sequence; Amino Acids; Nanopores
PubMed: 37490313
DOI: 10.1021/acsnano.3c05628 -
Channels (Austin, Tex.) Dec 2023Voltage-gated sodium channels initiate action potentials in nerve and muscle, and voltage-gated calcium channels couple depolarization of the plasma membrane to... (Review)
Review
Voltage-gated sodium channels initiate action potentials in nerve and muscle, and voltage-gated calcium channels couple depolarization of the plasma membrane to intracellular events such as secretion, contraction, synaptic transmission, and gene expression. In this Review and Perspective article, I summarize early work that led to identification, purification, functional reconstitution, and determination of the amino acid sequence of the protein subunits of sodium and calcium channels and showed that their pore-forming subunits are closely related. Decades of study by antibody mapping, site-directed mutagenesis, and electrophysiological recording led to detailed two-dimensional structure-function maps of the amino acid residues involved in voltage-dependent activation and inactivation, ion permeation and selectivity, and pharmacological modulation. Most recently, high-resolution three-dimensional structure determination by X-ray crystallography and cryogenic electron microscopy has revealed the structural basis for sodium and calcium channel function and pharmacological modulation at the atomic level. These studies now define the chemical basis for electrical signaling and provide templates for future development of new therapeutic agents for a range of neurological and cardiovascular diseases.
Topics: Calcium Channels; Sodium; Voltage-Gated Sodium Channels; Amino Acid Sequence; Action Potentials; Calcium
PubMed: 37983307
DOI: 10.1080/19336950.2023.2281714 -
Molecular & Cellular Proteomics : MCP Aug 2023The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and... (Review)
Review
The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.
Topics: Humans; Proteomics; Proteome; Databases, Protein; Amino Acid Sequence; Peptides
PubMed: 37301379
DOI: 10.1016/j.mcpro.2023.100591 -
Science (New York, N.Y.) Mar 2024Many clinically used drugs are derived from or inspired by bacterial natural products that often are produced through nonribosomal peptide synthetases (NRPSs),...
Many clinically used drugs are derived from or inspired by bacterial natural products that often are produced through nonribosomal peptide synthetases (NRPSs), megasynthetases that activate and join individual amino acids in an assembly line fashion. In this work, we describe a detailed phylogenetic analysis of several bacterial NRPSs that led to the identification of yet undescribed recombination sites within the thiolation (T) domain that can be used for NRPS engineering. We then developed an evolution-inspired "eXchange Unit between T domains" (XUT) approach, which allows the assembly of NRPS fragments over a broad range of GC contents, protein similarities, and extender unit specificities, as demonstrated for the specific production of a proteasome inhibitor designed and assembled from five different NRPS fragments.
Topics: Peptide Synthases; Phylogeny; Protein Engineering; Evolution, Molecular; Amino Acid Sequence; Bacterial Proteins; Sequence Analysis, Protein
PubMed: 38513038
DOI: 10.1126/science.adg4320 -
Bioanalysis Aug 2023
Topics: Mass Spectrometry; Amino Acid Sequence; Protein Processing, Post-Translational
PubMed: 37584366
DOI: 10.4155/bio-2023-0139 -
IEEE/ACM Transactions on Computational... 2023The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new...
The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.
Topics: Humans; Saccharomyces cerevisiae; Neural Networks, Computer; Proteins; Amino Acid Sequence; Multiomics
PubMed: 37027644
DOI: 10.1109/TCBB.2023.3248797 -
Journal of Proteome Research Dec 2023Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires... (Review)
Review
Top-down proteomics (TDP) aims to identify and profile intact protein forms (proteoforms) extracted from biological samples. True proteoform characterization requires that both the base protein sequence be defined and any mass shifts identified, ideally localizing their positions within the protein sequence. Being able to fully elucidate proteoform profiles lends insight into characterizing proteoform-unique roles, and is a crucial aspect of defining protein structure-function relationships and the specific roles of different (combinations of) protein modifications. However, defining and pinpointing protein post-translational modifications (PTMs) on intact proteins remains a challenge. Characterization of (heavily) modified proteins (>∼30 kDa) remains problematic, especially when they exist in a population of similarly modified, or kindred, proteoforms. This issue is compounded as the number of modifications increases, and thus the number of theoretical combinations. Here, we present our perspective on the challenges of analyzing kindred proteoform populations, focusing on annotation of protein modifications on an "average" protein. Furthermore, we discuss the technical requirements to obtain high quality fragmentation spectral data to robustly define site-specific PTMs, and the fact that this is tempered by the time requirements necessary to separate proteoforms in advance of mass spectrometry analysis.
Topics: Tandem Mass Spectrometry; Proteomics; Proteins; Protein Processing, Post-Translational; Amino Acid Sequence; Proteome
PubMed: 37937372
DOI: 10.1021/acs.jproteome.3c00416 -
ACS Synthetic Biology Sep 2023Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and...
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 10 of 10 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.
Topics: Amino Acids; Amino Acid Sequence; Cysteine; Neural Networks, Computer; Protein Engineering
PubMed: 37642646
DOI: 10.1021/acssynbio.3c00196