-
Current Opinion in Chemical Biology Dec 2021Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)
Review
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Topics: Amino Acid Sequence; Machine Learning; Protein Engineering
PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004 -
Briefings in Bioinformatics Sep 2023The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways... (Review)
Review
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Topics: Amino Acid Sequence; Exercise; Neural Networks, Computer; Proteins; Unsupervised Machine Learning
PubMed: 37864295
DOI: 10.1093/bib/bbad358 -
BMC Bioinformatics Feb 2024Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a...
PURPOSE
Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.
METHODS
We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.
RESULTS
PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.
CONCLUSION
Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
Topics: Proteins; Boronic Acids; Amino Acid Sequence; Sequence Alignment; Algorithms
PubMed: 38413857
DOI: 10.1186/s12859-024-05699-5 -
Acta Pharmaceutica Hungarica 2016Mass spectrometry is a high sensitivity, highly selective, high throughput analytical technique. It is well suited to characterize polar, high mass molecules. It is one... (Review)
Review
Mass spectrometry is a high sensitivity, highly selective, high throughput analytical technique. It is well suited to characterize polar, high mass molecules. It is one of the prime analytical techniques to study proteins, to determine their molecular mass, their amino acid sequence and their post-translational modifications. The objective of the present article is to introduce the most important mass spectrometry based methods relevant for protein analysis, like ionization techniques, mass analyzers and tandem mass spectrometry. We shall also introduce ,,top-down" and ,,buttom-up" protein sequencing and protein quantitation as well.
Topics: Amino Acid Sequence; Animals; Humans; Mass Spectrometry; Proteomics; Sequence Analysis, Protein
PubMed: 29873964
DOI: No ID Found -
Molecular & Cellular Proteomics : MCP Aug 2023The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and... (Review)
Review
The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.
Topics: Humans; Proteomics; Proteome; Databases, Protein; Amino Acid Sequence; Peptides
PubMed: 37301379
DOI: 10.1016/j.mcpro.2023.100591 -
Science Signaling Feb 2023Engineering protein-protein interactions to generate new functions presents a challenge with great potential for many applications, ranging from therapeutics to...
Engineering protein-protein interactions to generate new functions presents a challenge with great potential for many applications, ranging from therapeutics to synthetic biology. To avoid unwanted cross-talk with preexisting protein interaction networks in a cell, the specificity and selectivity of newly engineered proteins must be controlled. Here, we developed a computational strategy that mimics gene duplication and the divergence of preexisting interacting protein pairs to design new interactions. We used the bacterial PhoQ-PhoP two-component system as a model system to demonstrate the feasibility of this strategy and validated the approach with known experimental results. The designed protein pairs are predicted to exclusively interact with each other and to be insulated from potential cross-talk with their native partners. Thus, our approach enables exploration of uncharted regions of the protein sequence space and the design of new interacting protein pairs.
Topics: Amino Acid Sequence; Models, Biological; Protein Interaction Maps; Synthetic Biology
PubMed: 36853962
DOI: 10.1126/scisignal.abm4484 -
Computational Biology and Chemistry Aug 2022Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to...
Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to generate a profile based on the substitution scoring matrix. The profile applications are very dependent on the alignment algorithm and scoring system for amino acid substitution. However, sometimes there are no similar sequences in the database with the query sequence based on the scoring schema. In these cases, it is not possible to make a profile. This paper proposes a method named PA_SPP, based on pre-trained ProtAlbert transformer to predict the profile for a single protein sequence without alignment. The performance of transformers on natural languages is impressive. Protein sequences can be viewed as a language; we can benefit from these models. We analyze the attention heads in different layers of ProtAlbert to show that the transformer can capture five essential protein characteristics of a single sequence. This assessment shows that ProtAlbert considers some protein properties when suggesting amino acids for each position in the sequence. In other words, transformers can be considered an appropriate alternative for alignment and scoring schema to predict a profile. We evaluate PA_SPP on the Casp13 dataset, including 55 proteins. Meanwhile, one thermophilic and two mesophilic proteins are used as case studies. The results display high similarity between the predicted profiles and HSSP profiles.
Topics: Algorithms; Amino Acid Sequence; Databases, Factual; Proteins; Sequence Alignment
PubMed: 35802991
DOI: 10.1016/j.compbiolchem.2022.107717 -
IEEE/ACM Transactions on Computational... 2023The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new...
The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.
Topics: Humans; Saccharomyces cerevisiae; Neural Networks, Computer; Proteins; Amino Acid Sequence; Multiomics
PubMed: 37027644
DOI: 10.1109/TCBB.2023.3248797 -
Nature Communications Feb 2022The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding
PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9 -
Bioinformatics (Oxford, England) May 2022A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to...
MOTIVATION
A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA.
RESULTS
We created our own dataset by generating a variety of SAs for a set of 1351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs.Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction.
AVAILABILITY AND IMPLEMENTATION
Code and the data underlying this article are available at https://github.com/ba-lab/Alignment-Score/.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Sequence Alignment; Deep Learning; Computational Biology; Proteins; Amino Acid Sequence
PubMed: 35385080
DOI: 10.1093/bioinformatics/btac210