protein sequence - OpenMD.com Journal Search

Protein sequence design with deep generative models.

Current Opinion in Chemical Biology Dec 2021

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior... (Review)

Summary PubMed

Review

Authors: Zachary Wu, Kadina E Johnston, Frances H Arnold...

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.

Topics: Amino Acid Sequence; Machine Learning; Protein Engineering

PubMed: 34051682
DOI: 10.1016/j.cbpa.2021.04.004

Generative models for protein sequence modeling: recent advances and future directions.

Briefings in Bioinformatics Sep 2023

The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways... (Review)

Summary PubMed Full Text PDF

Review

Authors: Mehrsa Mardikoraem, Zirui Wang, Nathaniel Pascual...

The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.

Topics: Amino Acid Sequence; Exercise; Neural Networks, Computer; Proteins; Unsupervised Machine Learning

PubMed: 37864295
DOI: 10.1093/bib/bbad358

Protein embedding based alignment.

BMC Bioinformatics Feb 2024

Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a...

Summary PubMed Full Text PDF

Authors: Benjamin Giovanni Iovino, Yuzhen Ye

PURPOSE

Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.

METHODS

We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.

RESULTS

PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.

CONCLUSION

Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.

Topics: Proteins; Boronic Acids; Amino Acid Sequence; Sequence Alignment; Algorithms

PubMed: 38413857
DOI: 10.1186/s12859-024-05699-5

Application of Mass Spectrometry in Proteomics.

Acta Pharmaceutica Hungarica 2016

Mass spectrometry is a high sensitivity, highly selective, high throughput analytical technique. It is well suited to characterize polar, high mass molecules. It is one... (Review)

Summary PubMed

Review

Authors: Kiraly Marton, Dalmadine Kiss Rorbala, Drahos Laszlo...

Mass spectrometry is a high sensitivity, highly selective, high throughput analytical technique. It is well suited to characterize polar, high mass molecules. It is one of the prime analytical techniques to study proteins, to determine their molecular mass, their amino acid sequence and their post-translational modifications. The objective of the present article is to introduce the most important mass spectrometry based methods relevant for protein analysis, like ionization techniques, mass analyzers and tandem mass spectrometry. We shall also introduce ,,top-down" and ,,buttom-up" protein sequencing and protein quantitation as well.

Topics: Amino Acid Sequence; Animals; Humans; Mass Spectrometry; Proteomics; Sequence Analysis, Protein

PubMed: 29873964
DOI: No ID Found

UniProt and Mass Spectrometry-Based Proteomics-A 2-Way Working Relationship.

Molecular & Cellular Proteomics : MCP Aug 2023

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and... (Review)

Summary PubMed Full Text PDF

Review

Authors: E H Bowler-Barnett, J Fan, J Luo...

The human proteome comprises of all of the proteins produced by the sequences translated from the human genome with additional modifications in both sequence and function caused by nonsynonymous variants and posttranslational modifications including cleavage of the initial transcript into smaller peptides and polypeptides. The UniProtKB database (www.uniprot.org) is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information and presents a summary of experimentally verified, or computationally predicted, functional information added by our expert biocuration team for each protein in the proteome. Researchers in the field of mass spectrometry-based proteomics both consume and add to the body of data available in UniProtKB, and this review highlights the information we provide to this community and the knowledge we in turn obtain from groups via deposition of large-scale datasets in public domain databases.

Topics: Humans; Proteomics; Proteome; Databases, Protein; Amino Acid Sequence; Peptides

PubMed: 37301379
DOI: 10.1016/j.mcpro.2023.100591

Data-driven design of orthogonal protein-protein interactions.

Science Signaling Feb 2023

Engineering protein-protein interactions to generate new functions presents a challenge with great potential for many applications, ranging from therapeutics to...

Summary PubMed Full Text PDF

Authors: Duccio Malinverni, M Madan Babu

Engineering protein-protein interactions to generate new functions presents a challenge with great potential for many applications, ranging from therapeutics to synthetic biology. To avoid unwanted cross-talk with preexisting protein interaction networks in a cell, the specificity and selectivity of newly engineered proteins must be controlled. Here, we developed a computational strategy that mimics gene duplication and the divergence of preexisting interacting protein pairs to design new interactions. We used the bacterial PhoQ-PhoP two-component system as a model system to demonstrate the feasibility of this strategy and validated the approach with known experimental results. The designed protein pairs are predicted to exclusively interact with each other and to be insulated from potential cross-talk with their native partners. Thus, our approach enables exploration of uncharted regions of the protein sequence space and the design of new interacting protein pairs.

Topics: Amino Acid Sequence; Models, Biological; Protein Interaction Maps; Synthetic Biology

PubMed: 36853962
DOI: 10.1126/scisignal.abm4484

Protein sequence profile prediction using ProtAlbert transformer.

Computational Biology and Chemistry Aug 2022

Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to...

Summary PubMed

Authors: Armin Behjati, Fatemeh Zare-Mirakabad, Seyed Shahriar Arab...

Profiles are used to model protein families and domains. They are built by multiple sequence alignments obtained by mapping a query sequence against a database to generate a profile based on the substitution scoring matrix. The profile applications are very dependent on the alignment algorithm and scoring system for amino acid substitution. However, sometimes there are no similar sequences in the database with the query sequence based on the scoring schema. In these cases, it is not possible to make a profile. This paper proposes a method named PA_SPP, based on pre-trained ProtAlbert transformer to predict the profile for a single protein sequence without alignment. The performance of transformers on natural languages is impressive. Protein sequences can be viewed as a language; we can benefit from these models. We analyze the attention heads in different layers of ProtAlbert to show that the transformer can capture five essential protein characteristics of a single sequence. This assessment shows that ProtAlbert considers some protein properties when suggesting amino acids for each position in the sequence. In other words, transformers can be considered an appropriate alternative for alignment and scoring schema to predict a profile. We evaluate PA_SPP on the Casp13 dataset, including 55 proteins. Meanwhile, one thermophilic and two mesophilic proteins are used as case studies. The results display high similarity between the predicted profiles and HSSP profiles.

Topics: Algorithms; Amino Acid Sequence; Databases, Factual; Proteins; Sequence Alignment

PubMed: 35802991
DOI: 10.1016/j.compbiolchem.2022.107717

Prediction of Protein-Protein Interactions Using Vision Transformer and Language Model.

IEEE/ACM Transactions on Computational... 2023

The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new...

Summary PubMed

Authors: Kanchan Jha, Sriparna Saha, Sourav Karmakar...

The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.

Topics: Humans; Saccharomyces cerevisiae; Neural Networks, Computer; Proteins; Amino Acid Sequence; Multiomics

PubMed: 37027644
DOI: 10.1109/TCBB.2023.3248797

Protein sequence design with a learned potential.

Nature Communications Feb 2022

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions...

Summary PubMed Full Text PDF

Authors: Namrata Anand, Raphael Eguchi, Irimpan I Mathews...

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.

Topics: Amino Acid Sequence; Computer Simulation; Crystallography, X-Ray; Deep Learning; Models, Molecular; Protein Domains; Protein Engineering; Protein Folding

PubMed: 35136054
DOI: 10.1038/s41467-022-28313-9

Scoring protein sequence alignments using deep learning.

Bioinformatics (Oxford, England) May 2022

A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to...

Summary PubMed

Authors: Bikash Shrestha, Badri Adhikari

MOTIVATION

A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA.

RESULTS

We created our own dataset by generating a variety of SAs for a set of 1351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs.Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction.

AVAILABILITY AND IMPLEMENTATION

Code and the data underlying this article are available at https://github.com/ba-lab/Alignment-Score/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Sequence Alignment; Deep Learning; Computational Biology; Proteins; Amino Acid Sequence

PubMed: 35385080
DOI: 10.1093/bioinformatics/btac210