-
Current Opinion in Biotechnology Jun 2022Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive... (Review)
Review
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Topics: Amino Acid Sequence; Biotechnology; Machine Learning; Protein Engineering; Proteins
PubMed: 35413604
DOI: 10.1016/j.copbio.2022.102713 -
Methods in Molecular Biology (Clifton,... 2023The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The...
The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The progress is propelled by the improved accuracy of deep learning-based inter-residue contact map predictors coupled with the rising growth of protein sequence databases. Contact map encodes interatomic interaction information that can be exploited for highly accurate prediction of protein structures via contact map threading even for the query proteins that are not amenable to direct homology modeling. As such, contact-assisted threading has garnered considerable research effort. In this chapter, we provide an overview of existing contact-assisted threading methods while highlighting the recent advances and discussing some of the current limitations and future prospects in the application of contact-assisted threading for improving the accuracy of low-homology protein modeling.
Topics: Algorithms; Sequence Analysis, Protein; Proteins; Software; Amino Acid Sequence; Databases, Protein; Protein Conformation; Protein Folding
PubMed: 36959441
DOI: 10.1007/978-1-0716-2974-1_3 -
Journal of Proteome Research Feb 2023Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the...
Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the Baker lab have independently published protein structure prediction tools that can help us obtain predicted protein structures for the whole human proteome. This enabled us to visualize the entire human proteome using predicted 3D structures for the first time. To help other researchers best utilize these protein structure predictions in proteomics experiments, we present the Sequence Coverage Visualizer (SCV), http://scv.lab.gy, a web application for protein sequence coverage 3D visualization. Here we showed a few possible usages of the SCV, including the labeling of post-translational modifications and isotope labeling experiments. These results highlight the usefulness of such 3D visualization for proteomics experiments and how SCV can turn a regular proteomics experiment (identified peptide list) into structural insights. Furthermore, when used together with limited proteolysis, we demonstrated that SCV can help to compare different protein structures from different sources, including predicted ones and existing PDB entries. We hope our tool can provide help in the process of improving protein structure prediction accuracy. Overall, SCV is a convenient and powerful tool for visualizing proteomics results in 3D.
Topics: Humans; Proteome; Imaging, Three-Dimensional; Amino Acid Sequence; Peptides; Proteomics; Software
PubMed: 36511722
DOI: 10.1021/acs.jproteome.2c00358 -
Journal of Molecular Biology May 2021Information on the protein flexibility is essential to understand crucial molecular mechanisms such as protein stability, interactions with other molecules and protein...
Information on the protein flexibility is essential to understand crucial molecular mechanisms such as protein stability, interactions with other molecules and protein functions in general. B-factor obtained in the X-ray crystallography experiments is the most common flexibility descriptor available for the majority of the resolved protein structures. Since the gap between the number of the resolved protein structures and available protein sequences is continuously growing, it is important to provide computational tools for protein flexibility prediction from amino acid sequence. In the current study, we report a Deep Learning based protein flexibility prediction tool MEDUSA (https://www.dsimb.inserm.fr/MEDUSA). MEDUSA uses evolutionary information extracted from protein homologous sequences and amino acid physico-chemical properties as input for a convolutional neural network to assign a flexibility class to each protein sequence position. Trained on a non-redundant dataset of X-ray structures, MEDUSA provides flexibility prediction in two, three and five classes. MEDUSA is freely available as a web-server providing a clear visualization of the prediction results as well as a standalone utility (https://github.com/DSIMB/medusa). Analysis of the MEDUSA output allows a user to identify the potentially highly deformable protein regions and general dynamic properties of the protein.
Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Proteins; Software
PubMed: 33972018
DOI: 10.1016/j.jmb.2021.166882 -
Expert Review of Proteomics 2020Proteins are crucial for every cellular activity and unraveling their sequence and structure is a crucial step to fully understand their biology. Early methods of... (Review)
Review
INTRODUCTION
Proteins are crucial for every cellular activity and unraveling their sequence and structure is a crucial step to fully understand their biology. Early methods of protein sequencing were mainly based on the use of enzymatic or chemical degradation of peptide chains. With the completion of the human genome project and with the expansion of the information available for each protein, various databases containing this sequence information were formed.
AREAS COVERED
protein sequencing, shotgun proteomics and other mass-spectrometric techniques, along with the various software are currently available for proteogenomic analysis. Emphasis is placed on the methods for sequencing, together with potential and shortcomings using databases for interpretation of protein sequence data.
EXPERT OPINION
As mass-spectrometry sequencing performance is improving with better software and hardware optimizations, combined with user-friendly interfaces, protein sequencing becomes imperative in shotgun proteomic studies. Issues regarding unknown or mutated peptide sequences, as well as, unexpected post-translational modifications (PTMs) and their identification through false discovery rate searches using the target/decoy strategy need to be addressed. Ideally, it should become integrated in standard proteomic workflows as an add-on to conventional database search engines, which then would be able to provide improved identification.
Topics: Amino Acid Sequence; Computational Biology; Humans; Protein Processing, Post-Translational; Proteins; Proteomics; Sequence Analysis, Protein; Software; Tandem Mass Spectrometry
PubMed: 33016158
DOI: 10.1080/14789450.2020.1831387 -
Briefings in Functional Genomics Mar 2021Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and...
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Topics: Amino Acid Sequence; Computational Biology; Deep Learning; Neural Networks, Computer; Proteins
PubMed: 33527980
DOI: 10.1093/bfgp/elaa030 -
Nature Methods Jan 2023Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences...
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
Topics: Amino Acid Sequence; Proteins; Sequence Alignment; Algorithms; Genomics
PubMed: 36522501
DOI: 10.1038/s41592-022-01700-2 -
Current Opinion in Structural Biology Jun 2016The availability of vast amounts of protein sequence data facilitates detection of subtle statistical correlations due to imposed structural and functional constraints.... (Review)
Review
The availability of vast amounts of protein sequence data facilitates detection of subtle statistical correlations due to imposed structural and functional constraints. Recent breakthroughs using Direct Coupling Analysis (DCA) and related approaches have tapped into correlations believed to be due to compensatory mutations. This has yielded some remarkable results, including substantially improved prediction of protein intra- and inter-domain 3D contacts, of membrane and globular protein structures, of substrate binding sites, and of protein conformational heterogeneity. A complementary approach is Bayesian Partitioning with Pattern Selection (BPPS), which partitions related proteins into hierarchically-arranged subgroups based on correlated residue patterns. These correlated patterns are presumably due to structural and functional constraints associated with evolutionary divergence rather than to compensatory mutations. Hence joint application of DCA- and BPPS-based approaches should help sort out the structural and functional constraints contributing to sequence correlations.
Topics: Amino Acid Sequence; Computational Biology; Interatrial Block; Models, Molecular; Proteins; Sequence Alignment
PubMed: 27179293
DOI: 10.1016/j.sbi.2016.04.006 -
PloS One 2023With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.
Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall
PubMed: 37486915
DOI: 10.1371/journal.pone.0289030 -
Current Opinion in Structural Biology Feb 2017Aggregation can be thought of as a form of protein folding in which intermolecular associations lead to the formation of large, insoluble assemblies. Various types of... (Review)
Review
Aggregation can be thought of as a form of protein folding in which intermolecular associations lead to the formation of large, insoluble assemblies. Various types of aggregates can be differentiated by their internal structures and gross morphologies (e.g., fibrillar or amorphous), and the ability to accurately predict the likelihood of their formation by a given polypeptide is of great practical utility in the fields of biology (including the study of disease), biotechnology, and biomaterials research. Here we review aggregation/solubility prediction methods and selected applications thereof. The development of increasingly sophisticated methods that incorporate knowledge of conformations possibly adopted by aggregating polypeptide monomers and predict the internal structure of aggregates is improving the accuracy of the predictions and continually expanding the range of applications.
Topics: Amino Acid Sequence; Humans; Models, Molecular; Protein Structure, Quaternary; Protein Structure, Tertiary; Proteins; Solubility
PubMed: 28160724
DOI: 10.1016/j.sbi.2017.01.004