protein sequence - OpenMD.com Journal Search

LambdaPP: Fast and accessible protein-specific phenotype predictions.

Protein Science : a Publication of the... Jan 2023

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular...

Summary PubMed Full Text PDF

Authors: Tobias Olenyi, Céline Marquet, Michael Heinzinger...

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.

Topics: Artificial Intelligence; Proteins; Amino Acid Sequence; Protein Structure, Secondary; Sequence Alignment; Software

PubMed: 36454227
DOI: 10.1002/pro.4524

Understanding and controlling amyloid aggregation with chirality.

Current Opinion in Chemical Biology Oct 2021

Amyloid aggregation and human disease are inextricably linked. Examples include Alzheimer disease, Parkinson disease, and type II diabetes. While seminal advances on... (Review)

Summary PubMed Full Text PDF

Review

Authors: Alejandro R Foley, Jevgenij A Raskatov

Amyloid aggregation and human disease are inextricably linked. Examples include Alzheimer disease, Parkinson disease, and type II diabetes. While seminal advances on the mechanistic understanding of these diseases have been made over the last decades, controlling amyloid fibril formation still represents a challenge, and it is a subject of active research. In this regard, chiral modifications have increasingly been proved to offer a particularly well-suited approach toward accessing to previously unknown aggregation pathways and to provide with novel insights on the biological mechanisms of action of amyloidogenic peptides and proteins. Here, we summarize recent advances on how the use of mirror-image peptides/proteins and d-amino acid incorporations have helped modulate amyloid aggregation, offered new mechanistic tools to study cellular interactions, and allowed us to identify key positions within the peptide/protein sequence that influence amyloid fibril growth and toxicity.

Topics: Amino Acid Sequence; Amyloid; Amyloid beta-Peptides; Diabetes Mellitus, Type 2; Humans; Peptides

PubMed: 33610939
DOI: 10.1016/j.cbpa.2021.01.003

TT3D: Leveraging precomputed protein 3D sequence models to predict protein-protein interactions.

Bioinformatics (Oxford, England) Nov 2023

High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to...

Summary PubMed Full Text PDF

Authors: Samuel Sledzieski, Kapil Devkota, Rohit Singh...

MOTIVATION

High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these models to predict which pairs of proteins interact in a high-throughput manner is not immediately clear. The recent Foldseek method of van Kempen et al. encodes the structural information of distances and angles along the protein backbone into a linear string of the same length as the protein string, using tokens from a 21-letter discretized structural alphabet (3Di).

RESULTS

We show that using both the amino acid sequence and the 3Di sequence generated by Foldseek as inputs to our recent deep-learning method, Topsy-Turvy, substantially improves the performance of predicting protein-protein interactions cross-species. Thus TT3D (Topsy-Turvy 3D) presents a way to reuse all the computational effort going into producing high-quality structural models from sequence, while being sufficiently lightweight so that high-quality binary protein-protein interaction predictions across all protein pairs can be made genome-wide.

AVAILABILITY AND IMPLEMENTATION

TT3D is available at https://github.com/samsledje/D-SCRIPT. An archived version of the code at time of submission can be found at https://zenodo.org/records/10037674.

Topics: Amino Acid Sequence; Software; Proteins

PubMed: 37897686
DOI: 10.1093/bioinformatics/btad663

AlphaFold2 models indicate that protein sequence determines both structure and dynamics.

Scientific Reports Jun 2022

AlphaFold 2 (AF2) has placed Molecular Biology in a new era where we can visualize, analyze and interpret the structures and functions of all proteins solely from their...

Summary PubMed Full Text PDF

Authors: Hao-Bo Guo, Alexander Perminov, Selemon Bekele...

AlphaFold 2 (AF2) has placed Molecular Biology in a new era where we can visualize, analyze and interpret the structures and functions of all proteins solely from their primary sequences. We performed AF2 structure predictions for various protein systems, including globular proteins, a multi-domain protein, an intrinsically disordered protein (IDP), a randomized protein, two larger proteins (> 1000 AA), a heterodimer and a homodimer protein complex. Our results show that along with the three dimensional (3D) structures, AF2 also decodes protein sequences into residue flexibilities via both the predicted local distance difference test (pLDDT) scores of the models, and the predicted aligned error (PAE) maps. We show that PAE maps from AF2 are correlated with the distance variation (DV) matrices from molecular dynamics (MD) simulations, which reveals that the PAE maps can predict the dynamical nature of protein residues. Here, we introduce the AF2-scores, which are simply derived from pLDDT scores and are in the range of [0, 1]. We found that for most protein models, including large proteins and protein complexes, the AF2-scores are highly correlated with the root mean square fluctuations (RMSF) calculated from MD simulations. However, for an IDP and a randomized protein, the AF2-scores do not correlate with the RMSF from MD, especially for the IDP. Our results indicate that the protein structures predicted by AF2 also convey information of the residue flexibility, i.e., protein dynamics.

Topics: Amino Acid Sequence; Furylfuramide; Intrinsically Disordered Proteins; Molecular Dynamics Simulation; Protein Conformation

PubMed: 35739160
DOI: 10.1038/s41598-022-14382-9

Machine learning to navigate fitness landscapes for protein engineering.

Current Opinion in Biotechnology Jun 2022

Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive... (Review)

Summary PubMed Full Text PDF

Review

Authors: Chase R Freschlin, Sarah A Fahlberg, Philip A Romero...

Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.

Topics: Amino Acid Sequence; Biotechnology; Machine Learning; Protein Engineering; Proteins

PubMed: 35413604
DOI: 10.1016/j.copbio.2022.102713

Contact-Assisted Threading in Low-Homology Protein Modeling.

Methods in Molecular Biology (Clifton,... 2023

The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The...

Summary PubMed Full Text PDF

Authors: Sutanu Bhattacharya, Rahmatullah Roche, Md Hossain Shuvo...

The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The progress is propelled by the improved accuracy of deep learning-based inter-residue contact map predictors coupled with the rising growth of protein sequence databases. Contact map encodes interatomic interaction information that can be exploited for highly accurate prediction of protein structures via contact map threading even for the query proteins that are not amenable to direct homology modeling. As such, contact-assisted threading has garnered considerable research effort. In this chapter, we provide an overview of existing contact-assisted threading methods while highlighting the recent advances and discussing some of the current limitations and future prospects in the application of contact-assisted threading for improving the accuracy of low-homology protein modeling.

Topics: Algorithms; Sequence Analysis, Protein; Proteins; Software; Amino Acid Sequence; Databases, Protein; Protein Conformation; Protein Folding

PubMed: 36959441
DOI: 10.1007/978-1-0716-2974-1_3

Sequence Coverage Visualizer: A Web Application for Protein Sequence Coverage 3D Visualization.

Journal of Proteome Research Feb 2023

Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the...

Summary PubMed Full Text PDF

Authors: Xinhao Shao, Christopher Grams, Yu Gao...

Protein structure defines protein function and plays an extremely important role in protein characterization. Recently, two groups of researchers from DeepMind and the Baker lab have independently published protein structure prediction tools that can help us obtain predicted protein structures for the whole human proteome. This enabled us to visualize the entire human proteome using predicted 3D structures for the first time. To help other researchers best utilize these protein structure predictions in proteomics experiments, we present the Sequence Coverage Visualizer (SCV), http://scv.lab.gy, a web application for protein sequence coverage 3D visualization. Here we showed a few possible usages of the SCV, including the labeling of post-translational modifications and isotope labeling experiments. These results highlight the usefulness of such 3D visualization for proteomics experiments and how SCV can turn a regular proteomics experiment (identified peptide list) into structural insights. Furthermore, when used together with limited proteolysis, we demonstrated that SCV can help to compare different protein structures from different sources, including predicted ones and existing PDB entries. We hope our tool can provide help in the process of improving protein structure prediction accuracy. Overall, SCV is a convenient and powerful tool for visualizing proteomics results in 3D.

Topics: Humans; Proteome; Imaging, Three-Dimensional; Amino Acid Sequence; Peptides; Proteomics; Software

PubMed: 36511722
DOI: 10.1021/acs.jproteome.2c00358

Propagation, detection and correction of errors using the sequence database network.

Briefings in Bioinformatics Nov 2022

Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a... (Review)

Summary PubMed Full Text PDF

Review

Authors: Benjamin Goudey, Nicholas Geard, Karin Verspoor...

Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.

Topics: Databases, Nucleic Acid; Amino Acid Sequence; Computational Biology

PubMed: 36266246
DOI: 10.1093/bib/bbac416

Automatic Gene Function Prediction in the 2020's.

Genes Oct 2020

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need... (Review)

Summary PubMed Full Text PDF

Review

Authors: Stavros Makrodimitris, Roeland C H J van Ham, Marcel J T Reinders...

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.

Topics: Algorithms; Amino Acid Sequence; Computational Biology; Electronic Data Processing; Gene Ontology; Machine Learning; Models, Biological; Molecular Sequence Annotation; Proteins

PubMed: 33120976
DOI: 10.3390/genes11111264

Protein embeddings improve phage-host interaction prediction.

PloS One 2023

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help...

Summary PubMed Full Text PDF

Authors: Mark Edward M Gonzales, Jennifer C Ureta, Anish M S Shrestha...

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.

Topics: Amino Acid Sequence; Proteome; Bacteriophages; Differential Threshold; Mental Recall

PubMed: 37486915
DOI: 10.1371/journal.pone.0289030