protein sequence - OpenMD.com Journal Search

DCSE:Double-Channel-Siamese-Ensemble model for protein protein interaction prediction.

BMC Genomics Aug 2022

Protein-protein interaction (PPI) is very important for many biochemical processes. Therefore, accurate prediction of PPI can help us better understand the role of...

Summary PubMed Full Text PDF

Authors: Wenqi Chen, Shuang Wang, Tao Song...

BACKGROUND

Protein-protein interaction (PPI) is very important for many biochemical processes. Therefore, accurate prediction of PPI can help us better understand the role of proteins in biochemical processes. Although there are many methods to predict PPI in biology, they are time-consuming and lack accuracy, so it is necessary to build an efficiently and accurately computational model in the field of PPI prediction.

RESULTS

We present a novel sequence-based computational approach called DCSE (Double-Channel-Siamese-Ensemble) to predict potential PPI. In the encoding layer, we treat each amino acid as a word, and map it into an N-dimensional vector. In the feature extraction layer, we extract features from local and global perspectives by Multilayer Convolutional Neural Network (MCN) and Multilayer Bidirectional Gated Recurrent Unit with Convolutional Neural Networks (MBC). Finally, the output of the feature extraction layer is then fed into the prediction layer to output whether the input protein pair will interact each other. The MCN and MBC are siamese and ensemble based network, which can effectively improve the performance of the model. In order to demonstrate our model's performance, we compare it with four machine learning based and three deep learning based models. The results show that our method outperforms other models in all evaluation criteria. The Accuracy, Precision, [Formula: see text], Recall and MCC of our model are 0.9303, 0.9091, 0.9268, 0.9452, 0.8609. For the other seven models, the highest Accuracy, Precision, [Formula: see text], Recall and MCC are 0.9288, 0.9243, 0.9246, 0.9250, 0.8572. We also test our model in the imbalanced dataset and transfer our model to another species. The results show our model is excellent.

CONCLUSION

Our model achieves the best performance by comparing it with seven other models. NLP-based coding method has a good effect on PPI prediction task. MCN and MBC extract protein sequence features from local and global perspectives and these two feature extraction layers are based on siamese and ensemble network structures. Siamese-based network structure can keep the features consistent and ensemble based network structure can effectively improve the accuracy of the model.

Topics: Amino Acid Sequence; Machine Learning; Neural Networks, Computer; Proteins

PubMed: 35922751
DOI: 10.1186/s12864-022-08772-6

PIPENN: protein interface prediction from sequence with an ensemble of neural nets.

Bioinformatics (Oxford, England) Apr 2022

The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a...

Summary PubMed Full Text PDF

Authors: Bas Stringer, Hans de Ferrante, Sanne Abeln...

MOTIVATION

The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly and challenging task, while protein sequence data are ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different Deep Learning (DL) architectures and learning strategies for protein-protein, protein-nucleotide and protein-small molecule interface prediction has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six DL architectures and various learning strategies with sequence-derived input features.

RESULTS

We constructed a large dataset dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang and Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction dataset. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are available at https://github.com/ibivu/pipenn/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Proteins; Machine Learning; Software; Amino Acid Sequence; Nucleotides; Computational Biology

PubMed: 35150231
DOI: 10.1093/bioinformatics/btac071

SPEACH_AF: Sampling protein ensembles and conformational heterogeneity with Alphafold2.

PLoS Computational Biology Aug 2022

The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes...

Summary PubMed Full Text PDF

Authors: Richard A Stein, Hassane S Mchaourab

The unprecedented performance of Deepmind's Alphafold2 in predicting protein structure in CASP XIV and the creation of a database of structures for multiple proteomes and protein sequence repositories is reshaping structural biology. However, because this database returns a single structure, it brought into question Alphafold's ability to capture the intrinsic conformational flexibility of proteins. Here we present a general approach to drive Alphafold2 to model alternate protein conformations through simple manipulation of the multiple sequence alignment via in silico mutagenesis. The approach is grounded in the hypothesis that the multiple sequence alignment must also encode for protein structural heterogeneity, thus its rational manipulation will enable Alphafold2 to sample alternate conformations. A systematic modeling pipeline is benchmarked against canonical examples of protein conformational flexibility and applied to interrogate the conformational landscape of membrane proteins. This work broadens the applicability of Alphafold2 by generating multiple protein conformations to be tested biologically, biochemically, biophysically, and for use in structure-based drug design.

Topics: Amino Acid Sequence; Drug Design; Protein Conformation; Proteins; Sequence Alignment

PubMed: 35994486
DOI: 10.1371/journal.pcbi.1010483

Rapid multiple protein sequence search by parallel and heterogeneous computation.

Bioinformatics (Oxford, England) Mar 2024

Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences...

Summary PubMed Full Text PDF

Authors: Jiefu Li, Ziyuan Wang, Xuwei Fan...

MOTIVATION

Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences continues to grow rapidly, there is an increasing need for efficient and scalable multiple sequence query algorithms for super-large databases without expensive time and computational costs.

RESULTS

We introduce Chorus, a novel protein sequence query system that leverages parallel model and heterogeneous computation architecture to enable users to query thousands of protein sequences concurrently against large protein databases on a desktop workstation. Chorus achieves over 100× speedup over BLASTP without sacrificing sensitivity. We demonstrate the utility of Chorus through a case study of analyzing a ∼1.5-TB large-scale metagenomic datasets for novel CRISPR-Cas protein discovery within 30 min.

AVAILABILITY AND IMPLEMENTATION

Chorus is open-source and its code repository is available at https://github.com/Bio-Acc/Chorus.

Topics: Software; Algorithms; Amino Acid Sequence; Proteins; Databases, Protein

PubMed: 38547405
DOI: 10.1093/bioinformatics/btae151

A quest for cytosolic sequons and their functions.

Scientific Reports Apr 2024

Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human...

Summary PubMed Full Text PDF

Authors: Manthan Desai, Syed Rafid Chowdhury, Bingyun Sun...

Evolution shapes protein sequences for their functions. Here, we studied the moonlighting functions of the N-linked sequon NXS/T, where X is not P, in human nucleocytosolic proteins. By comparing membrane and secreted proteins in which sequons are well known for N-glycosylation, we discovered that cyto-sequons can participate in nucleic acid binding, particularly in zinc finger proteins. Our global studies further discovered that sequon occurrence is largely proportional to protein length. The contribution of sequons to protein functions, including both N-glycosylation and nucleic acid binding, can be regulated through their density as well as the biased usage between NXS and NXT. In proteins where other PTMs or structural features are rich, such as phosphorylation, transmembrane ɑ-helices, and disulfide bridges, sequon occurrence is scarce. The information acquired here should help understand the relationship between protein sequence and function and assist future protein design and engineering.

Topics: Humans; Proteins; Glycosylation; Amino Acid Sequence; Phosphorylation; Nucleic Acids

PubMed: 38565583
DOI: 10.1038/s41598-024-57334-1

Improving sequence-based modeling of protein families using secondary-structure quality assessment.

Bioinformatics (Oxford, England) Nov 2021

Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function...

Summary PubMed Full Text PDF

Authors: Cyril Malbranke, David Bikard, Simona Cocco...

MOTIVATION

Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family.

RESULTS

We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments.

AVAILABILITY AND IMPLEMENTATION

Data and code available at https://github.com/CyrilMa/ssqa.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Proteins; Amino Acid Sequence; Sequence Alignment; Protein Structure, Secondary; Mutagenesis

PubMed: 34117879
DOI: 10.1093/bioinformatics/btab442

MAFFT-DASH: integrated protein sequence and structural alignment.

Nucleic Acids Research Jul 2019

Here, we describe a web server that integrates structural alignments with the MAFFT multiple sequence alignment (MSA) tool. For this purpose, we have prepared a...

Summary PubMed Full Text PDF

Authors: John Rozewicki, Songling Li, Karlou Mar Amada...

Here, we describe a web server that integrates structural alignments with the MAFFT multiple sequence alignment (MSA) tool. For this purpose, we have prepared a web-based Database of Aligned Structural Homologs (DASH), which provides structural alignments at the domain and chain levels for all proteins in the Protein Data Bank (PDB), and can be queried interactively or by a simple REST-like API. MAFFT-DASH integration can be invoked with a single flag on either the web (https://mafft.cbrc.jp/alignment/server/) or command-line versions of MAFFT. In our benchmarks using 878 cases from the BAliBase, HomFam, OXFam, Mattbench and SISYPHUS datasets, MAFFT-DASH showed 10-20% improvement over standard MAFFT for MSA problems with weak similarity, in terms of Sum-of-Pairs (SP), a measure of how well a program succeeds at aligning input sequences in comparison to a reference alignment. When MAFFT alignments were supplemented with homologous sequences, further improvement was observed. Potential applications of DASH beyond MSA enrichment include functional annotation through detection of remote homology and assembly of template libraries for homology modeling.

Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Humans; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Analysis, RNA; Sequence Homology; Software

PubMed: 31062021
DOI: 10.1093/nar/gkz342

Predicting subcellular location of protein with evolution information and sequence-based deep learning.

BMC Bioinformatics Oct 2021

Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine...

Summary PubMed Full Text PDF

Authors: Zhijun Liao, Gaofeng Pan, Chao Sun...

BACKGROUND

Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations.

RESULTS

Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848.

CONCLUSION

The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.

Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Deep Learning; Position-Specific Scoring Matrices; Proteins

PubMed: 34686152
DOI: 10.1186/s12859-021-04404-0

Exploring protein sequence-function landscapes.

Nature Biotechnology Feb 2017

Summary PubMed Full Text PDF

Authors: Tyler N Starr, Joseph W Thornton

Topics: Amino Acid Sequence; Humans; Proteins

PubMed: 28178247
DOI: 10.1038/nbt.3786

General overview on structure prediction of twilight-zone proteins.

Theoretical Biology & Medical Modelling Sep 2015

Protein structure prediction from amino acid sequence has been one of the most challenging aspects in computational structural biology despite significant progress in... (Review)

Summary PubMed Full Text PDF

Review

Authors: Bee Yin Khor, Gee Jun Tye, Theam Soon Lim...

Protein structure prediction from amino acid sequence has been one of the most challenging aspects in computational structural biology despite significant progress in recent years showed by critical assessment of protein structure prediction (CASP) experiments. When experimentally determined structures are unavailable, the predictive structures may serve as starting points to study a protein. If the target protein consists of homologous region, high-resolution (typically <1.5 Å) model can be built via comparative modelling. However, when confronted with low sequence similarity of the target protein (also known as twilight-zone protein, sequence identity with available templates is less than 30%), the protein structure prediction has to be initiated from scratch. Traditionally, twilight-zone proteins can be predicted via threading or ab initio method. Based on the current trend, combination of different methods brings an improved success in the prediction of twilight-zone proteins. In this mini review, the methods, progresses and challenges for the prediction of twilight-zone proteins were discussed.

Topics: Amino Acid Sequence; Computational Biology; Models, Molecular; Molecular Sequence Data; Proteins; Sequence Analysis, Protein

PubMed: 26338054
DOI: 10.1186/s12976-015-0014-1