protein sequence - OpenMD.com Journal Search

Selecting the Right Similarity-Scoring Matrix.

Current Protocols in Bioinformatics 2013

Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary...

Summary PubMed Full Text PDF

Authors: William R Pearson

Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.

Topics: Amino Acid Sequence; Amino Acid Substitution; DNA; Molecular Sequence Data; Position-Specific Scoring Matrices; Sequence Alignment; Sequence Homology, Amino Acid

PubMed: 24509512
DOI: 10.1002/0471250953.bi0305s43

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning.

Scientific Reports Jun 2021

Deep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins....

Summary PubMed Full Text PDF

Authors: Farhan Quadir, Raj S Roy, Randal Halfmann...

Deep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Topics: Algorithms; Amino Acid Sequence; Computational Biology; Deep Learning; Protein Conformation; Proteins; Sequence Alignment; Sequence Analysis, Protein; Software

PubMed: 34112907
DOI: 10.1038/s41598-021-91827-7

CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction.

Bioinformatics (Oxford, England) Jun 2016

Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein...

Summary PubMed Full Text PDF

Authors: Xuefeng Cui, Zhiwu Lu, Sheng Wang...

MOTIVATION

Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.

METHOD

We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.

RESULTS

We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

AVAILABILITY AND IMPLEMENTATION

Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx

CONTACT

: [email protected]

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Algorithms; Amino Acid Sequence; Computational Biology; Proteins; Sequence Alignment; Sequence Analysis, Protein; Software

PubMed: 27307635
DOI: 10.1093/bioinformatics/btw271

Calling the amino acid sequence of a protein/peptide from the nanospectrum produced by a sub-nanometer diameter pore.

Scientific Reports Oct 2022

The blockade current that develops when a protein translocates across a thin membrane through a sub-nanometer diameter pore informs with extreme sensitivity on the...

Summary PubMed Full Text PDF

Authors: Xiaowen Liu, Zhuxin Dong, Gregory Timp...

The blockade current that develops when a protein translocates across a thin membrane through a sub-nanometer diameter pore informs with extreme sensitivity on the sequence of amino acids that constitute the protein. The current blockade signals measured during the translocation are called a nanospectrum of the protein. Whereas mass spectrometry (MS) is still the dominant technology for protein identification, it suffers limitations. In proteome-wide studies, MS identifies proteins by database search but often fails to provide high protein sequence coverage. It is also not very sensitive requiring about a femtomole for protein identification. Compared with MS, a sub-nanometer diameter pore (i.e. a sub-nanopore) directly reads the amino acids constituting a single protein molecule, but efficient computational tools are still required for processing and interpreting nanospectra. Here, we delineate computational methods for processing sub-nanopore nanospectra and predicting theoretical nanospectra from protein sequences, which are essential for protein identification.

Topics: Amino Acid Sequence; Proteome; Peptides; Nanopores; Amino Acids

PubMed: 36284132
DOI: 10.1038/s41598-022-22305-x

Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity.

Journal of Chemical Information and... Sep 2016

We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from...

Summary PubMed Full Text

Authors: Yu-Chen Chen, Robert Tolbert, Alex M Aronov...

We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). Since the target and the off-targets of a drug share at least a common ligand, i.e., the drug itself, the prediction of positive protein pairs may help identify off-targets. We evaluated representative protein-centric and ligand-centric approaches, including (1) 2D and 3D ligand similarity, (2) several measures of protein sequence similarity in conjunction with different sequence sources (e.g., full protein sequence versus binding site residues), and (3) a newly described pocket shape similarity and alignment program called SiteHopper. While the sequence-based alignment of pocket residues achieved the best overall performance, SiteHopper outperformed sequence-based approaches for unrelated proteins with only 20-30% pocket residue identity. Analogously, among ligand-centric approaches, path-based fingerprints achieved the best overall performance, but ROCS-based ligand shape similarity outperformed path-based fingerprints for structurally dissimilar ligands (Tanimoto 25%-40%). A significant drop in recognition performance was observed for ligand-centric approaches when PDB ligands were used instead of ChEMBL ligands. Finally, we analyzed the relationship between pocket shape and ligand shape in our data set and found that similar ligands tend to bind to similar pockets while similar pockets may accept a range of different-shaped ligands.

Topics: Amino Acid Sequence; Benchmarking; Computational Biology; Ligands; Models, Molecular; Protein Conformation; Proteins

PubMed: 27559831
DOI: 10.1021/acs.jcim.6b00118

qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids.

Protein Science : a Publication of the... Jan 2023

Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce...

Summary PubMed Full Text PDF

Authors: Zhonghua Wu, Sushmita Basu, Xuantai Wu...

Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein-NA interactions for large protein families and proteomes.

Topics: Amino Acids; Nucleic Acids; Databases, Protein; Amino Acid Sequence; Proteome; Computational Biology

PubMed: 36519304
DOI: 10.1002/pro.4544

CoLiDe: Combinatorial Library Design tool for probing protein sequence space.

Bioinformatics (Oxford, England) May 2021

Current techniques of protein engineering focus mostly on re-designing small targeted regions or defined structural scaffolds rather than constructing combinatorial...

Summary PubMed Full Text PDF

Authors: Vyacheslav Tretyachenko, Václav Voráček, Radko Souček...

MOTIVATION

Current techniques of protein engineering focus mostly on re-designing small targeted regions or defined structural scaffolds rather than constructing combinatorial libraries of versatile compositions and lengths. This is a missed opportunity because combinatorial libraries are emerging as a vital source of novel functional proteins and are of interest in diverse research areas.

RESULTS

Here, we present a computational tool for Combinatorial Library Design (CoLiDe) offering precise control over protein sequence composition, length and diversity. The algorithm uses evolutionary approach to provide solutions to combinatorial libraries of degenerate DNA templates. We demonstrate its performance and precision using four different input alphabet distribution on different sequence lengths. In addition, a model design and experimental pipeline for protein library expression and purification is presented, providing a proof-of-concept that our protocol can be used to prepare purified protein library samples of up to 1011-1012 unique sequences. CoLiDe presents a composition-centric approach to protein design towards different functional phenomena.

AVAILABILITYAND IMPLEMENTATION

CoLiDe is implemented in Python and freely available at https://github.com/voracva1/CoLiDe.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Topics: Algorithms; Amino Acid Sequence; Gene Library; Protein Engineering; Proteins; Software

PubMed: 32956450
DOI: 10.1093/bioinformatics/btaa804

Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines.

BioMed Research International 2015

Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of...

Summary PubMed Full Text PDF

Authors: Zhu-Hong You, Jianqiang Li, Xin Gao...

Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.

Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Helicobacter pylori; Humans; Protein Interaction Mapping; Proteins; Saccharomyces cerevisiae; Sequence Analysis, Protein; Support Vector Machine

PubMed: 26000305
DOI: 10.1155/2015/867516

GENERALIST: A latent space based generative model for protein sequence families.

PLoS Computational Biology Nov 2023

Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative...

Summary PubMed Full Text PDF

Authors: Hoda Akl, Brooke Emison, Xiaochuan Zhao...

Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.

Topics: Proteins; Amino Acid Sequence; Amino Acids

PubMed: 38011273
DOI: 10.1371/journal.pcbi.1011655

Design of a Protein with Improved Thermal Stability by an Evolution-Based Generative Model.

Angewandte Chemie (International Ed. in... Dec 2022

Efficient design of functional proteins with higher thermal stability remains challenging especially for highly diverse sequence variants. Considering the evolutionary...

Summary PubMed Full Text PDF

Authors: Pengfei Tian, Adrien Lemaire, Fabien Sénéchal...

Efficient design of functional proteins with higher thermal stability remains challenging especially for highly diverse sequence variants. Considering the evolutionary pressure on protein folds, sequence design optimizing evolutionary fitness could help designing folds with higher stability. Using a generative evolution fitness model trained to capture variation patterns in natural sequences, we designed artificial sequences of a proteinaceous inhibitor of pectin methylesterase enzymes. These inhibitors have considerable industrial interest to avoid phase separation in fruit juice manufacturing or reduce methanol in distillates, averting chromatographic passages triggering unwanted aroma loss. Six out of seven designs with up to 30 % divergence to other inhibitor sequences are functional and two have improved thermal stability. This method can improve protein stability expanding functional protein sequence space, with traits valuable for industrial applications and scientific research.

Topics: Amino Acid Sequence; Proteins; Protein Stability

PubMed: 36259321
DOI: 10.1002/anie.202202711