-
Current Protocols in Bioinformatics 2013Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary...
Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.
Topics: Amino Acid Sequence; Amino Acid Substitution; DNA; Molecular Sequence Data; Position-Specific Scoring Matrices; Sequence Alignment; Sequence Homology, Amino Acid
PubMed: 24509512
DOI: 10.1002/0471250953.bi0305s43 -
Scientific Reports Jun 2021Deep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins....
Deep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; Deep Learning; Protein Conformation; Proteins; Sequence Alignment; Sequence Analysis, Protein; Software
PubMed: 34112907
DOI: 10.1038/s41598-021-91827-7 -
Bioinformatics (Oxford, England) Jun 2016Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein...
MOTIVATION
Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.
METHOD
We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.
RESULTS
We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.
AVAILABILITY AND IMPLEMENTATION
Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx
CONTACT
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; Proteins; Sequence Alignment; Sequence Analysis, Protein; Software
PubMed: 27307635
DOI: 10.1093/bioinformatics/btw271 -
Scientific Reports Oct 2022The blockade current that develops when a protein translocates across a thin membrane through a sub-nanometer diameter pore informs with extreme sensitivity on the...
The blockade current that develops when a protein translocates across a thin membrane through a sub-nanometer diameter pore informs with extreme sensitivity on the sequence of amino acids that constitute the protein. The current blockade signals measured during the translocation are called a nanospectrum of the protein. Whereas mass spectrometry (MS) is still the dominant technology for protein identification, it suffers limitations. In proteome-wide studies, MS identifies proteins by database search but often fails to provide high protein sequence coverage. It is also not very sensitive requiring about a femtomole for protein identification. Compared with MS, a sub-nanometer diameter pore (i.e. a sub-nanopore) directly reads the amino acids constituting a single protein molecule, but efficient computational tools are still required for processing and interpreting nanospectra. Here, we delineate computational methods for processing sub-nanopore nanospectra and predicting theoretical nanospectra from protein sequences, which are essential for protein identification.
Topics: Amino Acid Sequence; Proteome; Peptides; Nanopores; Amino Acids
PubMed: 36284132
DOI: 10.1038/s41598-022-22305-x -
Journal of Chemical Information and... Sep 2016We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from...
We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). Since the target and the off-targets of a drug share at least a common ligand, i.e., the drug itself, the prediction of positive protein pairs may help identify off-targets. We evaluated representative protein-centric and ligand-centric approaches, including (1) 2D and 3D ligand similarity, (2) several measures of protein sequence similarity in conjunction with different sequence sources (e.g., full protein sequence versus binding site residues), and (3) a newly described pocket shape similarity and alignment program called SiteHopper. While the sequence-based alignment of pocket residues achieved the best overall performance, SiteHopper outperformed sequence-based approaches for unrelated proteins with only 20-30% pocket residue identity. Analogously, among ligand-centric approaches, path-based fingerprints achieved the best overall performance, but ROCS-based ligand shape similarity outperformed path-based fingerprints for structurally dissimilar ligands (Tanimoto 25%-40%). A significant drop in recognition performance was observed for ligand-centric approaches when PDB ligands were used instead of ChEMBL ligands. Finally, we analyzed the relationship between pocket shape and ligand shape in our data set and found that similar ligands tend to bind to similar pockets while similar pockets may accept a range of different-shaped ligands.
Topics: Amino Acid Sequence; Benchmarking; Computational Biology; Ligands; Models, Molecular; Protein Conformation; Proteins
PubMed: 27559831
DOI: 10.1021/acs.jcim.6b00118 -
Protein Science : a Publication of the... Jan 2023Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce...
Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein-NA interactions for large protein families and proteomes.
Topics: Amino Acids; Nucleic Acids; Databases, Protein; Amino Acid Sequence; Proteome; Computational Biology
PubMed: 36519304
DOI: 10.1002/pro.4544 -
Bioinformatics (Oxford, England) May 2021Current techniques of protein engineering focus mostly on re-designing small targeted regions or defined structural scaffolds rather than constructing combinatorial...
MOTIVATION
Current techniques of protein engineering focus mostly on re-designing small targeted regions or defined structural scaffolds rather than constructing combinatorial libraries of versatile compositions and lengths. This is a missed opportunity because combinatorial libraries are emerging as a vital source of novel functional proteins and are of interest in diverse research areas.
RESULTS
Here, we present a computational tool for Combinatorial Library Design (CoLiDe) offering precise control over protein sequence composition, length and diversity. The algorithm uses evolutionary approach to provide solutions to combinatorial libraries of degenerate DNA templates. We demonstrate its performance and precision using four different input alphabet distribution on different sequence lengths. In addition, a model design and experimental pipeline for protein library expression and purification is presented, providing a proof-of-concept that our protocol can be used to prepare purified protein library samples of up to 1011-1012 unique sequences. CoLiDe presents a composition-centric approach to protein design towards different functional phenomena.
AVAILABILITYAND IMPLEMENTATION
CoLiDe is implemented in Python and freely available at https://github.com/voracva1/CoLiDe.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Amino Acid Sequence; Gene Library; Protein Engineering; Proteins; Software
PubMed: 32956450
DOI: 10.1093/bioinformatics/btaa804 -
BioMed Research International 2015Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of...
Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.
Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Helicobacter pylori; Humans; Protein Interaction Mapping; Proteins; Saccharomyces cerevisiae; Sequence Analysis, Protein; Support Vector Machine
PubMed: 26000305
DOI: 10.1155/2015/867516 -
PLoS Computational Biology Nov 2023Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative...
Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.
Topics: Proteins; Amino Acid Sequence; Amino Acids
PubMed: 38011273
DOI: 10.1371/journal.pcbi.1011655 -
Angewandte Chemie (International Ed. in... Dec 2022Efficient design of functional proteins with higher thermal stability remains challenging especially for highly diverse sequence variants. Considering the evolutionary...
Efficient design of functional proteins with higher thermal stability remains challenging especially for highly diverse sequence variants. Considering the evolutionary pressure on protein folds, sequence design optimizing evolutionary fitness could help designing folds with higher stability. Using a generative evolution fitness model trained to capture variation patterns in natural sequences, we designed artificial sequences of a proteinaceous inhibitor of pectin methylesterase enzymes. These inhibitors have considerable industrial interest to avoid phase separation in fruit juice manufacturing or reduce methanol in distillates, averting chromatographic passages triggering unwanted aroma loss. Six out of seven designs with up to 30 % divergence to other inhibitor sequences are functional and two have improved thermal stability. This method can improve protein stability expanding functional protein sequence space, with traits valuable for industrial applications and scientific research.
Topics: Amino Acid Sequence; Proteins; Protein Stability
PubMed: 36259321
DOI: 10.1002/anie.202202711