-
Bioinformatics (Oxford, England) Mar 2023Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
MOTIVATION
Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired.
RESULTS
Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein.
AVAILABILITY AND IMPLEMENTATION
The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Topics: Amino Acid Sequence; Proteins; Software
PubMed: 36916746
DOI: 10.1093/bioinformatics/btad122 -
Advances in Experimental Medicine and... 2015Short, linear motifs (SLiMs) in proteins are functional microdomains consisting of contiguous residue segments along the protein sequence, typically not more than 10... (Review)
Review
Short, linear motifs (SLiMs) in proteins are functional microdomains consisting of contiguous residue segments along the protein sequence, typically not more than 10 consecutive amino acids in length with less than 5 defined positions. Many positions are 'degenerate' thus offering flexibility in terms of the amino acid types allowed at those positions. Their short length and degenerate nature confers evolutionary plasticity meaning that SLiMs often evolve convergently. Further, SLiMs have a propensity to occur within intrinsically unstructured protein segments and this confers versatile functionality to unstructured regions of the proteome. SLiMs mediate multiple types of protein interactions based on domain-peptide recognition and guide functions including posttranslational modifications, subcellular localization of proteins, and ligand binding. SLiMs thus behave as modular interaction units that confer versatility to protein function and SLiM-mediated interactions are increasingly being recognized as therapeutic targets. In this chapter we start with a brief description about the properties of SLiMs and their interactions and then move on to discuss algorithms and tools including several web-based methods that enable the discovery of novel SLiMs (de novo motif discovery) as well as the prediction of novel occurrences of known SLiMs. Both individual amino acid sequences as well as sets of protein sequences can be scanned using these methods to obtain statistically overrepresented sequence patterns. Lists of putatively functional SLiMs are then assembled based on parameters such as evolutionary sequence conservation, disorder scores, structural data, gene ontology terms and other contextual information that helps to assess the functional credibility or significance of these motifs. These bioinformatics methods should certainly guide experiments aimed at motif discovery.
Topics: Algorithms; Amino Acid Motifs; Amino Acid Sequence; Computational Biology; Intrinsically Disordered Proteins; Molecular Sequence Data; Protein Conformation; Sequence Homology, Amino Acid
PubMed: 26387106
DOI: 10.1007/978-3-319-20164-1_9 -
Biomolecules Jan 2022Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell...
Protein-peptide interactions (PpIs) are a subset of the overall protein-protein interaction (PPI) network in the living cell and are pivotal for the majority of cell processes and functions. High-throughput methods to detect PpIs and PPIs usually require time and costs that are not always affordable. Therefore, reliable in silico predictions represent a valid and effective alternative. In this work, a new algorithm is described, implemented in a freely available tool, i.e., "PepThreader", to carry out PPIs and PpIs prediction and analysis. PepThreader threads multiple fragments derived from a full-length protein sequence (or from a peptide library) onto a second template peptide, in complex with a protein target, "spotting" the potential binding peptides and ranking them according to a sequence-based and structure-based threading score. The threading algorithm first makes use of a scoring function that is based on peptides sequence similarity. Then, a rerank of the initial hits is performed, according to structure-based scoring functions. PepThreader has been benchmarked on a dataset of 292 protein-peptide complexes that were collected from existing databases of experimentally determined protein-peptide interactions. An accuracy of 80%, when considering the top predicted 25 hits, was achieved, which performs in a comparable way with the other state-of-art tools in PPIs and PpIs modeling. Nonetheless, PepThreader is unique in that it is able at the same time to spot a binding peptide within a full-length sequence involved in PPI and model its structure within the receptor. Therefore, PepThreader adds to the already-available tools supporting the experimental PPIs and PpIs identification and characterization.
Topics: Amino Acid Sequence; Peptide Library; Peptides; Protein Interaction Mapping; Software
PubMed: 35204702
DOI: 10.3390/biom12020201 -
BMC Bioinformatics Feb 2024To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the...
BACKGROUND
To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.
RESULTS
CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.
CONCLUSIONS
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
Topics: Genome; Sequence Alignment; Proteins; Amino Acid Sequence; Nucleotides
PubMed: 38424511
DOI: 10.1186/s12859-024-05700-1 -
Pacific Symposium on Biocomputing.... 2021Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells....
Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.
Topics: Amino Acid Sequence; COVID-19; Computational Biology; Humans; Proteins; SARS-CoV-2
PubMed: 33691013
DOI: No ID Found -
IEEE Transactions on Nanobioscience Jul 2021Protein structure prediction (PSP) predicts the native conformation for a given protein sequence. Classically, the problem has been shown to belong to the NP-complete...
Protein structure prediction (PSP) predicts the native conformation for a given protein sequence. Classically, the problem has been shown to belong to the NP-complete complexity class. Its applications range from physics, through bioinformatics to medicine and quantum biology. It is possible however to speed it up with quantum computational methods, as we show in this paper. Here we develop a fast quantum algorithm for PSP in three-dimensional hydrophobic-hydrophilic model on body-centered cubic lattice with quadratic speedup over its classical counterparts. Given a protein sequence of n amino acids, our algorithm reduces the temporal and spatial complexities to, respectively, [Formula: see text] and O(n logn) . With respect to oracle-related quantum algorithms for the NP-complete problems, we identify our algorithm as optimal. To justify the feasibility of the proposed algorithm we successfully solve the problem on IBM quantum simulator involving 21 and 25 qubits. We confirm the experimentally obtained high probability of success in finding the desired conformation by calculating the theoretical probability estimations.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; Protein Conformation; Proteins
PubMed: 33690123
DOI: 10.1109/TNB.2021.3065051 -
Briefings in Bioinformatics Sep 2023Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on...
Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.
Topics: Crystallization; Machine Learning; Amino Acid Sequence; Algorithms; Computational Biology
PubMed: 37649385
DOI: 10.1093/bib/bbad319 -
Bioinformatics (Oxford, England) Jan 2023Ever increasing amounts of protein structure data, combined with advances in machine learning, have led to the rapid proliferation of methods available for...
SUMMARY
Ever increasing amounts of protein structure data, combined with advances in machine learning, have led to the rapid proliferation of methods available for protein-sequence design. In order to utilize a design method effectively, it is important to understand the nuances of its performance and how it varies by design target. Here, we present PDBench, a set of proteins and a number of standard tests for assessing the performance of sequence-design methods. PDBench aims to maximize the structural diversity of the benchmark, compared with previous benchmarking sets, in order to provide useful biological insight into the behaviour of sequence-design methods, which is essential for evaluating their performance and practical utility. We believe that these tools are useful for guiding the development of novel sequence design algorithms and will enable users to choose a method that best suits their design target.
AVAILABILITY AND IMPLEMENTATION
https://github.com/wells-wood-research/PDBench.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Software; Algorithms; Proteins; Amino Acid Sequence; Benchmarking; Computational Biology
PubMed: 36637198
DOI: 10.1093/bioinformatics/btad027 -
Bioinformatics (Oxford, England) Nov 2022The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly...
MOTIVATION
The advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.
RESULTS
E-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101 146 human protein single amino acid variants in 13 661 proteins, derived from public resources. When tested on a blind set comprising 10 266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.
AVAILABILITY AND IMPLEMENTATION
The method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Humans; Polymorphism, Single Nucleotide; Artificial Intelligence; Amino Acid Sequence; Proteins; Amino Acids; Computational Biology; Molecular Sequence Annotation
PubMed: 36227117
DOI: 10.1093/bioinformatics/btac678 -
Biological Reviews of the Cambridge... Feb 2023Proteins form arguably the most significant link between genotype and phenotype. Understanding the relationship between protein sequence and structure, and applying this... (Review)
Review
Proteins form arguably the most significant link between genotype and phenotype. Understanding the relationship between protein sequence and structure, and applying this knowledge to predict function, is difficult. One way to investigate these relationships is by considering the space of protein folds and how one might move from fold to fold through similarity, or potential evolutionary relationships. The many individual characterisations of fold space presented in the literature can tell us a lot about how well the current Protein Data Bank represents protein fold space, how convergence and divergence may affect protein evolution, how proteins affect the whole of which they are part, and how proteins themselves function. A synthesis of these different approaches and viewpoints seems the most likely way to further our knowledge of protein structure evolution and thus, facilitate improved protein structure design and prediction.
Topics: Proteins; Amino Acid Sequence
PubMed: 36210328
DOI: 10.1111/brv.12905