-
International Journal of Molecular... Dec 2021Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein's function, and a key determinant for the production...
Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein's function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein's solubility to guide experimental design.
Topics: Amino Acid Sequence; Computational Biology; Deep Learning; Models, Chemical; Proteins; Solubility
PubMed: 34948354
DOI: 10.3390/ijms222413555 -
MAbs Jan 2019Amino acid sequence variation in protein therapeutics requires close monitoring during cell line and cell culture process development. A cross-functional team of Pfizer... (Review)
Review
Amino acid sequence variation in protein therapeutics requires close monitoring during cell line and cell culture process development. A cross-functional team of Pfizer colleagues from the Analytical and Bioprocess Development departments worked closely together for over 6 years to formulate and communicate a practical, reliable sequence variant (SV) testing strategy with state-of-the-art techniques that did not necessitate more resources or lengthen project timelines. The final Pfizer SV screening strategy relies on next-generation sequencing (NGS) and amino acid analysis (AAA) as frontline techniques to identify mammalian cell clones with genetic mutations and recognize cell culture process media/feed conditions that induce misincorporations, respectively. Mass spectrometry (MS)-based techniques had previously been used to monitor secreted therapeutic products for SVs, but we found NGS and AAA to be equally informative, faster, less cumbersome screening approaches. MS resources could then be used for other purposes, such as the in-depth characterization of product quality in the final stages of commercial-ready cell line and culture process development. Once an industry-wide challenge, sequence variation is now routinely monitored and controlled at Pfizer (and other biopharmaceutical companies) through increased awareness, dedicated cross-line efforts, smart comprehensive strategies, and advances in instrumentation/software, resulting in even higher product quality standards for biopharmaceutical products.
Topics: Amino Acid Sequence; Animals; Genetic Variation; High-Throughput Screening Assays; Humans; Sequence Analysis, Protein
PubMed: 30303443
DOI: 10.1080/19420862.2018.1531965 -
Scientific Reports Oct 2023Three-dimensional protein structures are invaluable sources of information for the functional annotation of protein molecules. Describing the function of a protein...
Three-dimensional protein structures are invaluable sources of information for the functional annotation of protein molecules. Describing the function of a protein sequence is one of the most common problems in biology. Generally, this problem can be facilitated by studying the tertiary structure of proteins. In the lack of protein structures, comparative modeling often provides a useful three-dimensional model of the protein associated with at least one known protein structure. Comparative modeling predicts the tertiary structure of a certain protein sequence (target) mainly based on its homological sequence to the sequence of one or more proteins with known structures (templates). MODELLER is one of the most widely used tools for homology or comparative modeling of three-dimensional protein structures. However, most users find it challenging to start with MODELLER as it is a command line based and requires knowledge of basic Python scripting to use it efficiently. In this study, a web-based interface has been designed to predict the tertiary structure of proteins based on Modeller, which does the comparative modeling automatically, and uses PHP and Python programming languages. This tool is called "EasyModel" and is available at http://bioinf.modares.ac.ir/software/easymodel/ . EasyModel provides a straightforward graphical interface for Modeller that can be used in only one browser.
Topics: Software; Proteins; Programming Languages; Amino Acid Sequence; Internet; User-Computer Interface
PubMed: 37821634
DOI: 10.1038/s41598-023-44505-9 -
Bioinformatics (Oxford, England) Feb 2020The accuracy and success rate of de novo protein design remain limited, mainly due to the parameter over-fitting of current energy functions and their inability to...
MOTIVATION
The accuracy and success rate of de novo protein design remain limited, mainly due to the parameter over-fitting of current energy functions and their inability to discriminate incorrect designs from correct designs.
RESULTS
We developed an extended energy function, EvoEF2, for efficient de novo protein sequence design, based on a previously proposed physical energy function, EvoEF. Remarkably, EvoEF2 recovered 32.5%, 47.9% and 22.3% of all, core and surface residues for 148 test monomers, and was generally applicable to protein-protein interaction design, as it recapitulated 30.9%, 42.4%, 31.3% and 21.4% of all, core, interface and surface residues for 88 test dimers, significantly outperforming EvoEF on the native sequence recapitulation. We further used I-TASSER to evaluate the foldability of the 148 designed monomer sequences, where all of them were predicted to fold into structures with high fold- and atomic-level similarity to their corresponding native structures, as demonstrated by the fact that 87.8% of the predicted structures shared a root-mean-square-deviation less than 2 Å to their native counterparts. The study also demonstrated that the usefulness of physical energy functions is highly correlated with the parameter optimization processes, and EvoEF2, with parameters optimized using sequence recapitulation, is more suitable for computational protein sequence design than EvoEF, which was optimized on thermodynamic mutation data.
AVAILABILITY AND IMPLEMENTATION
The source code of EvoEF2 and the benchmark datasets are freely available at https://zhanglab.ccmb.med.umich.edu/EvoEF.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; Proteins; Software
PubMed: 31588495
DOI: 10.1093/bioinformatics/btz740 -
Biomolecules Jul 2023Tandem repeats in proteins are patterns of residues repeated directly adjacent to each other. The evolution of these repeats can be assessed by using groups of...
Tandem repeats in proteins are patterns of residues repeated directly adjacent to each other. The evolution of these repeats can be assessed by using groups of homologous sequences, which can help pointing to events of unit duplication or deletion. High pressure in a protein family for variation of a given type of repeat might point to their function. Here, we propose the analysis of protein families to calculate protein short tandem repeats (pSTRs) in each protein sequence and assess their variability within the family in terms of number of units. To facilitate this analysis, we developed the pSTR tool, a method to analyze the evolution of protein short tandem repeats in a given protein family by pairwise comparisons between evolutionarily related protein sequences. We evaluated pSTR unit number variation in protein families of 12 complete metazoan proteomes. We hypothesize that families with more dynamic ensembles of repeats could reflect particular roles of these repeats in processes that require more adaptability.
Topics: Animals; Amino Acid Sequence; Proteome; Microsatellite Repeats; Evolution, Molecular
PubMed: 37509152
DOI: 10.3390/biom13071116 -
Biopolymers Mar 2023Coevolution between protein residues is normally interpreted as direct contact. However, the evolutionary record of a protein sequence contains rich information that may...
Coevolution between protein residues is normally interpreted as direct contact. However, the evolutionary record of a protein sequence contains rich information that may include long-range functional couplings, couplings that report on homo-oligomeric states or even conformational changes. Due to the complexity of the sequence space and the lack of structural information on various members of a protein family, it has been difficult to effectively mine the additional information encoded in a multiple sequence alignment (MSA). Here, taking advantage of the recent release of the AlphaFold (AF) database we attempt to identify coevolutionary couplings that cannot be explained simply by spatial proximity. We propose a simple computational method that performs direct coupling analysis on a MSA and searches for couplings that are not satisfied in any of the AF models of members of the identified protein family. Application of this method on 2012 protein families suggests that ~12% of the total identified coevolving residue pairs are spatially distant and more likely to be disordered than their contacting counterparts. We expect that this analysis will help improve the quality of coevolutionary distance restraints used for structure determination and will be useful in identifying potentially functional/allosteric cross-talk between distant residues.
Topics: Evolution, Molecular; Proteins; Amino Acid Sequence; Protein Conformation
PubMed: 36752285
DOI: 10.1002/bip.23530 -
Cell Systems Oct 2020Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network,...
Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it in vitro using circular dichroism. ProteinSolver is freely available at http://design.proteinsolver.org and https://gitlab.com/ostrokach/proteinsolver. A record of this paper's transparent peer review process is included in the Supplemental Information.
Topics: Algorithms; Amino Acid Sequence; Computer Simulation; Databases, Protein; Neural Networks, Computer; Protein Engineering; Proteins; Sequence Analysis, Protein; Software
PubMed: 32971019
DOI: 10.1016/j.cels.2020.08.016 -
BMC Bioinformatics Mar 2014Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using...
BACKGROUND
Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using classifier algorithms. Although software tools are available for both feature extraction and classifier construction, their application is not straightforward, requiring users to install various packages and to convert data into different formats. This lack of easily accessible software hampers quick, explorative use of sequence-based classification techniques by biologists.
RESULTS
We have developed the web-based software tool SPiCE for exploring sequence-based features of proteins in predefined classes. It offers data upload/download, sequence-based feature calculation, data visualization and protein classifier construction and testing in a single integrated, interactive environment. To illustrate its use, two example datasets are included showing the identification of differences in amino acid composition between proteins yielding low and high production levels in fungi and low and high expression levels in yeast, respectively.
CONCLUSIONS
SPiCE is an easy-to-use online tool for extracting and exploring sequence-based features of sets of proteins, allowing non-experts to apply advanced classification techniques. The tool is available at http://helix.ewi.tudelft.nl/spice.
Topics: Algorithms; Amino Acid Sequence; Aspergillus niger; Internet; Molecular Sequence Data; Proteins; Saccharomyces cerevisiae; Sequence Analysis, Protein; Software Design
PubMed: 24685258
DOI: 10.1186/1471-2105-15-93 -
PLoS Computational Biology Apr 2019It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at...
It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.
Topics: Amino Acid Sequence; Computational Biology; Databases, Protein; Evolution, Molecular; Models, Molecular; Mutation; Phylogeny; Protein Conformation; Protein Folding; Proteins; Sequence Homology, Amino Acid; Structural Homology, Protein
PubMed: 30958823
DOI: 10.1371/journal.pcbi.1006767 -
BMC Biology Aug 2017Strong DNA conservation among divergent species is an indicator of enduring functionality. With weaker sequence conservation we enter a vast 'twilight zone' in which... (Review)
Review
Strong DNA conservation among divergent species is an indicator of enduring functionality. With weaker sequence conservation we enter a vast 'twilight zone' in which sequence subject to transient or lower constraint cannot be distinguished easily from neutrally evolving, non-functional sequence. Twilight zone functional sequence is illuminated instead by principles of selective constraint and positive selection using genomic data acquired from within a species' population. Application of these principles reveals that despite being biochemically active, most twilight zone sequence is not functional.
Topics: Amino Acid Sequence; Conserved Sequence; Evolution, Molecular; Sequence Analysis, Protein
PubMed: 28814299
DOI: 10.1186/s12915-017-0411-5