-
Current Opinion in Structural Biology Jun 2016Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring... (Review)
Review
Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring metrics appear similar between successful and failed cases. Nevertheless, the use of sequence statistics, modularity and symmetry from natural proteins, combined with computational design both at the coarse-grained and atomistic levels is propelling a new wave of design efforts to success. Here we highlight recent examples of design, showing how the wealth of natural protein sequence and topology data may be leveraged to reduce the search space and increase the likelihood of achieving desired outcomes.
Topics: Amino Acid Sequence; Computational Biology; Protein Engineering; Proteins
PubMed: 27270240
DOI: 10.1016/j.sbi.2016.05.007 -
Journal of Structural Biology Nov 2019Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary...
Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) repeats are widespread and many define regions with a function in protein interactions. For these reasons, we have developed an algorithm to quickly analyze local repeatability along protein sequences, that is, how close a protein fragment is from a perfect repeat. Using this algorithm we identified that the proteins of the yeast Saccharomyces cerevisiae are depleted in short repeats (approximate or not) of odd-length, while the human proteins are not, that the fish Danio rerio has many proteins with repeats of length two and that the plant Arabidopsis thaliana has an unusually large amount of repeats of length seven. Our method (REpeatability Scanner, RES, accessible at http://cbdm-01.zdv.uni-mainz.de/~munoz/res/) allows to find regions with approximate short repeats in protein sequences, and helps to characterize the variable use of LCRs and compositional bias in different organisms.
Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Evolution, Molecular; Humans; Proteins; Repetitive Sequences, Amino Acid; Sequence Alignment; Sequence Analysis, Protein
PubMed: 31408700
DOI: 10.1016/j.jsb.2019.08.003 -
Combinatorial Chemistry & High... 2021Based on protein sequence information, a simple and effective method was used to analyze protein sequence similarity and predict DNA-binding protein. (Comparative Study)
Comparative Study
AIMS
Based on protein sequence information, a simple and effective method was used to analyze protein sequence similarity and predict DNA-binding protein.
BACKGROUND
It is absolutely necessary that we generate computational methods of low complexity to accurate infer protein structure, function, and evolution in the rapidly growing number of molecular biology data available.
OBJECTIVE
It is important to generate novel computational algorithms for analyzing and comparing protein sequences with the rapidly growing number of molecular biology data available.
METHODS
Based on global and local position representation with the curves of Fermat spiral and normalized moments of inertia of the curve of Fermat spiral, respectively, moreover, composition of 20 amino acids to get the numerical characteristics of protein sequences.
RESULTS
It has been applied to analyze the similarity/dissimilarity of nine ND5 proteins, the analysis results are consistent with the biological evolution theory. Furthermore, we employ the Logistic regression with 5-fold cross-validation to establish the prediction of DNA-binding proteins model, which outperformed the DNAbinder, iDNA-prot, DNA-prot and gDNA-prot by 0.0069-0.609 in terms of F-measure, 0.293-0.898 in terms of MCC in unbalanced dataset.
CONCLUSION
These results show that our method, namely FermatS, is effective to compare, recognition and prediction the protein sequences.
Topics: Algorithms; Amino Acid Sequence; Computational Biology; DNA-Binding Proteins; Databases, Protein; Protein Conformation
PubMed: 33208064
DOI: 10.2174/1386207323999201117111738 -
Computers in Biology and Medicine Mar 2024The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized...
The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.
Topics: Computational Biology; Proteins; Amino Acid Sequence; Amino Acids; Algorithms; Support Vector Machine; Sequence Analysis, Protein; Databases, Protein
PubMed: 38217977
DOI: 10.1016/j.compbiomed.2024.107956 -
Biomacromolecules Feb 2023Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike...
Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.
Topics: Proteins; Polymers; Amino Acid Sequence; Polymerization
PubMed: 36638823
DOI: 10.1021/acs.biomac.2c01036 -
The Journal of Physical Chemistry... Aug 2022A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...
A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.
Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics
PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986 -
Current Opinion in Structural Biology Feb 2021The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)
Review
The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.
Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins
PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015 -
Progress in Biophysics and Molecular... Sep 2022Because of the increase in different types of diseases in human habitats, demands for designing various types of drugs are also increasing. Protein and its structure... (Review)
Review
Because of the increase in different types of diseases in human habitats, demands for designing various types of drugs are also increasing. Protein and its structure play a very important role in drug design. Therefore researchers from different areas like mathematics, medicines, and computer science are teaming up for getting better solutions in the said field. In this paper, we have discussed different methods of secondary and tertiary protein structure prediction (PSP), along with the limitations of different approaches. Different types of datasets used in PSP are also discussed here. This paper also tells about different performance measures to evaluate the prediction accuracy of PSP methods. Different software's/servers are available for download, which are used to find the protein structures for the input protein sequence. These softwares will also help to compare the performance of any new algorithm with other available methods. Details of those softwares are also mentioned in this paper.
Topics: Algorithms; Amino Acid Sequence; Humans; Protein Structure, Tertiary; Proteins; Software
PubMed: 35588858
DOI: 10.1016/j.pbiomolbio.2022.05.002 -
Cell Systems Jun 2021Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...
Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.
Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins
PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017 -
Scientific Reports May 2022Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks...
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .
Topics: Amino Acid Sequence; Amino Acids; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae
PubMed: 35589837
DOI: 10.1038/s41598-022-12201-9