protein sequence - OpenMD.com Journal Search

Using natural sequences and modularity to design common and novel protein topologies.

Current Opinion in Structural Biology Jun 2016

Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring... (Review)

Summary PubMed

Review

Authors: Aron Broom, Kyle Trainor, Duncan Ws MacKenzie...

Protein design is still a challenging undertaking, often requiring multiple attempts or iterations for success. Typically, the source of failure is unclear, and scoring metrics appear similar between successful and failed cases. Nevertheless, the use of sequence statistics, modularity and symmetry from natural proteins, combined with computational design both at the coarse-grained and atomistic levels is propelling a new wave of design efforts to success. Here we highlight recent examples of design, showing how the wealth of natural protein sequence and topology data may be leveraged to reduce the search space and increase the likelihood of achieving desired outcomes.

Topics: Amino Acid Sequence; Computational Biology; Protein Engineering; Proteins

PubMed: 27270240
DOI: 10.1016/j.sbi.2016.05.007

Repeatability in protein sequences.

Journal of Structural Biology Nov 2019

Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary...

Summary PubMed

Authors: Mohamed Kamel, Pablo Mier, Abdelkamel Tari...

Low complexity regions (LCRs) in protein sequences have special properties that are very different from those of globular proteins. The rules that define secondary structure elements do not apply when the distribution of amino acids becomes biased. While there is a tendency towards structural disorder in LCRs, various examples, and particularly homorepeats of single amino acids, suggest that very short repeats could adopt structures very difficult to predict. These structures are possibly variable and dependant on the context of intra- or inter-molecular interactions. In general, short repeats in LCRs can induce structure. This could explain the observation that very short (non-perfect) repeats are widespread and many define regions with a function in protein interactions. For these reasons, we have developed an algorithm to quickly analyze local repeatability along protein sequences, that is, how close a protein fragment is from a perfect repeat. Using this algorithm we identified that the proteins of the yeast Saccharomyces cerevisiae are depleted in short repeats (approximate or not) of odd-length, while the human proteins are not, that the fish Danio rerio has many proteins with repeats of length two and that the plant Arabidopsis thaliana has an unusually large amount of repeats of length seven. Our method (REpeatability Scanner, RES, accessible at http://cbdm-01.zdv.uni-mainz.de/~munoz/res/) allows to find regions with approximate short repeats in protein sequences, and helps to characterize the variable use of LCRs and compositional bias in different organisms.

Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Evolution, Molecular; Humans; Proteins; Repetitive Sequences, Amino Acid; Sequence Alignment; Sequence Analysis, Protein

PubMed: 31408700
DOI: 10.1016/j.jsb.2019.08.003

FermatS: A Novel Numerical Representation for Protein Sequence Comparison and DNA-binding Protein Identification.

Combinatorial Chemistry & High... 2021

Based on protein sequence information, a simple and effective method was used to analyze protein sequence similarity and predict DNA-binding protein. (Comparative Study)

Summary PubMed

Comparative Study

Authors: Yanping Zhang, Ya Gao, Jianwei Ni...

AIMS

Based on protein sequence information, a simple and effective method was used to analyze protein sequence similarity and predict DNA-binding protein.

BACKGROUND

It is absolutely necessary that we generate computational methods of low complexity to accurate infer protein structure, function, and evolution in the rapidly growing number of molecular biology data available.

OBJECTIVE

It is important to generate novel computational algorithms for analyzing and comparing protein sequences with the rapidly growing number of molecular biology data available.

METHODS

Based on global and local position representation with the curves of Fermat spiral and normalized moments of inertia of the curve of Fermat spiral, respectively, moreover, composition of 20 amino acids to get the numerical characteristics of protein sequences.

RESULTS

It has been applied to analyze the similarity/dissimilarity of nine ND5 proteins, the analysis results are consistent with the biological evolution theory. Furthermore, we employ the Logistic regression with 5-fold cross-validation to establish the prediction of DNA-binding proteins model, which outperformed the DNAbinder, iDNA-prot, DNA-prot and gDNA-prot by 0.0069-0.609 in terms of F-measure, 0.293-0.898 in terms of MCC in unbalanced dataset.

CONCLUSION

These results show that our method, namely FermatS, is effective to compare, recognition and prediction the protein sequences.

Topics: Algorithms; Amino Acid Sequence; Computational Biology; DNA-Binding Proteins; Databases, Protein; Protein Conformation

PubMed: 33208064
DOI: 10.2174/1386207323999201117111738

PseAAC2Vec protein encoding for TCR protein sequence classification.

Computers in Biology and Medicine Mar 2024

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized...

Summary PubMed

Authors: Zahra Tayebi, Sarwan Ali, Taslim Murad...

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.

Topics: Computational Biology; Proteins; Amino Acid Sequence; Amino Acids; Algorithms; Support Vector Machine; Sequence Analysis, Protein; Databases, Protein

PubMed: 38217977
DOI: 10.1016/j.compbiomed.2024.107956

Sequence Design of Random Heteropolymers as Protein Mimics.

Biomacromolecules Feb 2023

Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike...

Summary PubMed Full Text PDF

Authors: Ivan Jayapurna, Zhiyuan Ruan, Marco Eres...

Random heteropolymers (RHPs) have been computationally designed and experimentally shown to recapitulate protein-like phase behavior and function. However, unlike proteins, RHP sequences are only statistically defined and cannot be sequenced. Recent developments in reversible-deactivation radical polymerization allowed simulated polymer sequences based on the well-established Mayo-Lewis equation to more accurately reflect ground-truth sequences that are experimentally synthesized. This led to opportunities to perform bioinformatics-inspired analysis on simulated sequences to guide the design, synthesis, and interpretation of RHPs. We compared batches on the order of 10000 simulated RHP sequences that vary by synthetically controllable and measurable RHP characteristics such as chemical heterogeneity and average degree of polymerization. Our analysis spans across 3 levels: segments along a single chain, sequences within a batch, and batch-averaged statistics. We discuss simulator fidelity and highlight the importance of robust segment definition. Examples are presented that demonstrate the use of simulated sequence analysis for in-silico iterative design to mimic protein hydrophobic/hydrophilic segment distributions in RHPs and compare RHP and protein sequence segments to explain experimental results of RHPs that mimic protein function. To facilitate the community use of this workflow, the simulator and analysis modules have been made available through an open source toolkit, the RHPapp.

Topics: Proteins; Polymers; Amino Acid Sequence; Polymerization

PubMed: 36638823
DOI: 10.1021/acs.biomac.2c01036

Analytical Solution to the Flory-Huggins Model.

The Journal of Physical Chemistry... Aug 2022

A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the...

Summary PubMed Full Text PDF

Authors: Daoyuan Qian, Thomas C T Michaels, Tuomas P J Knowles...

A self-consistent analytical solution for binodal concentrations of the two-component Flory-Huggins phase separation model is derived. We show that this form extends the validity of the Ginzburg-Landau expansion away from the critical point to cover the whole phase space. Furthermore, this analytical solution reveals an exponential scaling law of the dilute phase binodal concentration as a function of the interaction strength and chain length. We demonstrate explicitly the power of this approach by fitting experimental protein liquid-liquid phase separation boundaries to determine the effective chain length and solute-solvent interaction energies. Moreover, we demonstrate that this strategy allows us to resolve differences in interaction energy contributions of individual amino acids. This analytical framework can serve as a new way to decode the protein sequence grammar for liquid-liquid phase separation.

Topics: Amino Acid Sequence; Proteins; Solutions; Solvents; Thermodynamics

PubMed: 35977086
DOI: 10.1021/acs.jpclett.2c01986

Toward complete rational control over protein structure and function through computational design.

Current Opinion in Structural Biology Feb 2021

The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that... (Review)

Summary PubMed Full Text PDF

Review

Authors: Jared Adolf-Bryfogle, Frank D Teets, Christopher D Bahl...

The grand challenge of protein design is a general method for producing a polypeptide with arbitrary functionality, conformation, and biochemical properties. To that end, a wide variety of methods have been developed for the improvement of native proteins, the design of ideal proteins de novo, and the redesign of suboptimal proteins with better-performing substructures. These methods employ informatic comparisons of function-structure-sequence relationships as well as knowledge-based evaluation of protein properties to narrow the immense protein sequence search space down to an enumerable and often manually evaluable set of structures that meet specified criteria. While arbitrary manipulation of protein-protein interfaces and molecular catalysis remains an unsolved problem, and no protein shape or behavior manipulation algorithm is universally applicable, the promising results thus far are a strong indicator that a general approach to the arbitrary manipulation of polypeptides is within reach.

Topics: Algorithms; Amino Acid Sequence; Catalysis; Protein Conformation; Protein Folding; Proteins

PubMed: 33276237
DOI: 10.1016/j.sbi.2020.10.015

Different methods, techniques and their limitations in protein structure prediction: A review.

Progress in Biophysics and Molecular... Sep 2022

Because of the increase in different types of diseases in human habitats, demands for designing various types of drugs are also increasing. Protein and its structure... (Review)

Summary PubMed

Review

Authors: Vrushali Bongirwar, A S Mokhade

Because of the increase in different types of diseases in human habitats, demands for designing various types of drugs are also increasing. Protein and its structure play a very important role in drug design. Therefore researchers from different areas like mathematics, medicines, and computer science are teaming up for getting better solutions in the said field. In this paper, we have discussed different methods of secondary and tertiary protein structure prediction (PSP), along with the limitations of different approaches. Different types of datasets used in PSP are also discussed here. This paper also tells about different performance measures to evaluate the prediction accuracy of PSP methods. Different software's/servers are available for download, which are used to find the protein structures for the input protein sequence. These softwares will also help to compare the performance of any new algorithm with other available methods. Details of those softwares are also mentioned in this paper.

Topics: Algorithms; Amino Acid Sequence; Humans; Protein Structure, Tertiary; Proteins; Software

PubMed: 35588858
DOI: 10.1016/j.pbiomolbio.2022.05.002

Learning the protein language: Evolution, structure, and function.

Cell Systems Jun 2021

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available...

Summary PubMed Full Text PDF

Authors: Tristan Bepler, Bonnie Berger

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.

Topics: Amino Acid Sequence; Databases, Protein; Language; Machine Learning; Proteins

PubMed: 34139171
DOI: 10.1016/j.cels.2021.05.017

Prediction of protein-protein interaction using graph neural networks.

Scientific Reports May 2022

Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks...

Summary PubMed Full Text PDF

Authors: Kanchan Jha, Sriparna Saha, Hiteshi Singh...

Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .

Topics: Amino Acid Sequence; Amino Acids; Humans; Neural Networks, Computer; Proteins; Saccharomyces cerevisiae

PubMed: 35589837
DOI: 10.1038/s41598-022-12201-9