-
Molecular Biology and Evolution Apr 2023Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying...
Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.
Topics: Animals; Proteome; Drosophila melanogaster; DNA; Amino Acid Sequence; Saccharomyces cerevisiae
PubMed: 37036379
DOI: 10.1093/molbev/msad084 -
Genome Research Jul 2023Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and...
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Topics: Sequence Alignment; Proteins; Amino Acid Sequence; Algorithms; Amino Acids; Language
PubMed: 37414576
DOI: 10.1101/gr.277675.123 -
Briefings in Bioinformatics Jan 2023Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These...
Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets.
Topics: Amino Acid Sequence; Cluster Analysis; Proteins; Sequence Alignment
PubMed: 36642409
DOI: 10.1093/bib/bbac619 -
Sheng Wu Gong Cheng Xue Bao = Chinese... Nov 2021The accumulation of protein sequence and structure data allows researchers to obtain large amount of descriptive information, simultaneously it poses an urgent need for... (Review)
Review
The accumulation of protein sequence and structure data allows researchers to obtain large amount of descriptive information, simultaneously it poses an urgent need for researchers to extract information from existing data efficiently and apply it to downstream tasks. Protein design enables the development of novel proteins that are no longer restricted by experimental conditions, which is of great significance for drug target prediction, drug discovery, and material design. As an efficient method for data feature extraction, deep learning can be used to model protein data, and further add a priori information to design novel proteins. Therefore, protein design based on deep learning has become a promising approach despite of many challenges. This review summarizes the deep learning-based modeling and design methods of protein sequence and structure data, highlighting the strategies, principle, scope of application and case studies, with the aim to provide a valuable reference for relevant researchers.
Topics: Amino Acid Sequence; Deep Learning; Drug Development; Proteins
PubMed: 34841791
DOI: 10.13345/j.cjb.210393 -
Proteins Apr 2023The flexibility of protein structure is related to various biological processes, such as molecular recognition, allosteric regulation, catalytic activity, and protein...
The flexibility of protein structure is related to various biological processes, such as molecular recognition, allosteric regulation, catalytic activity, and protein stability. At the molecular level, protein dynamics and flexibility are important factors to understand protein function. DNA-binding proteins and Coronavirus proteins are of great concern and relatively unique proteins. However, exploring the flexibility of DNA-binding proteins and Coronavirus proteins through experiments or calculations is a difficult process. Since protein dihedral rotational motion can be used to predict protein structural changes, it provides key information about protein local conformation. Therefore, this paper introduces a method to improve the accuracy of protein flexibility prediction, DihProFle (Prediction of DNA-binding proteins and Coronavirus proteins flexibility introduces the calculated dihedral Angle information). Based on protein dihedral Angle information, protein evolution information, and amino acid physical and chemical properties, DihProFle realizes the prediction of protein flexibility in two cases on DNA-binding proteins and Coronavirus proteins, and assigns flexibility class to each protein sequence position. In this study, compared with the flexible prediction using sequence evolution information, and physicochemical properties of amino acids, the flexible prediction accuracy based on protein dihedral Angle information, sequence evolution information and physicochemical properties of amino acids improved by 2.2% and 3.1% in the nonstrict and strict conditions, respectively. And DihProFle achieves better performance than previous methods for protein flexibility analysis. In addition, we further analyzed the correlation of amino acid properties and protein dihedral angles with residues flexibility. The results show that the charged hydrophilic residues have higher proportion in the flexible region, and the rigid region tends to be in the angular range of the protein dihedral angle (such as the ψ angle of amino acid residues is more flexible than rigid in the range of 91°-120°). Therefore, the results indicate that hydrophilic residues and protein dihedral angle information play an important role in protein flexibility.
Topics: DNA-Binding Proteins; Coronavirus; Protein Conformation; Amino Acids; Amino Acid Sequence
PubMed: 36321218
DOI: 10.1002/prot.26443 -
Bioinformatics (Oxford, England) Feb 2020Protein-protein interactions (PPIs) play important roles in many biological processes. Conventional biological experiments for identifying PPI sites are costly and...
MOTIVATION
Protein-protein interactions (PPIs) play important roles in many biological processes. Conventional biological experiments for identifying PPI sites are costly and time-consuming. Thus, many computational approaches have been proposed to predict PPI sites. Existing computational methods usually use local contextual features to predict PPI sites. Actually, global features of protein sequences are critical for PPI site prediction.
RESULTS
A new end-to-end deep learning framework, named DeepPPISP, through combining local contextual and global sequence features, is proposed for PPI site prediction. For local contextual features, we use a sliding window to capture features of neighbors of a target amino acid as in previous studies. For global sequence features, a text convolutional neural network is applied to extract features from the whole protein sequence. Then the local contextual and global sequence features are combined to predict PPI sites. By integrating local contextual and global sequence features, DeepPPISP achieves the state-of-the-art performance, which is better than the other competing methods. In order to investigate if global sequence features are helpful in our deep learning model, we remove or change some components in DeepPPISP. Detailed analyses show that global sequence features play important roles in DeepPPISP.
AVAILABILITY AND IMPLEMENTATION
The DeepPPISP web server is available at http://bioinformatics.csu.edu.cn/PPISP/. The source code can be obtained from https://github.com/CSUBioGroup/DeepPPISP.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Amino Acid Sequence; Neural Networks, Computer; Protein Interaction Domains and Motifs; Proteins; Software
PubMed: 31593229
DOI: 10.1093/bioinformatics/btz699 -
MSystems Dec 2022A protein's function depends on functional residues that determine its binding specificity or its catalytic activity, but these residues are typically not considered...
A protein's function depends on functional residues that determine its binding specificity or its catalytic activity, but these residues are typically not considered when annotating a protein's function. To help biologists investigate the functional residues of proteins, we developed two interactive web-based tools, SitesBLAST and Sites on a Tree. Given a protein sequence, SitesBLAST finds homologs that have known functional residues and shows whether the functional residues are conserved. Sites on a Tree shows how functional residues vary across a protein family by showing them on a phylogenetic tree. These tools are available at http://papers.genomics.lbl.gov/sites. For most microbes of interest, a genome sequence is available, but the function of its proteins is not known. Instead, proteins' functions are predicted from their similarity to other protein sequences. Within a protein's sequence, a few key residues are most important for function, such as catalyzing a chemical reaction or determining what it binds. But most function prediction tools do not take these key residues into account. We developed interactive tools for identifying functional residues in a protein sequence by comparing it to proteins with known functional residues. Our tools also make it easy to compare key residues across many similar proteins. This should help biologists check if a protein's function is predicted correctly, or to predict if groups of similar proteins have conserved functions.
Topics: Phylogeny; Computational Biology; Proteins; Amino Acid Sequence; Data Interpretation, Statistical
PubMed: 36374048
DOI: 10.1128/msystems.00705-22 -
PLoS Computational Biology Apr 2022Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to...
Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to go beyond the existing database annotation, it is fundamental to understand the mechanisms underlying functional inheritance and divergence. If the homology relationship between proteins is known, can we determine whether the function diverged? In this work, we analyze different possibilities of protein sequence evolution after gene duplication and identify "inter-paralog inversions", i.e., sites where the relationship between the ancestry and the functional signal is decoupled. The amino acids in these sites are masked from being recognized by other prediction tools. Still, they play a role in functional divergence and could indicate a shift in protein function. We develop a method to specifically recognize inter-paralog amino acid inversions in a phylogeny and test it on real and simulated datasets. In a dataset built from the Epidermal Growth Factor Receptor (EGFR) sequences found in 88 fish species, we identify 19 amino acid sites that went through inversion after gene duplication, mostly located at the ligand-binding extracellular domain. Our work uncovers an outcome of protein duplications with direct implications in protein functional annotation and sequence evolution. The developed method is optimized to work with large protein datasets and can be readily included in a targeted protein analysis pipeline.
Topics: Amino Acid Sequence; Amino Acids; Animals; Chromosome Inversion; Evolution, Molecular; Gene Duplication; Phylogeny; Proteins
PubMed: 35377869
DOI: 10.1371/journal.pcbi.1010016 -
Advances in Experimental Medicine and... 2019Matrix-Assisted Laser Desorption Ionization In-Source Decay (MALDI-ISD) Mass Spectrometry is a very powerful tool for providing terminal sequence information of... (Review)
Review
Matrix-Assisted Laser Desorption Ionization In-Source Decay (MALDI-ISD) Mass Spectrometry is a very powerful tool for providing terminal sequence information of biomolecules with minimal sample preparations. Fragmentation is induced at the position where hydrogen radical transfers from matrix to analyte in the MALDI-ISD process by proposed mechanism. Uniform fragmentation in MALDI-ISD generates relative simple ion spectra of readable sequence ladders with labile modifications retained, which is advantageous over other fragmentation methods such as collision-induced dissociation (CID) for characterizing modifications. MALDI-ISD has been applied to de novo sequencing of a 13.6 kDa protein and fully validate sequences of therapeutic antibodies, showing its promising potential in examining reference sequences of biotherapeutics unambiguously. It has also been successfully applied to the analysis of modifications such as post-translational modifications (PTMs) and PEGylation. Here we discuss the applications of MALDI-ISD in protein sequencing and modification analysis by featuring representative studies in details.
Topics: Amino Acid Sequence; Hydrogen; Proteins; Sequence Analysis, Protein; Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization
PubMed: 31347041
DOI: 10.1007/978-3-030-15950-4_3 -
Methods in Molecular Biology (Clifton,... 2019Evolutionary domains are protein regions with observable sequence similarity to other known domains. Here we describe how to use common sequence and profile alignment...
Evolutionary domains are protein regions with observable sequence similarity to other known domains. Here we describe how to use common sequence and profile alignment algorithms (i.e., BLAST, HHsearch) to delineate putative domains in novel protein sequences, given a reference library of protein domains. In this case, we use our database of evolutionary domains (ECOD) as a reference, but other domain sequence libraries could be used (e.g., SCOP, CATH). We describe our domain partition algorithm along with specific notes on how to avoid domain indexing errors when working with multiple data sources and software algorithms with differing outputs.
Topics: Algorithms; Amino Acid Sequence; Databases, Protein; Protein Structure, Tertiary; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Homology, Amino Acid; Software
PubMed: 30298403
DOI: 10.1007/978-1-4939-8736-8_15