-
PloS One 2017The currently known protein sequences are not distributed equally in sequence space, but cluster into families. Analyzing the cluster size distribution gives a glimpse...
The currently known protein sequences are not distributed equally in sequence space, but cluster into families. Analyzing the cluster size distribution gives a glimpse of the large and unknown extant protein sequence space, which has been explored during evolution. For six protein superfamilies with different fold and function, the cluster size distributions followed a power law with slopes between 2.4 and 3.3, which represent upper limits to the cluster distribution of extant sequences. The power law distribution of cluster sizes is in accordance with percolation theory and strongly supports connectedness of extant sequence space. Percolation of extant sequence space has three major consequences: (1) It transforms our view of sequence space as a highly connected network where each sequence has multiple neighbors, and each pair of sequences is connected by many different paths. A high degree of connectedness is a necessary condition of efficient evolution, because it overcomes the possible blockage by sign epistasis and reciprocal sign epistasis. (2) The Fisher exponent is an indicator of connectedness and saturation of sequence space of each protein superfamily. (3) All clusters are expected to be connected by extant sequences that become apparent as a higher portion of extant sequence space becomes known. Being linked to biochemically distinct homologous families, bridging sequences are promising enzyme candidates for applications in biotechnology because they are expected to have substrate ambiguity or catalytic promiscuity.
Topics: Amino Acid Sequence; Cluster Analysis; Protein Folding; Proteins
PubMed: 29261740
DOI: 10.1371/journal.pone.0189646 -
BMC Bioinformatics Feb 2022For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict...
BACKGROUND
For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few.
RESULTS
In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service ( https://compbio.sjtu.edu.cn/protplat ) that is accessible to the public.
CONCLUSIONS
To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat.
Topics: Amino Acid Sequence; Machine Learning; Natural Language Processing; Neural Networks, Computer; Proteins
PubMed: 35148686
DOI: 10.1186/s12859-022-04604-2 -
BMC Bioinformatics Sep 2022Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental...
BACKGROUND
Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes.
RESULTS
In this work, we describe the use of amino acid sequence embeddings as a systematic framework for studying protein ontologies. Using a sequence embedding, we show that the bacterial carbohydrate metabolism class within the SEED annotation system contains 48 clusters of embedded sequences despite this class containing 29 functional labels. Furthermore, by embedding Bacillus amino acid sequences with unknown functions, we show that these unknown sequences form clusters that are likely to have similar biological roles.
CONCLUSIONS
This study demonstrates that amino acid sequence embeddings may be a powerful tool for developing more robust ontologies for annotating protein sequence data. In addition, embeddings may be beneficial for clustering protein sequences with unknown functions and selecting optimal candidate proteins to characterize experimentally.
Topics: Amino Acid Sequence; Bacteria; Computational Biology; Humans; Machine Learning; Molecular Sequence Annotation; Proteins
PubMed: 36151519
DOI: 10.1186/s12859-022-04930-5 -
Analytical Chemistry Sep 2023Membrane proteins are often challenging targets for native top-down mass spectrometry experimentation. The requisite use of membrane mimetics to solubilize such proteins...
Membrane proteins are often challenging targets for native top-down mass spectrometry experimentation. The requisite use of membrane mimetics to solubilize such proteins necessitates the application of supplementary activation methods to liberate protein ions prior to sequencing, which typically limits the sequence coverage achieved. Recently, infrared photoactivation has emerged as an alternative to collisional activation for the liberation of membrane proteins from surfactant micelles. However, much remains unknown regarding the mechanism by which IR activation liberates membrane protein ions from such micelles, the extent to which such methods can improve membrane protein sequence coverage, and the degree to which such approaches can be extended to support native proteomics. Here, we describe experiments designed to evaluate and probe infrared photoactivation for membrane protein sequencing, proteoform identification, and native proteomics applications. Our data reveal that infrared photoactivation can dissociate micelles composed of a variety of detergent classes, without the need for a strong IR chromophore by leveraging the relatively weak association energies of such detergent clusters in the gas phase. Additionally, our data illustrate how IR photoactivation can be extended to include membrane mimetics beyond micelles and liberate proteins from nanodiscs, liposomes, and bicelles. Finally, our data quantify the improvements in membrane protein sequence coverage produced through the use of IR photoactivation, which typically leads to membrane protein sequence coverage values ranging from 40 to 60%.
Topics: Detergents; Micelles; Membrane Proteins; Amino Acid Sequence; Mass Spectrometry
PubMed: 37610409
DOI: 10.1021/acs.analchem.3c02788 -
Molecular Biology and Evolution Aug 2017Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a...
Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modeling both "smooth" conformational changes and "catastrophic" conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence-structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof.
Topics: Amino Acid Sequence; Computer Simulation; Evolution, Molecular; Models, Genetic; Models, Molecular; Protein Conformation; Protein Structural Elements; Proteins; Sequence Alignment; Sequence Analysis, Protein
PubMed: 28453724
DOI: 10.1093/molbev/msx137 -
Nucleic Acids Research 2005We present a profile-profile multiple alignment strategy that uses database searching to collect homologues for each sequence in a given set, in order to enrich their...
We present a profile-profile multiple alignment strategy that uses database searching to collect homologues for each sequence in a given set, in order to enrich their available evolutionary information for the alignment. For each of the alignment sequences, the putative homologous sequences that score above a pre-defined threshold are incorporated into a position-specific pre-alignment profile. The enriched position-specific profile is used for standard progressive alignment, thereby more accurately describing the characteristic features of the given sequence set. We show that owing to the incorporation of the pre-alignment information into a standard progressive multiple alignment routine, the alignment quality between distant sequences increases significantly and outperforms state-of-the-art methods, such as T-COFFEE and MUSCLE. We also show that although entirely sequence-based, our novel strategy is better at aligning distant sequences when compared with a recent contact-based alignment method. Therefore, our pre-alignment profile strategy should be advantageous for applications that rely on high alignment accuracy such as local structure prediction, comparative modelling and threading.
Topics: Algorithms; Amino Acid Sequence; Molecular Sequence Data; Sequence Alignment; Sequence Analysis, Protein; Software
PubMed: 15699183
DOI: 10.1093/nar/gki233 -
New amino acid substitution matrix brings sequence alignments into agreement with structure matches.Proteins Jun 2021Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing...
Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing densities in globular proteins lead to interdependent substitutions, which have not previously been considered for amino acid similarities. At present, sequence matching compares sequences based only upon the similarities of single amino acids, ignoring the fact that in densely packed protein, there are additional conservative substitutions representing exchanges between two interacting amino acids, such as a small-large pair changing to a large-small pair substitutions that are not individually so conservative. Here we show that including information for such pairs of substitutions yields improved sequence matches, and that these yield significant gains in the agreements between sequence alignments and structure matches of the same protein pair. The result shows sequence segments matched where structure segments are aligned. There are gains for all 2002 collected cases where the sequence alignments that were not previously congruent with the structure matches. Our results also demonstrate a significant gain in detecting homology for "twilight zone" protein sequences. The amino acid substitution metrics derived have many other potential applications, for annotations, protein design, mutagenesis design, and empirical potential derivation.
Topics: Algorithms; Amino Acid Sequence; Amino Acid Substitution; Amino Acids; Databases, Protein; Datasets as Topic; Humans; Models, Molecular; Protein Engineering; Proteins; Sequence Alignment; Sequence Homology, Amino Acid
PubMed: 33469973
DOI: 10.1002/prot.26050 -
Bioinformatics (Oxford, England) Jul 2018The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to...
MOTIVATION
The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.
RESULTS
We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.
AVAILABILITY AND IMPLEMENTATION
https://github.com/hkmztrk/SMILESVecProteinRepresentation.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Amino Acid Sequence; Cluster Analysis; Computational Biology; Ligands; Models, Molecular; Protein Binding; Proteins; Sequence Analysis, Protein
PubMed: 29949957
DOI: 10.1093/bioinformatics/bty287 -
BMC Bioinformatics Jun 2021Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of...
BACKGROUND
Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions.
RESULTS
In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods.
CONCLUSION
The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.
Topics: Algorithms; Amino Acid Sequence; Amino Acids; Phylogeny; Proteins; Sequence Analysis, Protein
PubMed: 34078264
DOI: 10.1186/s12859-021-04223-3 -
Proteins Jan 2020Computational design of binding sites in proteins remains difficult, in part due to limitations in our current ability to sample backbone conformations that enable...
Computational design of binding sites in proteins remains difficult, in part due to limitations in our current ability to sample backbone conformations that enable precise and accurate geometric positioning of side chains during sequence design. Here we present a benchmark framework for comparison between flexible-backbone design methods applied to binding interactions. We quantify the ability of different flexible backbone design methods in the widely used protein design software Rosetta to recapitulate observed protein sequence profiles assumed to represent functional protein/protein and protein/small molecule binding interactions. The CoupledMoves method, which combines backbone flexibility and sequence exploration into a single acceptance step during the sampling trajectory, better recapitulates observed sequence profiles than the BackrubEnsemble and FastDesign methods, which separate backbone flexibility and sequence design into separate acceptance steps during the sampling trajectory. Flexible-backbone design with the CoupledMoves method is a powerful strategy for reducing sequence space to generate targeted libraries for experimental screening and selection.
Topics: Algorithms; Amino Acid Sequence; Binding Sites; Biophysical Phenomena; Computational Biology; Humans; Models, Molecular; Protein Binding; Protein Conformation; Protein Engineering; Protein Interaction Mapping; Proteins; Software
PubMed: 31344278
DOI: 10.1002/prot.25790