-
Bioinformatics (Oxford, England) Jul 2023Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has...
MOTIVATION
Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins.
RESULTS
Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences.
AVAILABILITY AND IMPLEMENTATION
Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.
Topics: Amino Acid Sequence; Proteins; Language; Protein Domains
PubMed: 37399105
DOI: 10.1093/bioinformatics/btad401 -
Scientific Reports Nov 2023The conformation flexibility of natural protein causes both complexity and difficulty to understand the relationship between structure and function. The prediction of...
The conformation flexibility of natural protein causes both complexity and difficulty to understand the relationship between structure and function. The prediction of intrinsically disordered protein primarily is focusing on to disclose the regions with structural flexibility involving relevant biological functions and various diseases. The order of amino acids in protein sequence determines possible conformations, folding flexibility and biological function. Although many methods provided the information of intrinsically disordered protein (IDP), but the results are mainly limited to determine the locations of regions without knowledge of possible folding conformations. Here, the developed protein folding fingerprint adopted the protein folding variation matrix (PFVM) to reveal all possible folding patterns for the intrinsically disordered protein along its sequence. The PFVM integrally exhibited the intrinsically disordered protein with disordering regions, degree of disorder as well as folding pattern. The advantage of PFVM will not only provide rich information for IDP, but also may promote the study of protein folding problem.
Topics: Intrinsically Disordered Proteins; Protein Folding; Amino Acid Sequence; Amino Acids; Protein Conformation
PubMed: 37990040
DOI: 10.1038/s41598-023-45969-5 -
BMC Genomics Aug 2023The nematode Caenorhabditis briggsae has been used as a model in comparative genomics studies with Caenorhabditis elegans because of their striking morphological and...
BACKGROUND
The nematode Caenorhabditis briggsae has been used as a model in comparative genomics studies with Caenorhabditis elegans because of their striking morphological and behavioral similarities. However, the potential of C. briggsae for comparative studies is limited by the quality of its genome resources. The genome resources for the C. briggsae laboratory strain AF16 have not been developed to the same extent as C. elegans. The recent publication of a new chromosome-level reference genome for QX1410, a C. briggsae wild strain closely related to AF16, has provided the first step to bridge the gap between C. elegans and C. briggsae genome resources. Currently, the QX1410 gene models consist of software-derived gene predictions that contain numerous errors in their structure and coding sequences. In this study, a team of researchers manually inspected over 21,000 gene models and underlying transcriptomic data to repair software-derived errors.
RESULTS
We designed a detailed workflow to train a team of nine students to manually curate gene models using RNA read alignments. We manually inspected the gene models, proposed corrections to the coding sequences of over 8,000 genes, and modeled thousands of putative isoforms and untranslated regions. We exploited the conservation of protein sequence length between C. briggsae and C. elegans to quantify the improvement in protein-coding gene model quality and showed that manual curation led to substantial improvements in the protein sequence length accuracy of QX1410 genes. Additionally, collinear alignment analysis between the QX1410 and AF16 genomes revealed over 1,800 genes affected by spurious duplications and inversions in the AF16 genome that are now resolved in the QX1410 genome.
CONCLUSIONS
Community-based, manual curation using transcriptome data is an effective approach to improve the quality of software-derived protein-coding genes. The detailed protocols provided in this work can be useful for future large-scale manual curation projects in other species. Our manual curation efforts have brought the QX1410 gene models to a comparable level of quality as the extensively curated AF16 gene models. The improved genome resources for C. briggsae provide reliable tools for the study of Caenorhabditis biology and other related nematodes.
Topics: Humans; Animals; Caenorhabditis; Caenorhabditis elegans; Exons; Amino Acid Sequence; Gene Expression Profiling
PubMed: 37626289
DOI: 10.1186/s12864-023-09582-0 -
General and Comparative Endocrinology Jul 2024G protein-coupled receptor 84 (GPR84) was cloned as an orphan receptor, and medium-chain fatty acids were then revealed as endogenous ligands. GPR84 is expressed in...
G protein-coupled receptor 84 (GPR84) was cloned as an orphan receptor, and medium-chain fatty acids were then revealed as endogenous ligands. GPR84 is expressed in immune cells and is believed to protect liver function from lipotoxicity caused by overeating and high-fat diet intake. This study aimed to present the molecular characterization of GPR84 in domestic cats. The deduced amino acid sequence of the feline GPR84 shows high sequence homology (83-89 %) with the orthologues from other mammalians by cDNA cloning of feline GPR84. Remarkably high mRNA expression was observed in the bone marrow by Q-PCR analysis. The inhibition of intracellular cAMP concentration was observed in cells transfected with feline GPR84 and treated with medium-chain fatty acids. Immunostaining of GPR84 and free fatty acid receptor 2 (FFAR2)/GPR43 in the bone marrow, where high mRNA expression was observed, showed reactions in macrophages and myeloid cells. To clarify whether the receptor formed homo/hetero-merization, GPR84 and FFARs were analyzed using Nano-Luc binary technology and NanoLuc bioluminescence resonance energy transfer technologies, which revealed that GPR84 formed more heteromers with FFAR2 than homomers with each other. In addition, when GPR84 and FFAR2/GPR43 were cotransfected in the cell, their localization on the cell membrane was reduced compared with that when single receptors were transfected. These results indicated that GPR84 is a functional receptor protein that is expressed in cat tissues and may have a protein-protein interaction with FFAR2/GPR43 on the cell membrane.
Topics: Animals; Receptors, G-Protein-Coupled; Cats; Amino Acid Sequence
PubMed: 38641150
DOI: 10.1016/j.ygcen.2024.114520 -
Genes Dec 2023is the most widely distributed freshwater shrimp in China, with important economic value and great potential for development. The forkheadboxL2 () gene has been found...
is the most widely distributed freshwater shrimp in China, with important economic value and great potential for development. The forkheadboxL2 () gene has been found to be involved in the reproductive development of many crustaceans. To understand the role of the gene in the gonad development of , we designed CDS-specific primers for the () gene and cloned its CDS sequence using RT-PCR. The nucleotide and protein sequence information was then analyzed through bioinformatics analysis. The expression and subcellular localization of in various tissues were detected using qRT-PCR and in situ hybridization. The effects of knockdown on gonad development were investigated using RNA interference. The results showed that the CDS length of the gene was 1614 bp and encoded 537 amino acids. Protein sequence comparison and phylogenetic analysis showed that was the closest relative to Crayfish. qRT-PCR analysis indicated that the expression level of in the testis was significantly higher (>40 fold) than that in the ovary ( < 0.01). The in situ hybridization results showed that was expressed in both the cytoplasm and the nucleus of egg cells, and that the expression was strongest in egg cells at the early stage of yolk synthesis, while weak in the secondary oocytes. The positive signal was strongest in the spermatocyte nucleolus, while only a trace signal was observed in the cytoplasm. After interfering with the gene using dsRNA, the expression of in the RNA interference group was significantly lower than that in the control group, and this interference effect lasted for one week. Moreover, the gonad index of the experimental group was significantly lower than that of the control group ( < 0.05) after 10 days of cultivation following knockdown. The expression levels of the and genes, which are related to gonad development, decreased significantly after gene interference. The results suggest that the gene is involved in the growth and development of gonads, particularly in the development of testis, and is related to the early development of oocytes. This study provides a theoretical basis for the artificial breeding of .
Topics: Male; Animals; Female; Astacoidea; Phylogeny; Amino Acid Sequence; Polymerase Chain Reaction; Cloning, Molecular
PubMed: 38137012
DOI: 10.3390/genes14122190 -
Molecules (Basel, Switzerland) Sep 2023The folded structures of proteins can be accurately predicted by deep learning algorithms from their amino-acid sequences. By contrast, in spite of decades of research...
The folded structures of proteins can be accurately predicted by deep learning algorithms from their amino-acid sequences. By contrast, in spite of decades of research studies, the prediction of folding pathways and the unfolded and misfolded states of proteins, which are intimately related to diseases, remains challenging. A two-state (folded/unfolded) description of protein folding dynamics hides the complexity of the unfolded and misfolded microstates. Here, we focus on the development of simplified order parameters to decipher the complexity of disordered protein structures. First, we show that any connected, undirected, and simple graph can be associated with a linear chain of atoms in thermal equilibrium. This analogy provides an interpretation of the usual topological descriptors of a graph, namely the Kirchhoff index and Randić resistance, in terms of effective force constants of a linear chain. We derive an exact relation between the Kirchhoff index and the average shortest path length for a linear graph and define the free energies of a graph using an Einstein model. Second, we represent the three-dimensional protein structures by connected, undirected, and simple graphs. As a proof of concept, we compute the topological descriptors and the graph free energies for an all-atom molecular dynamics trajectory of folding/unfolding events of the proteins Trp-cage and HP-36 and for the ensemble of experimental NMR models of Trp-cage. The present work shows that the local, nonlocal, and global force constants and free energies of a graph are promising tools to quantify unfolded/disordered protein states and folding/unfolding dynamics. In particular, they allow the detection of transient misfolded rigid states.
Topics: Proteins; Protein Folding; Amino Acid Sequence; Molecular Dynamics Simulation
PubMed: 37764437
DOI: 10.3390/molecules28186659 -
Analytical Chemistry Sep 2023Membrane proteins are often challenging targets for native top-down mass spectrometry experimentation. The requisite use of membrane mimetics to solubilize such proteins...
Membrane proteins are often challenging targets for native top-down mass spectrometry experimentation. The requisite use of membrane mimetics to solubilize such proteins necessitates the application of supplementary activation methods to liberate protein ions prior to sequencing, which typically limits the sequence coverage achieved. Recently, infrared photoactivation has emerged as an alternative to collisional activation for the liberation of membrane proteins from surfactant micelles. However, much remains unknown regarding the mechanism by which IR activation liberates membrane protein ions from such micelles, the extent to which such methods can improve membrane protein sequence coverage, and the degree to which such approaches can be extended to support native proteomics. Here, we describe experiments designed to evaluate and probe infrared photoactivation for membrane protein sequencing, proteoform identification, and native proteomics applications. Our data reveal that infrared photoactivation can dissociate micelles composed of a variety of detergent classes, without the need for a strong IR chromophore by leveraging the relatively weak association energies of such detergent clusters in the gas phase. Additionally, our data illustrate how IR photoactivation can be extended to include membrane mimetics beyond micelles and liberate proteins from nanodiscs, liposomes, and bicelles. Finally, our data quantify the improvements in membrane protein sequence coverage produced through the use of IR photoactivation, which typically leads to membrane protein sequence coverage values ranging from 40 to 60%.
Topics: Detergents; Micelles; Membrane Proteins; Amino Acid Sequence; Mass Spectrometry
PubMed: 37610409
DOI: 10.1021/acs.analchem.3c02788 -
BMC Biology Sep 2023Intrinsically disordered regions (IDRs) are widely distributed in proteins and related to many important biological functions. Accurately identifying IDRs is of great...
BACKGROUND
Intrinsically disordered regions (IDRs) are widely distributed in proteins and related to many important biological functions. Accurately identifying IDRs is of great significance for protein structure and function analysis. Because the long disordered regions (LDRs) and short disordered regions (SDRs) share different characteristics, the existing predictors fail to achieve better and more stable performance on datasets with different ratios between LDRs and SDRs. There are two main reasons. First, the existing predictors construct network structures based on their own experiences such as convolutional neural network (CNN) which is used to extract the feature of neighboring residues in protein, and long short-term memory (LSTM) is used to extract the long-distance dependencies feature of protein residues. But these networks cannot capture the hidden feature associated with the length-dependent between residues. Second, many algorithms based on deep learning have been proposed but the complementarity of the existing predictors is not fully explored and used.
RESULTS
In this study, the neural architecture search (NAS) algorithm was employed to automatically construct the network structures so as to capture the hidden features in protein sequences. In order to stably predict both the LDRs and SDRs, the model constructed by NAS was combined with length-dependent models for capturing the unique features of SDRs or LDRs and general models for capturing the common features between LDRs and SDRs. A new predictor called IDP-Fusion was proposed.
CONCLUSIONS
Experimental results showed that IDP-Fusion can achieve more stable performance than the other existing predictors on independent test sets with different ratios between SDRs and LDRs.
Topics: Algorithms; Amino Acid Sequence; Memory, Long-Term; Protein Domains
PubMed: 37674132
DOI: 10.1186/s12915-023-01672-5 -
Memorias Do Instituto Oswaldo Cruz 2023In 2022, an outbreak of mpox that started in European countries spread worldwide through human-to-human transmission. Cases have been mostly mild, but severe clinical...
BACKGROUND
In 2022, an outbreak of mpox that started in European countries spread worldwide through human-to-human transmission. Cases have been mostly mild, but severe clinical presentations have been reported. In these cases, tecovirimat has been the drug of choice to treat patients with aggravated disease.
OBJECTIVES
Here we investigated the tecovirimat susceptibility of 18 clinical isolates of monkeypox virus (MPXV) obtained from different regions of Brazil.
METHODS
Different concentrations of tecovirimat were added to cell monolayers infected with each MPXV isolate. After 72 hours, cells were fixed and stained for plaque visualization, counting, and measurement. The ortholog of F13L gene from each MPXV isolate was polymerase chain reaction (PCR)-amplified, sequenced, and the predicted protein sequences were analyzed.
FINDINGS
The eighteen MPXV isolates generated plaques of different sizes. Although all isolates were highly sensitive to the drug, two showed different response curves and IC50 values. However, the target protein of tecovirimat, F13 (VP37), was 100% conserved in all MPXV isolates and therefore does not explain the difference in sensitivity.
MAIN CONCLUSIONS
Our results support screening different MPXV isolates for tecovirimat susceptibility as an important tool to better use of the restricted number of tecovirimat doses available in low-income countries to treat patients with mpox.
Topics: Humans; Mpox (monkeypox); Monkeypox virus; Amino Acid Sequence; Benzamides
PubMed: 37436275
DOI: 10.1590/0074-02760230056 -
Journal of Structural Biology Sep 2023Coiled coils are a widespread and well understood protein fold. Their short and simple repeats underpin considerable structural and functional diversity. The vast...
Coiled coils are a widespread and well understood protein fold. Their short and simple repeats underpin considerable structural and functional diversity. The vast majority of coiled coils consist of 7-residue (heptad) sequence repeats, but in essence most combinations of 3- and 4-residue segments, each starting with a residue of the hydrophobic core, are compatible with coiled-coil structure. The most frequent among these other repeat patterns are 11-residue (hendecad, 3 + 4 + 4) repeats. Hendecads are frequently found in low copy number, interspersed between heptads, but some proteins consist largely or entirely of hendecad repeats. Here we describe the first large-scale survey of these proteins in the proteome of life. For this, we scanned the protein sequence database for sequences with 11-residue periodicity that lacked β-strand prediction. We then clustered these by pairwise similarity to construct a map of potential hendecad coiled-coil families. Here we discuss these according to their structural properties, their potential cellular roles, and the evolutionary mechanisms shaping their diversity. We note in particular the continuous amplification of hendecads, both within existing proteins and de novo from previously non-coding sequence, as a powerful mechanism in the genesis of new coiled-coil forms.
Topics: Proteome; Amino Acid Sequence; Protein Domains; Protein Conformation
PubMed: 37524272
DOI: 10.1016/j.jsb.2023.108007