-
BMC Bioinformatics Apr 2012A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is...
BACKGROUND
A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task.
RESULTS
We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts.
CONCLUSIONS
PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.
Topics: Algorithms; Alternative Splicing; Animals; Exons; Expressed Sequence Tags; Genomics; Humans; Introns; Sequence Alignment; Software
PubMed: 22537006
DOI: 10.1186/1471-2105-13-S5-S2 -
Molecular and Cellular Biology Nov 1985DNA fragments located 10 kilobases apart in the genome and containing, respectively, the first myosin light chain 1 (MLC1f) and the first myosin light chain 3 (MLC3f)... (Comparative Study)
Comparative Study
DNA fragments located 10 kilobases apart in the genome and containing, respectively, the first myosin light chain 1 (MLC1f) and the first myosin light chain 3 (MLC3f) specific exon of the rat myosin light chain 1 and 3 gene, together with several hundred base pairs of upstream flanking sequences, have been shown in runoff in vitro transcription assays to direct initiation of transcription at the cap sites of MLC1f and MLC3f mRNAs used in vivo. These results establish the presence of two separate, functional promoters within that gene. A comparison of the nucleotide sequence of the rat MLC1f/3f gene with the corresponding sequences from mouse and chicken shows that: the MLC1f promoter regions have been highly conserved up to position -150 from the cap site while the MLC3f promoter regions display a very poor degree of homology and even the absence or poor conservation of typical eucaryotic promoter elements such as TATA and CAT boxes; the exon/intron structure of this gene has been completely conserved in the three species; and corresponding exons, except for the regions encoding most of the 5' and 3' untranslated sequences, show greater than 75% homology while corresponding introns are similar in size but considerably divergent in sequence. The above findings indicate that the overall structure of the MLC1f/3f genes has been maintained between avian and mammalian species and that these genes contain two functional and widely spaced promoters. The fact that the structures of the alkali light chain gene from Drosophila melanogaster and of other related genes of the troponin C supergene family resemble a MLC3f gene without an upstream promoter and first exon strongly suggests that the present-day MLC1f/3f genes of higher vertebrates arose from a primordial alkali light chain gene through the addition of a far-upstream MLC1f-specific promoter and first exon. The two promoters have evolved at different rates, with the MLC1f promoter being more conserved than the MLC3f promoter. This discrepant evolutionary rate might reflect different mechanisms of promoter activation for the transcription of MLC1f and MLC3f RNA.
Topics: Amino Acid Sequence; Animals; Base Sequence; Chickens; DNA Restriction Enzymes; Drosophila melanogaster; Endonucleases; Genes; Genes, Regulator; Mice; Myosin Subfragments; Myosins; Peptide Fragments; Promoter Regions, Genetic; Rats; Single-Strand Specific DNA and RNA Endonucleases; Species Specificity; Templates, Genetic; Transcription, Genetic
PubMed: 3018505
DOI: 10.1128/mcb.5.11.3168-3182.1985 -
The Biochemical Journal Jul 1986The nucleotide sequences of two segments of DNA (2250 and 2921 base-pairs) containing the functionally related fumarase (fumC) and aspartase (aspA) genes of Escherichia... (Comparative Study)
Comparative Study
The nucleotide sequences of two segments of DNA (2250 and 2921 base-pairs) containing the functionally related fumarase (fumC) and aspartase (aspA) genes of Escherichia coli K12 were determined. The fumC structural gene comprises 1398 base-pairs (466 codons, excluding the initiation codon), and it encodes a polypeptide of Mr 50353 that resembles the fumarases of Bacillus subtilis 168 (citG-gene product), rat liver and pig heart. The fumC gene starts 140 base-pairs downstream of the structurally-unrelated fumA gene, but there is no evidence that both genes form part of the same operon. The aspA structural gene comprises 1431 base-pairs (477 codons excluding the initiation codon), and it encodes a polypeptide of Mr 52190, similar to that predicted from maxicell studies and for the enzyme from E. coli W. Remarkable homologies were found between the primary structures of the fumarase (fumC and citG) and aspartase (aspA) genes and their products, suggesting close structural and evolutionary relationships.
Topics: Amino Acid Sequence; Amino Acids; Ammonia-Lyases; Aspartate Ammonia-Lyase; Base Sequence; Cloning, Molecular; Codon; DNA, Bacterial; Escherichia coli; Fumarate Hydratase; Genes
PubMed: 3541901
DOI: 10.1042/bj2370547 -
BMC Biotechnology Apr 2009To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the...
BACKGROUND
To improve efficiency in high throughput protein structure determination, we have developed a database software package, Gene Composer, which facilitates the information-rich design of protein constructs and their codon engineered synthetic gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bio-informatics steps used in modern structure guided protein engineering and synthetic gene engineering.
RESULTS
An interactive Alignment Viewer allows the researcher to simultaneously visualize sequence conservation in the context of known protein secondary structure, ligand contacts, water contacts, crystal contacts, B-factors, solvent accessible area, residue property type and several other useful property views. The Construct Design Module enables the facile design of novel protein constructs with altered N- and C-termini, internal insertions or deletions, point mutations, and desired affinity tags. The modifications can be combined and permuted into multiple protein constructs, and then virtually cloned in silico into defined expression vectors. The Gene Design Module uses a protein-to-gene algorithm that automates the back-translation of a protein amino acid sequence into a codon engineered nucleic acid gene sequence according to a selected codon usage table with minimal codon usage threshold, defined G:C% content, and desired sequence features achieved through synonymous codon selection that is optimized for the intended expression system. The gene-to-oligo algorithm of the Gene Design Module plans out all of the required overlapping oligonucleotides and mutagenic primers needed to synthesize the desired gene constructs by PCR, and for physically cloning them into selected vectors by the most popular subcloning strategies.
CONCLUSION
We present a complete description of Gene Composer functionality, and an efficient PCR-based synthetic gene assembly procedure with mis-match specific endonuclease error correction in combination with PIPE cloning. In a sister manuscript we present data on how Gene Composer designed genes and protein constructs can result in improved protein production for structural studies.
Topics: Algorithms; Cloning, Molecular; Codon; Computational Biology; Databases, Genetic; Genes, Synthetic; Protein Engineering; Sequence Alignment; Software; User-Computer Interface
PubMed: 19383142
DOI: 10.1186/1472-6750-9-36 -
BMC Genomics Jan 2009The origin and importance of exon-intron architecture comprises one of the remaining mysteries of gene evolution. Several studies have investigated the variations of...
BACKGROUND
The origin and importance of exon-intron architecture comprises one of the remaining mysteries of gene evolution. Several studies have investigated the variations of intron length, GC content, ordinal position in a gene and divergence. However, there is little study about the structural variation of exons and introns.
RESULTS
We investigated the length, GC content, ordinal position and divergence in both exons and introns of 13 eukaryotic genomes, representing plant and animal. Our analyses revealed that three basic patterns of exon-intron variation were present in nearly all analyzed genomes (P < 0.001 in most cases): an ordinal reduction of length and divergence in both exon and intron, a co-variation between exon and its flanking introns in their length, GC content and divergence, and a decrease of average exon (or intron) length, GC content and divergence as the total exon numbers of a gene increased. In addition, we observed that the shorter introns had either low or high GC content, and the GC content of long introns was intermediate.
CONCLUSION
Although the factors contributing to these patterns have not been identified, our results provide three important clues: common factor(s) exist and may shape both exons and introns; the ordinal reduction patterns may reflect a time-orderly evolution; and the larger first and last exons may be splicing-required. These clues provide a framework for elucidating mechanisms involved in the organization of eukaryotic genomes and particularly in building exon-intron structures.
Topics: Animals; Base Composition; Evolution, Molecular; Exons; Genetic Variation; Genome; Humans; Introns; Plants; Sequence Alignment; Sequence Analysis, DNA; Species Specificity
PubMed: 19166620
DOI: 10.1186/1471-2164-10-47 -
Nucleic Acids Research Mar 2021Precise identification of correct exon-intron boundaries is a prerequisite to analyze the location and structure of genes. The existing framework for genomic signals,...
Precise identification of correct exon-intron boundaries is a prerequisite to analyze the location and structure of genes. The existing framework for genomic signals, delineating exon and introns in a genomic segment, seems insufficient, predominantly due to poor sequence consensus as well as limitations of training on available experimental data sets. We present here a novel concept for characterizing exon-intron boundaries in genomic segments on the basis of structural and energetic properties. We analyzed boundary junctions on both sides of all the exons (3 28 368) of protein coding genes from human genome (GENCODE database) using 28 structural and three energy parameters. Study of sequence conservation at these sites shows very poor consensus. It is observed that DNA adopts a unique structural and energy state at the boundary junctions. Also, signals are somewhat different for housekeeping and tissue specific genes. Clustering of 31 parameters into four derived vectors gives some additional insights into the physical mechanisms involved in this biological process. Sites of structural and energy signals correlate well to the positions playing important roles in pre-mRNA splicing.
Topics: Exons; Genes, Essential; Genome, Human; Genomics; Humans; Introns; RNA Splice Sites
PubMed: 33621338
DOI: 10.1093/nar/gkab098 -
Nucleic Acids Research Dec 1981The tryptophan (trp) operon of Escherichia coli has become the basic reference structure for studies on tryptophan metabolism. Within the past five years the application... (Review)
Review
The tryptophan (trp) operon of Escherichia coli has become the basic reference structure for studies on tryptophan metabolism. Within the past five years the application of recombinant DNA and sequencing methodologies has permitted the characterization of the structural and functional elements in this gene cluster at the molecular level. In this summary report we present the complete nucleotide sequence for the five structural genes of the trp operon of E. coli together with the internal and flanking regions of regulatory information.
Topics: Base Sequence; Biological Evolution; Codon; DNA, Bacterial; Escherichia coli; Genes; Genes, Bacterial; Operon; Peptides; Tryptophan
PubMed: 7038627
DOI: 10.1093/nar/9.24.6647 -
Nucleic Acids Research Jun 2021Many non-coding RNAs with known functions are structurally conserved: their intramolecular secondary and tertiary interactions are maintained across evolutionary time....
Many non-coding RNAs with known functions are structurally conserved: their intramolecular secondary and tertiary interactions are maintained across evolutionary time. Consequently, the presence of conserved structure in multiple sequence alignments can be used to identify candidate functional non-coding RNAs. Here, we present a bioinformatics method that couples iterative homology search with covariation analysis to assess whether a genomic region has evidence of conserved RNA structure. We used this method to examine all unannotated regions of five well-studied fungal genomes (Saccharomyces cerevisiae, Candida albicans, Neurospora crassa, Aspergillus fumigatus, and Schizosaccharomyces pombe). We identified 17 novel structurally conserved non-coding RNA candidates, which include four H/ACA box small nucleolar RNAs, four intergenic RNAs and nine RNA structures located within the introns and untranslated regions (UTRs) of mRNAs. For the two structures in the 3' UTRs of the metabolic genes GLY1 and MET13, we performed experiments that provide evidence against them being eukaryotic riboswitches.
Topics: 3' Untranslated Regions; Computational Biology; Genome, Fungal; Introns; Lysine-tRNA Ligase; Markov Chains; Nucleic Acid Conformation; RNA, Fungal; RNA, Small Nucleolar; RNA, Untranslated; Ribosomal Proteins; Riboswitch; Sequence Alignment; Thioredoxins
PubMed: 34086938
DOI: 10.1093/nar/gkab355 -
BMC Genomics Jan 2014Polyploid species contribute to Oryza diversity. However, the mechanisms underlying gene and genome evolution in Oryza polyploids remain largely unknown. The...
BACKGROUND
Polyploid species contribute to Oryza diversity. However, the mechanisms underlying gene and genome evolution in Oryza polyploids remain largely unknown. The allotetraploid Oryza minuta, which is estimated to have formed less than one million years ago, along with its putative diploid progenitors (O. punctata and O. officinalis), are quite suitable for the study of polyploid genome evolution using a comparative genomics approach.
RESULTS
Here, we performed a comparative study of a large genomic region surrounding the Shattering4 locus in O. minuta, as well as in O. punctata and O. officinalis. Duplicated genomes in O. minuta have maintained the diploid genome organization, except for several structural variations mediated by transposon movement. Tandem duplicated gene clusters are prevalent in the Sh4 region, and segmental duplication followed by random deletion is illustrated to explain the gene gain-and-loss process. Both copies of most duplicated genes still persist in O. minuta. Molecular evolution analysis suggested that these duplicated genes are equally evolved and mostly manipulated by purifying selection. However, cDNA-SSCP analysis revealed that the expression patterns were dramatically altered between duplicated genes: nine of 29 duplicated genes exhibited expression divergence in O. minuta. We further detected one gene silencing event that was attributed to gene structural variation, but most gene silencing could not be related to sequence changes. We identified one case in which DNA methylation differences within promoter regions that were associated with the insertion of one hAT element were probably responsible for gene silencing, suggesting a potential epigenetic gene silencing pathway triggered by TE movement.
CONCLUSIONS
Our study revealed both genetic and epigenetic mechanisms involved in duplicated gene silencing in the allotetraploid O. minuta.
Topics: DNA Methylation; Epigenesis, Genetic; Evolution, Molecular; Gene Silencing; Genes, Duplicate; Genes, Plant; Genetic Loci; Genomics; Homologous Recombination; Multigene Family; Oryza; Plant Proteins; Promoter Regions, Genetic; Tetraploidy
PubMed: 24393121
DOI: 10.1186/1471-2164-15-11 -
BMC Biotechnology Apr 2009With the goal of improving yield and success rates of heterologous protein production for structural studies we have developed the database and algorithm software...
BACKGROUND
With the goal of improving yield and success rates of heterologous protein production for structural studies we have developed the database and algorithm software package Gene Composer. This freely available electronic tool facilitates the information-rich design of protein constructs and their engineered synthetic gene sequences, as detailed in the accompanying manuscript.
RESULTS
In this report, we compare heterologous protein expression levels from native sequences to that of codon engineered synthetic gene constructs designed by Gene Composer. A test set of proteins including a human kinase (P38alpha), viral polymerase (HCV NS5B), and bacterial structural protein (FtsZ) were expressed in both E. coli and a cell-free wheat germ translation system. We also compare the protein expression levels in E. coli for a set of 11 different proteins with greatly varied G:C content and codon bias.
CONCLUSION
The results consistently demonstrate that protein yields from codon engineered Gene Composer designs are as good as or better than those achieved from the synonymous native genes. Moreover, structure guided N- and C-terminal deletion constructs designed with the aid of Gene Composer can lead to greater success in gene to structure work as exemplified by the X-ray crystallographic structure determination of FtsZ from Bacillus subtilis. These results validate the Gene Composer algorithms, and suggest that using a combination of synthetic gene and protein construct engineering tools can improve the economics of gene to structure research.
Topics: Algorithms; Base Composition; Cell-Free System; Codon; Escherichia coli; Gene Expression; Genes, Synthetic; Humans; Protein Engineering; Protein Structure, Tertiary; Sequence Alignment; Software; User-Computer Interface
PubMed: 19383143
DOI: 10.1186/1472-6750-9-37