-
Nature Communications Oct 2023The short lengths of short-read sequencing reads challenge the analysis of paralogous genomic regions in exome and genome sequencing data. Most genetic variants within...
The short lengths of short-read sequencing reads challenge the analysis of paralogous genomic regions in exome and genome sequencing data. Most genetic variants within these homologous regions therefore remain unidentified in standard analyses. Here, we present a method (Chameleolyser) that accurately identifies single nucleotide variants and small insertions/deletions (SNVs/Indels), copy number variants and ectopic gene conversion events in duplicated genomic regions using whole-exome sequencing data. Application to a cohort of 41,755 exome samples yields 20,432 rare homozygous deletions and 2,529,791 rare SNVs/Indels, of which we show that 338,084 are due to gene conversion events. None of the SNVs/Indels are detectable using regular analysis techniques. Validation by high-fidelity long-read sequencing in 20 samples confirms >88% of called variants. Focusing on variation in known disease genes leads to a direct molecular diagnosis in 25 previously undiagnosed patients. Our method can readily be applied to existing exome data.
Topics: Humans; Exome; Polymorphism, Single Nucleotide; INDEL Mutation; DNA Copy Number Variations; Systems Analysis; High-Throughput Nucleotide Sequencing
PubMed: 37891200
DOI: 10.1038/s41467-023-42531-9 -
Bioinformatics (Oxford, England) Jul 2023With recent advances in sequencing technologies, it is now possible to obtain near-perfect complete bacterial chromosome assemblies cheaply and efficiently by combining...
SUMMARY
With recent advances in sequencing technologies, it is now possible to obtain near-perfect complete bacterial chromosome assemblies cheaply and efficiently by combining a long-read-first assembly approach with short-read polishing. However, existing methods for assembling bacterial plasmids from long-read-first assemblies often misassemble or even miss bacterial plasmids entirely and accordingly require manual curation. Plassembler was developed to provide a tool that automatically assembles and outputs bacterial plasmids using a hybrid assembly approach. It achieves increased accuracy and computational efficiency compared to the existing gold standard tool Unicycler by removing chromosomal reads from the input read sets using a mapping approach.
AVAILABILITY AND IMPLEMENTATION
Plassembler is implemented in Python and is installable as a bioconda package using 'conda install -c bioconda plassembler'. The source code is available on GitHub at https://github.com/gbouras13/plassembler. The full benchmarking pipeline can be found at https://github.com/gbouras13/plassembler_simulation_benchmarking, while the benchmarking input FASTQ and output files can be found at https://doi.org/10.5281/zenodo.7996690.
Topics: Sequence Analysis, DNA; High-Throughput Nucleotide Sequencing; Software; Plasmids; Benchmarking
PubMed: 37369026
DOI: 10.1093/bioinformatics/btad409 -
PLoS Computational Biology Oct 2023Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet...
Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to "strand" long-read cDNA libraries that rectifies inaccurate mapping and assembly of long-read transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5' and 3' ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
Topics: DNA, Complementary; RNA, Long Noncoding; Sequence Analysis, RNA; Transcriptome; Gene Library; Protein Isoforms; High-Throughput Nucleotide Sequencing
PubMed: 37883581
DOI: 10.1371/journal.pcbi.1011576 -
Nature Genetics Dec 2023Short-read sequencing is the workhorse of cancer genomics yet is thought to miss many structural variants (SVs), particularly large chromosomal alterations. To...
Short-read sequencing is the workhorse of cancer genomics yet is thought to miss many structural variants (SVs), particularly large chromosomal alterations. To characterize missing SVs in short-read whole genomes, we analyzed 'loose ends'-local violations of mass balance between adjacent DNA segments. In the landscape of loose ends across 1,330 high-purity cancer whole genomes, most large (>10-kb) clonal SVs were fully resolved by short reads in the 87% of the human genome where copy number could be reliably measured. Some loose ends represent neotelomeres, which we propose as a hallmark of the alternative lengthening of telomeres phenotype. These pan-cancer findings were confirmed by long-molecule profiles of 38 breast cancer and melanoma cases. Our results indicate that aberrant homologous recombination is unlikely to drive the majority of large cancer SVs. Furthermore, analysis of mass balance in short-read whole genome data provides a surprisingly complete picture of cancer chromosomal structure.
Topics: Humans; Female; Genomics; Sequence Analysis, DNA; Genome, Human; Chromosome Aberrations; High-Throughput Nucleotide Sequencing; Breast Neoplasms; Genomic Structural Variation
PubMed: 37945902
DOI: 10.1038/s41588-023-01540-6 -
Journal of Clinical Microbiology Oct 2023Cytomegalovirus (CMV) is a significant cause of morbidity and mortality among immunocompromised hosts, including transplant recipients. Antiviral prophylaxis or...
Cytomegalovirus (CMV) is a significant cause of morbidity and mortality among immunocompromised hosts, including transplant recipients. Antiviral prophylaxis or treatment is used to reduce the incidence of CMV disease in this patient population; however, there is concern about increasing antiviral resistance. Detection of antiviral resistance in CMV was traditionally accomplished using Sanger sequencing of and genes, in which specific mutations may result in reduced antiviral activity. In this study, a novel next-generation sequencing (NGS) method was developed and validated to detect mutations in / associated with antiviral resistance. Plasma samples ( = 27) submitted for antiviral resistance testing by Sanger sequencing were also analyzed using the NGS method. When compared to Sanger sequencing, the NGS assay demonstrated 100% (27/27) overall agreement for determining antiviral resistance/susceptibility and 88% (22/25) agreement at the level of resistance-associated mutations. The limit of detection of the NGS method was determined to be 500 IU/mL, and the lower threshold for detecting mutations associated with resistance was established at 15%. The NGS assay represents a novel laboratory tool that assists healthcare providers in treating patients who are infected with CMV harboring resistance-associated mutations and who may benefit from tailored antiviral therapy.
Topics: Humans; Cytomegalovirus; Antiviral Agents; Cytomegalovirus Infections; Mutation; High-Throughput Nucleotide Sequencing; Drug Resistance, Viral
PubMed: 37750719
DOI: 10.1128/jcm.00429-23 -
Annual Review of Biomedical Data Science Aug 2023The human microbiome is complex, variable from person to person, essential for health, and related to both the risk for disease and the efficacy of our treatments. There... (Review)
Review
The human microbiome is complex, variable from person to person, essential for health, and related to both the risk for disease and the efficacy of our treatments. There are robust techniques to describe microbiota with high-throughput sequencing, and there are hundreds of thousands of already-sequenced specimens in public archives. The promise remains to use the microbiome both as a prognostic factor and as a target for precision medicine. However, when used as an input in biomedical data science modeling, the microbiome presents unique challenges. Here, we review the most common techniques used to describe microbial communities, explore these unique challenges, and discuss the more successful approaches for biomedical data scientists seeking to use the microbiome as an input in their studies.
Topics: Humans; Microbiota; Precision Medicine; High-Throughput Nucleotide Sequencing
PubMed: 37159872
DOI: 10.1146/annurev-biodatasci-020722-043017 -
Cold Spring Harbor Protocols Oct 2023Transposon mutagenesis has been the method of choice for genetic screens and selections in bacteria by virtue of the transposon being linked to the disrupted gene,...
Transposon mutagenesis has been the method of choice for genetic screens and selections in bacteria by virtue of the transposon being linked to the disrupted gene, simplifying its identification. Transposon sequencing (Tn-seq) is a high-throughput version of transposon mutant screening, in which massively parallel sequencing is used to simultaneously follow the fitness of all mutants in a complex library. In a single experiment, one can use Tn-seq to interrogate the contribution of all genes of a bacterium to fitness under a condition of interest. Here, we introduce a method to construct a saturating transposon insertion library in Gram-negative bacteria, to capture the transposon junctions , and to identify essential genes and conditional genes using massively parallel sequencing. The accompanying protocol was developed as part of Cold Spring Harbor's Advanced Bacterial Genetics course.
Topics: Mutagenesis, Insertional; DNA Transposable Elements; High-Throughput Nucleotide Sequencing; Gene Library
PubMed: 36931734
DOI: 10.1101/pdb.top107867 -
Microbial Biotechnology Jan 2024The human microbiome plays a crucial role in maintaining health, with advances in high-throughput sequencing technology and reduced sequencing costs triggering a surge... (Review)
Review
The human microbiome plays a crucial role in maintaining health, with advances in high-throughput sequencing technology and reduced sequencing costs triggering a surge in microbiome research. Microbiome studies generally incorporate five key phases: design, sampling, sequencing, analysis, and reporting, with sequencing strategy being a crucial step offering numerous options. Present mainstream sequencing strategies include Amplicon sequencing, Metagenomic Next-Generation Sequencing (mNGS), and Targeted Next-Generation Sequencing (tNGS). Two innovative technologies recently emerged, namely MobiMicrobe high-throughput microbial single-cell genome sequencing technology and 2bRAD-M simplified metagenomic sequencing technology, compensate for the limitations of mainstream technologies, each boasting unique core strengths. This paper reviews the basic principles and processes of these three mainstream and two novel microbiological technologies, aiding readers in understanding the benefits and drawbacks of different technologies, thereby guiding the selection of the most suitable method for their research endeavours.
Topics: Humans; Microbiota; Metagenome; High-Throughput Nucleotide Sequencing; Metagenomics; Technology
PubMed: 37929823
DOI: 10.1111/1751-7915.14364 -
Molecular Ecology Resources Aug 2023Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately...
Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that experienced substantial rearrangements. Most plastomes contain a large inverted repeat and two single-copy regions, and a few dispersed repeats; however, the plastomes of some taxa harbour long repeat sequences (>300 bp). These long repeats make it challenging to assemble complete plastomes using short-read data, leading to misassemblies and consensus sequences with spurious rearrangements. Single-molecule, long-read sequencing has the potential to overcome these challenges, yet there is no consensus on the most effective method for accurately assembling plastomes using long-read data. We generated a pipeline, plastid Genome Assembly Using Long-read data (ptGAUL), to address the problem of plastome assembly using long-read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long-read data sets. We showed that ptGAUL quickly produces accurate and unbiased assemblies using only ~50× coverage of plastome data. Additionally, we deployed ptGAUL to assemble four new Juncus (Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements in Juncus plastomes compared with basal lineages of Poales. The ptGAUL pipeline is available on GitHub: https://github.com/Bean061/ptgaul.
Topics: Genome, Plastid; Repetitive Sequences, Nucleic Acid; Gene Rearrangement; Plastids; High-Throughput Nucleotide Sequencing; Sequence Analysis, DNA
PubMed: 36939021
DOI: 10.1111/1755-0998.13787 -
Molecular Aspects of Medicine Apr 2024Massively parallel sequencing technologies have long been used in both basic research and clinical routine. The recent introduction of digital sequencing has made... (Review)
Review
Massively parallel sequencing technologies have long been used in both basic research and clinical routine. The recent introduction of digital sequencing has made previously challenging applications possible by significantly improving sensitivity and specificity to now allow detection of rare sequence variants, even at single molecule level. Digital sequencing utilizes unique molecular identifiers (UMIs) to minimize sequencing-induced errors and quantification biases. Here, we discuss the principles of UMIs and how they are used in digital sequencing. We outline the properties of different UMI types and the consequences of various UMI approaches in relation to experimental protocols and bioinformatics. Finally, we describe how digital sequencing can be applied in specific research fields, focusing on cancer management where it can be used in screening of asymptomatic individuals, diagnosis, treatment prediction, prognostication, monitoring treatment efficacy and early detection of treatment resistance as well as relapse.
Topics: Humans; High-Throughput Nucleotide Sequencing; Computational Biology; Sensitivity and Specificity
PubMed: 38367531
DOI: 10.1016/j.mam.2024.101253