-
Bioinformatics (Oxford, England) May 2023Increases in the cohort size in long-read sequencing projects necessitate more efficient software for quality assessment and processing of sequencing data from Oxford...
SUMMARY
Increases in the cohort size in long-read sequencing projects necessitate more efficient software for quality assessment and processing of sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Here, we describe novel tools for summarizing experiments, filtering datasets, visualizing phased alignments results, and updates to the NanoPack software suite.
AVAILABILITY AND IMPLEMENTATION
The cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility. NanoPlot and NanoComp are written in Python3. Links to the separate tools and their documentation can be found at https://github.com/wdecoster/nanopack. All tools are compatible with Linux, Mac OS, and the MS Windows Subsystem for Linux and are released under the MIT license. The repositories include test data, and the tools are continuously tested using GitHub Actions and can be installed with the conda dependency manager.
Topics: Humans; Software; Sequence Analysis, DNA; High-Throughput Nucleotide Sequencing; Nanopores; Documentation
PubMed: 37171891
DOI: 10.1093/bioinformatics/btad311 -
Journal of Molecular Evolution Jun 2023Random DNA barcodes are a versatile tool for tracking cell lineages, with applications ranging from development to cancer to evolution. Here, we review and critically... (Review)
Review
Random DNA barcodes are a versatile tool for tracking cell lineages, with applications ranging from development to cancer to evolution. Here, we review and critically evaluate barcode designs as well as methods of barcode sequencing and initial processing of barcode data. We first demonstrate how various barcode design decisions affect data quality and propose a new design that balances all considerations that we are currently aware of. We then discuss various options for the preparation of barcode sequencing libraries, including inline indices and Unique Molecular Identifiers (UMIs). Finally, we test the performance of several established and new bioinformatic pipelines for the extraction of barcodes from raw sequencing reads and for error correction. We find that both alignment and regular expression-based approaches work well for barcode extraction, and that error-correction pipelines designed specifically for barcode data are superior to generic ones. Overall, this review will help researchers to approach their barcoding experiments in a deliberate and systematic way.
Topics: DNA Barcoding, Taxonomic; DNA; Sequence Analysis, DNA; Computational Biology; High-Throughput Nucleotide Sequencing
PubMed: 36651964
DOI: 10.1007/s00239-022-10083-z -
Medical Principles and Practice :... 2024The success in determining the whole genome sequence of a bacterial pathogen was first achieved in 1995 by determining the complete nucleotide sequence of Haemophilus... (Review)
Review
The success in determining the whole genome sequence of a bacterial pathogen was first achieved in 1995 by determining the complete nucleotide sequence of Haemophilus influenzae Rd using the chain-termination method established by Sanger et al. in 1977 and automated by Hood et al. in 1987. However, this technology was laborious, costly, and time-consuming. Since 2004, high-throughput next-generation sequencing technologies have been developed, which are highly efficient, require less time, and are cost-effective for whole genome sequencing (WGS) of all organisms, including bacterial pathogens. In recent years, the data obtained using WGS technologies coupled with bioinformatics analyses of the sequenced genomes have been projected to revolutionize clinical bacteriology. WGS technologies have been used in the identification of bacterial species, strains, and genotypes from cultured organisms and directly from clinical specimens. WGS has also helped in determining resistance to antibiotics by the detection of antimicrobial resistance genes and point mutations. Furthermore, WGS data have helped in the epidemiological tracking and surveillance of pathogenic bacteria in healthcare settings as well as in communities. This review focuses on the applications of WGS in clinical bacteriology.
Topics: Humans; Whole Genome Sequencing; Genome, Bacterial; Drug Resistance, Bacterial; High-Throughput Nucleotide Sequencing
PubMed: 38402870
DOI: 10.1159/000538002 -
Journal of Computational Biology : a... Dec 2023Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled... (Review)
Review
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Topics: Algorithms; Sequence Analysis, DNA; Genomics; High-Throughput Nucleotide Sequencing; Software
PubMed: 37646787
DOI: 10.1089/cmb.2023.0094 -
Bioinformatics (Oxford, England) May 2022Regulatory elements (REs), such as enhancers and promoters, are known as regulatory sequences functional in a heterogeneous regulatory network to control gene expression...
MOTIVATION
Regulatory elements (REs), such as enhancers and promoters, are known as regulatory sequences functional in a heterogeneous regulatory network to control gene expression by recruiting transcription regulators and carrying genetic variants in a context specific way. Annotating those REs relies on costly and labor-intensive next-generation sequencing and RNA-guided editing technologies in many cellular contexts.
RESULTS
We propose a systematic Gene Ontology Annotation method for Regulatory Elements (RE-GOA) by leveraging the powerful word embedding in natural language processing. We first assemble a heterogeneous network by integrating context specific regulations, protein-protein interactions and gene ontology (GO) terms. Then we perform network embedding and associate regulatory elements with GO terms by assessing their similarity in a low dimensional vector space. With three applications, we show that RE-GOA outperforms existing methods in annotating TFs' binding sites from ChIP-seq data, in functional enrichment analysis of differentially accessible peaks from ATAC-seq data, and in revealing genetic correlation among phenotypes from their GWAS summary statistics data.
AVAILABILITY AND IMPLEMENTATION
The source code and the systematic RE annotation for human and mouse are available at https://github.com/AMSSwanglab/RE-GOA.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Animals; Chromatin Immunoprecipitation Sequencing; High-Throughput Nucleotide Sequencing; Mice; Molecular Sequence Annotation; Promoter Regions, Genetic; Regulatory Sequences, Nucleic Acid
PubMed: 35561169
DOI: 10.1093/bioinformatics/btac185 -
Current Protocols in Human Genetics Sep 2020Profiling genetic variants-including single nucleotide variants, small insertions and deletions, copy number variations, and structural variations (SVs)-from both... (Review)
Review
Profiling genetic variants-including single nucleotide variants, small insertions and deletions, copy number variations, and structural variations (SVs)-from both healthy individuals and individuals with disease is a key component of genetic and biomedical research. SVs are large-scale changes in the genome and involve breakage and rejoining of DNA fragments. They may affect thousands to millions of nucleotides and can lead to loss, gain, and reshuffling of genes and regulatory elements. SVs are known to impact gene expression and potentially result in altered phenotypes and diseases. Therefore, identifying SVs from the human genomes is particularly important. In this review, I describe advantages and disadvantages of the available high-throughput assays for the discovery of SVs, which are the most challenging genetic alterations to detect. A practical guide is offered to suggest the most suitable strategies for discovering different types of SVs including common germline, rare, somatic, and complex variants. I also discuss factors to be considered, such as cost and performance, for different strategies when designing experiments. Last, I present several approaches to identify potential SV artifacts caused by samples, experimental procedures, and computational analysis. © 2020 Wiley Periodicals LLC.
Topics: DNA Copy Number Variations; Genome, Human; Genomics; High-Throughput Nucleotide Sequencing; Humans; Mutation; Sequence Analysis, DNA
PubMed: 32813322
DOI: 10.1002/cphg.103 -
BMC Bioinformatics May 2022De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data...
BACKGROUND
De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs.
RESULTS
Here we introduce a framework to guide genome assembly with additional data. Our approach is based on clustering the reads, such that each read in each cluster originates from nearby positions in the genome according to the additional data. These sets are then assembled independently and the resulting contigs are further assembled in a hierarchical manner. We implemented our approach for genetic linkage maps in a tool called HGGA.
CONCLUSIONS
Our experiments on simulated and real Pacific Biosciences long reads and genetic linkage maps show that HGGA produces a more contiguous assembly with less contigs and from 1.2 to 9.8 times higher NGA50 or N50 than a plain assembly of the reads and 1.03 to 6.5 times higher NGA50 or N50 than a previous approach integrating genetic linkage maps with contig assembly. Furthermore, also the correctness of the assembly remains similar or improves as compared to an assembly using only the read data.
Topics: Genome; High-Throughput Nucleotide Sequencing; Sequence Analysis, DNA
PubMed: 35525918
DOI: 10.1186/s12859-022-04701-2 -
Genomics May 2022Phasing, and in particular polyploid phasing, have been challenging problems held back by the limited read length of high-throughput short read sequencing methods which... (Review)
Review
Phasing, and in particular polyploid phasing, have been challenging problems held back by the limited read length of high-throughput short read sequencing methods which can't overcome the distance between heterozygous sites and labor high cost of alternative methods such as the physical separation of chromosomes for example. Recently developed single molecule long-read sequencing methods provide much longer reads which overcome this previous limitation. Here we review the alignment-based methods of polyploid phasing that rely on four main strategies: population inference methods, which leverage the genetic information of several individuals to phase a sample; objective function minimization methods, which minimize a function such as the Minimum Error Correction (MEC); graph partitioning methods, which represent the read data as a graph and split it into k haplotype subgraphs; cluster building methods, which iteratively grow clusters of similar reads into a final set of clusters that represent the haplotypes. We discuss the advantages and limitations of these methods and the metrics used to assess their performance, proposing that accuracy and contiguity are the most meaningful metrics. Finally, we propose the field of alignment-based polyploid phasing would greatly benefit from the use of a well-designed benchmarking dataset with appropriate evaluation metrics. We consider that there are still significant improvements which can be achieved to obtain more accurate and contiguous polyploid phasing results which reflect the complexity of polyploid genome architectures.
Topics: Humans; Genome, Human; Sequence Analysis, DNA; Haplotypes; Algorithms; Polyploidy; High-Throughput Nucleotide Sequencing
PubMed: 35483655
DOI: 10.1016/j.ygeno.2022.110369 -
International Journal of Environmental... Apr 2022Advances in Next Generation Sequencing technologies allow us to inspect and unlock the genome to a level of detail that was unimaginable only a few decades ago....
Advances in Next Generation Sequencing technologies allow us to inspect and unlock the genome to a level of detail that was unimaginable only a few decades ago. Omics-based studies are casting a light on the patterns and determinants of disease conditions in populations, as well as on the influence of microbial communities on human health, just to name a few. Through increasing volumes of sequencing information, for example, it is possible to compare genomic features and analyze the modulation of the transcriptome under different environmental stimuli. Although protocols for NGS preparation are intended to leave little to no space for contamination of any kind, a noticeable fraction of sequencing reads still may not uniquely represent what was intended to be sequenced in the first place. If a natural consequence of a sequencing sample is to assess the presence of features of interest by mapping the obtained reads to a genome of reference, sometimes it is useful to determine the fraction of those that do not map, or that map discordantly, and store this information to a new file for subsequent analyses. Here we propose a new mapper, which we called Squid, that among other accessory functionalities finds and returns sequencing reads that match or do not match to a reference sequence database in any orientation. We encourage the use of Squid prior to any quantification pipeline to assess, for instance, the presence of contaminants, especially in RNA-Seq experiments.
Topics: Animals; Decapodiformes; Genomics; High-Throughput Nucleotide Sequencing; Humans; RNA-Seq; Sequence Analysis, RNA; Software; Transcriptome
PubMed: 35564837
DOI: 10.3390/ijerph19095442 -
Journal of Vascular and Interventional... Aug 2023The discovery of increasing numbers of actionable molecular and gene targets for cancer treatment has driven the demand for tissue sampling for next-generation... (Review)
Review
The discovery of increasing numbers of actionable molecular and gene targets for cancer treatment has driven the demand for tissue sampling for next-generation sequencing (NGS). Requirements for sequencing can be very specific, and inadequate sampling leads to delays in management and decision making. It is important that interventional radiologists are aware of NGS technologies and their common applications and be cognizant of the factors that contribute to successful sample sequencing. This review summarizes the fundamentals of cancer tissue collection and processing for NGS. It elaborates on sequencing technologies and their applications with the aim of providing readers with a working understanding that can enhance their clinical practice. It then describes imaging, tumor, biopsy, and sample collection factors that improve the chances of NGS success. Finally, it discusses future practice, highlighting the problem of undersampling in both clinical and research settings and the opportunities within interventional radiology to address this.
Topics: Humans; Neoplasms; Biopsy; High-Throughput Nucleotide Sequencing
PubMed: 36977432
DOI: 10.1016/j.jvir.2023.03.012