-
Microbiome Feb 2019Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to...
BACKGROUND
Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required.
RESULTS
We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM.
CONCLUSIONS
CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM.
Topics: Algorithms; Animals; Computer Simulation; Gastrointestinal Microbiome; Humans; Metagenome; Metagenomics; Mice; Models, Biological; Sequence Analysis, DNA; Software
PubMed: 30736849
DOI: 10.1186/s40168-019-0633-6 -
Bioinformatics (Oxford, England) Sep 2022Despite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial...
MOTIVATION
Despite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. Deep graph learning algorithms have been proposed in other fields to deal with complex graph data structures. The graph structure generated during the assembly process could be integrated with contig features to obtain better bins with deep learning.
RESULTS
We propose GraphMB, which uses graph neural networks to incorporate the assembly graph into the binning process. We test GraphMB on long-read datasets of different complexities, and compare the performance with other binners in terms of the number of High Quality (HQ) genome bins obtained. With our approach, we were able to obtain unique bins on all real datasets, and obtain more bins on most datasets. In particular, we obtained on average 17.5% more HQ bins when compared with state-of-the-art binners and 13.7% when aggregating the results of our binner with the others. These results indicate that a deep learning model can integrate contig-specific and graph-structure information to improve metagenomic binning.
AVAILABILITY AND IMPLEMENTATION
GraphMB is available from https://github.com/MicrobialDarkMatter/GraphMB.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Sequence Analysis, DNA; Metagenomics; Metagenome; Genome, Microbial; Algorithms
PubMed: 35972375
DOI: 10.1093/bioinformatics/btac557 -
Journal of Biomolecular Techniques : JBT Apr 2017
Topics: DNA; High-Throughput Nucleotide Sequencing; Humans; Metagenome; Metagenomics; RNA
PubMed: 28400709
DOI: 10.7171/jbt.17-2801-010 -
Communications Biology Oct 2023Assembly of reads from metagenomic samples is a hard problem, often resulting in highly fragmented genome assemblies. Metagenomic binning allows us to reconstruct...
Assembly of reads from metagenomic samples is a hard problem, often resulting in highly fragmented genome assemblies. Metagenomic binning allows us to reconstruct genomes by re-grouping the sequences by their organism of origin, thus representing a crucial processing step when exploring the biological diversity of metagenomic samples. Here we present Adversarial Autoencoders for Metagenomics Binning (AAMB), an ensemble deep learning approach that integrates sequence co-abundances and tetranucleotide frequencies into a common denoised space that enables precise clustering of sequences into microbial genomes. When benchmarked, AAMB presented similar or better results compared with the state-of-the-art reference-free binner VAMB, reconstructing ~7% more near-complete (NC) genomes across simulated and real data. In addition, genomes reconstructed using AAMB had higher completeness and greater taxonomic diversity compared with VAMB. Finally, we implemented a pipeline Integrating VAMB and AAMB that enabled improved binning, recovering 20% and 29% more simulated and real NC genomes, respectively, compared to VAMB, with moderate additional runtime.
Topics: Metagenome; Genome, Microbial; Metagenomics; Cluster Analysis; Benchmarking
PubMed: 37865678
DOI: 10.1038/s42003-023-05452-3 -
Microbiome Mar 2016Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating... (Review)
Review
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.
Topics: Contig Mapping; Datasets as Topic; Genome, Microbial; Metagenome; Metagenomics; Sequence Analysis, DNA
PubMed: 26951112
DOI: 10.1186/s40168-016-0154-5 -
Scientific Data Feb 2023Common culturing techniques and priorities bias our discovery towards specific traits that may not be representative of microbial diversity in nature. So far, these...
Common culturing techniques and priorities bias our discovery towards specific traits that may not be representative of microbial diversity in nature. So far, these biases have not been systematically examined. To address this gap, here we use 116,884 publicly available metagenome-assembled genomes (MAGs, completeness ≥80%) from 203 surveys worldwide as a culture-independent sample of bacterial and archaeal diversity, and compare these MAGs to the popular RefSeq genome database, which heavily relies on cultures. We compare the distribution of 12,454 KEGG gene orthologs (used as trait proxies) in the MAGs and RefSeq genomes, while controlling for environment type (ocean, soil, lake, bioreactor, human, and other animals). Using statistical modeling, we then determine the conditional probabilities that a species is represented in RefSeq depending on its genetic repertoire. We find that the majority of examined genes are significantly biased for or against in RefSeq. Our systematic estimates of gene prevalences across bacteria and archaea in nature and gene-specific biases in reference genomes constitutes a resource for addressing these issues in the future.
Topics: Animals; Archaea; Bacteria; Genome, Microbial; Metagenome; Metagenomics
PubMed: 36759614
DOI: 10.1038/s41597-023-01994-7 -
Microbial Genomics Apr 2024The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most... (Review)
Review
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Topics: Humans; Deep Learning; Microbiota; Metagenome; Gastrointestinal Microbiome; Metagenomics
PubMed: 38630611
DOI: 10.1099/mgen.0.001231 -
Genomics, Proteomics & Bioinformatics Dec 2018Metagenomes from uncultured microorganisms are rich resources for novel enzyme genes. The methods used to screen the metagenomic libraries fall into two categories,... (Review)
Review
Metagenomes from uncultured microorganisms are rich resources for novel enzyme genes. The methods used to screen the metagenomic libraries fall into two categories, which are based on sequence or function of the enzymes. The sequence-based approaches rely on the known sequences of the target gene families. In contrast, the function-based approaches do not involve the incorporation of metagenomic sequencing data and, therefore, may lead to the discovery of novel gene sequences with desired functions. In this review, we discuss the function-based screening strategies that have been used in the identification of enzymes from metagenomes. Because of its simplicity, agar plate screening is most commonly used in the identification of novel enzymes with diverse functions. Other screening methods with higher sensitivity are also employed, such as microtiter plate screening. Furthermore, several ultra-high-throughput methods were developed to deal with large metagenomic libraries. Among these are the FACS-based screening, droplet-based screening, and the in vivo reporter-based screening methods. The application of these novel screening strategies has increased the chance for the discovery of novel enzyme genes.
Topics: Animals; Bacteria; Enzymes; Gene Library; High-Throughput Screening Assays; Metagenome; Metagenomics; Plants
PubMed: 30597257
DOI: 10.1016/j.gpb.2018.01.002 -
STAR Protocols Sep 2022Homology-based search is commonly used to uncover mobile genetic elements (MGEs) from metagenomes, but it heavily relies on reference genomes in the database. Here we...
Homology-based search is commonly used to uncover mobile genetic elements (MGEs) from metagenomes, but it heavily relies on reference genomes in the database. Here we introduce a protocol to extract CRISPR-targeted sequences from the assembled human gut metagenomic sequences without using a reference database. We describe the assembling of metagenome contigs, the extraction of CRISPR direct repeats and spacers, the discovery of protospacers, and the extraction of protospacer-enriched regions using the graph-based approach. This protocol could extract numerous characterized/uncharacterized MGEs. For complete details on the use and execution of this protocol, please refer to Sugimoto et al. (2021).
Topics: Base Sequence; Clustered Regularly Interspaced Short Palindromic Repeats; Humans; Metagenome; Metagenomics
PubMed: 35780428
DOI: 10.1016/j.xpro.2022.101525 -
Microbiology Spectrum Feb 2023Lower respiratory infection (LRI) is the most fatal communicable disease, with only a few pathogens identified. Metagenomic next-generation sequencing (mNGS), as an...
Lower respiratory infection (LRI) is the most fatal communicable disease, with only a few pathogens identified. Metagenomic next-generation sequencing (mNGS), as an unbiased, hypothesis-free, and culture-independent method, theoretically enables the detection of all pathogens in a single test. In this study, we developed and validated a DNA-based mNGS method for the diagnosis of LRIs from bronchoalveolar lavage fluid (BALF). We prepared simulated data sets and published raw data sets from patients to evaluate the performance of our in-house bioinformatics pipeline and compared it with the popular metagenomics pipeline Kraken2-Bracken. In addition, a series of biological microbial communities were used to comprehensively validate the performance of our mNGS assay. Sixty-nine clinical BALF samples were used for clinical validation to determine the accuracy. The in-house bioinformatics pipeline validation showed a recall of 88.03%, precision of 99.14%, and F1 score of 92.26% via single-genome simulated data. Mock microbial community and clinical metagenomic data showed that the in-house pipeline has a stricter cutoff value than Kraken2-Bracken, which could prevent false-positive detection by the bioinformatics pipeline. The validation for the whole mNGS pipeline revealed that overwhelming human DNA, long-term storage at 4°C, and repeated freezing-thawing reduced the analytical sensitivity of the assay. The mNGS assay showed a sensitivity of 95.18% and specificity of 91.30% for pathogen detection from BALF samples. This study comprehensively demonstrated the analytical performance of this laboratory-developed mNGS assay for pathogen detection from BALF, which contributed to the standardization of this technology. To our knowledge, this study is the first to comprehensively validate the mNGS assay for the diagnosis of LRIs from BALF. This study exhibited a ready-made example for clinical laboratories to prepare reference materials and develop comprehensive validation schemes for their in-house mNGS assays, which would accelerate the standardization of mNGS testing.
Topics: Humans; Metagenome; Respiratory Tract Infections; Microbiota; High-Throughput Nucleotide Sequencing; Metagenomics
PubMed: 36507666
DOI: 10.1128/spectrum.03812-22