-
Briefings in Bioinformatics Nov 2022Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we...
Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we conducted a critical assessment of pan-genomics of MAGs, by comparing pan-genome analysis results of complete bacterial genomes and simulated MAGs. We found that incompleteness led to significant core gene (CG) loss. The CG loss remained when using different pan-genome analysis tools (Roary, BPGA, Anvi'o) and when using a mixture of MAGs and complete genomes. Contamination had little effect on core genome size (except for Roary due to in its gene clustering issue) but had major influence on accessory genomes. Importantly, the CG loss was partially alleviated by lowering the CG threshold and using gene prediction algorithms that consider fragmented genes, but to a less degree when incompleteness was higher than 5%. The CG loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees. Our main findings were supported by a study of real MAG-isolate genome data. We conclude that lowering CG threshold and predicting genes in metagenome mode (as Anvi'o does with Prodigal) are necessary in pan-genome analysis of MAGs. Development of new pan-genome analysis tools specifically for MAGs are needed in future studies.
Topics: Metagenome; Phylogeny; Genome, Bacterial; Genomics; Sequence Analysis, DNA; Metagenomics
PubMed: 36124775
DOI: 10.1093/bib/bbac413 -
Nucleic Acids Research Aug 2022Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed...
Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed vRhyme, a fast and precise software for construction of viral metagenome-assembled genomes (vMAGs). vRhyme utilizes single- or multi-sample coverage effect size comparisons between scaffolds and employs supervised machine learning to identify nucleotide feature similarities, which are compiled into iterations of weighted networks and refined bins. To refine bins, vRhyme utilizes unique features of viral genomes, namely a protein redundancy scoring mechanism based on the observation that viruses seldom encode redundant genes. Using simulated viromes, we displayed superior performance of vRhyme compared to available binning tools in constructing more complete and uncontaminated vMAGs. When applied to 10,601 viral scaffolds from human skin, vRhyme advanced our understanding of resident viruses, highlighted by identification of a Herelleviridae vMAG comprised of 22 scaffolds, and another vMAG encoding a nitrate reductase metabolic gene, representing near-complete genomes post-binning. vRhyme will enable a convention of binning uncultivated viral genomes and has the potential to transform metagenome-based viral ecology.
Topics: Genome, Viral; High-Throughput Nucleotide Sequencing; Humans; Metagenome; Metagenomics; Sequence Analysis, DNA; Software
PubMed: 35544285
DOI: 10.1093/nar/gkac341 -
Nucleic Acids Research Jul 2022Despite recent methodology and reference database improvements for taxonomic profiling tools, metagenomic assembly and genomic binning remain important pillars of...
Despite recent methodology and reference database improvements for taxonomic profiling tools, metagenomic assembly and genomic binning remain important pillars of metagenomic analysis workflows. In case reference information is lacking, genomic binning is considered to be a state-of-the-art method in mixed culture metagenomic data analysis. In this light, our previously published tool BusyBee Web implements a composition-based binning method efficient enough to function as a rapid online utility. Handling assembled contigs and long nanopore generated reads alike, the webserver provides a wide range of supplementary annotations and visualizations. Half a decade after the initial publication, we revisited existing functionality, added comprehensive visualizations, and increased the number of data analysis customization options for further experimentation. The webserver now allows for visualization-supported differential analysis of samples, which is computationally expensive and typically only performed in coverage-based binning methods. Further, users may now optionally check their uploaded samples for plasmid sequences using PLSDB as a reference database. Lastly, a new application programming interface with a supporting python package was implemented, to allow power users fully automated access to the resource and integration into existing workflows. The webserver is freely available under: https://www.ccb.uni-saarland.de/busybee.
Topics: Algorithms; Metagenome; Software; Metagenomics; Workflow; Sequence Analysis, DNA
PubMed: 35489067
DOI: 10.1093/nar/gkac298 -
Frontiers in Cellular and Infection... 2023The species diversity of microbiomes is a cutting-edge concept in metagenomic research. In this study, we propose a multifractal analysis for metagenomic research.
INTRODUCTION
The species diversity of microbiomes is a cutting-edge concept in metagenomic research. In this study, we propose a multifractal analysis for metagenomic research.
METHOD AND RESULTS
Firstly, we visualized the chaotic game representation (CGR) of simulated metagenomes and real metagenomes. We find that metagenomes are visualized with self-similarity. Then we defined and calculated the multifractal dimension for the visualized plot of simulated and real metagenomes, respectively. By analyzing the Pearson correlation coefficients between the multifractal dimension and the traditional species diversity index, we obtain that the correlation coefficients between the multifractal dimension and the species richness index and Shannon diversity index reached the maximum value when q = 0, 1, and the correlation coefficient between the multifractal dimension and the Simpson diversity index reached the maximum value when q = 5. Finally, we apply our method to real metagenomes of the gut microbiota of 100 infants who are newborn and 4 and 12 months old. The results show that the multifractal dimensions of an infant's gut microbiomes can distinguish age differences.
CONCLUSION AND DISCUSSION
There is self-similarity among the CGRs of WGS of metagenomes, and the multifractal spectrum is an important characteristic for metagenomes. The traditional diversity indicators can be unified under the framework of multifractal analysis. These results coincided with similar results in macrobial ecology. The multifractal spectrum of infants' gut microbiomes are related to the development of the infants.
Topics: Humans; Infant; Infant, Newborn; Metagenome; Microbiota; Gastrointestinal Microbiome; Metagenomics; Ecology
PubMed: 36779183
DOI: 10.3389/fcimb.2023.1117421 -
Microbiology Spectrum Apr 2022The reproductive tract metagenome plays a significant role in the various reproductive system functions, including reproductive cycles, health, and fertility. One of the...
The reproductive tract metagenome plays a significant role in the various reproductive system functions, including reproductive cycles, health, and fertility. One of the major challenges in bovine vaginal metagenome studies is host DNA contamination, which limits the sequencing capacity for metagenomic content and reduces the accuracy of untargeted shotgun metagenomic profiling. This is the first study comparing the effectiveness of different host depletion and DNA extraction methods for bovine vaginal metagenomic samples. The host depletion methods evaluated were slow centrifugation (Soft-spin), NEBNext Microbiome DNA Enrichment kit (NEBNext), and propidium monoazide (PMA) treatment, while the extraction methods were DNeasy Blood and Tissue extraction (DNeasy) and QIAamp DNA Microbiome extraction (QIAamp). Soft-spin and QIAamp were the most effective host depletion method and extraction methods, respectively, in reducing the number of cattle genomic content in bovine vaginal samples. The reduced host-to-microbe ratio in the extracted DNA increased the sequencing depth for microbial reads in untargeted shotgun sequencing. Bovine vaginal samples extracted with QIAamp presented taxonomical profiles which closely resembled the mock microbial composition, especially for the recovery of Gram-positive bacteria. Additionally, samples extracted with QIAamp presented extensive functional profiles with deep coverage. Overall, a combination of Soft-spin and QIAamp provided the most robust representation of the vaginal microbial community in cattle while minimizing host DNA contamination. In addition to the host tissue collected during the sampling process, bovine vaginal samples are saturated with large amounts of extracellular DNA and secreted proteins that are essential for physiological purposes, including the reproductive cycle and immune defense. Due to the high host-to-microbe genome ratio, which hampers the sequencing efficacy for metagenome samples and the recovery of the actual metagenomic profiles, bovine vaginal samples cannot benefit from the full potential of shotgun sequencing. This is the first investigation on the most effective host depletion and extraction methods for bovine vaginal metagenomic samples. This study demonstrated an effective combination of host depletion and extraction methods, which harvested higher percentages of 16S rRNA genes and microbial reads, which subsequently led to a taxonomical profile that resembled the actual community and a functional profile with deeper coverage. A representative metagenomic profile is essential for investigating the role of the bovine vaginal metagenome for both reproductive function and susceptibility to infections.
Topics: Animals; Cattle; DNA; Female; Metagenome; Metagenomics; RNA, Ribosomal, 16S; Sequence Analysis, DNA
PubMed: 35404108
DOI: 10.1128/spectrum.00412-21 -
Bioinformatics (Oxford, England) Sep 2019Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life...
MOTIVATION
Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences.
RESULTS
Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances.
AVAILABILITY AND IMPLEMENTATION
The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.
Topics: Algorithms; High-Throughput Nucleotide Sequencing; Metagenome; Metagenomics; Sequence Analysis, DNA
PubMed: 30649204
DOI: 10.1093/bioinformatics/bty1071 -
Bioinformatics (Oxford, England) Jul 2020Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to...
MOTIVATION
Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition.
RESULTS
We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications.
AVAILABILITY AND IMPLEMENTATION
The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Metagenome; Metagenomics; Sequence Analysis, DNA; Software
PubMed: 32657364
DOI: 10.1093/bioinformatics/btaa441 -
Bioinformatics (Oxford, England) Jan 2022With a large number of metagenomic datasets becoming available, eukaryotic metagenomics emerged as a new challenge. The proper classification of eukaryotic nuclear and...
MOTIVATION
With a large number of metagenomic datasets becoming available, eukaryotic metagenomics emerged as a new challenge. The proper classification of eukaryotic nuclear and organellar genomes is an essential step toward a better understanding of eukaryotic diversity.
RESULTS
We developed Tiara, a deep-learning-based approach for the identification of eukaryotic sequences in the metagenomic datasets. Its two-step classification process enables the classification of nuclear and organellar eukaryotic fractions and subsequently divides organellar sequences into plastidial and mitochondrial. Using the test dataset, we have shown that Tiara performed similarly to EukRep for prokaryotes classification and outperformed it for eukaryotes classification with lower calculation time. In the tests on the real data, Tiara performed better than EukRep in analyzing the small dataset representing eukaryotic cell microbiome and large dataset from the pelagic zone of oceans. Tiara is also the only available tool correctly classifying organellar sequences, which was confirmed by the recovery of nearly complete plastid and mitochondrial genomes from the test data and real metagenomic data.
AVAILABILITY AND IMPLEMENTATION
Tiara is implemented in python 3.8, available at https://github.com/ibe-uw/tiara and tested on Unix-based systems. It is released under an open-source MIT license and documentation is available at https://ibe-uw.github.io/tiara. Version 1.0.1 of Tiara has been used for all benchmarks.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Software; Deep Learning; Eukaryota; Eukaryotic Cells; Metagenomics; Metagenome
PubMed: 34570171
DOI: 10.1093/bioinformatics/btab672 -
Microbiome May 2022For many environments, biome-specific microbial gene catalogues are being recovered using shotgun metagenomics followed by assembly and gene calling on the assembled...
BACKGROUND
For many environments, biome-specific microbial gene catalogues are being recovered using shotgun metagenomics followed by assembly and gene calling on the assembled contigs. The assembly is typically conducted either by individually assembling each sample or by co-assembling reads from all the samples. The co-assembly approach can potentially recover genes that display too low abundance to be assembled from individual samples. On the other hand, combining samples increases the risk of mixing data from closely related strains, which can hamper the assembly process. In this respect, assembly on individual samples followed by clustering of (near) identical genes is preferable. Thus, both approaches have potential pros and cons, but it remains to be evaluated which assembly strategy is most effective. Here, we have evaluated three assembly strategies for generating gene catalogues from metagenomes using a dataset of 124 samples from the Baltic Sea: (1) assembly on individual samples followed by clustering of the resulting genes, (2) co-assembly on all samples, and (3) mix assembly, combining individual and co-assembly.
RESULTS
The mix-assembly approach resulted in a more extensive nonredundant gene set than the other approaches and with more genes predicted to be complete and that could be functionally annotated. The mix assembly consists of 67 million genes (Baltic Sea gene set, BAGS) that have been functionally and taxonomically annotated. The majority of the BAGS genes are dissimilar (< 95% amino acid identity) to the Tara Oceans gene dataset, and hence, BAGS represents a valuable resource for brackish water research.
CONCLUSION
The mix-assembly approach represents a feasible approach to increase the information obtained from metagenomic samples. Video abstract.
Topics: Algorithms; Cluster Analysis; Ecosystem; Metagenome; Metagenomics
PubMed: 35524337
DOI: 10.1186/s40168-022-01259-2 -
PeerJ 2022As the size of reference sequence databases and high-throughput sequencing datasets continue to grow, it is becoming computationally infeasible to use traditional...
As the size of reference sequence databases and high-throughput sequencing datasets continue to grow, it is becoming computationally infeasible to use traditional alignment to large genome databases for taxonomic classification of metagenomic reads. Exact matching approaches can rapidly assign taxonomy and summarize the composition of microbial communities, but they sacrifice accuracy and can lead to false positives. Full alignment tools provide higher confidence assignments and can assign sequences from genomes that diverge from reference sequences; however, full alignment tools are computationally intensive. To address this, we designed MTSv specifically for alignment-based taxonomic assignment in metagenomic analysis. This tool implements an FM-index assisted q-gram filter and SIMD accelerated Smith-Waterman algorithm to find alignments. However, unlike traditional aligners, MTSv will not attempt to make additional alignments to a TaxID once an alignment of sufficient quality has been found. This improves efficiency when many reference sequences are available per taxon. MTSv was designed to be flexible and can be modified to run on either memory or processor constrained systems. Although MTSv cannot compete with the speeds of exact k-mer matching approaches, it is reasonably fast and has higher precision than popular exact matching approaches. Because MTSv performs a full alignment it can classify reads even when the genomes share low similarity with reference sequences and provides a tool for high confidence pathogen detection with low off-target assignments to near neighbor species.
Topics: Sequence Analysis, DNA; Algorithms; Metagenome; Databases, Nucleic Acid; Metagenomics
PubMed: 36389404
DOI: 10.7717/peerj.14292