-
Genome Biology Apr 2021High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated... (Review)
Review
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.
Topics: Chromosomes; Computational Biology; Diploidy; Genomics; Haplotypes; High-Throughput Nucleotide Sequencing; Humans; Metagenome; Metagenomics; Polyploidy; Sequence Analysis, DNA
PubMed: 33845884
DOI: 10.1186/s13059-021-02328-9 -
Philosophical Transactions of the Royal... Dec 2020Today massive amounts of sequenced metagenomic and metatranscriptomic data from different ecological niches and environmental locations are available. Scientific... (Review)
Review
Today massive amounts of sequenced metagenomic and metatranscriptomic data from different ecological niches and environmental locations are available. Scientific progress depends critically on methods that allow extracting useful information from the various types of sequence data. Here, we will first discuss types of information contained in the various flavours of biological sequence data, and how this information can be interpreted to increase our scientific knowledge and understanding. We argue that a mechanistic understanding of biological systems analysed from different perspectives is required to consistently interpret experimental observations, and that this understanding is greatly facilitated by the generation and analysis of dynamic mathematical models. We conclude that, in order to construct mathematical models and to test mechanistic hypotheses, time-series data are of critical importance. We review diverse techniques to analyse time-series data and discuss various approaches by which time-series of biological sequence data have been successfully used to derive and test mechanistic hypotheses. Analysing the bottlenecks of current strategies in the extraction of knowledge and understanding from data, we conclude that combined experimental and theoretical efforts should be implemented as early as possible during the planning phase of individual experiments and scientific research projects. This article is part of the theme issue 'Integrative research perspectives on marine conservation'.
Topics: Conservation of Natural Resources; Ecosystem; Gene Expression Profiling; Metagenome; Metagenomics; Models, Biological; Transcriptome
PubMed: 33131436
DOI: 10.1098/rstb.2019.0448 -
Genome Research Mar 2020Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically,... (Review)
Review
Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.
Topics: Data Curation; Genome, Archaeal; Genome, Bacterial; Metagenome; Metagenomics
PubMed: 32188701
DOI: 10.1101/gr.258640.119 -
Current Microbiology Nov 2020Next-generation sequencing (NGS) technologies together with an improved access to compute performance led to a cost-effective genome sequencing over the past several... (Review)
Review
Next-generation sequencing (NGS) technologies together with an improved access to compute performance led to a cost-effective genome sequencing over the past several years. This allowed researchers to fully unleash the potential of genomic and metagenomic analyses to better elucidate two-way interactions between host cells and microbiome, both in steady-state and in pathological conditions. Experimental research involving metagenomics shows that skin resident microbes can influence the cutaneous pathophysiology. Here, we review metagenome approaches to study microbiota at this barrier site. We also describe the consequences of changes in the skin microbiota burden and composition, mostly revealed by these technologies, in the development of common inflammatory skin diseases.
Topics: High-Throughput Nucleotide Sequencing; Humans; Metagenome; Metagenomics; Microbiota; Skin Diseases
PubMed: 32813091
DOI: 10.1007/s00284-020-02163-4 -
Scientific Data Jun 2022With the rapid development of high-throughput sequencing technology, the amount of metagenomic data (including both 16S and whole-genome sequencing data) in public...
With the rapid development of high-throughput sequencing technology, the amount of metagenomic data (including both 16S and whole-genome sequencing data) in public repositories is increasing exponentially. However, owing to the large and decentralized nature of the data, it is still difficult for users to mine, compare, and analyze the data. The animal metagenome database (AnimalMetagenome DB) integrates metagenomic sequencing data with host information, making it easier for users to find data of interest. The AnimalMetagenome DB is designed to contain all public metagenomic data from animals, and the data are divided into domestic and wild animal categories. Users can browse, search, and download animal metagenomic data of interest based on different attributes of the metadata such as animal species, sample site, study purpose, and DNA extraction method. The AnimalMetagenome DB version 1.0 includes metadata for 82,097 metagenomes from 4 domestic animals (pigs, bovines, horses, and sheep) and 540 wild animals. These metagenomes cover 15 years of experiments, 73 countries, 1,044 studies, 63,214 amplicon sequencing data, and 10,672 whole genome sequencing data. All data in the database are hosted and available in figshare https://doi.org/10.6084/m9.figshare.19728619 .
Topics: Animals; Cattle; Databases, Factual; High-Throughput Nucleotide Sequencing; Horses; Metadata; Metagenome; Metagenomics; Sheep; Swine
PubMed: 35710683
DOI: 10.1038/s41597-022-01444-w -
Nature Biotechnology Nov 2023Metagenomic assembly enables new organism discovery from microbial communities, but it can only capture few abundant organisms from most metagenomes. Here we present...
Metagenomic assembly enables new organism discovery from microbial communities, but it can only capture few abundant organisms from most metagenomes. Here we present MetaPhlAn 4, which integrates information from metagenome assemblies and microbial isolate genomes for more comprehensive metagenomic taxonomic profiling. From a curated collection of 1.01 M prokaryotic reference and metagenome-assembled genomes, we define unique marker genes for 26,970 species-level genome bins, 4,992 of them taxonomically unidentified at the species level. MetaPhlAn 4 explains ~20% more reads in most international human gut microbiomes and >40% in less-characterized environments such as the rumen microbiome and proves more accurate than available alternatives on synthetic evaluations while also reliably quantifying organisms with no cultured isolates. Application of the method to >24,500 metagenomes highlights previously undetected species to be strong biomarkers for host conditions and lifestyles in human and mouse microbiomes and shows that even previously uncharacterized species can be genetically profiled at the resolution of single microbial strains.
Topics: Humans; Animals; Mice; Metagenome; Microbiota; Gastrointestinal Microbiome; Metagenomics; Phylogeny
PubMed: 36823356
DOI: 10.1038/s41587-023-01688-w -
Cardiovascular Research Feb 2021
Topics: Metagenome; Metagenomics; Microbiota
PubMed: 32569375
DOI: 10.1093/cvr/cvaa175 -
Microbiology Spectrum Aug 2023Petabases of environmental metagenomic data are publicly available, presenting an opportunity to characterize complex environments and discover novel lineages of life....
Petabases of environmental metagenomic data are publicly available, presenting an opportunity to characterize complex environments and discover novel lineages of life. Metagenome coassembly, in which many metagenomic samples from an environment are simultaneously analyzed to infer the underlying genomes' sequences, is an essential tool for achieving this goal. We applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 terabases (Tbp) of metagenome data from a tropical soil in the Luquillo Experimental Forest (LEF), Puerto Rico. The resulting coassembly yielded 39 high-quality (>90% complete, <5% contaminated, with predicted 23S, 16S, and 5S rRNA genes and ≥18 tRNAs) metagenome-assembled genomes (MAGs), including two from the candidate phylum . Another 268 medium-quality (≥50% complete, <10% contaminated) MAGs were extracted, including the candidate phyla , , and . In total, 307 medium- or higher-quality MAGs were assigned to 23 phyla, compared to 294 MAGs assigned to nine phyla in the same samples individually assembled. The low-quality (<50% complete, <10% contaminated) MAGs from the coassembly revealed a 49% complete rare biosphere microbe from the candidate phylum FCPU426 among other low-abundance microbes, an 81% complete fungal genome from the phylum Ascomycota, and 30 partial eukaryotic MAGs with ≥10% completeness, possibly representing protist lineages. A total of 22,254 viruses, many of them low abundance, were identified. Estimation of metagenome coverage and diversity indicates that we may have characterized ≥87.5% of the sequence diversity in this humid tropical soil and indicates the value of future terabase-scale sequencing and coassembly of complex environments. Petabases of reads are being produced by environmental metagenome sequencing. An essential step in analyzing these data is metagenome assembly, the computational reconstruction of genome sequences from microbial communities. "Coassembly" of metagenomic sequence data, in which multiple samples are assembled together, enables more complete detection of microbial genomes in an environment than "multiassembly," in which samples are assembled individually. To demonstrate the potential for coassembling terabases of metagenome data to drive biological discovery, we applied MetaHipMer2, a distributed metagenome assembler that runs on supercomputing clusters, to coassemble 3.4 Tbp of reads from a humid tropical soil environment. The resulting coassembly, its functional annotation, and analysis are presented here. The coassembly yielded more, and phylogenetically more diverse, microbial, eukaryotic, and viral genomes than the multiassembly of the same data. Our resource may facilitate the discovery of novel microbial biology in tropical soils and demonstrates the value of terabase-scale metagenome sequencing.
Topics: Soil; Microbiota; Bacteria; Metagenome; Genome, Viral; Metagenomics
PubMed: 37310219
DOI: 10.1128/spectrum.00200-23 -
MSphere Nov 2020Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy...
Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy identifiers. To prevent this, urgent efforts should be undertaken by submitters of metagenomic data sets as well as by database managers.
Topics: Algorithms; Databases, Genetic; Metagenome; Metagenomics; Proteins
PubMed: 33148820
DOI: 10.1128/mSphere.00854-20 -
Nature Jan 2022Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats,...
Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats, little is known about the distribution of genes across the global biosphere, with implications for human and planetary health. Here we constructed a non-redundant gene catalogue of 303 million species-level genes (clustered at 95% nucleotide identity) from 13,174 publicly available metagenomes across 14 major habitats and use it to show that most genes are specific to a single habitat. The small fraction of genes found in multiple habitats is enriched in antibiotic-resistance genes and markers for mobile genetic elements. By further clustering these species-level genes into 32 million protein families, we observed that a small fraction of these families contain the majority of the genes (0.6% of families account for 50% of the genes). The majority of species-level genes and protein families are rare. Furthermore, species-level genes, and in particular the rare ones, show low rates of positive (adaptive) selection, supporting a model in which most genetic variability observed within each protein family is neutral or nearly neutral.
Topics: Anti-Bacterial Agents; Drug Resistance, Microbial; Ecosystem; Humans; Metagenome; Metagenomics
PubMed: 34912116
DOI: 10.1038/s41586-021-04233-4