-
Nature Methods Feb 2021Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous...
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.
Topics: Algorithms; Genome; Haplotypes; Sequence Analysis, DNA
PubMed: 33526886
DOI: 10.1038/s41592-020-01056-5 -
Plant Biotechnology Journal Jun 2022Genome phasing is a recently developed assembly method that separates heterozygous eukaryotic genomic regions and builds haplotype-resolved assemblies. Because... (Review)
Review
Genome phasing is a recently developed assembly method that separates heterozygous eukaryotic genomic regions and builds haplotype-resolved assemblies. Because differences between haplotypes are ignored in most published de novo genomes, assemblies are available as consensus genomes consisting of haplotype mixtures, thus increasing the need for genome phasing. Here, we review the operating principles and characteristics of several freely available and widely used phasing tools (TrioCanu, FALCON-Phase, and ALLHiC). An examination of downstream analyses using haplotype-resolved genome assemblies in plants indicated significant differences among haplotypes regarding chromosomal rearrangements, sequence insertions, and expression of specific alleles that contribute to the acquisition of the biological characteristics of plant species. Finally, we suggest directions to solve addressing limitations of current genome-phasing methods. This review provides insights into the current progress, limitations, and future directions of de novo genome phasing, which will enable researchers to easily access and utilize genome-phasing in studies involving highly heterozygous complex plant genomes.
Topics: Alleles; Genome, Plant; Genomics; Haplotypes; Plants; Sequence Analysis, DNA
PubMed: 35332665
DOI: 10.1111/pbi.13815 -
Molecular Ecology Mar 2023The term "haplotype block" is commonly used in the developing field of haplotype-based inference methods. We argue that the term should be defined based on the structure...
The term "haplotype block" is commonly used in the developing field of haplotype-based inference methods. We argue that the term should be defined based on the structure of the Ancestral Recombination Graph (ARG), which contains complete information on the ancestry of a sample. We use simulated examples to demonstrate key features of the relationship between haplotype blocks and ancestral structure, emphasizing the stochasticity of the processes that generate them. Even the simplest cases of neutrality or of a "hard" selective sweep produce a rich structure, often missed by commonly used statistics. We highlight a number of novel methods for inferring haplotype structure, based on the full ARG, or on a sequence of trees, and illustrate how they can be used to define haplotype blocks using an empirical data set. While the advent of new, computationally efficient methods makes it possible to apply these concepts broadly, they (and additional new methods) could benefit from adding features to explore haplotype blocks, as we define them. Understanding and applying the concept of the haplotype block will be essential to fully exploit long and linked-read sequencing technologies.
Topics: Haplotypes; Algorithms; Models, Genetic
PubMed: 36433653
DOI: 10.1111/mec.16793 -
Trends in Ecology & Evolution Mar 2020The particular combinations of alleles that define haplotypes along individual chromosomes can be determined with increasing ease and accuracy by using current... (Review)
Review
The particular combinations of alleles that define haplotypes along individual chromosomes can be determined with increasing ease and accuracy by using current sequencing technologies. Beyond allele frequencies, haplotype data collected in population samples contain information about the history of allelic associations in gene genealogies, and this is of tremendous potential for conservation genomics. We provide an overview of how haplotype information can be used to assess historical demography, gene flow, selection, and the evolutionary outcomes of hybridization across different timescales relevant to conservation issues. We address technical aspects of applying such approaches to nonmodel species. We conclude that there is much to be gained by integrating haplotype-based analyses in future conservation genomics studies.
Topics: Alleles; Gene Flow; Gene Frequency; Genomics; Haplotypes
PubMed: 31810774
DOI: 10.1016/j.tree.2019.10.012 -
The Plant Journal : For Cell and... Jan 2023To improve our understanding of genetic mechanisms underlying complex traits in plants, a comprehensive analysis of gene variants is required. Eucalyptus is an important...
To improve our understanding of genetic mechanisms underlying complex traits in plants, a comprehensive analysis of gene variants is required. Eucalyptus is an important forest plantation genus that is highly outbred. Trait dissection and molecular breeding in eucalypts currently relies on biallelic single-nucleotide polymorphism (SNP) markers. These markers fail to capture the large amount of haplotype diversity in these species, and thus multi-allelic markers are required. We aimed to develop a gene-based haplotype mining panel for Eucalyptus species. We generated 17 999 oligonucleotide probe sets for targeted sequencing of selected regions of 6293 genes implicated in growth and wood properties, pest and disease resistance, and abiotic stress responses. We identified and phased 195 834 SNPs using a read-based phasing approach to reveal SNP-based haplotypes. A total of 8915 target regions (at 4637 gene loci) passed tests for Mendelian inheritance. We evaluated the haplotype panel in four Eucalyptus species (E. grandis, E. urophylla, E. dunnii and E. nitens) to determine its ability to capture diversity across eucalypt species. This revealed an average of 3.13-4.52 haplotypes per target region in each species, and 33.36% of the identified haplotypes were shared by at least two species. This haplotype mining panel will enable the analysis of haplotype diversity within and between species, and provide multi-allelic markers that can be used for genome-wide association studies and gene-based breeding approaches.
Topics: Haplotypes; Eucalyptus; Genome-Wide Association Study; Plant Breeding; Phenotype; Polymorphism, Single Nucleotide
PubMed: 36394447
DOI: 10.1111/tpj.16026 -
Medical Principles and Practice :... 2020The aim of this study was to assess the HLA haplotype frequencies and genetic profiles of the Kuwaiti population.
OBJECTIVE
The aim of this study was to assess the HLA haplotype frequencies and genetic profiles of the Kuwaiti population.
MATERIALS AND METHODS
Whole venous blood was obtained from 595 healthy, unrelated Kuwaiti volunteers. The study population was genotyped for HLA class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1 and HLA-DQB1) loci using sequence-specific oligonucleotide (SSO) probe-based hybridization and high-resolution HLA genotyping. Haplotype frequencies were estimated using an implementation of the expectation maximization algorithm that resolves both phase and allelic ambiguity. The Kuwaiti population was compared with other populations from the US National Marrow Donor Program (NMDP), by running a principal component analysis (PCA) on the relevant haplotype frequencies.
RESULTS
The most common HLA class I alleles in Kuwait were HLA-A*02:01g, HLA-C*06:02g, and HLA-B*50:01g with frequencies of 16, 14, and 12%, respectively. The most common HLA class II alleles in Kuwait were HLA-DQB1*02:01g and HLA-DRB1*07:01 with frequencies of 29.7 and 16.5%, respectively. The most common Kuwaiti haplotype observed was HLA-A*02:01g∼HLA-C*06:02g∼HLA-B*50:01g∼HLA-DRB1*07:01∼HLA-DQB1*02:01g at a frequency of 2.3%. The PCA demonstrated close genetic proximity of the Kuwaiti population with Middle Eastern, Southeast Asian, and North African populations in the NMDP.
CONCLUSION
Identifying the haplotype diversity in the Kuwaiti population will contribute to the selection of an HLA-match for HSCT, disease associations, pharmacogenomics, and knowledge of pop-ulation HLA diversity.
Topics: Genetic Profile; Genetic Variation; HLA Antigens; Haplotypes; Humans; Kuwait
PubMed: 30870850
DOI: 10.1159/000499593 -
Genetics Sep 2019The population-genetic statistic [Formula: see text] is used widely to describe allele frequency distributions in subdivided populations. The increasing availability of...
The population-genetic statistic [Formula: see text] is used widely to describe allele frequency distributions in subdivided populations. The increasing availability of DNA sequence data has recently enabled computations of [Formula: see text] from sequence-based "haplotype loci." At the same time, theoretical work has revealed that [Formula: see text] has a strong dependence on the underlying genetic diversity of a locus from which it is computed, with high diversity constraining values of [Formula: see text] to be low. In the case of haplotype loci, for which two haplotypes that are distinct over a specified length along a chromosome are treated as distinct alleles, genetic diversity is influenced by haplotype length: longer haplotype loci have the potential for greater genetic diversity. Here, we study the dependence of [Formula: see text] on haplotype length. Using a model in which a haplotype locus is sequentially incremented by one biallelic locus at a time, we show that increasing the length of the haplotype locus can either increase or decrease the value of [Formula: see text], and usually decreases it. We compute [Formula: see text] on haplotype loci in human populations, finding a close correspondence between the observed values and our theoretical predictions. We conclude that effects of haplotype length are valuable to consider when interpreting [Formula: see text] calculated on haplotypic data.
Topics: Gene Frequency; Genome-Wide Association Study; Haplotypes; Humans; Linkage Disequilibrium; Models, Genetic; Polymorphism, Single Nucleotide
PubMed: 31285255
DOI: 10.1534/genetics.119.302430 -
Human Immunology Sep 2019The highly polymorphic classical human leukocyte antigen (HLA) genes display strong linkage disequilibrium (LD) that results in conserved multi-locus haplotypes. For... (Review)
Review
The highly polymorphic classical human leukocyte antigen (HLA) genes display strong linkage disequilibrium (LD) that results in conserved multi-locus haplotypes. For unrelated individuals in defined populations, HLA haplotype frequencies can be estimated using the expectation-maximization (EM) method. Haplotypes can also be constructed using HLA allele segregation from nuclear families. It is straightforward to identify many HLA genotyping inconsistencies by visually reviewing HLA allele segregation in family members. It is also possible to identify potential crossover events when two or more children are available in a nuclear family. This process of visual inspection can be unwieldy, and we developed the "HaplObserve" program to standardize the process and automatically build haplotypes using family-based HLA allele segregation. HaplObserve facilitates systematically building haplotypes, and reporting potential crossover events. HLA Haplotype Validator (HLAHapV) is a program originally developed to impute chromosomal phase from genotype data using reference haplotype data. We updated and adapted HLAHapV to systematically compare observed and estimated haplotypes. We also used HLAHapV to identify haplotypes when uninformative HLA genotypes are present in families. Finally, we developed "pould", an R package that calculates haplotype frequencies, and estimates standard measures of global (locus-level) LD from both observed and estimated haplotypes.
Topics: Alleles; Child; Gene Frequency; Genetic Loci; HLA Antigens; Haplotypes; Heterozygote; Humans; Linkage Disequilibrium; Nuclear Family; Pedigree; Software
PubMed: 30735756
DOI: 10.1016/j.humimm.2019.01.010 -
Genetics Jul 2014A novel haplotype association method is presented, and its power is demonstrated. Relying on a statistical model for linkage disequilibrium (LD), the method first infers...
A novel haplotype association method is presented, and its power is demonstrated. Relying on a statistical model for linkage disequilibrium (LD), the method first infers ancestral haplotypes and their loadings at each marker for each individual. The loadings are then used to quantify local haplotype sharing between individuals at each marker. A statistical model was developed to link the local haplotype sharing and phenotypes to test for association. We devised a novel method to fit the LD model, reducing the complexity from putatively quadratic to linear (in the number of ancestral haplotypes). Therefore, the LD model can be fitted to all study samples simultaneously, and, consequently, our method is applicable to big data sets. Compared to existing haplotype association methods, our method integrated out phase uncertainty, avoided arbitrariness in specifying haplotypes, and had the same number of tests as the single-SNP analysis. We applied our method to data from the Wellcome Trust Case Control Consortium and discovered eight novel associations between seven gene regions and five disease phenotypes. Among these, GRIK4, which encodes a protein that belongs to the glutamate-gated ionic channel family, is strongly associated with both coronary artery disease and rheumatoid arthritis. A software package implementing methods described in this article is freely available at http://www.haplotype.org.
Topics: Algorithms; Alleles; Bayes Theorem; Case-Control Studies; Computer Simulation; Databases, Genetic; Genetic Association Studies; Genetic Predisposition to Disease; Haplotypes; Humans; Linkage Disequilibrium; Models, Genetic; Phenotype; Polymorphism, Single Nucleotide
PubMed: 24812308
DOI: 10.1534/genetics.114.164814 -
Bioinformatics (Oxford, England) Jan 2020The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological,...
MOTIVATION
The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.
RESULTS
We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.
AVAILABILITY AND IMPLEMENTATION
Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Topics: Algorithms; Genome; Haplotypes; Sequence Analysis, DNA; Software
PubMed: 31406990
DOI: 10.1093/bioinformatics/btz575