We
study haplotype reconstruction under the Mendelian law of
inheritance and the minimum recombination principle on pedigree
data. We prove that the problem of finding a minimum-recombinant
haplotype configuration (MRHC) is in general NP-hard. This is the first
complexity result concerning the problem to our knowledge. An iterative
algorithm based on blocks of consecutive resolved marker loci (called block-extension)
is proposed. It is very efficient and can be used for large pedigrees with a
large number of markers, especially for those data sets requiring few
recombinants (or recombination events). A polynomial-time exact algorithm for
haplotype reconstruction without recombinants is also presented. This algorithm
first identifies all the necessary constraints based on the Mendelian law and
the zero recombinant assumption, and represents them using a system of linear
equations over the cyclic group Z2. By using a simple method based
on Gaussian elimination, we could obtain all possible feasible haplotype
configurations. A C++ implementation of the block-extension algorithm, called PedPhase,
has been tested on both simulated data and real data. The results show that the
program performs very well on both types of data and will be useful for large
scale haplotype inference projects. Due to space limit, the references, tables
and figures are in a separate file. The full
version of our paper is accepted by Journal of
Bioinformatics & Computational Biology and is available at our website http://www.cs.ucr.edu/~jili/haplotyping.html
. An extended abstract is to appear in RECOMB03 .
Genetic fine-mapping for complex diseases (such as
cancers, diabetes, osteoporoses etc.) is currently a great challenge for
geneticists and will continue to be so in the near future. With the
availability of single nucleotide polymorphisms (SNPs) information,
researchers see new potentials of genetic mapping. The ongoing genetic
variation and haplotype map projects at the National Human Genome Research
Institute (NHGRI) of the USA and the international HapMap project are focused
on the discovery and typing of SNPs and development of high-resolution maps of
genetic variation and haplotypes for human [4].
But, the use of haplotype maps has been limited due to the fact that the human
genome is a diploid and, in practice, genotype data instead
of haplotype data are collected directly, especially in large scale
sequencing projects, because of cost considerations. Efficient and accurate
computational methods for haplotype reconstruction from genotype data are still
highly demanded.
We are interested in
rule-based haplotype reconstruction methods on pedigrees. In a very recent
paper [7], Qian and Beckmann
proposed a rule-based algorithm to reconstruct haplotype configurations for
pedigree data, based on the minimum recombination principle. (From now on, we
call their algorithm MRH.) Given a pedigree and the genotype information about
each member of the pedigree (with possibly missing data), the authors are
interested in finding the haplotype configurations for each member such that
the total number of recombinants (or recombination events) in the whole
pedigree is minimized. We call the problem the Minimum-Recombinant
Haplotype Configuration (MRHC) problem. Although the algorithm MRH
in [7] performs very well for small
pedigrees, its effectiveness scales very poorly because it runs extremely
slowly on data of even moderate sizes, especially for data with biallelic
markers. This is regrettable since large SNP data sets on pedigrees are
becoming increasingly interesting and SNP markers are biallelic.
In this paper, we first show that the MRHC problem is in general NP-hard and then devise an efficient iterative (heuristic) algorithm (called block-extension) for MRHC. We also consider the special case of haplotype reconstruction where no recombinants are assumed. This special case is interesting not only because its solution may be useful for solving the general MRHC problem as a subroutine but also because the frequency of recombinants is expected to be close to 0 when a small (or moderately large) region of genomic DNA is considered in real data set [3]. Previously, only an exponential-time algorithm based on exhaustive enumeration was known [5]. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.
In
this section, we first give a formal definition of the MRHC problem, and then
prove that the problem is NP-hard even if the input pedigree data contains only
two loci.
Definition 1 A pedigree graph is a connected directed acyclic graph (DAG) G =
{V, E}, where V = M ÈF ÈN, M stands for the male nodes, F stands for the
female nodes, N stands for the mating nodes, and E = {e = (u,v): u Î M ÈF and v Î N or u Î N and v Î M ÈF}. M ÈF are called the individual nodes. The in-degree of
each individual node is at most 1. The in-degree of a mating node must be 2,
with one edge starting from a male node (called father) and the other edge from
a female node (called mother), and the out-degree of a mating node must be
larger than zero.
In a pedigree, the
individual nodes adjacent to a mating node (i.e. they have
edges from the mating node) are called the children of the two
individual nodes adjacent from the mating node (i.e. the father
and mother nodes, which have edges to the mating node). The individual nodes
that have no parents (in-degree is zero) are called founders. For each
mating node, the induced subgraph containing the father, mother, mating, and
child nodes is called a nuclear family. A parents-offspring trio
consists of two parents and one of their children. A mating loop is a
cycle in the graph if the directions of edges are ignored. Figure 1 shows an illustration of an example pedigree that highlights mating
nodes on top of a conventional drawing of the pedigree (where the mating nodes
are omitted). For convenience, we will use conventional drawings of pedigrees
throughout the paper. Figure 2 illustrates an example where the paternal haplotype of member 3 is the
result of a recombinant. 3
We use PS(parental
source) to indicate which allele comes from which parent at each locus and
we use GS (grand-parental source) to indicate if an allele at a
PS-resolved locus comes from a grand-paternal allele or a grand-maternal
allele.
Definition 2 A haplotype configuration of a pedigree is an assignment of the PS
of each locus and the GS of each allele for each member of the pedigree that is
consistent with the Mendelian law.
Definition 3 Given a pedigree graph and genotype information for each member of
the pedigree, find a haplotype configuration for the pedigree that requires the
minimum number of recombinants.
Unfortunately, we can prove
that MRHC is NP-hard and thus does not have a polynomial-time algorithm, unless
P = NP.
Theorem 1 MRHC is NP-hard.4
Because
of the NP-hardness of MRHC, it is unlikely to find an efficient algorithm to
solve MRHC exactly. We thus propose an iterative heuristic algorithm for MRHC,
called the block-extension algorithm. Here, a block means a
consecutive sequence of resolved loci of some individual. The basic idea of the
algorithm is to partition the loci in each member of the pedigree into blocks
after applying the Mendelian law and some simple greedy strategy (such as
avoiding double recombinants within a small region of loci). It then repeatedly
uses the longest haplotype block in the pedigree to extend other blocks (in
other members) by resolving unresolved loci based on the minimum recombination
principle. This approach may potentially be more efficient than the algorithm
MRH from [7], especially when the number of required recombinants is small,
because multiple loci are resolved simultaneously instead of separately. More
precisely, the block-extension algorithm is based on the following
observations.
1) Long haplotype blocks are common in human genomes [2,3,4]. Few or zero recombinants
are expected within these blocks.
2) Shared haplotype blocks
among siblings are strong evidence that no recombination should occur on those
siblings based on the assumption that recombinants are rare in a pedigree.
3) Double recombinants are
very rare within a small region.
4) It is sometimes possible
to determine that some individuals must involve recombinants. We may be better
off to leave these individuals until the last.
The algorithm has 5 steps.
The main steps of the algorithm are also illustrated by an example.
Step 1: Missing genotype imputing by the Mendelian law.
Step 2: PS and GS assignments by the Mendelian law.
Step 3: Greedy assignment of GS values.
Step 4: Block-extension.
Step 5: Finishing the remaining gaps.
Figure 3 illustrates the main steps
of the above algorithm using a simple example. Since the input pedigree and
genotype information, as given in (A), have no missing data, we skip step 1.
(B) shows the result of step 2. In step 3, since we know the GS values of the
alleles at the first locus of member 3-3, we assign the adjacent alleles (at
loci 2 to 4) the same GS value. This forces locus 2 at both parents of member
3-3 to be PS-resolved. The result of step 3 is shown in (C). In step 4, we use
the longest block (loci 1 to 4 in member 3-1) to help resolve the same region
in member 3-4 (by consulting the corresponding region in member 3-2). This
results in a longer block in member 3-4 as shown in (D). We then use the
longest block in member 3-4 to help resolve loci 5 and 6 in members 3-1 and 3-2
and the blocks in members 3-1 and 3-2 to help resolve locus 5 in member 3-3, as
shown in (E). In step 5, we consider the two alternative PS assignments for
locus 7 in member 3-1. For each such assignment, we calculate the best PS
assignments for locus 7 in the other members of the same nuclear family (i.e.
members 3-2, 3-3, and 3-4) in order to minimize the number of recombinants
required. The assignment shown in (F) turns out to be the best choice, and the
final haplotype configuration is shown in (G), which happens to require one
recombinant.
Theorem 2 Let n denote the size of the pedigree, m the number of loci, and d
the largest number of children in a nuclear family. The block-extension
algorithm runs in O(dmn) time.
As
a heuristic, the above block-extension algorithm does not always compute an
optimal solution for MRHC. For a special case (i.e. data can be
interpreted with zero recombinants, which are common for organisms like human),
we have an efficient (polynomial-time) algorithm to compute all possible
haplotype assignments involving no recombinants. Observe that finding a
haplotype configuration is equivalent to reducing the degree of freedom in the
PS/GS assignment for each locus/allele in every individual. We may thus find
all necessary constraints on the PS/GS assignments first so that the
freedom left is really free (i.e. the choices will always lead
to 0-recombinant haplotype configurations). Then we can simply enumerate all
haplotype configurations satisfying the constraints.
We define four levels of
constraints. The first level of constraints are the strongest and specify
values for the involved alleles (i.e. these alleles are
GS-resolved). The second level of constraints specify values for the involved loci
(i.e. these loci are PS-resolved). The last two levels of constraints
are concerned with two loci. The third level of constraints describe
what alleles should be on the same haplotype in one member without
specifying the actual PS values. For example, in Figure 4 (left), the PS-resolved
loci in member 3 forces the parents have the same haplotype grouping in order
to obtain a solution without recombinants, although we do not know the actual
PS of the two haplotypes in the grouping. The forth level constraint are the
weakest. Each level 4 constraint is concerned with the relationship between the
haplotype groupings in some different individuals (i.e.
parent-child or mates) at two loci, although it does not specify the actual
haplotype grouping. For example, in Figure 4 (right), the three members of the parents-offspring trio should all
have the same haplotype grouping in order not to incur any recombinant in the
trio. But, we do not have information to determine which case must hold. All
possible types of level 3 and level 4 constraints will be listed below.
For each locus i in member
j, we define a binary variable xi,j to represent the PS value of the
locus. The basic idea of our algorithm is to identify all the level 2-4
constraints based on the Mendelian law and the 0-recombinant assumption by
examining every parents-offspring trio, and represent the constraints as linear
equations on xi,j's over the cyclic group Z2. It then
finds all feasible 0-recombinant solutions by solving the equations. More
specifically, the level 2 constraints are collected locus by locus by examining
every parents-offspring trio, which is the same as step 2 of the
block-extension algorithm. In order to collect level 3 and level 4 constraints
in the form of linear equations on the binary PS variables, we need to consider
pairs of loci for each parents-offspring trio. Without distinguishing the two
parents in a parents-offspring trio, there are essentially four types of level
3 constraints and three types of level 4 constraints as summarized in
Tables 1 and 2. In the tables, x, y are
the parents and z is the child. An * indicates any allele. x1
represents the binary PS variable for locus 1 in member x. These constraints
are collected trio by trio. Definition 4 Given the level 2-4
constraints defined above, a consistent solution is an assignment of binary
values to all the PS variables that satisfies every constraint.
Clearly, every feasible
0-recombinant solution is consistent. The following theorem shows that the
converse is also true.
Theorem 3 Every consistent solution is a feasible
0-recombinant solution.
The above constraints form a
system of (sparse) linear equations over the group Z2, which could
be solved by the classical Gaussian elimination method running in cubic time
(see e.g. [8]). Since we are dealing
with Z2, a much simpler algorithm (adapted from Gaussian
elimination) is presented in the full paper.
We
have implemented the above block-extension algorithm as a C++ program, PedPhase,
which is available to the public upon request to either of the authors. To
evaluate the performance of PedPhase, we compared PedPhase and MRH on simulated
genotype data in terms of accuracy and efficiency using three different
pedigree structures. The results show that PedPhase is comparable with MRH in
terms of accuracy when the number of recombination events is small and is much
faster than MRH on large dataset (large pedigree/large number of loci). We
further compared the performance of PedPhase, MRH and an EM algorithm on a real
data set that consists of 12 multi-generation pedigrees from a recent paper [3]. The results show that
both rule-based haplotyping methods (PedPhase and MRH) could discover almost
all common haplotypes (i.e. haplotypes with frequencies >
5%) that were inferred by the EM algorithm [3]. The haplotype frequencies
estimated from the haplotype results of PedPhase and MRH (by simple counting)
are also similar to the estimations by the EM algorithm. Most of the blocks (
> 85%) were found by PedPhase and MRH to have involved 0 recombinants. The
assumptions and observations made in Section 2.2
and the minimum recombination principle are well supported by this real
dataset.
Pedigrees with mating loops are not a big problem for both rule-based algorithms block-extension and MRH, although they usually cause troubles for statistical methods. But, we have observed in the experiments that for both of the rule-based algorithms, the results on the pedigree structure with a loop are slightly worse than the results on the other two pedigree structures without loops. More investigation is needed on this issue. Notice that the pedigree structure used in the proof of the NP-hardness of the MRHC problem is very complicated and requires a large number of recombinants. An interesting question is if there is an efficient algorithm for solving MRHC (exactly) on pedigrees that require only a small (fixed) number of recombinants.
1An overview of
Jing Li's current research for ACM SRC Grand Finals.
2Research
supported by NSF grant CCR-9988353.
3The pedigree
diagrams in this paper were generated using WPEDRAW [1].
4All proofs can
be found in the full paper at website http://www.cs.ucr.edu/~
jili/haplotyping.htm.