Efficient Rule-Based Haplotyping Algorithms for Pedigree Data1

Jing Li2
jili@cs.ucr.edu
Faculty advisor: Prof. Tao Jiang
Dept of Computer Science & Engineering
University of California - Riverside
ACM membership number: 8011447

Abstract

We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects. Due to space limit, the references, tables and figures are in a separate file. The full version of our paper is accepted by Journal of Bioinformatics & Computational Biology and is available at our website http://www.cs.ucr.edu/~jili/haplotyping.html . An extended abstract is to appear in RECOMB03 .

1  Haplotyping problem, background & related work

Genetic fine-mapping for complex diseases (such as cancers, diabetes, osteoporoses etc.) is currently a great challenge for geneticists and will continue to be so in the near future. With the availability of single nucleotide polymorphisms (SNPs) information, researchers see new potentials of genetic mapping. The ongoing genetic variation and haplotype map projects at the National Human Genome Research Institute (NHGRI) of the USA and the international HapMap project are focused on the discovery and typing of SNPs and development of high-resolution maps of genetic variation and haplotypes for human [4]. But, the use of haplotype maps has been limited due to the fact that the human genome is a diploid and, in practice, genotype data instead of haplotype data are collected directly, especially in large scale sequencing projects, because of cost considerations. Efficient and accurate computational methods for haplotype reconstruction from genotype data are still highly demanded.

We are interested in rule-based haplotype reconstruction methods on pedigrees. In a very recent paper [7], Qian and Beckmann proposed a rule-based algorithm to reconstruct haplotype configurations for pedigree data, based on the minimum recombination principle. (From now on, we call their algorithm MRH.) Given a pedigree and the genotype information about each member of the pedigree (with possibly missing data), the authors are interested in finding the haplotype configurations for each member such that the total number of recombinants (or recombination events) in the whole pedigree is minimized. We call the problem the Minimum-Recombinant Haplotype Configuration (MRHC) problem. Although the algorithm MRH in [7] performs very well for small pedigrees, its effectiveness scales very poorly because it runs extremely slowly on data of even moderate sizes, especially for data with biallelic markers. This is regrettable since large SNP data sets on pedigrees are becoming increasingly interesting and SNP markers are biallelic.

In this paper, we first show that the MRHC problem is in general NP-hard and then devise an efficient iterative (heuristic) algorithm (called block-extension) for MRHC. We also consider the special case of haplotype reconstruction where no recombinants are assumed. This special case is interesting not only because its solution may be useful for solving the general MRHC problem as a subroutine but also because the frequency of recombinants is expected to be close to 0 when a small (or moderately large) region of genomic DNA is considered in real data set [3]. Previously, only an exponential-time algorithm based on exhaustive enumeration was known [5]. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.

2  Methods

2.1  The MRHC problem and its NP-hardness

In this section, we first give a formal definition of the MRHC problem, and then prove that the problem is NP-hard even if the input pedigree data contains only two loci.

Definition 1 A pedigree graph is a connected directed acyclic graph (DAG) G = {V, E}, where V = M ÈF ÈN, M stands for the male nodes, F stands for the female nodes, N stands for the mating nodes, and E = {e = (u,v): u Î M ÈF and v Î N or u Î N and v Î M ÈF}. M ÈF are called the individual nodes. The in-degree of each individual node is at most 1. The in-degree of a mating node must be 2, with one edge starting from a male node (called father) and the other edge from a female node (called mother), and the out-degree of a mating node must be larger than zero.

In a pedigree, the individual nodes adjacent to a mating node (i.e. they have edges from the mating node) are called the children of the two individual nodes adjacent from the mating node (i.e. the father and mother nodes, which have edges to the mating node). The individual nodes that have no parents (in-degree is zero) are called founders. For each mating node, the induced subgraph containing the father, mother, mating, and child nodes is called a nuclear family. A parents-offspring trio consists of two parents and one of their children. A mating loop is a cycle in the graph if the directions of edges are ignored. Figure 1 shows an illustration of an example pedigree that highlights mating nodes on top of a conventional drawing of the pedigree (where the mating nodes are omitted). For convenience, we will use conventional drawings of pedigrees throughout the paper. Figure 2 illustrates an example where the paternal haplotype of member 3 is the result of a recombinant.  3

We use PS(parental source) to indicate which allele comes from which parent at each locus and we use GS (grand-parental source) to indicate if an allele at a PS-resolved locus comes from a grand-paternal allele or a grand-maternal allele.

Definition 2 A haplotype configuration of a pedigree is an assignment of the PS of each locus and the GS of each allele for each member of the pedigree that is consistent with the Mendelian law.

Definition 3 Given a pedigree graph and genotype information for each member of the pedigree, find a haplotype configuration for the pedigree that requires the minimum number of recombinants.

Unfortunately, we can prove that MRHC is NP-hard and thus does not have a polynomial-time algorithm, unless P = NP.

Theorem 1 MRHC is NP-hard.4

2.2  An efficient iterative algorithm for MRHC

Because of the NP-hardness of MRHC, it is unlikely to find an efficient algorithm to solve MRHC exactly. We thus propose an iterative heuristic algorithm for MRHC, called the block-extension algorithm. Here, a block means a consecutive sequence of resolved loci of some individual. The basic idea of the algorithm is to partition the loci in each member of the pedigree into blocks after applying the Mendelian law and some simple greedy strategy (such as avoiding double recombinants within a small region of loci). It then repeatedly uses the longest haplotype block in the pedigree to extend other blocks (in other members) by resolving unresolved loci based on the minimum recombination principle. This approach may potentially be more efficient than the algorithm MRH from [7], especially when the number of required recombinants is small, because multiple loci are resolved simultaneously instead of separately. More precisely, the block-extension algorithm is based on the following observations.

1) Long haplotype blocks are common in human genomes [2,3,4]. Few or zero recombinants are expected within these blocks.

2) Shared haplotype blocks among siblings are strong evidence that no recombination should occur on those siblings based on the assumption that recombinants are rare in a pedigree.

3) Double recombinants are very rare within a small region.

4) It is sometimes possible to determine that some individuals must involve recombinants. We may be better off to leave these individuals until the last.

The algorithm has 5 steps. The main steps of the algorithm are also illustrated by an example.

Step 1: Missing genotype imputing by the Mendelian law.

Step 2: PS and GS assignments by the Mendelian law.

Step 3: Greedy assignment of GS values.

Step 4: Block-extension.

Step 5: Finishing the remaining gaps.

Figure 3 illustrates the main steps of the above algorithm using a simple example. Since the input pedigree and genotype information, as given in (A), have no missing data, we skip step 1. (B) shows the result of step 2. In step 3, since we know the GS values of the alleles at the first locus of member 3-3, we assign the adjacent alleles (at loci 2 to 4) the same GS value. This forces locus 2 at both parents of member 3-3 to be PS-resolved. The result of step 3 is shown in (C). In step 4, we use the longest block (loci 1 to 4 in member 3-1) to help resolve the same region in member 3-4 (by consulting the corresponding region in member 3-2). This results in a longer block in member 3-4 as shown in (D). We then use the longest block in member 3-4 to help resolve loci 5 and 6 in members 3-1 and 3-2 and the blocks in members 3-1 and 3-2 to help resolve locus 5 in member 3-3, as shown in (E). In step 5, we consider the two alternative PS assignments for locus 7 in member 3-1. For each such assignment, we calculate the best PS assignments for locus 7 in the other members of the same nuclear family (i.e. members 3-2, 3-3, and 3-4) in order to minimize the number of recombinants required. The assignment shown in (F) turns out to be the best choice, and the final haplotype configuration is shown in (G), which happens to require one recombinant.

Theorem 2 Let n denote the size of the pedigree, m the number of loci, and d the largest number of children in a nuclear family. The block-extension algorithm runs in O(dmn) time.

2.3  A constraint-based algorithm for 0-recombinant data

As a heuristic, the above block-extension algorithm does not always compute an optimal solution for MRHC. For a special case (i.e. data can be interpreted with zero recombinants, which are common for organisms like human), we have an efficient (polynomial-time) algorithm to compute all possible haplotype assignments involving no recombinants. Observe that finding a haplotype configuration is equivalent to reducing the degree of freedom in the PS/GS assignment for each locus/allele in every individual. We may thus find all necessary constraints on the PS/GS assignments first so that the freedom left is really free (i.e. the choices will always lead to 0-recombinant haplotype configurations). Then we can simply enumerate all haplotype configurations satisfying the constraints.

We define four levels of constraints. The first level of constraints are the strongest and specify values for the involved alleles (i.e. these alleles are GS-resolved). The second level of constraints specify values for the involved loci (i.e. these loci are PS-resolved). The last two levels of constraints are concerned with two loci. The third level of constraints describe what alleles should be on the same haplotype in one member without specifying the actual PS values. For example, in Figure 4 (left), the PS-resolved loci in member 3 forces the parents have the same haplotype grouping in order to obtain a solution without recombinants, although we do not know the actual PS of the two haplotypes in the grouping. The forth level constraint are the weakest. Each level 4 constraint is concerned with the relationship between the haplotype groupings in some different individuals (i.e. parent-child or mates) at two loci, although it does not specify the actual haplotype grouping. For example, in Figure 4 (right), the three members of the parents-offspring trio should all have the same haplotype grouping in order not to incur any recombinant in the trio. But, we do not have information to determine which case must hold. All possible types of level 3 and level 4 constraints will be listed below.

For each locus i in member j, we define a binary variable xi,j to represent the PS value of the locus. The basic idea of our algorithm is to identify all the level 2-4 constraints based on the Mendelian law and the 0-recombinant assumption by examining every parents-offspring trio, and represent the constraints as linear equations on xi,j's over the cyclic group Z2. It then finds all feasible 0-recombinant solutions by solving the equations. More specifically, the level 2 constraints are collected locus by locus by examining every parents-offspring trio, which is the same as step 2 of the block-extension algorithm. In order to collect level 3 and level 4 constraints in the form of linear equations on the binary PS variables, we need to consider pairs of loci for each parents-offspring trio. Without distinguishing the two parents in a parents-offspring trio, there are essentially four types of level 3 constraints and three types of level 4 constraints as summarized in Tables 1 and 2. In the tables, x, y are the parents and z is the child. An * indicates any allele. x1 represents the binary PS variable for locus 1 in member x. These constraints are collected trio by trio. Definition 4 Given the level 2-4 constraints defined above, a consistent solution is an assignment of binary values to all the PS variables that satisfies every constraint.

Clearly, every feasible 0-recombinant solution is consistent. The following theorem shows that the converse is also true.

Theorem 3 Every consistent solution is a feasible 0-recombinant solution.

The above constraints form a system of (sparse) linear equations over the group Z2, which could be solved by the classical Gaussian elimination method running in cubic time (see e.g. [8]). Since we are dealing with Z2, a much simpler algorithm (adapted from Gaussian elimination) is presented in the full paper.

3  Results

We have implemented the above block-extension algorithm as a C++ program, PedPhase, which is available to the public upon request to either of the authors. To evaluate the performance of PedPhase, we compared PedPhase and MRH on simulated genotype data in terms of accuracy and efficiency using three different pedigree structures. The results show that PedPhase is comparable with MRH in terms of accuracy when the number of recombination events is small and is much faster than MRH on large dataset (large pedigree/large number of loci). We further compared the performance of PedPhase, MRH and an EM algorithm on a real data set that consists of 12 multi-generation pedigrees from a recent paper [3]. The results show that both rule-based haplotyping methods (PedPhase and MRH) could discover almost all common haplotypes (i.e. haplotypes with frequencies > 5%) that were inferred by the EM algorithm [3]. The haplotype frequencies estimated from the haplotype results of PedPhase and MRH (by simple counting) are also similar to the estimations by the EM algorithm. Most of the blocks ( > 85%) were found by PedPhase and MRH to have involved 0 recombinants. The assumptions and observations made in Section 2.2 and the minimum recombination principle are well supported by this real dataset.

4  Concluding remarks

Pedigrees with mating loops are not a big problem for both rule-based algorithms block-extension and MRH, although they usually cause troubles for statistical methods. But, we have observed in the experiments that for both of the rule-based algorithms, the results on the pedigree structure with a loop are slightly worse than the results on the other two pedigree structures without loops. More investigation is needed on this issue. Notice that the pedigree structure used in the proof of the NP-hardness of the MRHC problem is very complicated and requires a large number of recombinants. An interesting question is if there is an efficient algorithm for solving MRHC (exactly) on pedigrees that require only a small (fixed) number of recombinants.

Footnotes:

1An overview of Jing Li's current research for ACM SRC Grand Finals.

2Research supported by NSF grant CCR-9988353.

3The pedigree diagrams in this paper were generated using WPEDRAW [1].

4All proofs can be found in the full paper at website http://www.cs.ucr.edu/~ jili/haplotyping.htm.