# Joint Mapping of Quantitative Trait Loci for Multiple Binary Characters

^{*}Department of Botany and Plant Sciences, University of California, Riverside, California 92521^{†}Chinese Academy of Agricultural Sciences, Beijing 100081, China

- 1
*Corresponding author:*Department of Botany and Plant Sciences, University of California, Riverside, CA 92521. E-mail: xu{at}genetics.ucr.edu

## Abstract

Joint mapping for multiple quantitative traits has shed new light on genetic mapping by pinpointing pleiotropic effects and close linkage. Joint mapping also can improve statistical power of QTL detection. However, such a joint mapping procedure has not been available for discrete traits. Most disease resistance traits are measured as one or more discrete characters. These discrete characters are often correlated. Joint mapping for multiple binary disease traits may provide an opportunity to explore pleiotropic effects and increase the statistical power of detecting disease loci. We develop a maximum-likelihood method for mapping multiple binary traits. We postulate a set of multivariate normal disease liabilities, each contributing to the phenotypic variance of one disease trait. The underlying liabilities are linked to the binary phenotypes through some underlying thresholds. The new method actually maps loci for the variation of multivariate normal liabilities. As a result, we are able to take advantage of existing methods of joint mapping for quantitative traits. We treat the multivariate liabilities as missing values so that an expectation-maximization (EM) algorithm can be applied here. We also extend the method to joint mapping for both discrete and continuous traits. Efficiency of the method is demonstrated using simulated data. We also apply the new method to a set of real data and detect several loci responsible for blast resistance in rice.

MULTIPLE traits are measured virtually in all line-crossing experiments of QTL mapping. Yet, almost all data collected for multiple traits are analyzed separately for different traits. Joint analysis for multiple traits has shed new light in QTL mapping by improving the statistical power of QTL detection and increasing the accuracy of QTL localization when different traits segregating in the mapping population are genetically related. Joint analysis for multiple traits is defined as a method that includes all traits simultaneously in a single model, rather than analyzing one trait at a time and reporting the results in a format that appears to be multiple-trait analysis. In addition to the increased power and resolution of QTL detection, joint mapping can provide insights into fundamental genetic mechanisms underlying trait relationships such as pleiotropy *vs.* close linkage and genotype-by-environment (*G* × *E*) interaction, which would otherwise be difficult to address if traits are analyzed separately.

Substantial work has been done in joint mapping for multiple quantitative traits (Jiang and Zeng 1995; Korol* et al*. 1995, 2001; Mangin* et al*. 1998; Henshall and Goddard 1999; Williams* et al.* 1999; Knott and Haley 2000; Hackett* et al*. 2001). In general, there are two ways to perform a joint mapping. One way is the true multivariate analysis in which a multivariate normal distribution is assumed for the multiple traits, and thus a multivariate Gaussian model is applied to construct the likelihood function. Parameter estimation is conducted via either the expectation-maximization (EM) algorithm (Dempster* et al*. 1977) or the multiple-traits least-squares method (Knott and Haley 2000). One problem with these multivariate analyses is that if the number of traits is large, there will be too many hypotheses to test and interpretation of the results will become cumbersome. The other way of multiple-trait analysis is to utilize a dimension reduction technique, *e.g.*, the principal component analysis, to transform the data into fewer variables, *i.e.*, “super traits,” that explain the majority of the total variation of the entire set of traits. Analyzing the super traits requires little additional work (Korol* et al*. 1995, 2001; Mangin* et al.* 1998) compared to that for the single-trait genetic mapping statistics. However, as pointed out by Hackett* et al*. (2001), inferences based on the super traits might result in detection of spurious QTL. Furthermore, parameters of the super traits are often difficult to interpret biologically. Nevertheless, joint mapping provides a good opportunity to answer more questions about the genetic architecture of complex trait sets and deserves continued efforts from investigators in the QTL mapping community.

In contrast to that for multiple quantitative traits, relatively little work has been done in joint mapping for multiple discrete traits (Lange and Whittaker 2001). In fact, multiple discrete traits, especially multiple binary disease traits, are frequently collected in plants and laboratory animals. Most disease resistance traits are measured as one or more dichotomous characters. For example, in experiments mapping disease resistance loci, multiple pathogen races or strains are commonly used to determine the number of race-specific resistance loci involved. In such cases, infection by each strain is measured as a binary trait and the overall infection spectrum is a vector of several binary measurements. In practice, scientists may be less interested in identifying resistance loci to individual strains, but more interested in loci with a wide spectrum of resistance, which, in principle, can be better addressed with the joint-mapping strategy. Unfortunately, there has been no report on such a joint-mapping analysis for multiple binary traits. Recently, Lange and Whittaker (2001) applied the generalized estimating equations (GEE; Liang and Zeger 1986) method to mapping multiple discrete trait loci. Results of GEE are hardly compared with those of single-trait mapping because there is no univariate version of the GEE.

Furthermore, it is not uncommon that investigators may collect both continuous and discrete traits in a single mapping experiment. For example, disease resistance characters may be measured in a QTL mapping experiment for yield traits, or vice versa. Combining the disease resistance traits (discrete) with the yield traits (continuous) may allow investigators to answer some important questions such as possible fitness penalty of resistance loci. Even if the associated quantitative traits are not directly responsible for the disease status but linked to the disease loci, joint analysis will still provide additional power in identifying the disease loci (Williams* et al.* 1999; Huang and Jiang 2003). So far, joint analysis of mixed types of traits has been attempted only in pedigree analysis of human populations under the random model methodology.

The goal of this study is to develop a formal multivariate version of the maximum-likelihood methodology for joint mapping of QTL underlying multiple binary traits and mixed types of traits in line-crossing experiments under the fixed-model framework. We analyzed both simulated data and data collected from field experiments.

## METHODS

### Joint mapping for multiple binary traits:

#### Statistical model:

Suppose that we have a sample of *n* individuals from an F_{2} population derived from the cross of two inbred lines with observation on *m* binary traits. Assume that we also genotype a number of codominant molecular markers with known map positions for the species in question. Let *w _{jk}* denote the phenotype of the

*k*th binary trait on the

*j*th individual and

*w*= 1 if individual

_{jk}*j*is affected and

*w*= 0 if

_{jk}*j*is unaffected. Further define

*y*as the underlying latent variable,

_{jk}*i.e.*, the liability, for the

*k*th binary trait on the

*j*th individual. The relationship between the liability and the discrete phenotype is described by the following threshold model, 1where the threshold 0 is chosen in an arbitrary fashion. Each of the

*m*liabilities is a continuous variable, similar to the phenotypic value of a quantitative trait. The difference between a liability and a quantitative trait is that the former is not observable but inferred from the discrete phenotype. The liabilities are described by the following linear models, 2where

*b*

_{0}

*is the mean effect (intercept) for trait*

_{k}*k*in the scale of liability;

*b*

_{1}

*and*

_{k}*b*

_{2}

*are, respectively, the additive and dominance effects of the putative QTL;*

_{k}*x*

_{0}

*,*

_{j}*x*

_{1}

*, and*

_{j}*x*

_{2}

*are the incidence variables for the mean effect and the additive and dominance effects, respectively; and*

_{j}*e*is the residual error for trait

_{jk}*k*of individual

*j*. We assume that the residual errors are independent among individuals but correlated among traits within individuals. In matrix notation, model (2) can be written as 3where

**y**

*= [*

_{j}*y*

_{j}_{1}…

*y*] is a 1 ×

_{jm}*m*vector for the liabilities;

**x**

*= [*

_{j}*x*

_{0}

_{j}x_{1}

_{j}x_{2}

*], which is defined as*

_{j}**x**

*=*

_{j}**h**

_{1}= [1 1 0] if individual

*j*takes genotype

**x**

*=*

_{j}**h**

_{2}= [1 0 1] if

*j*has a genotype

**x**

*=*

_{j}**h**

_{3}= [1 −1 0] if

*j*is

**B**is a 3 ×

*m*matrix defined as 4and

**e**

*is a 1 ×*

_{j}*m*vector for the residual errors, which has a covariance matrix 5

Note that the variances are estimable from the latent variables but not from the observed data. Therefore, some restrictions are required when the binary data are taken into account (McCulloch 1994). The probability that individual *j* is affected by all the *m* binary diseases is 6where 7is the multivariate normal probability density. The probabilities for other joint binary phenotypes are derived similarly. For *m* dichotomous traits, there will be 2* ^{m}* possible joint binary phenotypes. With the threshold model, mapping binary traits has been formulated as a problem of mapping quantitative trait loci. Therefore, methods of QTL mapping for multiple quantitative traits (Jiang and Zeng 1995) may be adopted here in multiple binary trait mapping.

Model (3) is a general multiple linear model (GMLM) with missing values in **x*** _{j}* because the QTL genotypes are not observable. The next step of the GMLM analysis with missing values is to infer the probabilities of QTL genotypes conditional on the marker information, denoted by

*p*= Pr(

_{jq}**x**

*=*

_{j}**h**

*|*

_{q}*I*

_{M}) for

*q*= 1, 2, 3, where

*I*

_{M}denotes marker information. We can compute

*p*using Table 1 of Jiang and Zeng (1995) if double recombinants are not considered or using Table 1 of Luo and Kearsey (1992) if no crossover interference is assumed. However, in general, we need to adopt the multipoint method (Jiang and Zeng 1997) to handle missing or dominance markers.

_{jq}#### Maximum-likelihood estimation:

Let us denote the parameters by a vector **θ** = {**B**, **V**}. The likelihood function for the *j*th individual conditional on **x*** _{j}* is 8where

**w**

*= [*

_{j}*w*

_{j}_{1}, … ,

*w*],

_{jm}*g*

_{1}(

*w*) = (

_{jk}*w*− 1) × ∞, and

_{jk}*g*

_{2}(

*w*) =

_{jk}*w*× ∞, for

_{jk}*k*= 1, … ,

*m*. Note that

*g*

_{1}(

*w*) = (

_{jk}*w*− 1) × ∞ = −∞ and

_{jk}*g*

_{2}(

*w*) =

_{jk}*w*× ∞ = 0 if

_{jk}*w*= 0, whereas

_{jk}*g*

_{1}(

*w*) = (

_{jk}*w*− 1) × ∞ = 0 and

_{jk}*g*

_{2}(

*w*) =

_{jk}*w*× ∞ = ∞ if

_{jk}*w*= 1. The

_{jk}*m*-dimensional integral Φ

*(*

_{m}**w**

*;*

_{j}**x**

_{j}**B**,

**V**) may be calculated with some special algorithms or by executing an intrinsic function from some software packages. A two-dimensional integral can be found in SAS (SAS Institute 1999). The probability Pr(

**w**

*|*

_{j}**x**

*,*

_{j}**θ**) is also called the penetrance of the QTL with genotype

**x**

*.*

_{j}Since **x*** _{j}* is missing and only

*p*is calculated, the actual likelihood function for the

_{jq}*j*th individual is 9The overall log-likelihood for the entire mapping population is 10

Solving the above log-likelihood function is tedious. We now introduce a simple EM algorithm to find the solution, which takes advantage of the simplicity of the original linear model with both and being treated as missing values.

Instead of directly maximizing the log likelihood given in Equation 10, the EM algorithm deals with the complete-data log-likelihood function, 11Because **X** and **Y** are treated as missing values, the EM algorithm starts with maximizing the expectation of the complete-data log-likelihood, 12where the expectation is taken with respect to **X** and **Y**, conditional on the data and the parameters at the current values **θ**^{(}^{t}^{)}. For brevity, we use *E*[ξ(**X**, **Y**)] to denote the conditional expectation throughout the entire text, but the full notation for the expectation should be where ξ(**X**, **Y**) represents the term whose expectation is required in the EM algorithm. Note that Equation 12 differs from Equation 10 in two aspects: (i) Equation 10 takes the expectation before the log transformation whereas Equation 12 takes the expectation after the log transformation; and (ii) the expectations are taken using different probability distributions for the two equations. In Equation 12, the expectation is taken using the probability distribution conditional on the current parameter values and the phenotypic values. Maximizing Equation 12 with respect to the parameters, we get 1314The MLE of **B** and **V** in the complete-data situation (both **X** and **Y** are observed) can be found in Anderson (1984)(Sect. 8.2). Giri (1996)(pp. 92–98) also provided the derivation in the simple case where **x**_{j}**B** = **μ** is a 1 × *m* vector of means. The results shown in Equations 13 and 14 are extensions of the results in the complete-data situation by adding the symbols of conditional expectation. appendix C gives a step-by-step derivation of Equations 13 and 14.

In binary data analysis under the liability model, the usual restriction is to let all the variances (diagonal elements of matrix **V**) equal unity (McCulloch 1994). If we had maximized the complete-data log likelihood function with such a restriction, the maximization step would be extremely complicated because there is no easy way to find a closed-form expression for **V̂**. Trying to find an explicit form for **V̂**, one would lose the advantage of the EM algorithm. Fortunately, Gueorguieva and Agresti (2001) showed that maximizing Equation 12 with **V** unrestricted, the EM algorithm converges to unique estimates of the fully identifiable ratios, **BS**^{−1}, where 15

Therefore, Gueorguieva and Agresti (2001) maximized the log-likelihood function with **V** unrestricted and then standardized the model effects by taking **B*** = **BS**^{−1}. The standardized covariance matrix became **R** = **S**^{−1}**VS**^{−1}. At each EM iteration, Gueorguieva and Agresti (2001) estimated **B** and **V** and immediately replaced these two parameters by their standardized versions, **B*** and **R**, before entering the next EM iteration to make sure that the EM converges. In our EM algorithm, we have already defined **B** as a standardized vector of genetic effects. We simply need to standardize **V** during each iteration of the EM algorithm to ensure the convergence of the iterations. The estimated correlation matrix **R̂** is indeed the MLE of **R** based on the invariance property of the MLE (DeGroot 1986) because **V̂** is the MLE of **V** and **R** is a function of **V**. Equations 13 and 14 represent the maximization step of the EM algorithm. We now investigate the expectation step of the EM algorithm. Recall that the probability of **x*** _{j}* conditional on marker information is denoted by

*p*. This probability may be called the prior probability. After incorporating the phenotypic value and the parameters, we obtain the posterior probability, denoted by 16

_{jq}Note that the **V** matrix has been replaced by the **R** matrix to reflect the standardization. In fact, the unrestricted covariance matrix **V** is used only once when we try to estimate it. Once **V** is estimated, it is immediately standardized into the form of **R**, which is then used in all steps of the EM iterations. The expectations are actually obtained using the posterior probabilities rather than the prior probabilities. Therefore, the conditional expectations given the data **W** are 17where *E*(**y*** _{j}*|

**w**

*,*

_{j}**h**

*,*

_{q}**θ**) is the expectation of a truncated multivariate normal distribution. The residual error covariance matrix in Equation 14 becomes 18where

**μ**

*=*

_{jq}*E*(

**y**

*|*

_{j}**w**

*,*

_{j}**h**

*,*

_{q}**θ**) is a short notation for the conditional expectation and

**U**

*= Var(*

_{jq}**y**

*|*

_{j}**w**

*,*

_{j}**h**

*,*

_{q}**θ**) denotes the conditional covariance matrix of a truncated multivariate normal distribution. Neither

**μ**

*nor*

_{jq}**U**

*has an explicit expression except in the special case when*

_{jq}*m*= 2 and 3. These expectation and covariance matrices are calculated using the moment-generating function (Tallis 1963) or the Gibbs sampler (Chan and Kuk 1997). The moment-generating function approach needs multidimensional integrals and cannot be implemented easily in practice when

*m*is large (see appendix A for the special case when

*m*= 2). The Gibbs sampling approach requires Monte Carlo simulation, which is suitable for large

*m*. Details of the Monte Carlo method are given in appendix B.

The EM algorithm may be summarized in the following steps:

Step 1. Choose the initial values for

**θ**,**θ**^{(0)}= {**B**^{(0)},**R**^{(0)}}.Step 2. Calculate the posterior probabilities of the QTL genotype given the current values of all unknowns using Equation 16.

Step 3. Calculate the expectations using Equation 17, a process in the E-step.

Step 4. Calculate

**μ**=_{jq}*E*(**y**|_{j}**w**,_{j}**h**,_{q}**θ**) and**U**= Var(_{jq}**y**|_{j}**w**,_{j}**h**,_{q}**θ**) using the moment-generating function or the Gibbs sampler (see appendix A and appendix B), another process of the E-step.Step 5. Update parameter

**B**using Equation 13, update parameter**V**using Equation 14, and convert**V**into**R**.Step 6. Replace the initial parameters by the updated values and repeat steps 2–5 until convergence.

#### Likelihood-ratio test:

Define the log-likelihood function evaluated at the maximum-likelihood estimate (MLE) of parameters as 19where and **θ̂** is the MLE of **θ**. This is also called the likelihood value under the full model. We need the likelihood values under various restricted models to test different hypotheses.

The overall null hypothesis is “no effect of QTL at the locus of interest,” denoted by H_{0}: *b*_{1}* _{k}* =

*b*

_{2}

*= 0 for*

_{k}*k*= 1, … ,

*m*or H

_{0}:

**LB**=

**0**, where If we solve the MLE of the parameters under the restriction of

**LB**=

**0**and evaluate the likelihood function evaluated at the solutions with this restriction, we have 20where

**θ̃**is the MLE of

**θ**under the restricted model. The likelihood-ratio test statistic is 21Under the null hypothesis, this test statistic will approximately follow a chi-square distribution with 2(

*m*− 1) d.f.

Various other test statistics may be defined by choosing different **L** matrices, as given by Jiang and Zeng (1995). For example, to test the additive effects for all traits, the **L** matrix is defined as **L** = [0 1 0]. The degrees of freedom for this test are *m* − 1. To test the dominance effects for all traits, we use **L** = [0 0 1] with *m* − 1 d.f. for the test statistic.

To test trait-specific effects, we need to introduce another matrix, **T**, which is used to postmultiply matrix **B**. For example, to test QTL effects (both additive and dominance) for the *k*th trait, the null hypothesis is H_{0}: **LBT** = **0**, where and **T** is an *m* × 1 vector with the *k*th element being one and all the remaining elements being zeros. The test statistic will be 22with 2 d.f. In general, the **L** matrix controls the type of effects (population mean, additive, and dominance) being tested and the **T** matrix controls the traits (from 1 to *m*) being tested.

The position of the QTL is another parameter of interest. However, in the one-dimensional genome scan, the position is first treated as fixed and then the entire genome is tested for every putative position with a 1- or 2-cM increment. The likelihood-ratio test statistic forms a test statistic profile. The position corresponding to the highest peak is declared as the estimated QTL position if the peak surpasses a given critical value (Churchill and Doerge 1994; Diggle *et al*. 1996; Piepho 2001).

### Joint mapping for mixed types of traits:

We now describe a statistical model and likelihood analysis for joint mapping of loci that affect one binary trait and multiple quantitative traits. Let *w _{j}*

_{1}be the phenotype of the binary trait for the

*j*th individual and defined as

*w*

_{j}_{1}= 1 if individual

*j*is affected and

*w*

_{j}_{1}= 0 if it is not affected. Further define

*y*as the value of the

_{jk}*k*th observed quantitative trait, for

*k*= 2, … ,

*m*, on the

*j*th individual. We also define

*y*

_{j}_{1}as the liability for the binary trait, The liability of the binary trait and the phenotypic values of the quantitative traits are arranged in a vector called where is a special notation for a subvector of

**y**

*that excludes the first element. The*

_{j}**y**

*vector is described by the same model as given in Equation 3. Let us further partition matrix*

_{j}**B**into , where is the first column of matrix

**B**and is the remaining columns of

**B**. The residual errors are joint normal with mean zero and a covariance matrix

**V**, which can be partitioned into 23

where , and

is the lower right block of matrix **V**. The variance of the liability for the disease trait, however, is restricted to unity. Therefore, the standardized form of the **V** matrix is **R** = **S**^{−1}**VS**^{−1}, which is a function of the unrestricted covariance matrix **V**, where **S** = diag(σ_{1} 1 … 1). Similar partitioning given in Equation 23 also applies to matrix **R**. The joint distribution of the phenotype for individual *j* is 24where 25and 26The probability density φ within the integral is a conditional density of *y _{j}*

_{1}given

**y**

_{j}

_{1̅}. It is a univariate normal with mean 27and variance 28Therefore, Φ(

*w*

_{j}_{1};

**y**

_{j}_{(−1)},

**x**

_{j}**b**

_{1},

**R**

_{11}) is a truncated univariate normal probability.

The likelihood function for the *j*th individual is 29The overall log-likelihood of the entire mapping population is 30

Again, we adopt the EM algorithm to find the MLE of parameters. The maximization step is accomplished through Equations 13 and 14. The expectation step requires calculation of the posterior probabilities of QTL genotypes and then uses these probabilities to find various expectations. The posterior probability of a QTL genotype is 31from which we get the expectations 32where 33is a 1 × *m* vector, which has been partitioned into a scalar *E* and a vector **y**_{j}_{1̅}. The expectation is taken only for the liability of the binary trait. The remaining traits already take the observed values and thus no expectations are taken. The expectation for the liability, *E*, is obtained from the truncated normal distribution with mean given by Equation B5 of appendix B and variance given by Equation B6 of appendix B. The residual error covariance matrix is given by Equation 18. However, the conditional expectation is replaced by 34and the conditional variance by 35

where Var is the variance of a truncated normal distribution. Both *E* and Var can be found from the truncated normal distribution (Cohen 1991) and no Gibbs sampler is required.

## RESULTS

### Simulation studies:

To further evaluate the properties of the proposed method, we conducted two simulation experiments. For the sake of simplicity, we designed one experiment to evaluate the performance of joint mapping for two binary traits and another experiment for the mixture of one binary and one quantitative trait. In each experiment, one chromosome with 11 evenly distributed markers was simulated for an F_{2} population. We simulated a single QTL located at 35 cM of the chromosome and the QTL effects of both traits under three different levels of heritability (proportion of variance in liability explained by the QTL). The effects of the QTL used in the simulation experiments are given in Table 1. The correlation coefficient between the residuals of the liabilities for the two traits was chosen at 0.25. Under these settings, both traits had the same heritability. The three levels of the QTL effects led to three different levels of the heritability: 5, 10, and 15%. The genetic correlation between the two traits was expected to be 1.0, −0.446, and 0.423, respectively, for the different chosen levels of the heritability. The sample size of the simulated F_{2} population was *n* = 200. Each simulation experiment was replicated 100 times.

The first simulation experiment was to evaluate the efficiency of the joint binary trait mapping. We first simulated the liabilities of the two traits and then artificially truncated the continuous liabilities into two binary phenotypes using a threshold of zero. In the second simulation experiment, we truncated only the liability of the first trait to generate a binary phenotype but left the second trait intact so that we had one binary trait and one continuous trait.

Each data sample was analyzed using both the joint-mapping and single-trait-mapping statistics. For the single-trait analysis, we used Lander and Botstein's (1989) method for the continuous trait and the method of Xu* et al*. (2003) for the binary trait. Since we considered only two traits in the joint mapping, explicit formulas were used in each of the EM iteration steps (see appendix A for the explicit formulas). The critical values for the chromosomewise type I error rate of 5% were determined by the approximate method of Piepho (2001). In real data analysis, one should use the permutation test (Churchill and Doerge 1994) to obtain the most appropriate critical values for significance tests. For the joint analysis, the empirical power was determined by the proportion of the replicated samples (out of 100) whose highest test statistic values along the chromosome were greater than Piepho's (2001) critical value. The peak where the highest test statistic occurred was usually close to the true QTL position. However, a significant QTL was declared even if the peak was not exactly at the true position. For the separate analyses of individual traits, the statistical power was determined for the analysis of each trait as in the joint analysis. The critical value was recalculated for each trait in each replicate.

Tables 2 and 3 show the observed powers of QTL detection, the mean, and standard deviations (SD) of the estimated QTL locations and effects obtained from 100 replicated simulations. We compared the power of the joint analysis with that of a single-trait analysis for each trait separately. Joint analysis has a substantially higher power than either single-trait analysis. We understand that this may not be a fair comparison because joint analysis uses two traits while the single-trait analysis uses only one trait. However, this has been the standard way for comparison of joint mapping with separate mapping (Jiang and Zeng 1997). One may want to redefine the power for the separate analyses as the ability to detect at least one QTL effect (either additive or dominance) in at least one trait. Under this definition of the power, results of the two separate analyses may be combined so that the power is recalculated in the combined result. The combined power analysis requires either redefining the critical values by taking into account the multiple tests, which is difficult because the two separate analyses may be highly correlated, or simply using the sum of the powers of separate analyses (with an appropriate adjustment) as the combined power, which is *p*_{1} + *p*_{2} − *p*_{12}, where *p*_{1} and *p*_{2} are the powers for traits 1 and 2, respectively, and *p*_{12} is the proportion of the replicated simulations in which both traits are significant. For example, among the 100 replicates, if a significant QTL effect is detected in 50 samples for the first trait and a significant QTL effect is detected in 80 samples for the second trait, then *p*_{1} = 0.5 and *p*_{2} = 0.8. If QTL effects for both traits are detected in 40 samples, then *p*_{12} = 0.4. The combined power for the separate analyses will be 0.5 + 0.8 − 0.4 = 0.9. Using this approach to calculating the power, the combined power of the separate trait analyses was almost identical to that of the joint analysis (data not shown). Therefore, power increase in joint mapping as opposed to separate mapping depends on how one defines the power in the separate analyses. From the traditional definition of statistical power for single-trait analysis (Jiang and Zeng 1997), joint analysis has higher power than single-trait analysis, *i.e*., joint power greater than *p*_{1} and joint power greater than *p*_{2}. But the joint analysis has an equivalent power to the combined power for separate analyses if the combined power is defined as *p*_{1} + *p*_{2} − *p*_{12}, *i.e.*, joint power ≈ *p*_{1} + *p*_{2} − *p*_{12}.

The QTL effects and their standard deviations estimated from the joint mapping are comparable to those obtained from separate analyses. No obvious advantage of the joint mapping has been demonstrated from the simulation studies with respect to the estimates of QTL effects. The real advantage of the joint mapping over separate analyses has been demonstrated by the increased precision of the QTL position estimates in all situations examined (see Tables 2 and 3).

Overall, the parameter estimates are fairly close to the true parametric values. The general trend follows our expectation: high heritability tends to produce more accurate estimates than low heritability. If we compare the joint mapping of two binary traits with that of one binary and one continuous trait, we will note the power difference between the two experiments. Experiment 2 shows higher powers than experiment 1. This observation also follows our expectation because binary data are not as informative as continuously distributed data.

### Mapping rice blast resistance loci:

Developing blast resistance cultivars is one of the major objectives in rice (*Oryza sativa L*.) breeding in both tropical and temperate countries. The causal organism of the rice blast, *Pyricularia grisea*, is known for its high genetic variability, allowing it to overcome the resistance of the host plant. A framework linkage map was developed using 284 F_{10} recombinant inbred lines (RILs) from a “Lemont” × “Teqing” rice cultivar cross. A subset of 245 RILs innoculated with two rice blast races, IB54 and IG1, was used to map loci responsible for the hypersensitive reaction. Details of the experimental design, the measurements of phenotypes, and genotypes can be found in the original article by Tabien *et al*. (2000). The phenotypes were evaluated using a completely randomized design with three replicates. In other words, each line was evaluated three times for its reaction to each of the phathogen infections. The original scores of the plant response were measured from grade 0 to grade 5. The average score of the three replicates for each line was recorded as the raw data observation. The binary phenotype was defined as *w* = 0 if the average score was within the range 0–3 and *w* = 1 if the average score was 4–5. We were provided only with the binary data, not the original scores. The breeders were more interested in the genetic study of the qualitative dichotomous trait than in the genetic study of the numerical scores. This explains why we were approached by the breeders to analyze their data using the new methods.

Since the mapping population was a RIL population, a slight modification of our method for F_{2} was required. We replaced the probability transition matrix of F_{2} by that of F_{10} in calculating the conditional probability of QTL genotype (Jiang and Zeng 1997). There was still a 4% residual heterozygosity in the RIL lines (due to F_{10} instead of F_{∞}), which is sufficiently high to allow the dominance effects to be estimated. We treated the plant responses to blast pathogen races IB54 and IG1 as two separate binary traits. Therefore, joint mapping for both traits and separate mappings for individual traits were conducted for comparisons. The critical values of test statistics used to declare QTL were calculated using the method of Piepho (2001).

Table 4 shows the results of joint mapping and separate analyses. The joint mapping may have a greater power than separate analyses, as demonstrated by more detected QTL and higher test statistic values. A total of five resistance loci were identified by the joint mapping (qtl1–qtl5), but only four of them were detected with separate analyses (qtl4 was missed). Of these detected QTL, three of them (qtl1, qtl3, and qtl4) corresponded to *Pi-tq5*, *Pi-lm2*, and *Pi-tq6* detected previously on the basis of chi-square tests of individual marker-trait associations (Tabien *et al*. 2000). Two additional loci (qtl2 and qtl5) were detected on chromosomes 3 and 12, and they were not reported in the previous study (Tabien *et al*. 2000). For each of the two loci, the allele carried by the Lemont parent was responsible for the resistance. None of the genetic parameters, *e.g.*, the QTL effects and positions, were estimable in the previous chi-square tests conducted by the original authors (Tabien *et al.* 2000). The most striking result from the joint mapping was that all five resistance loci showed fairly consistent effects against both *P. grisea* races, while different resistance loci were detected separately by the single-trait analyses.

It is worth mentioning that results of joint mapping and separate mapping do not seem to be consistent in the real data analyses. This inconsistency, however, did not occur in the simulation studies. The reason for this is that we have taken a one-dimensional genome-scan approach, which uses a single-QTL model. In the simulation studies, we indeed simulated a single QTL. Therefore, the model adequately described the data. In the real data analysis, however, we used the single-QTL model to fit data controlled by apparently multiple QTL. The remaining QTL not fitted in the model may have caused all the inconsistencies observed between the joint and the separate analyses. In addition, the background QTL also have caused the observed high residual correlation. These problems can be solved by fitting a multiple-QTL model (see the discussion in a later section).

Table 5 shows the probabilities of the four possible phenotypic combinations under different genotypes of the identified QTL. For a single disease trait, penetrance is defined as the probability that a specific QTL genotype shows the affected phenotype. Penetrance has not been defined for multiple disease traits. Therefore, we listed the probabilities of all the four possible phenotype combinations for all genotypes of each detected QTL in Table 5. The penetrances of any particular genotypes for each QTL may be calculated from this table. For example, if we define the penetrance of a genotype as the probability that a plant with this genotype is affected by either of the two pathogens, the penetrance should be calculated using 1 − Pr(IB54 = R and IG1 = R). On the other hand, if we define the penetrance as the probability that the plant is affected by both pathogens, then we should use Pr(IB54 = S and IG1 = S). The marginal penetrance for one pathogen, say pathogen IB54, should be defined as Taking the first genotype of the first QTL, for example, we may be able to find penetrances defined in all possible ways, as shown below, and

Interested rice geneticists and breeders may want to find out all kinds of penetrances of interest from Table 5. This table may also help rice breeders develop an optimal marker-assisted seletion scheme to improve blast resistance in rice.

## DISCUSSION

Joint mapping offers several advantages over single-trait analyses. First, joint mapping may increase statistical power of QTL detection compared to single-trait analyses. Second, joint analysis can improve the precision of parameter estimation. Third, joint mapping provides an opportunity to answer more questions related to the genetic architecture of complex traits. These have been discussed by many authors (Jiang and Zeng 1995; Korol* et al*. 1995; Mangin* et al*. 1998; Henshall and Goddard 1999; Knott and Haley 2000) in multiple quantitative traits QTL mapping. Similar advantages also have been demonstrated here in the joint mapping for multiple binary traits. In this study, we paid more attention to the development of the EM algorithm rather than to various hypotheses tests, because the latter have been fully addressed by Jiang and Zeng (1995). In addition, the method was derived in the context of interval mapping. Extension to composite interval mapping should be preferred in practice, but this is simply a matter of implementation. Furthermore, the proposed method for F_{2} populations can be easily extended to other types of populations, *e.g*., backcrosses or four-way crosses, as demonstrated by the extension from F_{2} to RILs described in this study. The method differs from one mating design to another only by the possible different number of genotypes and different transition matrix from one locus to another.

In fact, there has been much work on single binary trait mapping (Hackett and Weller 1995; Xu and Atchley 1996; Yi and Xu 1999, 2000; Xu* et al*. 2003), using likelihood-based methods or Bayesian methods. However, the method of separate analyses of individual binary traits is, so far, the only approach currently available. For the first time, we developed the full probability model for joint mapping of multiple binary traits. The method requires numerical multiple integrals, as we know that high-dimensional numerical integration cannot be implemented easily in practice. Therefore, we presented the method using two traits as examples. In real data analysis, one may pay more attention to the information extracted from the data and thus may wish to perform joint mapping for more than two traits using the general algorithm developed here. Two factors may limit the number of traits included in the analysis. One is the computing time and the other is the difficulty in interpreting the results. For the rice blast data analysis with two binary traits, QTL search for the entire rice genome took ∼10 min, which is quite reasonable. For more than two traits, computing time is a big factor of concern. We highly recommended using a different but fast numerical integration algorithm specially designed for high-dimensional integration, *e.g*., Monte Carlo integration. The Bayesian method implemented via Markov chain Monte Carlo (MCMC) is an ideal tool to accomplish this. In addition, the Bayesian method can handle the multiple-QTL model with ease. To deal with the problem of interpretation, one must have some intuitive knowledge about the trait relationships and hypotheses underlying the traits. In the disease-resistance case, one would be interested not only in the number of loci involved, but also in the level of race specificity of individual resistance loci, since the hypersensitive response of rice to *P. grisea* is known to be controlled by the gene-for-gene system (Silué *et al*. 1992). However, this gene-for-gene system normally assumes that only two consequences, resistance or susceptibility, would result from interactions between alleles at a resistance locus of host plants and alleles at its corresponding avirulence locus in pathogens, which may not be always true, as is discussed in the following section; imperfect penetrance appears to be an important feature of resistance loci involved in the gene-for-gene interactions between host plants and their pathogens.

We took the maximum-likelihood approach and implemented the method via the EM algorithm. This is different from the GEE method described earlier. We favor the EM algorithm because it was developed on the basis of all existing theory and methods currently used for mapping loci of regular quantitative traits. In single binary trait mapping, one of the most frequently asked questions is “What is the advantage of using the probability model over the simple analysis that treats the binary traits as if they were continuous traits?” (Visscher *et al*. 1996). The same question also may be asked here for joint mapping of multiple binary traits. Although we did not try the joint analysis of binary traits by ignoring the binary nature of the traits, we predict that treating binary traits as if they were continuous traits may result in similar power in most situations. In some special cases, the probabilistic model may provide higher power than the simplified analysis, and we have not figured out the parameter range in which this will happen. The probabilistic model approach enables estimation of penetrance of a particular QTL genotype, which is an important property of genes involved in human diseases (Terwilliger and Weiss 1998) and plant disease resistance, as we demonstrated here. McIntyre *et al*. (2001) developed a different probabilistic model for QTL mapping of binary traits. The method also allows calculation of penetrance. Extension of their model to multiple binary trait analysis is another alternative approach. We did not choose this extension because the threshold model via the EM algorithm has a natural connection to existing methods of QTL mapping.

Finally, with the successful development of joint mapping of both multiple quantitative and qualitative traits, an important but largely unexploited area in genetic mapping begins to emerge, *i.e.*, the joint analysis of mixed types of traits. Although some authors (Williams* et al.* 1999; Huang and Jiang 2003) already exploited this idea in the context of human genetic mapping under the IBD-based random model framework, it has never been explored in QTL mapping of experimental populations. The method may be particularly useful in situations where mapping qualitative disease resistance of the gene-for-gene system is the primary objective while traits associated with quantitative resistance to the same or different pathogens are measured as by-products in the experiment. This coexistence of qualitative and quantitative resistance is widely present in many relationships between host plants and their pathogens in natural and agricultural systems (Leonard and Czochor 1980). Joint analyses of the correlated qualitative and quantitative phenotypes may substantially increase the power of detecting disease resistance loci and allow exploration of new features of loci involved.

## APPENDIX A:

### MOMENTS OF TRUNCATED BIVARIATE NORMAL DISTRIBUTION

Let **z**^{T} = [*z*_{1} *z*_{2}] be a vector of two variables distributed as a standardized bivariate normal distribution with correlation ρ. Let [*a*_{1}, *c*_{1}] and [*a*_{2}, *c*_{2}] be the double truncation points on variables *z*_{1} and *z*_{2}, respectively, and define α_{1} = Pr(*z*_{1} > *c*_{1}, *z*_{2} > *c*_{2}) as the area (integral) within the domain. Let us further define the first and second moments of the truncated standardized bivariate normal distribution at *z*_{1} > *c*_{1} and *z*_{2} > *c*_{2} as A1where A2and A3Equations A1 can be found from Tallis (1963). Similarly, we can calculate the first and second moments of truncated standardized bivariate normal distribution at *z*_{1} > *a*_{1} and *z*_{2} > *c*_{2}, *z*_{1} > *a*_{2} and *z*_{2} > *c*_{1}, and *z*_{1} > *a*_{1} and *z*_{2} > *a*_{2}, respectively. We further denote the above four truncated domains by 1, 2, 3, and 4, respectively. The following formula is used to calculate the moments under the double truncation with [*a*_{1}, *c*_{1}] and [*a*_{2}, *c*_{2}], A4where *T _{i}*,

*i*= 1, 2, 3, 4, represents the arbitrary first moment of (A1), α

*represents the probability under the corresponding truncated domain, and A5*

_{i}## APPENDIX B:

### CONDITIONAL EXPECTATION AND VARIANCE VIA GIBBS SAMPLER

The basic idea of the Gibbs sampler is to find the distribution of one element, say *y _{jk}*, conditional on the remaining components in vector

**y**

*and sample*

_{j}*y*from the conditional distribution. Under the assumption of multivariate normality for the liability vector,

_{jk}*i.e*.,

**y**

*∼*

_{j}*N*(

_{m}**x**

_{j}**B**,

**R**), the conditional density of a single component is univariate normal with mean and variance described as follows. First, let us make the following matrix partitioning, where B1is a special notation for a subset of vector

**y**

*that excludes*

_{j}*y*;

_{jk}*i.e*., the subscript

*k̅*indexes all elements except

*k*. Using this special notation we can partition matrix

**B**into , where is the

*k*th column of matrix

**B**and B2is a submatrix of

**B**with the

*k*th column left out. Let us further partition matrix

**R**into B3where

**R**

*= 1, , and B4Note that*

_{kk}**R**

_{k̅k̅}is the submatrix of

**R**with the

*k*th row and

*k*th column removed. The above matrix partitionings allow us to define the conditional mean of

*y*as B5and the conditional variance as B6Having found the distribution of one component conditional on the remaining components, one can easily sample each element from its perspective univariate normal distribution. The binary phenotype for each trait has not played a role in the above sampling scheme. To incorporate this information, we need to sample each liability from a truncated normal distribution with the mean and variance given above. For example, if

_{jk}*w*= 0,

_{jk}*y*should be sampled only if

_{jk}*y*≤ 0. If

_{jk}*w*= 1, however,

_{jk}*y*should be sampled only if

_{jk}*y*> 0. In fact, we adopted the algorithm of Devroye (1986) to simulate a variable from a truncated normal distribution. This special algorithm has a 100% rate of acceptance. The Monte Carlo sampling process is repeated many times with the simulated

_{jk}**y**

*forming a large sample,*

_{j}**y**

^{}

_{j},

**y**

^{}

_{j}, … ,

**y**

^{}

_{j}, where

*M*is a large number. Discarding the observations during the burn-in period and thereafter saving one observation every few cycles, we get a sample containing roughly independent observations, from which the sampled mean vector and the covariance matrix are calculated. The sampled mean and covariance matrix are used in place of

**μ**

*=*

_{jq}*E*(

**y**

*|*

_{j}**w**

*,*

_{j}**h**

*,*

_{q}**θ**) and

**U**

*= Var(*

_{jq}**y**

*|*

_{j}**w**

*,*

_{j}**h**

*,*

_{q}**θ**).

## APPENDIX C:

### DERIVATION OF THE EM ALGORITHM

**The expected likelihood function:**

The complete-data log likelihood is C1where **θ** = {**B**, **V**} is the vector of parameters and **X** and **Y** are the missing values. The data are the phenotypes of multiple binary traits, denoted by **W**.

The expectation of the complete-data log-likelihood function conditional on the current parameter values and the data is C2where the expectation is taken with respect to the missing values, **X** and **Y**, conditional on the current parameters **θ**^{(}^{t}^{)} = {**B**^{(}^{t}^{)}, **V**^{(}^{t}^{)}} and the data **W**. Note that we use a special notation *E _{Y}*

_{|}

*to denote conditional expectation with respect to*

_{X}**Y**given

**X**. The expectation of the complete-data log-likelihood (C2) is the target function subject to maximization in the EM algorithm.

#### Maximization with respect to B:

The expectation of the complete-data log-likelihood function relevant to **B** is C3The partial derivative of *L*(**B**|**θ**^{(}^{t}^{)}) with respect to **B** is C4

Setting (C4) equal to zero and solving for **B**, we obtain C5In the main text, we used the following simple notation for the conditional expectations, With this short notation, the solution for **B** becomes C6which concludes the proof of Equation 13 of the main text.

#### Maximization with respect to V:

The expectation of the complete-data log-likelihood function relevant to **V** is C7The partial derivative of this likelihood function with respect to **V** is complicated, but the derivative of *L* with respect to **V**^{−1} is straightforward. On the basis of the invariance property of ML analysis, if is the MLE of **V**^{−1}, then should be the MLE of **V**. Therefore, we set the partial derivative of *L* with respect to **V**^{−1} equalto zero and solve for **V**, as C8Setting (C8) equal to zero and solving for **V**, we get C9In the main text, we adopted a short notation for the expectation and denoted Equation C9 by C10This proves Equation 14 of the main text of the article.

## Acknowledgments

We are grateful to three anonymous reviewers for their suggestive comments on early versions of the manuscript. This research was supported by the National Institutes of Health grants R01-GM55321 and the United States Department of Agriculture National Research Initiative competitive grants program 00-35300-9245 to S.X.

## Footnotes

Communicating editor: R. Doerge

- Received June 26, 2003.
- Accepted October 11, 2004.

- Genetics Society of America