Introduction to QTL mapping

JT Lovell

2016-11-01

library(qtl)
library(devtools)
install_github("jtlovell/qtlTools")
## Downloading GitHub repo jtlovell/qtlTools@master
## from URL https://api.github.com/repos/jtlovell/qtlTools/zipball/master
## Installing qtlTools
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
##   --no-environ --no-save --no-restore --quiet CMD INSTALL  \
##   '/private/var/folders/b6/w7rk82fx273ct14rskd_csk40000gp/T/RtmpGtjxfY/devtools2616754f3e49/jtlovell-qtlTools-588f597'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.3/Resources/library'  \
##   --install-tests
## 
## Reloading installed qtlTools
library(qtlTools)

First, lets simulate some data

# Set the seed so that all our maps are identical
set.seed(42) 

# Simulate a 2-chromosome map with 25 markers on each 100cM linkage group
map<-sim.map(len = c(100,100), n.mar = c(25,25), include.x=F)

# Simulate 1 QTL on each chromosome
# Here, we set the bigger one on Chr 1 to be purely additive
#       the second, smaller one is dominant
cross<-sim.cross(map, n.ind=200, type="f2", map.function="kosambi",
                 model = rbind(c(1,10,10,1),c(2,90,5,5)))

cross is an R object containing the genetic map, genotype and phenotype matrices

The genetic map is an array of genetic markers, where markers are ordered by their similarity. This should reflect a) the recombination rate and b) the physical position of the markers in the genome. A centimorgan (cM) is a measure of the distance between genetic markers, referring to the probability of a recombination event happening for 1/100 individuals between a pair of markers

This is a genetic map, where each horizontal bar represents a marker

plot.map(cross)

Here, we have plotted the relationship among markers in the map:

Yellow indicates low recombination fractions (fewer cross-overs), which means that those markers should be closer to eachother on the genetic map. See the tutorial on improving genetic maps for more details.

plot.rf(cross)
## Warning in plot.rf(cross): Running est.rf.

Calculating conditional genotype probabilities

QTL mapping in experimental populations was traditionally limitted by the number of markers that could be feasibly genotyped. With NGS data, this is less of a problem; however, we still need to calculate conditional genotype probabilities to satisfy various statistical assumptions, and to fix genotyping errors. R/qtl has two functions, sim.geno and calc.genoprob that do this. Here, we only focus on calc.genoprob because it is better for complete genotype matrices with low error rate, and it is much faster.

cross<-calc.genoprob(cross, 
                     step = 2, # This means that we want 
                               # pseudomarkers every 2cM
                     error.prob = 0.001)

QTL mapping is simply the correlation of the genotype and phenotype

In our simulated cross object, we placed two QTL, one on the top of Chr1 and another on the tail of Chr2. This is what the genotype means look like across the genome:

meanScan(cross, pheno.col = "phenotype")

Another way to think about the effects of QTL is to order the individuals in a cross by their phenotypic value - where structure exists in this ordered matrix, a QTL is likely present.

The randomly ordered Chr1 genotype matrix

geno.image(cross, chr = 1, reorder = FALSE)

The Chr1 genotype matrix, ordered by the phenotype. Note the structure on the top of Chr1.

geno.image(cross, chr = 1, reorder = TRUE)

Basic QTL Mapping

Now that we have the general idea of QTL mapping, lets simulate a more realistic cross with QTL that have smaller effect sizes.

set.seed(42) 
map<-sim.map(len = c(100,100), n.mar = c(25,25), include.x=F)
cross<-sim.cross(map, n.ind=200, type="f2", map.function="kosambi",
                 model = rbind(c(1,10,1,1),c(1,90,.5,0),c(2,90,.5,1)))
cross<-calc.genoprob(cross, step=2, error.prob = 0.001)

Scanone - the QTL package’s simplest test for QTL

Scanone makes a statistical test at each point in the (pseudo)marker grid. The LOD statistic (log of the odds of the likelihood ratio) tests the hypothesis that there is no QTL versus the presence of a single qtl.

s1<-scanone(cross, method = "hk")
summary(s1)
##       chr  pos   lod
## D1M2    1 11.7 13.69
## D2M21   2 90.7  6.76
plot(s1, main = "one-way QTL scan")

Permutations allow inference of significance

By permuting (randomizing) the phenotype matrix, while keeping the phenotype matrix constant, we can define the null distribution of the LOD score. The number of times a permuted LOD excedes that of a QTL peak (out of 100) is the P-value.

perms<-scanone(cross, n.perm=100, verbose=F, method = "hk")
plot(perms)
abline(v = quantile(perms,.95), col = "red", lty=2)

summary(s1, perms = perms, pvalues=T)
##       chr  pos   lod pval
## D1M2    1 11.7 13.69    0
## D2M21   2 90.7  6.76    0
plot(s1, main = "one-way QTL scan with significance threshold")
add.threshold(out=s1, perms=perms, alpha = 0.05, col = "red", lty=2)

Scanone has its limitations

Note that we simulated this cross with 3 QTL, 2 on Chr1, 1 on Chr2. Yet summary(s1, perms = perms, pvalues=T) returns only the top peaks per chromosome. In fact, scanone can only define the best QTL on a chromosome.

Multiple QTL modeling - basics

To get a better picture of what is actually going on, esspecially where multiple QTL exist on a chromosome, we have to build multiple QTL models. This can be done manually… See: addqtl, addtoqtl, refineqtl

To begin, we start with the best QTL peak

maxS1 <- max(s1)
s1chr <- as.numeric(as.character(maxS1$chr))
s1pos <- as.numeric(as.character(maxS1$pos))

We use makeqtl to generate a qtl model containing the genotype probabilities at the chosen QTL position

qtl1 <- makeqtl(cross, chr=s1chr, pos=s1pos, what="prob")
plot(qtl1)

Given the presence of this first QTL, we look for a second and add it to the model using addqtl and addtoqtl.

form1 <- "y ~ Q1"
s2 <- addqtl(cross, qtl=qtl1, formula=form1, method="hk")
plot(s2)

s2chr <- max(s2)$chr
s2pos <- max(s2)$pos
qtl2 <- addtoqtl(cross, qtl=qtl1, chr=s2chr, pos=s2pos)

Now, with two qtl, it is important to make sure out positions are correct, conditional on the previously defined QTL.

form2 <- "y ~ Q1 + Q2"
qtl2 <- refineqtl(cross, formula=form2, qtl=qtl2, method="hk", verbose=F)
plotLodProfile(qtl2)

Lets finally repeat this process looking for that minor QTL on chr1

plot(s3 <- addqtl(cross, qtl=qtl2, formula=form2, method="hk"))

plot(qtl3 <- addtoqtl(cross, qtl=qtl2, chr=max(s3)$chr, pos=max(s3)$pos))

form3 <- "y ~ Q1 + Q2 + Q3"
plotLodProfile(qtl.model <- refineqtl(cross, formula=form3, qtl=qtl3, method="hk", verbose=F))

Automated stepwise QTL model selection

Here, we can do all of the steps described above, in one line. Follow the output to see how it works.

step.model<-stepwiseqtl(cross, max.qtl = 3, method = "hk", additive.only = T)
##  -Initial scan
## initial lod:  13.6942 
## ** new best ** (pLOD increased by 10.1742)
##     no.qtl =  1   pLOD = 10.1742   formula: y ~ Q1 
##  -Step 1 
##  ---Scanning for additive qtl
##         plod = 18.37641 
##  ---Refining positions
##     no.qtl =  2   pLOD = 18.37641   formula: y ~ Q1 + Q2 
## ** new best ** (pLOD increased by 8.2022)
##  -Step 2 
##  ---Scanning for additive qtl
##         plod = 20.21612 
##  ---Refining positions
##     no.qtl =  3   pLOD = 20.21612   formula: y ~ Q1 + Q2 + Q3 
## ** new best ** (pLOD increased by 1.8397)
##  -Starting backward deletion
##  ---Dropping Q3 
##     no.qtl =  2   pLOD = 18.37641   formula: y ~ Q1 + Q2 
##  ---Refining positions
##  ---Dropping Q2 
##     no.qtl =  1   pLOD = 10.1742   formula: y ~ Q1 
##  ---Refining positions
##  ---One last pass through refineqtl
plotLodProfile(step.model)

plot(step.model)

Fitting a multiple QTL model.

The most important function is fitqtl which fits an analysis of variance to the conditional genotype probabilities in the qtl model.

print(fit<-summary(fitqtl(cross, qtl = step.model, formula = formula(step.model), 
                    dropone = T, get.ests = T, method = "hk")))
## 
##      fitqtl summary
## 
## Method: Haley-Knott regression 
## Model:  normal phenotype
## Number of observations : 200 
## 
## Full model result
## ----------------------------------  
## Model formula: y ~ Q1 + Q2 + Q3 
## 
##        df       SS        MS      LOD     %var Pvalue(Chi2) Pvalue(F)
## Model   6 223.8609 37.310147 30.77612 50.76899            0         0
## Error 193 217.0793  1.124763                                         
## Total 199 440.9402                                                   
## 
## 
## Drop one QTL at a time ANOVA table: 
## ----------------------------------  
##        df Type III SS   LOD   %var F value Pvalue(Chi2) Pvalue(F)    
## 1@11.7  2      118.46 18.91 26.865   52.66            0   < 2e-16 ***
## 1@91.5  2       28.51  5.36  6.467   12.68            0  6.73e-06 ***
## 2@90.6  2       79.02 13.48 17.921   35.13            0  9.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Estimated effects:
## -----------------
##              est     SE      t
## Intercept 0.9337 0.0754 12.385
## 1@11.7a   0.8365 0.1070  7.820
## 1@11.7d   1.0747 0.1551  6.930
## 1@91.5a   0.5553 0.1124  4.941
## 1@91.5d   0.1683 0.1538  1.094
## 2@90.6a   0.6754 0.1078  6.265
## 2@90.6d   0.8322 0.1534  5.427

Getting confidence intervals.

It is also useful to understand the size of the interval that contains the QTL. There are two methods to get confidence intervals: lodint assesses the interval by the relative drop in LOD score from the QTL peak; bayesint uses the area under the peak to determine confidence intervals

lodint(step.model, qtl.index = 1, drop = 1.5) # LOD drop of 1.5
##         chr      pos      lod
## c1.loc8   1  8.00000 15.24452
## D1M2      1 11.74874 18.91250
## D1M3      1 13.46666 15.33715
lodint(step.model, qtl.index = 1, drop = 4) # LOD drop of 4
##          chr      pos      lod
## c1.loc6    1  6.00000 12.21613
## D1M2       1 11.74874 18.91250
## c1.loc16   1 16.00000 14.77962
bayesint(step.model, qtl.index = 1, prob = .80)
##          chr      pos      lod
## D1M2       1 11.74874 18.91250
## D1M2       1 11.74874 18.91250
## c1.loc12   1 12.00000 18.67627
bayesint(step.model, qtl.index = 1, prob = .99)
##          chr      pos      lod
## c1.loc10   1 10.00000 17.65661
## D1M2       1 11.74874 18.91250
## c1.loc12   1 12.00000 18.67627

Since we have multiple QTL, we would have to do this calculation for each QTL peak. However, we can use qtlTools::calcCis, which loops through the QTL and puts the output into a dataframe.

print(cis<-calcCis(cross, mod = step.model, lodint = TRUE, drop = 1.5))
##   pheno qtlname chr      pos    maxLod lowmarker highmarker lowposition
## 1    NA  1@11.7   1 11.74874 18.912503   c1.loc8       D1M3           8
## 2    NA  1@91.5   1 91.48060  5.359709  c1.loc86      D1M24          86
## 3    NA  2@90.6   2 90.57381 13.482043  c2.loc88   c2.loc94          88
##   highposition
## 1     13.46666
## 2     98.88917
## 3     94.00000

Plotting confidence intervals

For publication, reviewers often want to see the position of QTL intervals on a genetic map. qtlTools::segmentsOnMap provides one method to do this.

segmentsOnMap(cross=cross, 
              phe=cis$pheno, 
              chr=cis$chr, 
              l = cis$lowposition, 
              h = cis$highposition, 
              lwd = 5, legendPosition = "right", leg.inset=.1,
              palette = rainbow)
##    phe chr lod cm  l        h
## 1  1@8   1  NA NA  8 13.46666
## 2 1@86   1  NA NA 86 98.88917
## 3 2@88   2  NA NA 88 94.00000

Or more simply:

segmentsOnMap(cross=cross,
              calcCisResults=cis, 
              legendPosition = "right", leg.inset=.1, 
              palette = rainbow)
##    phe chr       lod       cm  l        h
## 1 1@12   1 18.912503 11.74874  8 13.46666
## 2 1@91   1  5.359709 91.48060 86 98.88917
## 3 2@91   2 13.482043 90.57381 88 94.00000

Compiling statistics

While R/qtl provides many functions to conduct qtl mapping, it is often difficult to extract and manipulate statical outputs. qtlTools::qtlStats provides a simple method to get statistics from a qtl model.

qtlStats(cross,  pheno.col = "phenotype",
         form = formula(step.model), 
         mod = step.model)
##    terms altnames qtlnames phenotype          formula df Type.III.SS
## 1 1@11.7       Q1   1@11.7 phenotype y ~ Q1 + Q2 + Q3  2   118.46006
## 2 1@91.5       Q2   1@91.5 phenotype y ~ Q1 + Q2 + Q3  2    28.51343
## 3 2@90.6       Q3   2@90.6 phenotype y ~ Q1 + Q2 + Q3  2    79.02101
## 4   <NA>     <NA>     <NA>      <NA>             <NA> NA          NA
## 5   <NA>     <NA>     <NA>      <NA>             <NA> NA          NA
## 6   <NA>     <NA>     <NA>      <NA>             <NA> NA          NA
##         LOD     X.var  F.value Pvalue.Chi2.    Pvalue.F. modelLod
## 1 18.912503 26.865334 52.65999 0.000000e+00 0.000000e+00 30.77612
## 2  5.359709  6.466508 12.67530 4.368084e-06 6.727920e-06 30.77612
## 3 13.482043 17.921025 35.12784 3.297362e-14 9.769963e-14 30.77612
## 4        NA        NA       NA           NA           NA       NA
## 5        NA        NA       NA           NA           NA       NA
## 6        NA        NA       NA           NA           NA       NA
##   modelPercVar qtlname  chr      pos    maxLod lowmarker highmarker
## 1     50.76899    <NA> <NA>       NA        NA      <NA>       <NA>
## 2     50.76899    <NA> <NA>       NA        NA      <NA>       <NA>
## 3     50.76899    <NA> <NA>       NA        NA      <NA>       <NA>
## 4           NA  1@11.7    1 11.74874 18.912503   c1.loc8       D1M3
## 5           NA  1@91.5    1 91.48060  5.359709  c1.loc86      D1M24
## 6           NA  2@90.6    2 90.57381 13.482043  c2.loc88   c2.loc94
##   lowposition highposition
## 1          NA           NA
## 2          NA           NA
## 3          NA           NA
## 4           8     13.46666
## 5          86     98.88917
## 6          88     94.00000

Extracting and Plotting qtl effects

`lsmeans4qtl produces a dataframe with SAS-style LSMeans, and also includes standard means, and se.

library(lsmeans)
## Loading required package: estimability
## 
## Attaching package: 'lsmeans'
## The following object is masked from 'package:devtools':
## 
##     test
alllsms<-lsmeans4qtl(cross, 
                     pheno.col = "phenotype",
                     form = "y ~ Q1 + Q2 + Q3", 
                     mod = step.model, 
                     covar=NULL)
print(alllsms)
##     Q1   Q2   Q3      lsmean        SE  df   lower.CL    upper.CL
## 1   AA <NA> <NA> -0.60919625 0.1601700 193 -0.9251046 -0.29328792
## 2   AB <NA> <NA>  1.30326041 0.1154055 193  1.0756424  1.53087842
## 3   BB <NA> <NA>  1.06590193 0.1545786 193  0.7610217  1.37078214
## 4 <NA>   AA <NA> -0.02770275 0.1636935 193 -0.3505607  0.29515519
## 5 <NA>   AB <NA>  0.70019254 0.1124443 193  0.4784150  0.92197004
## 6 <NA>   BB <NA>  1.08747630 0.1600063 193  0.7718908  1.40306181
## 7 <NA> <NA>   AA -0.36570969 0.1594569 193 -0.6802116 -0.05120775
## 8 <NA> <NA>   AB  1.14012265 0.1105252 193  0.9221303  1.35811497
## 9 <NA> <NA>   BB  0.98555312 0.1590346 193  0.6718842  1.29922206
##          mean       sem
## 1 -0.39956595 0.1765342
## 2  1.43599907 0.1231533
## 3  1.31643680 0.1958242
## 4  0.21348170 0.2566733
## 5  1.06129123 0.1240206
## 6  1.41357005 0.2162085
## 7 -0.01403133 0.1979478
## 8  1.23201204 0.1338871
## 9  1.37607761 0.2067131

Often we may be interested in some interaction among QTL (epistasis) or with a covariate. Lets add in one arbitrarily, then cull to the LSmeans of Q1 and Q3, averaging across values at Q2.

alllsms<-lsmeans4qtl(cross, 
                     pheno.col = "phenotype",
                     form = "y ~ Q1 + Q2 + Q3 + Q1*Q3", 
                     mod = step.model, 
                     covar=NULL)
## NOTE: Results may be misleading due to involvement in interactions
## NOTE: Results may be misleading due to involvement in interactions
lsms<-alllsms[!is.na(alllsms$Q1) & 
                !is.na(alllsms$Q3),]
lsms<-lsms[,c("Q1","Q3","lsmean","SE","mean","sem")]

library(ggplot2)
pos<-position_dodge(.1)
ggplot(lsms, aes(x = Q1, y = lsmean, shape = Q3,
   color = Q3, group = Q3))+
   geom_point(position = pos)+
   geom_line(position = pos)+
   theme_jtl()+
   geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean+SE), width = .1,position = pos)+
   ggtitle("sas-style LSMeans")

Summary

Here, we went over QTL mapping basics, including: - the effects of QTL - one-way QTL scans - building multiple QTL models - Fitting statistics to multiple QTL models

We have ignored many complexities in QTL mapping, including: - Epistasis - QTL*covariate interactions - Multiple phenotypes - Generating genetic maps

For more information on complex parts of QTL mapping, see R/qtl’s website.