STAT540 Seminar 06: Fitting and interpreting linear models (high volume)

Rebecca Johnston 12/02/2014

Load required packages and photoRec data
Fit a linear model
Be the boss of topTable()
Perform inference for some contrasts
Take-home exercise

Load required packages and photoRec data

library(limma)  # fit linear models
library(ggplot2)  # graphing
library(limma)  # graphing
library(reshape2)  # for reshaping data from wide to tall format
library(plyr)  # for data aggregation tasks

Load gene expression dataset:

prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)

## 'data.frame':    29949 obs. of  39 variables:

Load design matrix:

prDes <- readRDS("GSE4051_design.rds")
str(prDes)

## 'data.frame':    39 obs. of  4 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Fit a linear model

Explain gene expression in the wild type mice as a function of developmental stage (one-way ANOVA)?

Let's just work with the wild type data.

wtDat <- subset(prDat, select = prDes$gType == "wt")
str(wtDat, max.level = 0)

## 'data.frame':    29949 obs. of  20 variables:

wtDes <- subset(prDes, gType == "wt")
str(wtDes)

## 'data.frame':    20 obs. of  4 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 24 25 26 27 28 29 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 1 1 ...

Before we can use limma we must make our design matrix. Let's accept the default “reference + treatment effects” scheme for handling the devStage factor. I encourage you to inspect the design matrix and confirm it's what you expect.

wtDesMat <- model.matrix(~devStage, wtDes)
wtDesMat

##    (Intercept) devStageP2 devStageP6 devStageP10 devStage4_weeks
## 12           1          0          0           0               0
## 13           1          0          0           0               0
## 14           1          0          0           0               0
## 15           1          0          0           0               0
## 28           1          1          0           0               0
## 29           1          1          0           0               0
## 30           1          1          0           0               0
## 31           1          1          0           0               0
## 36           1          0          1           0               0
## 37           1          0          1           0               0
## 38           1          0          1           0               0
## 39           1          0          1           0               0
## 20           1          0          0           1               0
## 21           1          0          0           1               0
## 22           1          0          0           1               0
## 23           1          0          0           1               0
## 5            1          0          0           0               1
## 6            1          0          0           0               1
## 7            1          0          0           0               1
## 8            1          0          0           0               1
## attr(,"assign")
## [1] 0 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$devStage
## [1] "contr.treatment"

Now we will fit the model, for all probes at once, and use eBayes() to moderate the estimated error variances:

wtFit <- lmFit(wtDat, wtDesMat)
wtEbFit <- eBayes(wtFit)

The first thing we might ask is “which genes show differential expression over the course of development”? This can be addressed with an overall F test for the model. In the language used in lecture, we will compare a “big” model to a “small” model, where the “big” model includes a mean parameter (or effect) for each level of devStage and the “small” model includes a single mean parameter, e.g. an intercept. You might expect this to be the F test performed by topTable() by default, i.e. when no specific coefficients or contrasts are given to the coef argument …

topTable(wtEbFit)

##              X.Intercept. devStageP2 devStageP6 devStageP10
## 1423641_s_at        12.18    -0.0175     0.0750      0.0675
## 1438940_x_at        12.86     0.0850     0.1325      0.3425
## 1438657_x_at        12.78     0.1400     0.1250     -0.1850
## 1456736_x_at        12.32     0.1625     0.3050      0.2075
## 1436884_x_at        12.93     0.1775     0.3225      0.0300
## 1419700_a_at        12.32     0.1650     0.6475      0.8175
## 1435800_a_at        12.28     0.0450     0.6825      0.9000
## 1454613_at          12.47    -0.1075    -0.0500     -0.1025
## 1451240_a_at        13.00     0.3100     0.2800      0.2800
## 1450084_s_at        12.63     0.0825     0.0525      0.1725
##              devStage4_weeks AveExpr     F   P.Value adj.P.Val
## 1423641_s_at          0.1800   12.24 45350 3.574e-36 5.201e-32
## 1438940_x_at          0.3500   13.04 44957 3.865e-36 5.201e-32
## 1438657_x_at         -0.4500   12.71 43486 5.210e-36 5.201e-32
## 1456736_x_at          0.0725   12.47 39509 1.233e-35 6.725e-32
## 1436884_x_at          0.0250   13.04 39269 1.302e-35 6.725e-32
## 1419700_a_at          0.6825   12.78 39121 1.347e-35 6.725e-32
## 1435800_a_at          1.0200   12.81 36668 2.410e-35 1.031e-31
## 1454613_at           -0.3825   12.34 35835 2.962e-35 1.078e-31
## 1451240_a_at         -0.3700   13.10 35481 3.239e-35 1.078e-31
## 1450084_s_at          0.2600   12.75 34411 4.265e-35 1.234e-31

You'll see that, by default, topTable() reports the top 10 hits. But let's take more care and specify explicitly the coefficients we want to test for equality with zero.

colnames(coef(wtEbFit))  # remind yourself of the coef names

## [1] "(Intercept)"     "devStageP2"      "devStageP6"      "devStageP10"    
## [5] "devStage4_weeks"

(dsHits <- topTable(wtEbFit, coef = grep("devStage", colnames(coef(wtEbFit)))))

##              devStageP2 devStageP6 devStageP10 devStage4_weeks AveExpr
## 1440645_at       0.3990     0.1952      0.9200           3.961   6.528
## 1416041_at       0.1580     0.4797      0.3327           5.115   9.383
## 1425222_x_at     0.8820     0.7995      1.5488           5.532   7.028
## 1451635_at       1.3025     1.1900      2.0160           6.188   8.319
## 1429028_at      -2.4433    -3.4073     -4.3105          -4.602   8.045
## 1422929_s_at    -2.9117    -3.6182     -3.5472          -3.661   7.278
## 1424852_at       0.4575     0.2298      0.5740           3.979   7.454
## 1425171_at       0.9980     3.0530      5.2787           6.079   9.620
## 1451617_at       0.7255     2.5128      4.9837           6.685   8.817
## 1451618_at       0.6028     2.8903      5.0507           6.288   9.431
##                  F   P.Value adj.P.Val
## 1440645_at   425.4 1.588e-17 4.755e-13
## 1416041_at   195.5 1.522e-14 2.280e-10
## 1425222_x_at 173.4 4.348e-14 4.341e-10
## 1451635_at   157.3 1.013e-13 7.585e-10
## 1429028_at   148.8 1.646e-13 9.203e-10
## 1422929_s_at 146.9 1.844e-13 9.203e-10
## 1424852_at   143.2 2.290e-13 9.799e-10
## 1425171_at   138.8 3.002e-13 1.124e-09
## 1451617_at   136.5 3.485e-13 1.160e-09
## 1451618_at   134.2 4.032e-13 1.207e-09

You will notice that these are not the same hits we got with our first call to topTable(). Compare, e.g., the Affy IDs for the top hits and/or look at the typical F statistic magnitudes. And so we learn that you really must use the coef argument (or a contrasts workflow in more complicated settings) to explicitly define what you regard as a hit.

Use the hit list you stored above and your functions for extracting and plotting data to produce this plot for hits 3, 6, and 9 on the list.

Function for preparing gene expression data:

prepData <- function(x) {
  luckyDf <- prDat[x, ]
  luckyDf <- data.frame(gene = rownames(luckyDf), luckyDf) 
  luckyDf <- melt(luckyDf, id.vars = "gene", variable.name = "sidChar", 
                  value.name = "gExp")
  # Merge with wt experimental design
  luckyDf <- merge(luckyDf, wtDes, by = "sidChar")
  luckyDf <- arrange(luckyDf, devStage, gType)
  }

Function for making strip plots:

makeStripplot <- function(x) {
    ggplot(x, aes(devStage, gExp, colour = gType, group = gType)) + geom_point() + 
        facet_wrap(~gene) + stat_summary(fun.y = mean, geom = "line")
}

Run both functions given required gene list:

makeStripplot(newDat <- prepData(rownames(dsHits[c(3, 6, 9), ])))

plot of chunk unnamed-chunk-11

str(newDat)

## 'data.frame':    60 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ gene    : Factor w/ 3 levels "1422929_s_at",..: 3 2 1 1 3 2 2 3 1 3 ...
##  $ gExp    : num  5.83 5.49 9.88 10.7 5.72 ...
##  $ sidNum  : num  20 20 20 21 21 21 22 22 22 23 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 1 1 ...

Does it look plausible to you that – using only wild type data – these probes show the most compelling evidence for expression change over development?

Optional exercise: use lm() on one or all 3 of these probes and check if the F stats and p-values are similar. Don't expect exact equality because you must remember that limma has moderated the estimated error variance.

I'll choose one of the 3 probes, “1451617_at”, for a comparison:

mDat <- prepData("1451617_at")
str(mDat)

## 'data.frame':    20 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 8 9 10 11 16 17 ...
##  $ gene    : Factor w/ 1 level "1451617_at": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gExp    : num  5.83 5.72 6.08 5.71 6.8 ...
##  $ sidNum  : num  20 21 22 23 24 25 26 27 28 29 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 1 1 ...

Now call linear model using “lm” function:

mDatlm <- lm(gExp ~ devStage, mDat)
summary(mDatlm)

## 
## Call:
## lm(formula = gExp ~ devStage, data = mDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9730 -0.1863 -0.0371  0.2429  1.1810 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.835      0.259   22.54  5.6e-13 ***
## devStageP2         0.726      0.366    1.98    0.066 .  
## devStageP6         2.513      0.366    6.86  5.4e-06 ***
## devStageP10        4.984      0.366   13.61  7.6e-10 ***
## devStage4_weeks    6.685      0.366   18.26  1.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.518 on 15 degrees of freedom
## Multiple R-squared:  0.969,  Adjusted R-squared:  0.961 
## F-statistic:  119 on 4 and 15 DF,  p-value: 3.58e-11

Compare this to limma result:

dsHits["1451617_at", c("F", "P.Value")]

##                F   P.Value
## 1451617_at 136.5 3.485e-13

The results from lm vs. limma for this probe are fairly similar, but not exactly the same.

Be the boss of topTable()

You need to learn to take control of topTable() by using various arguments to get the hits you want in the order you want. Furthermore, you should familiarize yourself with the output it returns, so you are comfortable extracting the output that you need.

How many probes have Benjamini-Hochberg (“BH”) adjusted p-values for the F test conducted above that are less than 1e-05?

cutoff <- 1e-05
dsHits <- topTable(wtEbFit, coef = grep("devStage", colnames(coef(wtEbFit))), 
    p.value = cutoff, n = nrow(wtDat))
nrow(dsHits)

## [1] 350

What is the 63rd hit on this list? Provide it's Affy ID, F statistic, BH adjusted p-value, and the estimated effect for developmental stage “P6” in that order.

dsHits[63, c("F", "adj.P.Val", "devStageP6")]

##                  F adj.P.Val devStageP6
## 1451633_a_at 64.01 1.049e-07      2.069

Consider the effects associated with developmental stages P2 and P10. Scatterplot the t statistics for the test that the P2 effect is zero against that for P10. Ideally this plot would be a high-volume scatterplot, include an x = y line, and have an aspect ratio of 1 and common axes, but just do your best.

First identify differentially expressed genes in P2 and P10 using topTable, make a dataframe containing the t statistic columns from the respective results, then plot using ggplot2:

# colnames(wtEbFit)
P2Hits <- topTable(wtEbFit, coef = "devStageP2", n = nrow(wtDat), 
                   sort = "none")
P10Hits <- topTable(wtEbFit, coef = "devStageP10", n = nrow(wtDat), 
                    sort = "none")
# head(P2Hits)
# head(P10Hits)

P2P10t <- data.frame(P2 = P2Hits$t, P10 = P10Hits$t)
summary(P2P10t)

##        P2               P10         
##  Min.   :-15.934   Min.   :-20.055  
##  1st Qu.: -1.650   1st Qu.: -1.830  
##  Median :  0.299   Median :  0.457  
##  Mean   :  0.104   Mean   : -0.033  
##  3rd Qu.:  1.961   3rd Qu.:  1.916  
##  Max.   :  7.012   Max.   : 16.750

P10 has the largest range in values, therefore use these values for both the x and y limits of the plot for the aspect ratio to be 1:

ggplot(P2P10t, aes(P2, P10)) +
  geom_point(alpha = 0.1) +
  stat_density2d(geom = "polygon", aes(fill = ..level..)) + 
  geom_abline(intercept = 0, slope = 1) +
  scale_x_continuous(limits = c(min(P2P10t$P10), max(P2P10t$P10))) +
  scale_y_continuous(limits = c(min(P2P10t$P10), max(P2P10t$P10))) +
  xlab("t-statistic for P2 effect") +
  ylab("t-statistic for P10 effect")

plot of chunk unnamed-chunk-18

Create a density plot of the associated adjusted p-values, so you can get a sense of which developmental stage, P2 or P10, is more clearly distinguished from baseline E16.

P2P10adjP <- data.frame(P2 = P2Hits$adj.P.Val, P10 = P10Hits$adj.P.Val)
P2P10adjPtall <- melt(P2P10adjP, variable.name = "DevStage", 
                      value.name = "adjPVal")

ggplot(P2P10adjPtall, aes(adjPVal, colour = DevStage)) + geom_density() + xlab("Adjusted P Value")

plot of chunk unnamed-chunk-20

Is this what you'd expect? Yes, I would expect the gene expression of P10 to be more clearly distinguished from baseline E16 than P2 as it is a later time point.

If you require a BH adjusted p-value less than 1e-03, how many hits do you get for P2? How many for P10? How much overlap is there?

addmargins(table(P2Hits$adj.P.Val < 0.001, P10Hits$adj.P.Val < 0.001, dnn = c("P2", 
    "P10")))

##        P10
## P2      FALSE  TRUE   Sum
##   FALSE 29201   695 29896
##   TRUE      1    52    53
##   Sum   29202   747 29949

There are 53 hits for P2, 747 hits for P10, and 53 overlap.

How can you do this test without using table?

Now just focus on the P10 effect. Create a scatterplot matrix of raw p-values, BH adjusted p-values, and BY p-values.

P10pVals <- data.frame(raw = P10Hits$P.Value,
                       BH = P10Hits$adj.P.Val,
                       BY = p.adjust(P10Hits$P.Value, method = "BY"))
head(P10pVals)

##      raw     BH BY
## 1 0.5156 0.6027  1
## 2 0.0420 0.1056  1
## 3 0.1602 0.2498  1
## 4 0.9666 0.9742  1
## 5 0.1911 0.2839  1
## 6 0.2164 0.3112  1

The ggplot2 result from plotmatrix looked awful, so I'm resorting to base graphics. A message from the ggplot2 output suggested the GGally package intsead but this package masks several functions I use regularly such as melt!

pairs(P10pVals)

plot of chunk unnamed-chunk-23

As expected, the raw and BH-adjusted p values are positively correlated. I'm not sure what BY-adjusted P values are so I can't comment on this.

Perform inference for some contrasts

Let's try to distinguish genes that have stable expression at the last three developmental stages (P6, P10, and 4_weeks) from those that do not. If expression doesn't change from P6 to P10 to 4_weeks, then the effects for all 3 of those developmental stages should be the same. That means that the difference between the P10 and P6 effects is zero and ditto for the difference between 4_weeks effect and P10 (or P6, for that matter). Let's form these contrasts.

colnames(wtDesMat)

## [1] "(Intercept)"     "devStageP2"      "devStageP6"      "devStageP10"    
## [5] "devStage4_weeks"

(cont.matrix <- makeContrasts(
  P10VsP6 = devStageP10 - devStageP6, 
  fourweeksVsP10 = devStage4_weeks - devStageP10, 
  levels = wtDesMat))

## Warning: Renaming (Intercept) to Intercept

##                  Contrasts
## Levels            P10VsP6 fourweeksVsP10
##   Intercept             0              0
##   devStageP2            0              0
##   devStageP6           -1              0
##   devStageP10           1             -1
##   devStage4_weeks       0              1

wtFitCont <- contrasts.fit(wtFit, cont.matrix)

## Warning: row names of contrasts don't match col names of coefficients

wtEbFitCont <- eBayes(wtFitCont)

What does topTable() do with our contrasts?

topTable(wtEbFitCont)

##              P10VsP6 fourweeksVsP10 AveExpr     F   P.Value adj.P.Val
## 1440645_at    0.7247          3.041   6.528 632.7 2.224e-17 6.662e-13
## 1416041_at   -0.1470          4.782   9.383 302.4 1.473e-14 2.206e-10
## 1425222_x_at  0.7492          3.983   7.028 235.4 1.300e-13 1.297e-09
## 1424852_at    0.3442          3.405   7.454 225.1 1.910e-13 1.430e-09
## 1420726_x_at  0.1732          3.551   7.190 203.5 4.555e-13 2.640e-09
## 1451635_at    0.8260          4.172   8.319 200.0 5.289e-13 2.640e-09
## 1429394_at   -0.0980          2.410   7.848 167.5 2.416e-12 1.034e-08
## 1455447_at   -0.9765         -1.800   9.973 153.5 5.063e-12 1.896e-08
## 1429791_at    0.2480          1.658   8.026 145.7 7.877e-12 2.621e-08
## 1422612_at    0.4837          3.426   8.833 142.2 9.676e-12 2.840e-08

The top hits are probes where there is big change from P6 to P10, from P10 to 4_weeks, or both. Let's check that by plotting the data from the top 4 hits.

contHits <- topTable(wtEbFitCont)
makeStripplot(prepData(rownames(contHits)[1:4]))

plot of chunk unnamed-chunk-26

So far, so good. These 4 probes show little expression change from P6 to P10 and a strong increase from P10 to 4_weeks. I would like to find some where there's a change in each case but perhaps in opposite direction. Let's press on.

Let's use decideTests() to adjust the p-values for both contrasts globally, i.e. all together and then threshhold them at a cutoff of 1e-04.

cutoff <- 1e-04
wtResCont <- decideTests(wtEbFitCont, p.value = cutoff, method = "global")
summary(wtResCont)

##    P10VsP6 fourweeksVsP10
## -1       4              8
## 0    29945          29895
## 1        0             46

We see there are 4 probes that go down from P6 to P10 and no hits going the other way. There are 8 probes that go down from P10 to 4_weeks and 46 going the other way. Let's try to pull out various hits and plot their data. Here are the 4 that decline from P6 to P10:

(hits1 <- rownames(prDat)[which(wtResCont[, "P10VsP6"] < 0)])

## [1] "1416635_at" "1437781_at" "1454752_at" "1455260_at"

makeStripplot(prepData(hits1))

plot of chunk unnamed-chunk-28

Here are 4 of the 8 that decline from P10 to 4_weeks:

(hits2 <- rownames(prDat)[which(wtResCont[, "fourweeksVsP10"] < 0)])

## [1] "1416021_a_at" "1423851_a_at" "1434500_at"   "1435415_x_at"
## [5] "1437502_x_at" "1448182_a_at" "1452679_at"   "1455447_at"

makeStripplot(prepData(hits2[1:4]))

plot of chunk unnamed-chunk-29

Is there any overlap between these probes?

intersect(hits1, hits2)

## character(0)

Apparently not. Here are 4 of the 46 that increase from P10 to 4_weeks:

(hits3 <- rownames(prDat)[which(wtResCont[, "fourweeksVsP10"] > 0)])

##  [1] "1416041_at"   "1417280_at"   "1418406_at"   "1418710_at"  
##  [5] "1418789_at"   "1419069_at"   "1420725_at"   "1420726_x_at"
##  [9] "1421061_at"   "1421818_at"   "1422612_at"   "1424852_at"  
## [13] "1424895_at"   "1425222_x_at" "1426059_at"   "1426223_at"  
## [17] "1427388_at"   "1428763_at"   "1429394_at"   "1429791_at"  
## [21] "1430580_at"   "1431174_at"   "1433699_at"   "1434297_at"  
## [25] "1434573_at"   "1435436_at"   "1435679_at"   "1435727_s_at"
## [29] "1436265_at"   "1436287_at"   "1440402_at"   "1440645_at"  
## [33] "1441518_at"   "1442243_at"   "1443252_at"   "1446484_at"  
## [37] "1449170_at"   "1449393_at"   "1451042_a_at" "1451635_at"  
## [41] "1452243_at"   "1453385_at"   "1455493_at"   "1457878_at"  
## [45] "1458418_at"   "1459904_at"

makeStripplot(prepData(hits3[1:4]))

plot of chunk unnamed-chunk-31

Is there any overlap between these probes and the previous “down” hits?

intersect(hits1, hits3)

## character(0)

intersect(hits2, hits3)

## character(0)

That's disappointing. If I revisit this workflow but make the p-value cutoff less stringent, maybe I can find the gene expression profile I'm looking for:

cutoff <- 0.01
nHits <- 8
wtResCont <- decideTests(wtEbFitCont, p.value = cutoff, method = "global")
summary(wtResCont)

##    P10VsP6 fourweeksVsP10
## -1      40             49
## 0    29897          29636
## 1       12            264

hits1 <- rownames(prDat)[which(wtResCont[, "P10VsP6"] < 0)]
makeStripplot(prepData(hits1[1:nHits]))