Need to install limma:
source(“http://www.bioconductor.org/biocLite.R”) biocLite(“limma”) biocLite(“statmod”)
Load packages:
library(limma)
library(lattice)
library(ggplot2)
Load photoRec data:
prDat <- read.delim("../photoRecDataset/dataClean/GSE4051_data.tsv")
str(prDat, max.level = 0)
## 'data.frame': 29949 obs. of 39 variables:
prDes <- readRDS("../photoRecDataset/dataClean/GSE4051_design.rds")
str(prDes)
## 'data.frame': 39 obs. of 4 variables:
## $ sidChar : chr "Sample_11" "Sample_12" "Sample_2" "Sample_9" ...
## $ sidNum : num 11 12 2 9 36 37 38 39 16 17 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 5 5 5 5 1 1 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 2 2 2 2 1 1 1 1 2 2 ...
You might want to use the functions you wrote last week to extract and stripplot excerpts from the photoRec dataset. If you stored the code defining those functions cleanly in a script, you could make them available now by using the source() function.
source("../photoRecDataset/codes/prepDatScript.r")
source("../photoRecDataset/codes/makeStripplot.R")
The lmFit() from limma is arguably your main workhorse function for fitting a common linear model to the data for a very large number of genes. It has at least two strengths to recommend it:
1. It does this in a computationally efficient way, i.e. better than you writing a top-level for() loop and probably even better than pursuing an apply()-type strategy.
2. It borrows strength across the large number of genes (= datasets) to moderate the gene-wise estimate of error variance.
Before we dive in and start using limma with the photoRec dataset, let's do a small simulation to illustrate how lousy variance estimates can be when the number of samples is small.
Let's simulate data for 1000 genes. For each gene, we get 3 observations from a normal distribution with mean 0 and variance 1. We generate the data for each gene independent of the others.
m <- 1000
n <- 3
x <- matrix(rnorm(m * n), nrow = m)
Let's take the observed gene-wise variances. Yes, folks, we are estimating variance with samples of size 3.
obsVar <- apply(x, 1, var)
summary(obsVar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 0.296 0.740 0.972 1.360 6.640
mean(obsVar < 1/3)
## [1] 0.279
densityplot(~obsVar, n = 200)
Notice how many of the observed variances are freakishly small (and freakishly large!), even though they are indeed equal to 1 “on average”. For example, we see that at least a quarter of the genes appear to exhibit a sample variance that is less than one-third the true variance. This can wreak havoc with statistical inference, such as t-statistics. This is what limma – or the statistical methods it embodies, actually – is designed to combat.
wtDes <- subset(prDes, gType == "wt")
str(wtDes)
## 'data.frame': 20 obs. of 4 variables:
## $ sidChar : chr "Sample_36" "Sample_37" "Sample_38" "Sample_39" ...
## $ sidNum : num 36 37 38 39 20 21 22 23 32 33 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 5 5 5 5 1 1 1 1 4 4 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 1 1 ...
wtDat <- subset(prDat, select = prDes$gType == "wt")
str(wtDat, max.level = 0)
## 'data.frame': 29949 obs. of 20 variables:
Let's accept the default “ref + treatment effects” scheme for handling the devStage factor. I encourage you to inspect the design matrix and confirm it's what you expect.
(wtDesMat <- model.matrix(~devStage, wtDes))
## (Intercept) devStageP2 devStageP6 devStageP10 devStage4_weeks
## 5 1 0 0 0 1
## 6 1 0 0 0 1
## 7 1 0 0 0 1
## 8 1 0 0 0 1
## 12 1 0 0 0 0
## 13 1 0 0 0 0
## 14 1 0 0 0 0
## 15 1 0 0 0 0
## 20 1 0 0 1 0
## 21 1 0 0 1 0
## 22 1 0 0 1 0
## 23 1 0 0 1 0
## 28 1 1 0 0 0
## 29 1 1 0 0 0
## 30 1 1 0 0 0
## 31 1 1 0 0 0
## 36 1 0 1 0 0
## 37 1 0 1 0 0
## 38 1 0 1 0 0
## 39 1 0 1 0 0
## attr(,"assign")
## [1] 0 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$devStage
## [1] "contr.treatment"
wtFit <- lmFit(wtDat, wtDesMat)
wtEbFit <- eBayes(wtFit)
The first thing we might ask is “which genes show differential expression over the course of development”? This can be addressed with an overall F test for the model. In the language used in lecture, we will compare a “big” model to a “small” model, where the “big” model includes a mean parameter (or effect) for each level of devStage and the “small” model includes a single mean parameter, e.g. an intercept. You might expect this to be the F test performed by topTable() by default, i.e. when no specific coefficients or contrasts are given to the coef argument …
topTable(wtEbFit)
## X.Intercept. devStageP2 devStageP6 devStageP10
## 1423641_s_at 12.18 -0.0175 0.0750 0.0675
## 1438940_x_at 12.86 0.0850 0.1325 0.3425
## 1438657_x_at 12.78 0.1400 0.1250 -0.1850
## 1456736_x_at 12.32 0.1625 0.3050 0.2075
## 1436884_x_at 12.93 0.1775 0.3225 0.0300
## 1419700_a_at 12.32 0.1650 0.6475 0.8175
## 1435800_a_at 12.29 0.0450 0.6825 0.9000
## 1454613_at 12.47 -0.1075 -0.0500 -0.1025
## 1451240_a_at 13.00 0.3100 0.2800 0.2800
## 1450084_s_at 12.63 0.0825 0.0525 0.1725
## devStage4_weeks AveExpr F P.Value adj.P.Val
## 1423641_s_at 0.1800 12.24 45350 3.574e-36 5.201e-32
## 1438940_x_at 0.3500 13.04 44957 3.865e-36 5.201e-32
## 1438657_x_at -0.4500 12.71 43486 5.210e-36 5.201e-32
## 1456736_x_at 0.0725 12.47 39509 1.233e-35 6.725e-32
## 1436884_x_at 0.0250 13.04 39269 1.302e-35 6.725e-32
## 1419700_a_at 0.6825 12.78 39121 1.347e-35 6.725e-32
## 1435800_a_at 1.0200 12.81 36668 2.410e-35 1.031e-31
## 1454613_at -0.3825 12.34 35835 2.962e-35 1.078e-31
## 1451240_a_at -0.3700 13.10 35481 3.239e-35 1.078e-31
## 1450084_s_at 0.2600 12.75 34411 4.265e-35 1.234e-31
by default, topTable() reports the top 10 hits. But let's take more care and specify explicitly the coefficients we want to test for equality with zero. Recall that one can specify these by number but I recommend doing this by name.
# this will work but is not recommended
topTable(wtEbFit, coef = 2:5)
## devStageP2 devStageP6 devStageP10 devStage4_weeks AveExpr
## 1440645_at 0.3990 0.1953 0.9200 3.961 6.528
## 1416041_at 0.1580 0.4798 0.3327 5.114 9.383
## 1425222_x_at 0.8820 0.7995 1.5488 5.532 7.028
## 1451635_at 1.3025 1.1900 2.0160 6.188 8.319
## 1429028_at -2.4432 -3.4072 -4.3105 -4.602 8.045
## 1422929_s_at -2.9117 -3.6182 -3.5472 -3.661 7.278
## 1424852_at 0.4575 0.2297 0.5740 3.979 7.454
## 1425171_at 0.9980 3.0530 5.2787 6.079 9.620
## 1451617_at 0.7255 2.5127 4.9837 6.685 8.817
## 1451618_at 0.6027 2.8902 5.0507 6.288 9.431
## F P.Value adj.P.Val
## 1440645_at 425.4 1.588e-17 4.755e-13
## 1416041_at 195.5 1.522e-14 2.280e-10
## 1425222_x_at 173.4 4.348e-14 4.341e-10
## 1451635_at 157.3 1.013e-13 7.585e-10
## 1429028_at 148.8 1.646e-13 9.203e-10
## 1422929_s_at 146.9 1.844e-13 9.203e-10
## 1424852_at 143.2 2.290e-13 9.799e-10
## 1425171_at 138.8 3.002e-13 1.124e-09
## 1451617_at 136.5 3.485e-13 1.160e-09
## 1451618_at 134.2 4.032e-13 1.207e-09
# remind yoursefl the coefficients names
colnames(coef(wtEbFit))
## [1] "(Intercept)" "devStageP2" "devStageP6" "devStageP10"
## [5] "devStage4_weeks"
(dsHits <- topTable(wtEbFit, coef = grep("devStage", colnames(coef(wtEbFit)))))
## devStageP2 devStageP6 devStageP10 devStage4_weeks AveExpr
## 1440645_at 0.3990 0.1953 0.9200 3.961 6.528
## 1416041_at 0.1580 0.4798 0.3327 5.114 9.383
## 1425222_x_at 0.8820 0.7995 1.5488 5.532 7.028
## 1451635_at 1.3025 1.1900 2.0160 6.188 8.319
## 1429028_at -2.4432 -3.4072 -4.3105 -4.602 8.045
## 1422929_s_at -2.9117 -3.6182 -3.5472 -3.661 7.278
## 1424852_at 0.4575 0.2297 0.5740 3.979 7.454
## 1425171_at 0.9980 3.0530 5.2787 6.079 9.620
## 1451617_at 0.7255 2.5127 4.9837 6.685 8.817
## 1451618_at 0.6027 2.8902 5.0507 6.288 9.431
## F P.Value adj.P.Val
## 1440645_at 425.4 1.588e-17 4.755e-13
## 1416041_at 195.5 1.522e-14 2.280e-10
## 1425222_x_at 173.4 4.348e-14 4.341e-10
## 1451635_at 157.3 1.013e-13 7.585e-10
## 1429028_at 148.8 1.646e-13 9.203e-10
## 1422929_s_at 146.9 1.844e-13 9.203e-10
## 1424852_at 143.2 2.290e-13 9.799e-10
## 1425171_at 138.8 3.002e-13 1.124e-09
## 1451617_at 136.5 3.485e-13 1.160e-09
## 1451618_at 134.2 4.032e-13 1.207e-09
You will notice that these are not the same hits we got with our first call to topTable(). Compare, e.g., the Affy IDs for the top hits and/or look at the typical F statistic magnitudes. And so we learn that you really must use the coef argument (or a contrasts workflow in more complicated settings) to explicitly define what you regard as a hit.
# select the genes
theGenes <- rownames(dsHits[c(3, 6, 9), ])
# use the makeStripplot and prepDat functions to plot the 3 hit genes
makeStripplot(prepDat(wtDat, wtDes, theGenes))
## Warning: row names were found from a short variable and have been
## discarded
Does it look plausible to you that – using only wild type data – these probes show the most compelling evidence for expression change over development?
nrow(topTable(wtEbFit, number = Inf, p.value = 1e-05, coef = grep("devStage",
colnames(coef(wtEbFit)))))
## [1] 350
# or
topHits <- topTable(wtEbFit, number = Inf, p.value = 1e-05, coef = grep("devStage",
colnames(coef(wtEbFit))))
nrow(topHits)
## [1] 350
topHits <- topTable(wtEbFit, number = Inf, p.value = 1e-05, coef = grep("devStage",
colnames(coef(wtEbFit))))
names(topHits)
## [1] "devStageP2" "devStageP6" "devStageP10" "devStage4_weeks"
## [5] "AveExpr" "F" "P.Value" "adj.P.Val"
topHits[63, c("F", "adj.P.Val", "devStageP6")]
## F adj.P.Val devStageP6
## 1451633_a_at 64.01 1.049e-07 2.069
# alternatively can do it like this:
topTable(wtEbFit, number = Inf, p.value = 1e-05, coef = grep("devStage", colnames(coef(wtEbFit))))[63,
c("F", "adj.P.Val", "devStageP6")]
## F adj.P.Val devStageP6
## 1451633_a_at 64.01 1.049e-07 2.069
p2hits <- topTable(wtEbFit, number = Inf, sort.by = "none", coef = grep("devStageP2",
colnames(coef(wtEbFit))))
p10hits <- topTable(wtEbFit, number = Inf, sort.by = "none", coef = grep("devStageP10",
colnames(coef(wtEbFit))))
names(p2hits)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
smoothScatter(p2hits$t ~ p10hits$t, xlim = c(-20, 20), ylim = c(-20, 20), xlab = "t-statistic for effect of P10",
ylab = "t-statistic for effect of P2")
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
abline(a = 0, b = 1, col = "orange")
densityplot(~p2hits$adj.P.Val + p10hits$adj.P.Val, auto.key = T, plot.points = F,
n = 200)
P10 is more clearly distinquished from E16 (the reference).
addmargins(table(p2hits$adj.P.Val < 0.001, p10hits$adj.P.Val < 0.001, dnn = c("P2",
"P10")))
## P10
## P2 FALSE TRUE Sum
## FALSE 29201 695 29896
## TRUE 1 52 53
## Sum 29202 747 29949
With the p value threshold at 0.001 there are 53 hits for P2, 747 hits for P10, and 52 overlapping hits for both.
p10hitsBY <- topTable(wtEbFit, number = Inf, sort.by = "none", adjust.method = "BY",
coef = grep("devStageP10", colnames(coef(wtEbFit))))
pValsP10 <- data.frame(raw = p10hits$P.Value, BHadj = p10hits$adj.P.Val, BYadj = p10hitsBY$adj.P.Val)
splom(pValsP10)
In other words find differentially expressed genes at P6, P10, and 4weeks. If expression doesn't change from P6 to P10 to 4weeks, then the effects for all 3 of those developmental stages should be the same. That means that the difference between the P10 and P6 effects is zero and ditto for the difference between 4_weeks effect and P10 (or P6, for that matter). Let's form these contrasts.
colnames(wtDesMat)
## [1] "(Intercept)" "devStageP2" "devStageP6" "devStageP10"
## [5] "devStage4_weeks"
(contMat <- makeContrasts(P10vsP6 = devStageP10 - devStageP6, fourWeeksVsP10 = devStage4_weeks -
devStageP10, levels = wtDesMat))
## Warning: Renaming (Intercept) to Intercept
## Contrasts
## Levels P10vsP6 fourWeeksVsP10
## Intercept 0 0
## devStageP2 0 0
## devStageP6 -1 0
## devStageP10 1 -1
## devStage4_weeks 0 1
wtFitCont <- contrasts.fit(wtFit, contMat)
## Warning: row names of contrasts don't match col names of coefficients
wtEbFitCont <- eBayes(wtFitCont)
topTable(wtEbFitCont)
## P10vsP6 fourWeeksVsP10 AveExpr F P.Value adj.P.Val
## 1440645_at 0.7247 3.041 6.528 632.7 2.224e-17 6.662e-13
## 1416041_at -0.1470 4.782 9.383 302.4 1.473e-14 2.206e-10
## 1425222_x_at 0.7493 3.983 7.028 235.4 1.300e-13 1.297e-09
## 1424852_at 0.3442 3.405 7.454 225.1 1.910e-13 1.430e-09
## 1420726_x_at 0.1732 3.551 7.190 203.5 4.555e-13 2.640e-09
## 1451635_at 0.8260 4.172 8.319 200.0 5.289e-13 2.640e-09
## 1429394_at -0.0980 2.410 7.848 167.5 2.416e-12 1.034e-08
## 1455447_at -0.9765 -1.800 9.973 153.5 5.063e-12 1.896e-08
## 1429791_at 0.2480 1.658 8.026 145.7 7.877e-12 2.621e-08
## 1422612_at 0.4837 3.426 8.833 142.2 9.676e-12 2.840e-08
(topGenes <- rownames(topTable(wtEbFitCont)[1:4, ]))
## [1] "1440645_at" "1416041_at" "1425222_x_at" "1424852_at"
makeStripplot(prepDat(wtDat, wtDes, topGenes))
## Warning: row names were found from a short variable and have been
## discarded
These 4 probes show little expression change from P6 to P10 and a strong increase from P10 to 4_weeks.
cutoff <- 1e-04
wtResCont <- decideTests(wtEbFitCont, p.value = cutoff, method = "global")
summary(wtResCont)
## P10vsP6 fourWeeksVsP10
## -1 4 8
## 0 29945 29895
## 1 0 46
There are 4 probes that go down from P6 to P10 and no hits going the other way. There are 8 probes that go down from P10 to 4_weeks and 46 going the other way.
(hits1 <- rownames(wtDat)[which(wtResCont[, "P10vsP6"] < 0)])
## [1] "1416635_at" "1437781_at" "1454752_at" "1455260_at"
makeStripplot(prepDat(wtDat, wtDes, hits1))
## Warning: row names were found from a short variable and have been
## discarded
(hits2 <- rownames(wtDat)[which(wtResCont[, "fourWeeksVsP10"] < 0)])
## [1] "1416021_a_at" "1423851_a_at" "1434500_at" "1435415_x_at"
## [5] "1437502_x_at" "1448182_a_at" "1452679_at" "1455447_at"
makeStripplot(prepDat(wtDat, wtDes, hits2[1:4]))
## Warning: row names were found from a short variable and have been
## discarded
intersect(hits1, hits2)
## character(0)
No, there is not.
(hits3 <- rownames(wtDat)[which(wtResCont[, "fourWeeksVsP10"] > 0)])
## [1] "1416041_at" "1417280_at" "1418406_at" "1418710_at"
## [5] "1418789_at" "1419069_at" "1420725_at" "1420726_x_at"
## [9] "1421061_at" "1421818_at" "1422612_at" "1424852_at"
## [13] "1424895_at" "1425222_x_at" "1426059_at" "1426223_at"
## [17] "1427388_at" "1428763_at" "1429394_at" "1429791_at"
## [21] "1430580_at" "1431174_at" "1433699_at" "1434297_at"
## [25] "1434573_at" "1435436_at" "1435679_at" "1435727_s_at"
## [29] "1436265_at" "1436287_at" "1440402_at" "1440645_at"
## [33] "1441518_at" "1442243_at" "1443252_at" "1446484_at"
## [37] "1449170_at" "1449393_at" "1451042_a_at" "1451635_at"
## [41] "1452243_at" "1453385_at" "1455493_at" "1457878_at"
## [45] "1458418_at" "1459904_at"
makeStripplot(prepDat(wtDat, wtDes, hits3[1:4]))
## Warning: row names were found from a short variable and have been
## discarded
intersect(hits1, hits3)
## character(0)
intersect(hits2, hits3)
## character(0)
That's disappointing.
cutoff <- 0.01
nHits <- 8
wtResCont <- decideTests(wtEbFitCont, p.value = cutoff, method = "global")
summary(wtResCont)
## P10vsP6 fourWeeksVsP10
## -1 40 49
## 0 29897 29636
## 1 12 264
hits1 <- rownames(wtDat)[which(wtResCont[, "P10vsP6"] < 0)]
makeStripplot(prepDat(wtDat, wtDes, hits1[1:nHits]))
## Warning: row names were found from a short variable and have been
## discarded
hits2 <- rownames(wtDat)[which(wtResCont[, "fourWeeksVsP10"] < 0)]
makeStripplot(prepDat(wtDat, wtDes, hits2[1:nHits]))
## Warning: row names were found from a short variable and have been
## discarded
hits3 <- rownames(wtDat)[which(wtResCont[, "P10vsP6"] > 0)]
makeStripplot(prepDat(wtDat, wtDes, hits3[1:nHits]))
## Warning: row names were found from a short variable and have been
## discarded
hits4 <- rownames(wtDat)[which(wtResCont[, "fourWeeksVsP10"] > 0)]
makeStripplot(prepDat(wtDat, wtDes, hits4[1:nHits]))
## Warning: row names were found from a short variable and have been
## discarded
vennDiagram(wtResCont)
hits5 <- rownames(wtDat)[which(wtResCont[, "P10vsP6"] != 0 & wtResCont[, "fourWeeksVsP10"] !=
0)]
makeStripplot(prepDat(wtDat, wtDes, hits5))
## Warning: row names were found from a short variable and have been
## discarded
(hits6 <- rownames(wtDat)[which(wtResCont[, "P10vsP6"] > 0 & wtResCont[, "fourWeeksVsP10"] <
0)])
## [1] "1441815_at"
makeStripplot(prepDat(wtDat, wtDes, hits6))
colnames(wtDesMat)
## [1] "(Intercept)" "devStageP2" "devStageP6" "devStageP10"
## [5] "devStage4_weeks"
(contMat2 <- makeContrasts(P2vsE16 = devStageP2 - (Intercept), P6vsP2 = devStageP6 -
devStageP2, P10vsP6 = devStageP10 - devStageP6, fourWeeksVsP10 = devStage4_weeks -
devStageP10, levels = wtDesMat))
## Warning: Renaming (Intercept) to Intercept
## Contrasts
## Levels P2vsE16 P6vsP2 P10vsP6 fourWeeksVsP10
## Intercept -1 0 0 0
## devStageP2 1 -1 0 0
## devStageP6 0 1 -1 0
## devStageP10 0 0 1 -1
## devStage4_weeks 0 0 0 1
wtFitCont2 <- contrasts.fit(wtFit, contMat2)
## Warning: row names of contrasts don't match col names of coefficients
wtEbFitCont2 <- eBayes(wtFitCont2)
topTable(wtEbFitCont2)
## P2vsE16 P6vsP2 P10vsP6 fourWeeksVsP10 AveExpr F
## 1438657_x_at -12.64 -0.0150 -3.100e-01 -0.2650 12.71 2641
## 1423641_s_at -12.20 0.0925 -7.500e-03 0.1125 12.24 2610
## 1438940_x_at -12.78 0.0475 2.100e-01 0.0075 13.04 2485
## 1436884_x_at -12.75 0.1450 -2.925e-01 -0.0050 13.04 2226
## 1454613_at -12.57 0.0575 -5.250e-02 -0.2800 12.34 2212
## 1456736_x_at -12.16 0.1425 -9.750e-02 -0.1350 12.47 2202
## 1456349_x_at -11.58 -0.1200 -1.250e-02 -0.3375 11.53 2077
## 1451240_a_at -12.69 -0.0300 -2.776e-16 -0.6500 13.10 2032
## 1438859_x_at -13.00 -0.2425 -9.750e-02 -0.3125 12.63 1978
## 1416187_s_at -12.08 -0.1300 -5.500e-02 0.0825 12.18 1964
## P.Value adj.P.Val
## 1438657_x_at 1.309e-24 2.179e-20
## 1423641_s_at 1.455e-24 2.179e-20
## 1438940_x_at 2.259e-24 2.256e-20
## 1436884_x_at 6.061e-24 3.329e-20
## 1454613_at 6.413e-24 3.329e-20
## 1456736_x_at 6.669e-24 3.329e-20
## 1456349_x_at 1.125e-23 4.811e-20
## 1451240_a_at 1.369e-23 5.057e-20
## 1438859_x_at 1.744e-23 5.057e-20
## 1416187_s_at 1.858e-23 5.057e-20
(topGenes <- rownames(topTable(wtEbFitCont2[1:4, ])))
## [1] "1415673_at" "1415670_at" "1415672_at" "1415671_at"
makeStripplot(prepDat(wtDat, wtDes, topGenes))
## Warning: row names were found from a short variable and have been
## discarded
cutoff = 1e-04
wtResCont2 <- decideTests(wtEbFitCont2, p.value = cutoff, method = "global")
summary(wtResCont2)
## P2vsE16 P6vsP2 P10vsP6 fourWeeksVsP10
## -1 29903 0 32 37
## 0 46 29937 29907 29712
## 1 0 12 10 200
hits7 <- rownames(wtDat)[which(wtResCont2[, "P2vsE16"] != 0 & wtResCont2[, "P6vsP2"] !=
0 & wtResCont2[, "P10vsP6"] == 0 & wtResCont2[, "fourWeeksVsP10"] == 0)]
makeStripplot(prepDat(wtDat, wtDes, hits7))
## Warning: row names were found from a short variable and have been
## discarded
this does not look right?!