Rebecca Johnston 05/02/2014
library("ggplot2") # for graphing
library("reshape2") # for converting data from wide to tall format
library("plyr") # for data aggregation tasks
Load gene expression dataset:
prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)
## 'data.frame': 29949 obs. of 39 variables:
Load design matrix:
prDes <- readRDS("GSE4051_design.rds")
str(prDes)
## 'data.frame': 39 obs. of 4 variables:
## $ sidChar : chr "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
Write a function that takes any given affy probe ID and turns their associated values into a data frame.
METHOD: melting and merging. Index prDat with chosen genes:
luckyGenes <- c("1419655_at", "1438815_at")
luckyDf <- prDat[luckyGenes, ]
Reshape and bind with design matrix. Add rownames (probe ID) as a column for melt function first:
luckyDf <- data.frame(gene = rownames(luckyDf), luckyDf)
luckyDf <- melt(luckyDf, id.vars = "gene", variable.name = "sidChar", value.name = "gExp")
luckyDf <- merge(luckyDf, prDes, by = "sidChar")
Reorder using plyr function arrange:
luckyDf <- arrange(luckyDf, devStage, gType)
We can now write the function where variable input is “luckyGenes”:
luckyGenes <- c("1419655_at", "1438815_at")
prepData <- function(x) {
luckyDf <- prDat[x, ]
luckyDf <- data.frame(gene = rownames(luckyDf), luckyDf)
luckyDf <- melt(luckyDf, id.vars = "gene", variable.name = "sidChar", value.name = "gExp")
luckyDf <- merge(luckyDf, prDes, by = "sidChar")
luckyDf <- arrange(luckyDf, devStage, gType)
}
jDat <- prepData(luckyGenes)
str(jDat)
## 'data.frame': 78 obs. of 6 variables:
## $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ gene : Factor w/ 2 levels "1419655_at","1438815_at": 1 2 1 2 1 2 1 2 1 2 ...
## $ gExp : num 10.93 8.84 10.74 8.85 10.67 ...
## $ sidNum : num 20 20 21 21 22 22 23 23 16 16 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 2 2 ...
Plot data using ggplot:
ggplot(jDat, aes(devStage, gExp, colour = gType, group = gType)) + geom_point() +
facet_wrap(~gene) + stat_summary(fun.y = mean, geom = "line")
You will probably make lots of these plots. Why not write a function for this?
makeStripplot <- function(x) {
ggplot(x, aes(devStage, gExp, colour = gType, group = gType)) + geom_point() +
facet_wrap(~gene) + stat_summary(fun.y = mean, geom = "line")
}
makeStripplot(jDat)
You can use both of your functions together and create a mini dataset and plot it all at once:
makeStripplot(newDat <- prepData("1456341_a_at"))
str(newDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ gene : Factor w/ 1 level "1456341_a_at": 1 1 1 1 1 1 1 1 1 1 ...
## $ gExp : num 7.04 7.48 7.37 6.94 6.16 ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
Let's test for a difference in expected gene expression for probeset “1456341_a_at” at developmental stage P2 v. 4 weeks post-natal (ignoring genotype, i.e. lump the wild types and knockouts together). Let's assume a common variance in the two groups.
levels(newDat$devStage)
## [1] "E16" "P2" "P6" "P10" "4_weeks"
t.test(gExp ~ devStage, newDat, subset = devStage == "P2" | devStage == "4_weeks")
##
## Welch Two Sample t-test
##
## data: gExp by devStage
## t = -18.84, df = 13.98, p-value = 2.477e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.078 -3.244
## sample estimates:
## mean in group P2 mean in group 4_weeks
## 6.326 9.987
In other words, do “one-way ANOVA”. Focus on probeset “1438786_a_at”. Use function we already prepared to subset data and draw stripplot:
makeStripplot(mDat <- prepData("1438786_a_at"))
str(mDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ gene : Factor w/ 1 level "1438786_a_at": 1 1 1 1 1 1 1 1 1 1 ...
## $ gExp : num 7.94 8.99 8.55 8.62 5.72 ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
Now call linear model using “lm” function:
mDatlm <- lm(gExp ~ devStage, mDat, subset = gType == "wt")
summary(mDatlm)
##
## Call:
## lm(formula = gExp ~ devStage, data = mDat, subset = gType ==
## "wt")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1565 -0.4400 0.0288 0.4915 1.2065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.523 0.379 22.50 5.7e-13 ***
## devStageP2 -1.450 0.536 -2.71 0.016 *
## devStageP6 -0.107 0.536 -0.20 0.845
## devStageP10 -1.201 0.536 -2.24 0.040 *
## devStage4_weeks 0.081 0.536 0.15 0.882
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.758 on 15 degrees of freedom
## Multiple R-squared: 0.497, Adjusted R-squared: 0.363
## F-statistic: 3.71 on 4 and 15 DF, p-value: 0.0272
This result looks plausible: the intercept of 8.52 is close to the mean of E16 (the reference). And the values for the other devStages change accordingly (-/+ slope wrt devStageE16).
The “W” shape of the expression profile for “1438786_a_at” means that the expression values for developmental stages P2 and P10 are quite similar. We could formally test whether the P2 and P10 effects are equal or, equivalently, whether their difference is equal to zero. First extract the parameter estimates from the linear model you fit above. Hint: the coef() function will pull parameter estimates out of a wide array of fitted model objects in R.
coef(mDatlm)
## (Intercept) devStageP2 devStageP6 devStageP10
## 8.5227 -1.4503 -0.1067 -1.2012
## devStage4_weeks
## 0.0810
Now you need to construct the contrast matrix to form the difference between the P2 and P10 effects. Hint: it's OK for a matrix to have just one row.
levels(mDat$devStage)
## [1] "E16" "P2" "P6" "P10" "4_weeks"
Contrast matrix for observed difference in sample mean for wt mice at P2 - P10:
contMat <- matrix(c(0, 1, 0, -1, 0), nrow = 1)
(obsDiff <- contMat %*% coef(mDatlm)) # %*% means matrix multiplication!
## [,1]
## [1,] -0.249
Let's check that this really is the observed difference in sample mean for the wild type mice, P2 v. P10:
(sampMeans <- aggregate(gExp ~ devStage, mDat, FUN = mean, subset = gType ==
"wt"))
## devStage gExp
## 1 E16 8.523
## 2 P2 7.072
## 3 P6 8.416
## 4 P10 7.322
## 5 4_weeks 8.604
with(sampMeans, gExp[devStage == "P2"] - gExp[devStage == "P10"])
## [1] -0.249
Yes! Agrees with the observed difference we computed by multiplying our contrast matrix and the estimated parameters. If you don't get agreement, you have a problem … probably with your contrast matrix.
Now we need the (estimated) standard error for our contrast. The variance-covariance matrix of the parameters estimated in the original model can be obtained with vcov() and is equal to (XTX)-1(sigma hat)2.
vcov(mDatlm)
## (Intercept) devStageP2 devStageP6 devStageP10
## (Intercept) 0.1435 -0.1435 -0.1435 -0.1435
## devStageP2 -0.1435 0.2870 0.1435 0.1435
## devStageP6 -0.1435 0.1435 0.2870 0.1435
## devStageP10 -0.1435 0.1435 0.1435 0.2870
## devStage4_weeks -0.1435 0.1435 0.1435 0.1435
## devStage4_weeks
## (Intercept) -0.1435
## devStageP2 0.1435
## devStageP6 0.1435
## devStageP10 0.1435
## devStage4_weeks 0.2870
Let's check that this is really true. If we take the diagonal elements and take their square root, they should be exactly equal to the standard errors reported for out original model. Are they?
summary(mDatlm)$coefficients[, "Std. Error"]
## (Intercept) devStageP2 devStageP6 devStageP10
## 0.3788 0.5357 0.5357 0.5357
## devStage4_weeks
## 0.5357
sqrt(diag(vcov(mDatlm)))
## (Intercept) devStageP2 devStageP6 devStageP10
## 0.3788 0.5357 0.5357 0.5357
## devStage4_weeks
## 0.5357
Returning to our test of the P2 v. P10 contrast, recall that the variance-covariance matrix of a contrast obtained as Calpha is C(XTX)-1CT(sigma hat)2.
(estSe <- contMat %*% vcov(mDatlm) %*% t(contMat))
## [,1]
## [1,] 0.287
Now we form a test statistic as an observed effect divided by its estimated standard error:
(testStat <- obsDiff/estSe)
## [,1]
## [1,] -0.8676
Under the null hypothesis that the contrast equals zero, i.e. there is no true difference in mean for expression at P2 and P10 in wild type mice for this gene, the test statistic has a t distribution with n-p=20-5=15 degrees of freedom. We compute a two-sided p-value and we're done.
2 * pt(abs(testStat), df = df.residual(mDatlm), lower.tail = FALSE)
## [,1]
## [1,] 0.3993
Not surprisingly, this p-value is rather large and we conclude there is no difference.
Let's focus on probeset “1448690_at”. Use your functions to prepare the data and plot it.
makeStripplot(oDat <- prepData("1448690_at"))
str(oDat)
## 'data.frame': 39 obs. of 6 variables:
## $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ gene : Factor w/ 1 level "1448690_at": 1 1 1 1 1 1 1 1 1 1 ...
## $ gExp : num 8.02 9.05 8.71 8.92 6.8 ...
## $ sidNum : num 20 21 22 23 16 17 6 24 25 26 ...
## $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ gType : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...
Fit a linear model with covariates gType and devStage and include their interactions. Include interactions = multiply categorical covariates:
oFitBig <- lm(gExp ~ gType * devStage, oDat)
round(summary(oFitBig)$coefficients, digits = 5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.67800 0.3987 21.76755 0.00000
## gTypeNrlKO -0.84233 0.6090 -1.38320 0.17716
## devStageP2 -1.02900 0.5638 -1.82512 0.07830
## devStageP6 -1.91450 0.5638 -3.39571 0.00200
## devStageP10 -2.19325 0.5638 -3.89012 0.00054
## devStage4_weeks -2.08200 0.5638 -3.69280 0.00091
## gTypeNrlKO:devStageP2 0.06983 0.8299 0.08415 0.93352
## gTypeNrlKO:devStageP6 0.16533 0.8299 0.19922 0.84348
## gTypeNrlKO:devStageP10 0.22583 0.8299 0.27212 0.78745
## gTypeNrlKO:devStage4_weeks 0.64608 0.8299 0.77852 0.44257
Vet the results. Is the intercept plausible? How about the various effects? Do the ones with small p-values, e.g. meeting a conventional cut-off of 0.05, look 'real' to you?
Yes, significant p-values for wtdevStageP6, P10 and 4_weeks seem real. As for the intercepts for the interaction of gType and devStage, not sure about this? Why are the estimate values positive? Now I understand (thank you Alastair!). This value represents the interaction effect, the observed average - expected average (intercept + KO effect + devStageP2 effect).
Omit interactions = sum categorical covariates. No interaction assumes the lines must be parallel:
oFitSml <- lm(gExp ~ gType + devStage, oDat)
summary(oFitSml)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5803 0.3046 28.165 1.177e-24
## gTypeNrlKO -0.6144 0.2430 -2.528 1.643e-02
## devStageP2 -1.0104 0.3924 -2.575 1.470e-02
## devStageP6 -1.8481 0.3924 -4.710 4.328e-05
## devStageP10 -2.0966 0.3924 -5.343 6.703e-06
## devStage4_weeks -1.7752 0.3924 -4.524 7.444e-05
Now let's determine if the interaction terms are truly necessary. From the plot, the case for interaction seems very weak. This can be assessed with an F test that essentially looks at the reduction in the sum of squared residuals due to using a larger, more complicated model and determines if it is “big enough” given the number of additional parameters used. Recall the anova() function can take two fitted models, one nested within the other, and perform this test. (anova() can also be used on a single model to assess significance of terms, but remember the problem with the standard anova() function and unbalanced data)
anova(oFitSml, oFitBig)
## Analysis of Variance Table
##
## Model 1: gExp ~ gType + devStage
## Model 2: gExp ~ gType * devStage
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 18.9
## 2 29 18.4 4 0.497 0.2 0.94
With a p-value awfully close to one, we confirm that, no, there is no evidence for interaction in this particular case.
If you'd like to get a more exciting result, take a look at probeset “1429225_at”. Here are my plots, excerpts from the fitted model reports, and the F test for interaction:
makeStripplot(pDat <- prepData("1429225_at"))
Big model, with interaction = Number of parameters: gType * devStage = 2 * 5 = 10 parameters
pFitBig <- lm(gExp ~ gType * devStage, pDat)
round(summary(pFitBig)$coefficient, digits = 5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3125 0.2617 27.9392 0.00000
## gTypeNrlKO -0.2602 0.3998 -0.6507 0.52033
## devStageP2 -1.1583 0.3701 -3.1292 0.00397
## devStageP6 -1.2495 0.3701 -3.3757 0.00211
## devStageP10 -1.0717 0.3701 -2.8955 0.00712
## devStage4_weeks -0.9087 0.3701 -2.4551 0.02032
## gTypeNrlKO:devStageP2 0.2804 0.5448 0.5147 0.61068
## gTypeNrlKO:devStageP6 0.7589 0.5448 1.3929 0.17423
## gTypeNrlKO:devStageP10 1.7914 0.5448 3.2880 0.00265
## gTypeNrlKO:devStage4_weeks 2.2389 0.5448 4.1094 0.00030
Small model, no interaction = Number of parameters: (gType - 1) + (devStage - 1) + 1 = 1 + 4 + 1 = 6 parameters
pFitSml <- lm(gExp ~ gType + devStage, pDat)
round(summary(pFitSml)$coefficient, digits = 5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8652 0.2722 25.2199 0.00000
## gTypeNrlKO 0.7836 0.2172 3.6085 0.00101
## devStageP2 -1.0926 0.3506 -3.1161 0.00378
## devStageP6 -0.9446 0.3506 -2.6940 0.01101
## devStageP10 -0.2506 0.3506 -0.7147 0.47981
## devStage4_weeks 0.1361 0.3506 0.3883 0.70027
anova(pFitSml, pFitBig)
## Analysis of Variance Table
##
## Model 1: gExp ~ gType + devStage
## Model 2: gExp ~ gType * devStage
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 15.12
## 2 29 7.95 4 7.17 6.54 7e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not surprisingly, the interaction here is highly statistically significant.
For another day…