STAT540 Seminar 05: Fitting and interpreting linear models (low volume)

Rebecca Johnston 05/02/2014

Load required packages and photoRec data
Function to prepare a dataset for given genes
Write a function to stripplot a mini-dataset
Do a two-sample t-test
Fit a linear model with a categorical covariate
Perform inference for a contrast
Fit a linear model with two categorical covariates
Ideas for further work

Load required packages and photoRec data

library("ggplot2")  # for graphing
library("reshape2")  # for converting data from wide to tall format
library("plyr")  # for data aggregation tasks

Load gene expression dataset:

prDat <- read.table("GSE4051_data.tsv")
str(prDat, max.level = 0)

## 'data.frame':    29949 obs. of  39 variables:

Load design matrix:

prDes <- readRDS("GSE4051_design.rds")
str(prDes)

## 'data.frame':    39 obs. of  4 variables:
##  $ sidChar : chr  "Sample_20" "Sample_21" "Sample_22" "Sample_23" ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Function to prepare a dataset for given genes

Write a function that takes any given affy probe ID and turns their associated values into a data frame.

METHOD: melting and merging. Index prDat with chosen genes:

luckyGenes <- c("1419655_at", "1438815_at")
luckyDf <- prDat[luckyGenes, ]

Reshape and bind with design matrix. Add rownames (probe ID) as a column for melt function first:

luckyDf <- data.frame(gene = rownames(luckyDf), luckyDf)
luckyDf <- melt(luckyDf, id.vars = "gene", variable.name = "sidChar", value.name = "gExp")
luckyDf <- merge(luckyDf, prDes, by = "sidChar")

Reorder using plyr function arrange:

luckyDf <- arrange(luckyDf, devStage, gType)

We can now write the function where variable input is “luckyGenes”:

luckyGenes <- c("1419655_at", "1438815_at")
prepData <- function(x) {
    luckyDf <- prDat[x, ]
    luckyDf <- data.frame(gene = rownames(luckyDf), luckyDf)
    luckyDf <- melt(luckyDf, id.vars = "gene", variable.name = "sidChar", value.name = "gExp")
    luckyDf <- merge(luckyDf, prDes, by = "sidChar")
    luckyDf <- arrange(luckyDf, devStage, gType)
}
jDat <- prepData(luckyGenes)
str(jDat)

## 'data.frame':    78 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ gene    : Factor w/ 2 levels "1419655_at","1438815_at": 1 2 1 2 1 2 1 2 1 2 ...
##  $ gExp    : num  10.93 8.84 10.74 8.85 10.67 ...
##  $ sidNum  : num  20 20 21 21 22 22 23 23 16 16 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 1 1 1 1 2 2 ...

Plot data using ggplot:

ggplot(jDat, aes(devStage, gExp, colour = gType, group = gType)) + geom_point() + 
    facet_wrap(~gene) + stat_summary(fun.y = mean, geom = "line")

plot of chunk unnamed-chunk-8

Write a function to stripplot a mini-dataset

You will probably make lots of these plots. Why not write a function for this?

makeStripplot <- function(x) {
    ggplot(x, aes(devStage, gExp, colour = gType, group = gType)) + geom_point() + 
        facet_wrap(~gene) + stat_summary(fun.y = mean, geom = "line")
}
makeStripplot(jDat)

plot of chunk unnamed-chunk-9

You can use both of your functions together and create a mini dataset and plot it all at once:

makeStripplot(newDat <- prepData("1456341_a_at"))

plot of chunk unnamed-chunk-10

str(newDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene    : Factor w/ 1 level "1456341_a_at": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gExp    : num  7.04 7.48 7.37 6.94 6.16 ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Do a two-sample t-test

Let's test for a difference in expected gene expression for probeset “1456341_a_at” at developmental stage P2 v. 4 weeks post-natal (ignoring genotype, i.e. lump the wild types and knockouts together). Let's assume a common variance in the two groups.

levels(newDat$devStage)

## [1] "E16"     "P2"      "P6"      "P10"     "4_weeks"

t.test(gExp ~ devStage, newDat, subset = devStage == "P2" | devStage == "4_weeks")

## 
##  Welch Two Sample t-test
## 
## data:  gExp by devStage
## t = -18.84, df = 13.98, p-value = 2.477e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.078 -3.244
## sample estimates:
##      mean in group P2 mean in group 4_weeks 
##                 6.326                 9.987

Fit a linear model with a categorical covariate

In other words, do “one-way ANOVA”. Focus on probeset “1438786_a_at”. Use function we already prepared to subset data and draw stripplot:

makeStripplot(mDat <- prepData("1438786_a_at"))

plot of chunk unnamed-chunk-13

str(mDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene    : Factor w/ 1 level "1438786_a_at": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gExp    : num  7.94 8.99 8.55 8.62 5.72 ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Now call linear model using “lm” function:

mDatlm <- lm(gExp ~ devStage, mDat, subset = gType == "wt")
summary(mDatlm)

## 
## Call:
## lm(formula = gExp ~ devStage, data = mDat, subset = gType == 
##     "wt")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1565 -0.4400  0.0288  0.4915  1.2065 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        8.523      0.379   22.50  5.7e-13 ***
## devStageP2        -1.450      0.536   -2.71    0.016 *  
## devStageP6        -0.107      0.536   -0.20    0.845    
## devStageP10       -1.201      0.536   -2.24    0.040 *  
## devStage4_weeks    0.081      0.536    0.15    0.882    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.758 on 15 degrees of freedom
## Multiple R-squared:  0.497,  Adjusted R-squared:  0.363 
## F-statistic: 3.71 on 4 and 15 DF,  p-value: 0.0272

This result looks plausible: the intercept of 8.52 is close to the mean of E16 (the reference). And the values for the other devStages change accordingly (-/+ slope wrt devStageE16).

Perform inference for a contrast

The “W” shape of the expression profile for “1438786_a_at” means that the expression values for developmental stages P2 and P10 are quite similar. We could formally test whether the P2 and P10 effects are equal or, equivalently, whether their difference is equal to zero. First extract the parameter estimates from the linear model you fit above. Hint: the coef() function will pull parameter estimates out of a wide array of fitted model objects in R.

coef(mDatlm)

##     (Intercept)      devStageP2      devStageP6     devStageP10 
##          8.5227         -1.4503         -0.1067         -1.2012 
## devStage4_weeks 
##          0.0810

Now you need to construct the contrast matrix to form the difference between the P2 and P10 effects. Hint: it's OK for a matrix to have just one row.

levels(mDat$devStage)

## [1] "E16"     "P2"      "P6"      "P10"     "4_weeks"

Contrast matrix for observed difference in sample mean for wt mice at P2 - P10:

contMat <- matrix(c(0, 1, 0, -1, 0), nrow = 1)
(obsDiff <- contMat %*% coef(mDatlm))  # %*% means matrix multiplication!

##        [,1]
## [1,] -0.249

Let's check that this really is the observed difference in sample mean for the wild type mice, P2 v. P10:

(sampMeans <- aggregate(gExp ~ devStage, mDat, FUN = mean, subset = gType == 
    "wt"))

##   devStage  gExp
## 1      E16 8.523
## 2       P2 7.072
## 3       P6 8.416
## 4      P10 7.322
## 5  4_weeks 8.604

with(sampMeans, gExp[devStage == "P2"] - gExp[devStage == "P10"])

## [1] -0.249

Yes! Agrees with the observed difference we computed by multiplying our contrast matrix and the estimated parameters. If you don't get agreement, you have a problem … probably with your contrast matrix.

Now we need the (estimated) standard error for our contrast. The variance-covariance matrix of the parameters estimated in the original model can be obtained with vcov() and is equal to (X^TX)^-1(sigma hat)^2.

vcov(mDatlm)

##                 (Intercept) devStageP2 devStageP6 devStageP10
## (Intercept)          0.1435    -0.1435    -0.1435     -0.1435
## devStageP2          -0.1435     0.2870     0.1435      0.1435
## devStageP6          -0.1435     0.1435     0.2870      0.1435
## devStageP10         -0.1435     0.1435     0.1435      0.2870
## devStage4_weeks     -0.1435     0.1435     0.1435      0.1435
##                 devStage4_weeks
## (Intercept)             -0.1435
## devStageP2               0.1435
## devStageP6               0.1435
## devStageP10              0.1435
## devStage4_weeks          0.2870

Let's check that this is really true. If we take the diagonal elements and take their square root, they should be exactly equal to the standard errors reported for out original model. Are they?

summary(mDatlm)$coefficients[, "Std. Error"]

##     (Intercept)      devStageP2      devStageP6     devStageP10 
##          0.3788          0.5357          0.5357          0.5357 
## devStage4_weeks 
##          0.5357

sqrt(diag(vcov(mDatlm)))

##     (Intercept)      devStageP2      devStageP6     devStageP10 
##          0.3788          0.5357          0.5357          0.5357 
## devStage4_weeks 
##          0.5357

Returning to our test of the P2 v. P10 contrast, recall that the variance-covariance matrix of a contrast obtained as Calpha is C(X^TX)^-1C^T(sigma hat)^2.

(estSe <- contMat %*% vcov(mDatlm) %*% t(contMat))

##       [,1]
## [1,] 0.287

Now we form a test statistic as an observed effect divided by its estimated standard error:

(testStat <- obsDiff/estSe)

##         [,1]
## [1,] -0.8676

Under the null hypothesis that the contrast equals zero, i.e. there is no true difference in mean for expression at P2 and P10 in wild type mice for this gene, the test statistic has a t distribution with n-p=20-5=15 degrees of freedom. We compute a two-sided p-value and we're done.

2 * pt(abs(testStat), df = df.residual(mDatlm), lower.tail = FALSE)

##        [,1]
## [1,] 0.3993

Not surprisingly, this p-value is rather large and we conclude there is no difference.

Fit a linear model with two categorical covariates

Let's focus on probeset “1448690_at”. Use your functions to prepare the data and plot it.

makeStripplot(oDat <- prepData("1448690_at"))

plot of chunk unnamed-chunk-24

str(oDat)

## 'data.frame':    39 obs. of  6 variables:
##  $ sidChar : Factor w/ 39 levels "Sample_20","Sample_21",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gene    : Factor w/ 1 level "1448690_at": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gExp    : num  8.02 9.05 8.71 8.92 6.8 ...
##  $ sidNum  : num  20 21 22 23 16 17 6 24 25 26 ...
##  $ devStage: Factor w/ 5 levels "E16","P2","P6",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ gType   : Factor w/ 2 levels "wt","NrlKO": 1 1 1 1 2 2 2 1 1 1 ...

Fit a linear model with covariates gType and devStage and include their interactions. Include interactions = multiply categorical covariates:

oFitBig <- lm(gExp ~ gType * devStage, oDat)
round(summary(oFitBig)$coefficients, digits = 5)

##                            Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                 8.67800     0.3987 21.76755  0.00000
## gTypeNrlKO                 -0.84233     0.6090 -1.38320  0.17716
## devStageP2                 -1.02900     0.5638 -1.82512  0.07830
## devStageP6                 -1.91450     0.5638 -3.39571  0.00200
## devStageP10                -2.19325     0.5638 -3.89012  0.00054
## devStage4_weeks            -2.08200     0.5638 -3.69280  0.00091
## gTypeNrlKO:devStageP2       0.06983     0.8299  0.08415  0.93352
## gTypeNrlKO:devStageP6       0.16533     0.8299  0.19922  0.84348
## gTypeNrlKO:devStageP10      0.22583     0.8299  0.27212  0.78745
## gTypeNrlKO:devStage4_weeks  0.64608     0.8299  0.77852  0.44257

Vet the results. Is the intercept plausible? How about the various effects? Do the ones with small p-values, e.g. meeting a conventional cut-off of 0.05, look 'real' to you?

Yes, significant p-values for wtdevStageP6, P10 and 4_weeks seem real. As for the intercepts for the interaction of gType and devStage, not sure about this? Why are the estimate values positive? Now I understand (thank you Alastair!). This value represents the interaction effect, the observed average - expected average (intercept + KO effect + devStageP2 effect).

Omit interactions = sum categorical covariates. No interaction assumes the lines must be parallel:

oFitSml <- lm(gExp ~ gType + devStage, oDat)
summary(oFitSml)$coefficients

##                 Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       8.5803     0.3046  28.165 1.177e-24
## gTypeNrlKO       -0.6144     0.2430  -2.528 1.643e-02
## devStageP2       -1.0104     0.3924  -2.575 1.470e-02
## devStageP6       -1.8481     0.3924  -4.710 4.328e-05
## devStageP10      -2.0966     0.3924  -5.343 6.703e-06
## devStage4_weeks  -1.7752     0.3924  -4.524 7.444e-05

Now let's determine if the interaction terms are truly necessary. From the plot, the case for interaction seems very weak. This can be assessed with an F test that essentially looks at the reduction in the sum of squared residuals due to using a larger, more complicated model and determines if it is “big enough” given the number of additional parameters used. Recall the anova() function can take two fitted models, one nested within the other, and perform this test. (anova() can also be used on a single model to assess significance of terms, but remember the problem with the standard anova() function and unbalanced data)

anova(oFitSml, oFitBig)

## Analysis of Variance Table
## 
## Model 1: gExp ~ gType + devStage
## Model 2: gExp ~ gType * devStage
##   Res.Df  RSS Df Sum of Sq   F Pr(>F)
## 1     33 18.9                        
## 2     29 18.4  4     0.497 0.2   0.94

With a p-value awfully close to one, we confirm that, no, there is no evidence for interaction in this particular case.

If you'd like to get a more exciting result, take a look at probeset “1429225_at”. Here are my plots, excerpts from the fitted model reports, and the F test for interaction:

makeStripplot(pDat <- prepData("1429225_at"))

plot of chunk unnamed-chunk-28

Big model, with interaction = Number of parameters: gType * devStage = 2 * 5 = 10 parameters

pFitBig <- lm(gExp ~ gType * devStage, pDat)
round(summary(pFitBig)$coefficient, digits = 5)

##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  7.3125     0.2617 27.9392  0.00000
## gTypeNrlKO                  -0.2602     0.3998 -0.6507  0.52033
## devStageP2                  -1.1583     0.3701 -3.1292  0.00397
## devStageP6                  -1.2495     0.3701 -3.3757  0.00211
## devStageP10                 -1.0717     0.3701 -2.8955  0.00712
## devStage4_weeks             -0.9087     0.3701 -2.4551  0.02032
## gTypeNrlKO:devStageP2        0.2804     0.5448  0.5147  0.61068
## gTypeNrlKO:devStageP6        0.7589     0.5448  1.3929  0.17423
## gTypeNrlKO:devStageP10       1.7914     0.5448  3.2880  0.00265
## gTypeNrlKO:devStage4_weeks   2.2389     0.5448  4.1094  0.00030

Small model, no interaction = Number of parameters: (gType - 1) + (devStage - 1) + 1 = 1 + 4 + 1 = 6 parameters

pFitSml <- lm(gExp ~ gType + devStage, pDat)
round(summary(pFitSml)$coefficient, digits = 5)

##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)       6.8652     0.2722 25.2199  0.00000
## gTypeNrlKO        0.7836     0.2172  3.6085  0.00101
## devStageP2       -1.0926     0.3506 -3.1161  0.00378
## devStageP6       -0.9446     0.3506 -2.6940  0.01101
## devStageP10      -0.2506     0.3506 -0.7147  0.47981
## devStage4_weeks   0.1361     0.3506  0.3883  0.70027

anova(pFitSml, pFitBig)

## Analysis of Variance Table
## 
## Model 1: gExp ~ gType + devStage
## Model 2: gExp ~ gType * devStage
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)    
## 1     33 15.12                             
## 2     29  7.95  4      7.17 6.54  7e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not surprisingly, the interaction here is highly statistically significant.

Ideas for further work

For another day…