Multiple comparisons

In class on 2/10 we discussed the example of taking all pairwise comparisons between four data sets. I expressed confusion about two ways of calculating the Type I error (false positive) rate that I had used, which generated different answers. In retrospect, this was reasonable: let’s see why.

  1. For each \(n \in \{2, 3, ..., 9, 10\}\) (i.e., Ns = 2:10), use a simulation with runif(n,0,1), as in the 2/10 notes, to approximate the Type 1 error rate for all pairwise comparisons between \(n\) normally distributed sets, assuming that the null hypothesis is true in each comparison. Note that you’ll need to know the number of pairwise comparisons (where order doesn’t matter) for each \(n\): calculate this by writing an auxiliary function that computes \[ \mathbf{Comb}(n,2) = \frac{n!}{(n-2)!2!}. \] You can calculate \(n!\) in R using factorial(n). Test your function on some numbers for which you can easily calculate the answer by hand to make sure it works as intended. Then, build an \((n - 1) \times 3\) matrix with columns for (1) \(n\), (2) the number of pairwise comparisons for \(n\), and (3) the estimated Type I error rate. Plot \(n\) against the error rate for each \(n\).
comb.n.2 <- function(n){
  factorial(n) / (factorial(n-2) * factorial(2))
}
comb.n.2(3)  # should be 3
## [1] 3
comb.n.2(4)  # should be 6 
## [1] 6
comb.n.2(5)  # should be 10
## [1] 10
Ns = 2:10

error.rate.sim <- function(n, threshold=0.05) {
  sim <- function(i) any(runif(comb.n.2(n), 0, 1) < threshold)
  sims <- sapply(1:10000, sim)
  return(mean(sims))
}

numbers.of.comparisons <- sapply(Ns, comb.n.2)
FP.rates <- sapply(Ns, error.rate.sim)

err.mat <- matrix(c(Ns, numbers.of.comparisons, FP.rates), ncol=3)
colnames(err.mat)<- c('n', '#Comparisons', 'FP rates')
err.mat
##        n #Comparisons FP rates
##  [1,]  2            1   0.0501
##  [2,]  3            3   0.1384
##  [3,]  4            6   0.2729
##  [4,]  5           10   0.3970
##  [5,]  6           15   0.5474
##  [6,]  7           21   0.6529
##  [7,]  8           28   0.7613
##  [8,]  9           36   0.8394
##  [9,] 10           45   0.9015
plot(Ns, FP.rates, ylim=c(0,1),
     main="Pairwise comparison false positive rate estimates", 
     xlab="n (number of groups)", ylab="False Positive rate")

(A question about English: in “Plot \(n\) against the error rate for each \(n\),” should \(n\) be the horizontal axis?)

What I didn’t tell you in class in order to maintain the veil of ignorance (but, which you can see by looking at the code chunk tagged with “include = FALSE” toward the end of the source file for that day’s notes) was that I had generated all four data sets from the same distribution: d1 = rnorm(15, 3, 1), etc. So:

  1. Re-do (1), but this time actually generate \(n\) data sets, each of size 15, from a \(\mathcal{N}(3,1)\) distribution. see the t.test() function inside a for-loop to do get p-values for all pairwise comparisons between data sets. Again, build a matrix and plot \(n\) against the Type I error rate. Here is some starter code:
Ns <- 2:10
FP.rates.norm <- rep(0, length(Ns)) 

for (n in Ns) { # set value of n, using a loop over 2:10
  
  sim = function(k) {
    data.sets <- sapply(1:n, FUN=function(i) rnorm(15, 3, 1))
    all.p.values <- c()
    for (i in 1:(n - 1)) {
      for (j in (i+1):n) {
        p.value <- t.test(data.sets[,i], data.sets[,j])$p.value
        all.p.values <- c(all.p.values, p.value) 
        }
      }
    return(any(all.p.values < .05))
    }
  
  sims <- sapply(1:1000, sim)
  
  # do something with the vector of Booleans you've just computed
  FP.rates.norm[n-1] <- mean(sims)
}

The sim() function generates \(n\) data sets and loops over them, performing t-tests for each pair. Make sure you go through the function and understand what it does, step-by-step, before using it.

err.mat <- matrix(c(Ns, numbers.of.comparisons, FP.rates.norm), ncol=3)
colnames(err.mat)<- c('n', '#Comparisons', 'FP rates')
err.mat
##        n #Comparisons FP rates
##  [1,]  2            1    0.050
##  [2,]  3            3    0.117
##  [3,]  4            6    0.217
##  [4,]  5           10    0.306
##  [5,]  6           15    0.367
##  [6,]  7           21    0.455
##  [7,]  8           28    0.515
##  [8,]  9           36    0.548
##  [9,] 10           45    0.645
plot(Ns, FP.rates.norm, ylim=c(0,1),
     main="FP rate estimates (no independence assumption)", 
     xlab="n (number of groups)", ylab="False Positive rate")

  1. Plot the Type I error rates predicted by the two simulation methods against each other. Are your answers to (1) and (2) different? If so, why? (Hint: what assumption were we making about (in)dependence between tests when we modeled the probability of a Type I error on each comparison by drawing a fresh random number from \(\mathcal{U}(0,1)\)? Is this assumption fulfilled, given the generating distribution we assumed in (2)?) What does this analysis tell us about the dependence of the Type I error rate on facts about the true but unknown generating distribution?
plot(Ns, FP.rates, ylim=c(0,1),
     main="Pairwise comparison false positive rate estimates", 
     xlab="n (number of groups)", ylab="False Positive rate")

points(Ns, FP.rates.norm, col="blue")

legend("topleft",
       c("independence", "no independence"),
       col = c("black", "blue"),
       pch = 1)

The above plot shows that the first method tends to overestimate the Type I error rate. This is because it wrongly assumes that the t-test results of pairwise comparisons are independent of one another, with each having a \(5\%\) rate of committing a type I error. Note that under the null hypothesis, i.e., all the groups of the data are generated from the same distribution, the results of comparing group 1 with group 2, group 2 with group 3, and group 1 with group 3 are correlated. For example, if the t-test (correctly) shows negative results between group 1 and group 2, and group 2 and group 3, it means that the differences between group 1 and group 2 and between group 2 and group 3 are small. This further suggests that the difference between between group 1 and group 3 is also likely small, which provides additional information that the t-test is less likely to give a false negative result when comparing group 1 and group 3.

Hence the independence assumption does not hold and a simulation based on that will overestimate the type I error rate.

Linear regression

The english dataset in the languageR package includes information about the mean lexical decision reaction time (RTlexdec) and also about the neighborhood density of each word (which, for some reason, is called Ncount).

  1. Plot lexical decision time against neighborhood density.
library(languageR)
english$RT <- exp(english$RTlexdec)  # RTlexdec encodes log reaction time 
plot(english$Ncount, english$RT, 
     xlab="neighborhood density", ylab="lexical decision reaction time") 

(This time, lexical decision time is on the vertical axis.)

  1. Build a linear regression model which attempts to predict lexical decision time from neighborhood density, saving the object produced: m = lm(..., data=English). Use summary() to inspect the results. Explain the output, line-by-line, in prose. Make sure your explanation accounts for at least the following:
m <- lm(RT ~ Ncount, data=english)
summary(m)
## 
## Call:
## lm(formula = RT ~ Ncount, data = english)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.79  -90.72   -8.04   66.49  614.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 717.7208     2.7555 260.467  < 2e-16 ***
## Ncount       -1.5301     0.3466  -4.415 1.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.6 on 4566 degrees of freedom
## Multiple R-squared:  0.00425,    Adjusted R-squared:  0.004032 
## F-statistic: 19.49 on 1 and 4566 DF,  p-value: 1.036e-05

In the summary, the first line Call records the command to construct the linear model.

The second line reports the five-number summary of the residuals.

The third line reports the coefficients in the model. The intercept is the lexical decision time predicted by the model when the neighborhood density is 0. Its estimate is 717.7207616, and it is statistically significant at such a level that the p-value is extremely small (\(<2e-16\), which is very close to 0), although what it means is just that it is extremely unlikely that the intercept is \(0\). This is both uninformative (RTs can never be 0 anyways) and irrelevant (conceptually the neighborhood density should never be 0, so it makes little sense to predict the RT for 0 neighborhood density).

The slope is the amount of RT increase whenever the neighborhood density is increased by 1. Its estimate is -1.53008, which is negative. So in this case it means that whenever the neighborhood density is increased by 1, the RT is decreased by 1.53008. This is also highly statistically significant (\(p=1.04e-05<0.001\)), which means that the slope is very unlikely to be \(0\). This result means that the neighborhood density does help predict the RT. The higher the neighborhood density, the lower the RT.

The proportion of the variance in the data that the model explains is the R-squared, which is \(0.00425\). This means that only less than \(1\%\) of the variance is explained by the model. The baseline model is the one that only has the intercept term, which equals to the sample mean. The total variance in the sample is as follows

var(english$RT) * (length(english$RT)-1)
## [1] 60251538

We can see how much variance the model with the neighborhood density as a predictor explains below.

anova(m)
## Analysis of Variance Table
## 
## Response: RT
##             Df   Sum Sq Mean Sq F value    Pr(>F)    
## Ncount       1   256064  256064  19.488 1.036e-05 ***
## Residuals 4566 59995474   13140                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hence the model only explains \(256064 / 60251538 = 0.00425\) of the total variance (the \(R^2\)). However, the F-statistic is significant, which means that despite only making some tiny improvement, the model is still better than the baseline.

  1. Extract the intercept and slope coefficients using coef(). Plot lexical decision time against neighborhood density again, and use abline(a= ..., b= ..., col='red') to overlay a red line showing the best-fit regression line. [a= ... give the intercept of the line, and b= ... gives the slope.] How well does neighborhood density explain lexical decision time? Is this finding in conflict with your answer to the previous question about statistical significance?
plot(english$Ncount, english$RT, 
     xlab="neighborhood density", ylab="lexical decision reaction time") 
abline(coef(m)[1], coef(m)[2], col='red')

We can see that the regression line is almost horizontal, leaving lots of variance unexplained. This means that neighborhood density does not explain lexical decision time very well.

This is consistent with the previous answer. It is true that the slope is statistically significantly non-zero, but the estimated slope is only -1.53008, which is rather small and does not affect the RT very much.

  1. Plot the residuals by index. If your model is called m, you can access the residuals using m$residual or resid(m). [Note: If you give plot() only one argument, it will default to treating the index of the observation as the x-value: so plot(resid(m)) does the same thing as plot(1:length(resid(m)), resid(m)).] Is there an obvious pattern in the residuals? Looking at the data around the index where the shift occurs and see if you can work out what’s going on. By inspecting the data in this way, can you figure out which other variable we should add to the model in order to improve the model?
plot(resid(m))

Clearly, the residuals are different for different subgroups of data.

english$AgeSubject[1440:1460]
##  [1] young young young young young young young young young young young
## [12] young young old   old   old   old   old   old   old   old  
## Levels: old young
english$AgeSubject[2890:2910]
##  [1] old   old   old   old   old   old   old   old   old   old   old  
## [12] old   old   old   old   young young young young young young
## Levels: old young
english$AgeSubject[3730:3750]
##  [1] young young young young young young young old   old   old   old  
## [12] old   old   old   old   old   old   old   old   old   old  
## Levels: old young

We can see from above that shifts occur when the age group changes. So we should consider adding age as another predictor to improve the model.

  1. Build a second model with the variable you just picked out as an additional predictor, and use summary() to inspect the result. How does this model compare to the previous model in terms of \(R^2\) and RSS? What happens if you use anova() to compare the two models? Plot the residuals for your second model, comparing to the first.
m2 <- lm(RT ~ AgeSubject + Ncount, data=english)
summary(m2)
## 
## Call:
## lm(formula = RT ~ AgeSubject + Ncount, data = english)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -192.54  -59.90  -15.53   45.00  536.07 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      796.3071     2.3553 338.086  < 2e-16 ***
## AgeSubjectyoung -157.1727     2.4692 -63.654  < 2e-16 ***
## Ncount            -1.5301     0.2523  -6.064 1.43e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.44 on 4565 degrees of freedom
## Multiple R-squared:  0.4725, Adjusted R-squared:  0.4722 
## F-statistic:  2044 on 2 and 4565 DF,  p-value: < 2.2e-16

Now \(R^2=0.4725\), which is a great improvement on the previous model.

anova(m, m2)
## Analysis of Variance Table
## 
## Model 1: RT ~ Ncount
## Model 2: RT ~ AgeSubject + Ncount
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1   4566 59995474                                  
## 2   4565 31784357  1  28211118 4051.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also see that this second model has a lower RSS than the previous one. And it is statistically significantly better.

plot(resid(m2))

Now the residuals do not have obvious patterns, compared to those in the previous model.

  1. Should we try to improve the model further by adding in an interaction between the two variables in your second model? Why or why not? Try to answer this question by reasoning about the kind of data contained in these variables, without actually building such a model.

(Probably) No. First, the residual plot shows no obvious patterns. Second, conceptually it seems that age and neighborhood density affects the RT for independent reasons, so we would expect the two factors to be additive.

(Although I think it would still be good to actually check whether there is an interaction.)

  1. Build the model with an interaction discussed in (9) and use summary() to examine the output. How does the output compare to your a priori analysis in (9)?
m3 <- lm(RT ~ AgeSubject * Ncount, data=english)
summary(m3)
## 
## Call:
## lm(formula = RT ~ AgeSubject * Ncount, data = english)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -193.48  -59.81  -15.53   45.09  536.03 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             797.2523     2.8369 281.030  < 2e-16 ***
## AgeSubjectyoung        -159.0631     4.0120 -39.647  < 2e-16 ***
## Ncount                   -1.6809     0.3568  -4.711 2.54e-06 ***
## AgeSubjectyoung:Ncount    0.3017     0.5046   0.598     0.55    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.45 on 4564 degrees of freedom
## Multiple R-squared:  0.4725, Adjusted R-squared:  0.4722 
## F-statistic:  1363 on 3 and 4564 DF,  p-value: < 2.2e-16

We can see that the coefficient for the interaction term is not significantly different from \(0\), which means there is not enough evidence supporting the existence of an interaction. This is consistent with the previous answer.