In class on 2/10 we discussed the example of taking all pairwise comparisons between four data sets. I expressed confusion about two ways of calculating the Type I error (false positive) rate that I had used, which generated different answers. In retrospect, this was reasonable: let’s see why.
Ns = 2:10), use a simulation with runif(n,0,1), as in the 2/10 notes, to approximate the Type 1 error rate for all pairwise comparisons between \(n\) normally distributed sets, assuming that the null hypothesis is true in each comparison. Note that you’ll need to know the number of pairwise comparisons (where order doesn’t matter) for each \(n\): calculate this by writing an auxiliary function that computes \[
\mathbf{Comb}(n,2) = \frac{n!}{(n-2)!2!}.
\] You can calculate \(n!\) in R using factorial(n). Test your function on some numbers for which you can easily calculate the answer by hand to make sure it works as intended. Then, build an \((n - 1) \times 3\) matrix with columns for (1) \(n\), (2) the number of pairwise comparisons for \(n\), and (3) the estimated Type I error rate. Plot \(n\) against the error rate for each \(n\).comb.n.2 <- function(n){
factorial(n) / (factorial(n-2) * factorial(2))
}
comb.n.2(3) # should be 3
## [1] 3
comb.n.2(4) # should be 6
## [1] 6
comb.n.2(5) # should be 10
## [1] 10
Ns = 2:10
error.rate.sim <- function(n, threshold=0.05) {
sim <- function(i) any(runif(comb.n.2(n), 0, 1) < threshold)
sims <- sapply(1:10000, sim)
return(mean(sims))
}
numbers.of.comparisons <- sapply(Ns, comb.n.2)
FP.rates <- sapply(Ns, error.rate.sim)
err.mat <- matrix(c(Ns, numbers.of.comparisons, FP.rates), ncol=3)
colnames(err.mat)<- c('n', '#Comparisons', 'FP rates')
err.mat
## n #Comparisons FP rates
## [1,] 2 1 0.0501
## [2,] 3 3 0.1384
## [3,] 4 6 0.2729
## [4,] 5 10 0.3970
## [5,] 6 15 0.5474
## [6,] 7 21 0.6529
## [7,] 8 28 0.7613
## [8,] 9 36 0.8394
## [9,] 10 45 0.9015
plot(Ns, FP.rates, ylim=c(0,1),
main="Pairwise comparison false positive rate estimates",
xlab="n (number of groups)", ylab="False Positive rate")
(A question about English: in “Plot \(n\) against the error rate for each \(n\),” should \(n\) be the horizontal axis?)
What I didn’t tell you in class in order to maintain the veil of ignorance (but, which you can see by looking at the code chunk tagged with “include = FALSE” toward the end of the source file for that day’s notes) was that I had generated all four data sets from the same distribution:
d1 = rnorm(15, 3, 1), etc. So:
t.test() function inside a for-loop to do get p-values for all pairwise comparisons between data sets. Again, build a matrix and plot \(n\) against the Type I error rate. Here is some starter code:Ns <- 2:10
FP.rates.norm <- rep(0, length(Ns))
for (n in Ns) { # set value of n, using a loop over 2:10
sim = function(k) {
data.sets <- sapply(1:n, FUN=function(i) rnorm(15, 3, 1))
all.p.values <- c()
for (i in 1:(n - 1)) {
for (j in (i+1):n) {
p.value <- t.test(data.sets[,i], data.sets[,j])$p.value
all.p.values <- c(all.p.values, p.value)
}
}
return(any(all.p.values < .05))
}
sims <- sapply(1:1000, sim)
# do something with the vector of Booleans you've just computed
FP.rates.norm[n-1] <- mean(sims)
}
The sim() function generates \(n\) data sets and loops over them, performing t-tests for each pair. Make sure you go through the function and understand what it does, step-by-step, before using it.
err.mat <- matrix(c(Ns, numbers.of.comparisons, FP.rates.norm), ncol=3)
colnames(err.mat)<- c('n', '#Comparisons', 'FP rates')
err.mat
## n #Comparisons FP rates
## [1,] 2 1 0.050
## [2,] 3 3 0.117
## [3,] 4 6 0.217
## [4,] 5 10 0.306
## [5,] 6 15 0.367
## [6,] 7 21 0.455
## [7,] 8 28 0.515
## [8,] 9 36 0.548
## [9,] 10 45 0.645
plot(Ns, FP.rates.norm, ylim=c(0,1),
main="FP rate estimates (no independence assumption)",
xlab="n (number of groups)", ylab="False Positive rate")
plot(Ns, FP.rates, ylim=c(0,1),
main="Pairwise comparison false positive rate estimates",
xlab="n (number of groups)", ylab="False Positive rate")
points(Ns, FP.rates.norm, col="blue")
legend("topleft",
c("independence", "no independence"),
col = c("black", "blue"),
pch = 1)
The above plot shows that the first method tends to overestimate the Type I error rate. This is because it wrongly assumes that the t-test results of pairwise comparisons are independent of one another, with each having a \(5\%\) rate of committing a type I error. Note that under the null hypothesis, i.e., all the groups of the data are generated from the same distribution, the results of comparing group 1 with group 2, group 2 with group 3, and group 1 with group 3 are correlated. For example, if the t-test (correctly) shows negative results between group 1 and group 2, and group 2 and group 3, it means that the differences between group 1 and group 2 and between group 2 and group 3 are small. This further suggests that the difference between between group 1 and group 3 is also likely small, which provides additional information that the t-test is less likely to give a false negative result when comparing group 1 and group 3.
Hence the independence assumption does not hold and a simulation based on that will overestimate the type I error rate.
The english dataset in the languageR package includes information about the mean lexical decision reaction time (RTlexdec) and also about the neighborhood density of each word (which, for some reason, is called Ncount).
library(languageR)
english$RT <- exp(english$RTlexdec) # RTlexdec encodes log reaction time
plot(english$Ncount, english$RT,
xlab="neighborhood density", ylab="lexical decision reaction time")
(This time, lexical decision time is on the vertical axis.)
m = lm(..., data=English). Use summary() to inspect the results. Explain the output, line-by-line, in prose. Make sure your explanation accounts for at least the following:m <- lm(RT ~ Ncount, data=english)
summary(m)
##
## Call:
## lm(formula = RT ~ Ncount, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.79 -90.72 -8.04 66.49 614.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 717.7208 2.7555 260.467 < 2e-16 ***
## Ncount -1.5301 0.3466 -4.415 1.04e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.6 on 4566 degrees of freedom
## Multiple R-squared: 0.00425, Adjusted R-squared: 0.004032
## F-statistic: 19.49 on 1 and 4566 DF, p-value: 1.036e-05
In the summary, the first line Call records the command to construct the linear model.
The second line reports the five-number summary of the residuals.
The third line reports the coefficients in the model. The intercept is the lexical decision time predicted by the model when the neighborhood density is 0. Its estimate is 717.7207616, and it is statistically significant at such a level that the p-value is extremely small (\(<2e-16\), which is very close to 0), although what it means is just that it is extremely unlikely that the intercept is \(0\). This is both uninformative (RTs can never be 0 anyways) and irrelevant (conceptually the neighborhood density should never be 0, so it makes little sense to predict the RT for 0 neighborhood density).
The slope is the amount of RT increase whenever the neighborhood density is increased by 1. Its estimate is -1.53008, which is negative. So in this case it means that whenever the neighborhood density is increased by 1, the RT is decreased by 1.53008. This is also highly statistically significant (\(p=1.04e-05<0.001\)), which means that the slope is very unlikely to be \(0\). This result means that the neighborhood density does help predict the RT. The higher the neighborhood density, the lower the RT.
The proportion of the variance in the data that the model explains is the R-squared, which is \(0.00425\). This means that only less than \(1\%\) of the variance is explained by the model. The baseline model is the one that only has the intercept term, which equals to the sample mean. The total variance in the sample is as follows
var(english$RT) * (length(english$RT)-1)
## [1] 60251538
We can see how much variance the model with the neighborhood density as a predictor explains below.
anova(m)
## Analysis of Variance Table
##
## Response: RT
## Df Sum Sq Mean Sq F value Pr(>F)
## Ncount 1 256064 256064 19.488 1.036e-05 ***
## Residuals 4566 59995474 13140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence the model only explains \(256064 / 60251538 = 0.00425\) of the total variance (the \(R^2\)). However, the F-statistic is significant, which means that despite only making some tiny improvement, the model is still better than the baseline.
coef(). Plot lexical decision time against neighborhood density again, and use abline(a= ..., b= ..., col='red') to overlay a red line showing the best-fit regression line. [a= ... give the intercept of the line, and b= ... gives the slope.] How well does neighborhood density explain lexical decision time? Is this finding in conflict with your answer to the previous question about statistical significance?plot(english$Ncount, english$RT,
xlab="neighborhood density", ylab="lexical decision reaction time")
abline(coef(m)[1], coef(m)[2], col='red')
We can see that the regression line is almost horizontal, leaving lots of variance unexplained. This means that neighborhood density does not explain lexical decision time very well.
This is consistent with the previous answer. It is true that the slope is statistically significantly non-zero, but the estimated slope is only -1.53008, which is rather small and does not affect the RT very much.
m, you can access the residuals using m$residual or resid(m). [Note: If you give plot() only one argument, it will default to treating the index of the observation as the x-value: so plot(resid(m)) does the same thing as plot(1:length(resid(m)), resid(m)).] Is there an obvious pattern in the residuals? Looking at the data around the index where the shift occurs and see if you can work out what’s going on. By inspecting the data in this way, can you figure out which other variable we should add to the model in order to improve the model?plot(resid(m))
Clearly, the residuals are different for different subgroups of data.
english$AgeSubject[1440:1460]
## [1] young young young young young young young young young young young
## [12] young young old old old old old old old old
## Levels: old young
english$AgeSubject[2890:2910]
## [1] old old old old old old old old old old old
## [12] old old old old young young young young young young
## Levels: old young
english$AgeSubject[3730:3750]
## [1] young young young young young young young old old old old
## [12] old old old old old old old old old old
## Levels: old young
We can see from above that shifts occur when the age group changes. So we should consider adding age as another predictor to improve the model.
summary() to inspect the result. How does this model compare to the previous model in terms of \(R^2\) and RSS? What happens if you use anova() to compare the two models? Plot the residuals for your second model, comparing to the first.m2 <- lm(RT ~ AgeSubject + Ncount, data=english)
summary(m2)
##
## Call:
## lm(formula = RT ~ AgeSubject + Ncount, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -192.54 -59.90 -15.53 45.00 536.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 796.3071 2.3553 338.086 < 2e-16 ***
## AgeSubjectyoung -157.1727 2.4692 -63.654 < 2e-16 ***
## Ncount -1.5301 0.2523 -6.064 1.43e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.44 on 4565 degrees of freedom
## Multiple R-squared: 0.4725, Adjusted R-squared: 0.4722
## F-statistic: 2044 on 2 and 4565 DF, p-value: < 2.2e-16
Now \(R^2=0.4725\), which is a great improvement on the previous model.
anova(m, m2)
## Analysis of Variance Table
##
## Model 1: RT ~ Ncount
## Model 2: RT ~ AgeSubject + Ncount
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4566 59995474
## 2 4565 31784357 1 28211118 4051.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also see that this second model has a lower RSS than the previous one. And it is statistically significantly better.
plot(resid(m2))
Now the residuals do not have obvious patterns, compared to those in the previous model.
(Probably) No. First, the residual plot shows no obvious patterns. Second, conceptually it seems that age and neighborhood density affects the RT for independent reasons, so we would expect the two factors to be additive.
(Although I think it would still be good to actually check whether there is an interaction.)
summary() to examine the output. How does the output compare to your a priori analysis in (9)?m3 <- lm(RT ~ AgeSubject * Ncount, data=english)
summary(m3)
##
## Call:
## lm(formula = RT ~ AgeSubject * Ncount, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.48 -59.81 -15.53 45.09 536.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 797.2523 2.8369 281.030 < 2e-16 ***
## AgeSubjectyoung -159.0631 4.0120 -39.647 < 2e-16 ***
## Ncount -1.6809 0.3568 -4.711 2.54e-06 ***
## AgeSubjectyoung:Ncount 0.3017 0.5046 0.598 0.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.45 on 4564 degrees of freedom
## Multiple R-squared: 0.4725, Adjusted R-squared: 0.4722
## F-statistic: 1363 on 3 and 4564 DF, p-value: < 2.2e-16
We can see that the coefficient for the interaction term is not significantly different from \(0\), which means there is not enough evidence supporting the existence of an interaction. This is consistent with the previous answer.