We've taken a little break from building models to study general ideas of statistical inference, drawing conclusions about the world from data.
We're going to introduce another type of analysis of models, ANOVA, standing for “Analysis of Variance.”
We sometimes need to deal with complicated models. We had one strategy for interpreting such models — partial derivatives and partial change. But we also need a strategy for deciding when evidence for including a variable or a term warrants doing so. Looking at coefficients is very difficult. Looking at R2 hides how the different terms contribute. Looking at the change in R2 involves constructing lots and lots of models, which isn't itself a problem but is a hassle. (Later, we'll see that it does hide something, what we'll come to call “eating the variance.”)
To illustrate, here's a complicated model of wages. Do we really need all these terms:
mod = lm(wage ~ sector * sex + exper * sex * educ + married * union * sector,
data = CPS85)
The ANOVA report gives us a p-value for each model term, making it easy to spot what's helping and what's not.
anova(mod)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 7 2572 367 20.34 < 2e-16 ***
## sex 1 436 436 24.14 1.2e-06 ***
## exper 1 264 264 14.64 0.00015 ***
## educ 1 982 982 54.37 7.1e-13 ***
## married 1 18 18 1.01 0.31610
## union 1 172 172 9.50 0.00217 **
## sector:sex 6 115 19 1.07 0.38228
## sex:exper 1 77 77 4.27 0.03935 *
## exper:educ 1 44 44 2.43 0.11986
## sex:educ 1 1 1 0.07 0.79450
## married:union 1 4 4 0.23 0.63341
## sector:married 7 168 24 1.33 0.23404
## sector:union 7 174 25 1.37 0.21367
## sex:exper:educ 1 0 0 0.01 0.92735
## sector:married:union 4 163 41 2.25 0.06221 .
## Residuals 492 8886 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only one of the interaction terms seems to be contributing, and that's pretty marginal.
Let's try for housing prices:
houses = fetchData("SaratogaHouses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
hmod = lm(Price ~ Living.Area * Baths * Bedrooms * Fireplace * Age, data = houses)
anova(hmod)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value
## Living.Area 1 4.51e+12 4.51e+12 1960.63
## Baths 1 1.56e+11 1.56e+11 68.07
## Bedrooms 1 3.30e+10 3.30e+10 14.37
## Fireplace 1 5.17e+09 5.17e+09 2.25
## Age 1 3.96e+09 3.96e+09 1.72
## Living.Area:Baths 1 3.31e+11 3.31e+11 143.91
## Living.Area:Bedrooms 1 1.33e+09 1.33e+09 0.58
## Baths:Bedrooms 1 5.13e+09 5.13e+09 2.23
## Living.Area:Fireplace 1 8.74e+09 8.74e+09 3.80
## Baths:Fireplace 1 5.67e+09 5.67e+09 2.47
## Bedrooms:Fireplace 1 3.21e+09 3.21e+09 1.40
## Living.Area:Age 1 1.00e+10 1.00e+10 4.37
## Baths:Age 1 3.86e+10 3.86e+10 16.77
## Bedrooms:Age 1 7.11e+08 7.11e+08 0.31
## Fireplace:Age 1 2.18e+09 2.18e+09 0.95
## Living.Area:Baths:Bedrooms 1 2.67e+10 2.67e+10 11.60
## Living.Area:Baths:Fireplace 1 2.67e+09 2.67e+09 1.16
## Living.Area:Bedrooms:Fireplace 1 1.04e+10 1.04e+10 4.54
## Baths:Bedrooms:Fireplace 1 3.25e+09 3.25e+09 1.41
## Living.Area:Baths:Age 1 2.53e+09 2.53e+09 1.10
## Living.Area:Bedrooms:Age 1 9.33e+09 9.33e+09 4.06
## Baths:Bedrooms:Age 1 1.06e+09 1.06e+09 0.46
## Living.Area:Fireplace:Age 1 1.90e+08 1.90e+08 0.08
## Baths:Fireplace:Age 1 9.00e+09 9.00e+09 3.92
## Bedrooms:Fireplace:Age 1 8.67e+09 8.67e+09 3.77
## Living.Area:Baths:Bedrooms:Fireplace 1 6.39e+09 6.39e+09 2.78
## Living.Area:Baths:Bedrooms:Age 1 1.33e+09 1.33e+09 0.58
## Living.Area:Baths:Fireplace:Age 1 3.41e+09 3.41e+09 1.48
## Living.Area:Bedrooms:Fireplace:Age 1 2.63e+07 2.63e+07 0.01
## Baths:Bedrooms:Fireplace:Age 1 1.04e+10 1.04e+10 4.54
## Living.Area:Baths:Bedrooms:Fireplace:Age 1 2.94e+09 2.94e+09 1.28
## Residuals 1031 2.37e+12 2.30e+09
## Pr(>F)
## Living.Area < 2e-16 ***
## Baths 4.8e-16 ***
## Bedrooms 0.00016 ***
## Fireplace 0.13390
## Age 0.18935
## Living.Area:Baths < 2e-16 ***
## Living.Area:Bedrooms 0.44710
## Baths:Bedrooms 0.13542
## Living.Area:Fireplace 0.05139 .
## Baths:Fireplace 0.11669
## Bedrooms:Fireplace 0.23760
## Living.Area:Age 0.03678 *
## Baths:Age 4.5e-05 ***
## Bedrooms:Age 0.57811
## Fireplace:Age 0.33057
## Living.Area:Baths:Bedrooms 0.00068 ***
## Living.Area:Baths:Fireplace 0.28170
## Living.Area:Bedrooms:Fireplace 0.03331 *
## Baths:Bedrooms:Fireplace 0.23497
## Living.Area:Baths:Age 0.29475
## Living.Area:Bedrooms:Age 0.04421 *
## Baths:Bedrooms:Age 0.49725
## Living.Area:Fireplace:Age 0.77378
## Baths:Fireplace:Age 0.04805 *
## Bedrooms:Fireplace:Age 0.05231 .
## Living.Area:Baths:Bedrooms:Fireplace 0.09582 .
## Living.Area:Baths:Bedrooms:Age 0.44756
## Living.Area:Baths:Fireplace:Age 0.22336
## Living.Area:Bedrooms:Fireplace:Age 0.91478
## Baths:Bedrooms:Fireplace:Age 0.03330 *
## Living.Area:Baths:Bedrooms:Fireplace:Age 0.25792
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our job today is to start developing a pretty clear understanding of ANOVA and how it works, so that we can interpret the results of ANOVA in useful ways.
Do a very simple model with one quantitative explanatory variable, e.g.
mod = lm(width ~ length, data = KidsFeet)
summary(mod)
##
## Call:
## lm(formula = width ~ length, data = KidsFeet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8386 -0.3106 -0.0089 0.2762 0.7630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8623 1.2081 2.37 0.023 *
## length 0.2479 0.0488 5.08 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.396 on 37 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.395
## F-statistic: 25.8 on 1 and 37 DF, p-value: 1.1e-05
anova(mod)
## Analysis of Variance Table
##
## Response: width
## Df Sum Sq Mean Sq F value Pr(>F)
## length 1 4.06 4.06 25.8 1.1e-05 ***
## Residuals 37 5.81 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Work through the ANOVA calculation by comparing:
Show that F is equal to \( t^2 \). The parameter is the number of degrees of freedom in the residual, which is simply n (number of cases) minus m (the number of coefficients,or, equivalently, the number of model vectors, including the intercept)
The translation from t values to p values is based on the t-distribution.
foo = rt(10000, df = 100)
xhistogram(~foo, fit = "normal")
## Loading required package: MASS
foo = rt(10000, df = 4)
xhistogram(~foo, fit = "normal") # not a match
The normal doesn't put enough density at the center and puts way too little at the tails.
The reason for the long tails is that, when there are few degrees of freedom in the residual, we don't have a very good idea of the length of the residual vector.
This gets really bad when df is one or two.
That the distribution of \( t^2 \) is called F, but there's more to F than \( t^2 \)
One place where F shines is when we want to look at many explanatory vectors collectively.
In the last class, we looked at professor-wise gradepoint averages, with an eye to the question of whether some professors are easy grading. We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.
Now let's return to the question using analysis of variance.
grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
g2n = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
all = merge(grades, courses)
all = merge(all, g2n) # a data set of every grade given, etc.
Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:
mod1 = lm(gradepoint ~ iid, data = all)
r.squared(mod1)
## [1] 0.1745
The regression report actually gives a p-value for this r.squared. It's not any different than we would get by travelling to Planet Null: randomizing iid and seeing what is the distribution of R2 on Planet Null.
Another way to summarize the model is with an ANOVA report:
anova(mod1)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## iid 358 352 0.983 3.16 <2e-16 ***
## Residuals 5350 1665 0.311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explain R2 in terms of the graph of (hypothetical) R2 versus number of junky model vectors.
Now do it stepwise by finding the sum of squares of the fitted model values in a set of models ~1 and ~1 + iid
~1 to ~1 +iid.A way to think of the F statistic: miles per gallon for the model terms compared to miles per gallon for the
Of course, it's not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let's you do this:
mod2 = lm(gradepoint ~ sid + iid, data = all)
anova(mod2)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## sid 442 645 1.459 6.76 <2e-16 ***
## iid 358 313 0.873 4.04 <2e-16 ***
## Residuals 4908 1060 0.216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this.
mod3 = lm(gradepoint ~ iid + sid, data = all)
anova(mod3)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## iid 358 352 0.983 4.55 <2e-16 ***
## sid 442 606 1.370 6.34 <2e-16 ***
## Residuals 4908 1060 0.216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sum((fitted(mod2) - fitted(mod3))^2) # model values are the same
## [1] 1.039e-22