Stats 155 Class Notes 2012-11-09

Where We Are

We've taken a little break from building models to study general ideas of statistical inference, drawing conclusions about the world from data.

Where We're Heading

We're going to introduce another type of analysis of models, ANOVA, standing for “Analysis of Variance.”

A Teaser

We sometimes need to deal with complicated models. We had one strategy for interpreting such models — partial derivatives and partial change. But we also need a strategy for deciding when evidence for including a variable or a term warrants doing so. Looking at coefficients is very difficult. Looking at R2 hides how the different terms contribute. Looking at the change in R2 involves constructing lots and lots of models, which isn't itself a problem but is a hassle. (Later, we'll see that it does hide something, what we'll come to call “eating the variance.”)

To illustrate, here's a complicated model of wages. Do we really need all these terms:

mod = lm(wage ~ sector * sex + exper * sex * educ + married * union * sector, 
    data = CPS85)

The ANOVA report gives us a p-value for each model term, making it easy to spot what's helping and what's not.

anova(mod)
## Analysis of Variance Table
## 
## Response: wage
##                       Df Sum Sq Mean Sq F value  Pr(>F)    
## sector                 7   2572     367   20.34 < 2e-16 ***
## sex                    1    436     436   24.14 1.2e-06 ***
## exper                  1    264     264   14.64 0.00015 ***
## educ                   1    982     982   54.37 7.1e-13 ***
## married                1     18      18    1.01 0.31610    
## union                  1    172     172    9.50 0.00217 ** 
## sector:sex             6    115      19    1.07 0.38228    
## sex:exper              1     77      77    4.27 0.03935 *  
## exper:educ             1     44      44    2.43 0.11986    
## sex:educ               1      1       1    0.07 0.79450    
## married:union          1      4       4    0.23 0.63341    
## sector:married         7    168      24    1.33 0.23404    
## sector:union           7    174      25    1.37 0.21367    
## sex:exper:educ         1      0       0    0.01 0.92735    
## sector:married:union   4    163      41    2.25 0.06221 .  
## Residuals            492   8886      18                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only one of the interaction terms seems to be contributing, and that's pretty marginal.

Let's try for housing prices:

houses = fetchData("SaratogaHouses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
hmod = lm(Price ~ Living.Area * Baths * Bedrooms * Fireplace * Age, data = houses)
anova(hmod)
## Analysis of Variance Table
## 
## Response: Price
##                                            Df   Sum Sq  Mean Sq F value
## Living.Area                                 1 4.51e+12 4.51e+12 1960.63
## Baths                                       1 1.56e+11 1.56e+11   68.07
## Bedrooms                                    1 3.30e+10 3.30e+10   14.37
## Fireplace                                   1 5.17e+09 5.17e+09    2.25
## Age                                         1 3.96e+09 3.96e+09    1.72
## Living.Area:Baths                           1 3.31e+11 3.31e+11  143.91
## Living.Area:Bedrooms                        1 1.33e+09 1.33e+09    0.58
## Baths:Bedrooms                              1 5.13e+09 5.13e+09    2.23
## Living.Area:Fireplace                       1 8.74e+09 8.74e+09    3.80
## Baths:Fireplace                             1 5.67e+09 5.67e+09    2.47
## Bedrooms:Fireplace                          1 3.21e+09 3.21e+09    1.40
## Living.Area:Age                             1 1.00e+10 1.00e+10    4.37
## Baths:Age                                   1 3.86e+10 3.86e+10   16.77
## Bedrooms:Age                                1 7.11e+08 7.11e+08    0.31
## Fireplace:Age                               1 2.18e+09 2.18e+09    0.95
## Living.Area:Baths:Bedrooms                  1 2.67e+10 2.67e+10   11.60
## Living.Area:Baths:Fireplace                 1 2.67e+09 2.67e+09    1.16
## Living.Area:Bedrooms:Fireplace              1 1.04e+10 1.04e+10    4.54
## Baths:Bedrooms:Fireplace                    1 3.25e+09 3.25e+09    1.41
## Living.Area:Baths:Age                       1 2.53e+09 2.53e+09    1.10
## Living.Area:Bedrooms:Age                    1 9.33e+09 9.33e+09    4.06
## Baths:Bedrooms:Age                          1 1.06e+09 1.06e+09    0.46
## Living.Area:Fireplace:Age                   1 1.90e+08 1.90e+08    0.08
## Baths:Fireplace:Age                         1 9.00e+09 9.00e+09    3.92
## Bedrooms:Fireplace:Age                      1 8.67e+09 8.67e+09    3.77
## Living.Area:Baths:Bedrooms:Fireplace        1 6.39e+09 6.39e+09    2.78
## Living.Area:Baths:Bedrooms:Age              1 1.33e+09 1.33e+09    0.58
## Living.Area:Baths:Fireplace:Age             1 3.41e+09 3.41e+09    1.48
## Living.Area:Bedrooms:Fireplace:Age          1 2.63e+07 2.63e+07    0.01
## Baths:Bedrooms:Fireplace:Age                1 1.04e+10 1.04e+10    4.54
## Living.Area:Baths:Bedrooms:Fireplace:Age    1 2.94e+09 2.94e+09    1.28
## Residuals                                1031 2.37e+12 2.30e+09        
##                                           Pr(>F)    
## Living.Area                              < 2e-16 ***
## Baths                                    4.8e-16 ***
## Bedrooms                                 0.00016 ***
## Fireplace                                0.13390    
## Age                                      0.18935    
## Living.Area:Baths                        < 2e-16 ***
## Living.Area:Bedrooms                     0.44710    
## Baths:Bedrooms                           0.13542    
## Living.Area:Fireplace                    0.05139 .  
## Baths:Fireplace                          0.11669    
## Bedrooms:Fireplace                       0.23760    
## Living.Area:Age                          0.03678 *  
## Baths:Age                                4.5e-05 ***
## Bedrooms:Age                             0.57811    
## Fireplace:Age                            0.33057    
## Living.Area:Baths:Bedrooms               0.00068 ***
## Living.Area:Baths:Fireplace              0.28170    
## Living.Area:Bedrooms:Fireplace           0.03331 *  
## Baths:Bedrooms:Fireplace                 0.23497    
## Living.Area:Baths:Age                    0.29475    
## Living.Area:Bedrooms:Age                 0.04421 *  
## Baths:Bedrooms:Age                       0.49725    
## Living.Area:Fireplace:Age                0.77378    
## Baths:Fireplace:Age                      0.04805 *  
## Bedrooms:Fireplace:Age                   0.05231 .  
## Living.Area:Baths:Bedrooms:Fireplace     0.09582 .  
## Living.Area:Baths:Bedrooms:Age           0.44756    
## Living.Area:Baths:Fireplace:Age          0.22336    
## Living.Area:Bedrooms:Fireplace:Age       0.91478    
## Baths:Bedrooms:Fireplace:Age             0.03330 *  
## Living.Area:Baths:Bedrooms:Fireplace:Age 0.25792    
## Residuals                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our job today is to start developing a pretty clear understanding of ANOVA and how it works, so that we can interpret the results of ANOVA in useful ways.

ANOVA and Regression Report are Related

Do a very simple model with one quantitative explanatory variable, e.g.

mod = lm(width ~ length, data = KidsFeet)
summary(mod)
## 
## Call:
## lm(formula = width ~ length, data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8386 -0.3106 -0.0089  0.2762  0.7630 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8623     1.2081    2.37    0.023 *  
## length        0.2479     0.0488    5.08  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.396 on 37 degrees of freedom
## Multiple R-squared: 0.411,   Adjusted R-squared: 0.395 
## F-statistic: 25.8 on 1 and 37 DF,  p-value: 1.1e-05
anova(mod)
## Analysis of Variance Table
## 
## Response: width
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## length     1   4.06    4.06    25.8 1.1e-05 ***
## Residuals 37   5.81    0.16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Work through the ANOVA calculation by comparing:

Show that F is equal to \( t^2 \). The parameter is the number of degrees of freedom in the residual, which is simply n (number of cases) minus m (the number of coefficients,or, equivalently, the number of model vectors, including the intercept)

The translation from t values to p values is based on the t-distribution.

foo = rt(10000, df = 100)
xhistogram(~foo, fit = "normal")
## Loading required package: MASS

plot of chunk unnamed-chunk-6

foo = rt(10000, df = 4)
xhistogram(~foo, fit = "normal")  # not a match

plot of chunk unnamed-chunk-7

The normal doesn't put enough density at the center and puts way too little at the tails.

The reason for the long tails is that, when there are few degrees of freedom in the residual, we don't have a very good idea of the length of the residual vector.

Digression

This gets really bad when df is one or two.

That the distribution of \( t^2 \) is called F, but there's more to F than \( t^2 \)

Example: Do professors vary in how they grade? Revisited

One place where F shines is when we want to look at many explanatory vectors collectively.

In the last class, we looked at professor-wise gradepoint averages, with an eye to the question of whether some professors are easy grading. We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.

Now let's return to the question using analysis of variance.

grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
g2n = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
all = merge(grades, courses)
all = merge(all, g2n)  # a data set of every grade given, etc.

Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:

mod1 = lm(gradepoint ~ iid, data = all)
r.squared(mod1)
## [1] 0.1745

The regression report actually gives a p-value for this r.squared. It's not any different than we would get by travelling to Planet Null: randomizing iid and seeing what is the distribution of R2 on Planet Null.

Another way to summarize the model is with an ANOVA report:

anova(mod1)
## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    3.16 <2e-16 ***
## Residuals 5350   1665   0.311                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explain R2 in terms of the graph of (hypothetical) R2 versus number of junky model vectors.

Now do it stepwise by finding the sum of squares of the fitted model values in a set of models ~1 and ~1 + iid

A way to think of the F statistic: miles per gallon for the model terms compared to miles per gallon for the

Breaking up the Variance into Parts

Of course, it's not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let's you do this:

mod2 = lm(gradepoint ~ sid + iid, data = all)
anova(mod2)
## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sid        442    645   1.459    6.76 <2e-16 ***
## iid        358    313   0.873    4.04 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this.

mod3 = lm(gradepoint ~ iid + sid, data = all)
anova(mod3)
## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    4.55 <2e-16 ***
## sid        442    606   1.370    6.34 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sum((fitted(mod2) - fitted(mod3))^2)  # model values are the same
## [1] 1.039e-22