Stats 155 Class Notes 2012-11-09

Where We Are

We've taken a little break from building models to study general ideas of statistical inference, drawing conclusions about the world from data.

For the present, we've focussed on the idea of a test statistic, a number that we calculate from data. Examples of test statistics:
- A model coefficient
- The t statistic, which is the coefficient divided by the standard error
- R² from a model
- F from a whole model, which is really just a restatement of R² including some additional information such as the number of model vectors and the number of cases.
- There are many other things that could sensibly be used as test statistics. For example, partial changes where we examine the rate of change with respect to one variable while holding other variables constant.
In the Hypothesis Testing framework, the major concept is the Null Hypothesis. Broadly speaking, this is that “nothing is going on.”
- In the case of model coefficients, the Null Hypothesis is that our data are drawn from a population in which the coefficient is zero. (The same is true for t statistics, which are just coefficients divided by standard errors.)
- In the case of R² for a whole model test, the Null Hypothesis is that the explanatory vectors are no different from random junk. Even random junk can give an R² bigger than zero, and we have a notion that a typical R² for junk is \( (m-1)/(n-1) \) where \( m \) is the number of model vectors and \( n \) the number of cases. (The \( -1 \) reflects the fact that the intercept vector cannot possibly account for any variation in the response.)
You also should have an idea that the Alternative Hypothesis will be playing a role. So far we haven't said much about this beyond suggesting that the alternative hypothesis will help us decide on how much data should be collected for an effective study. Power is the relevant term here. We'll talk more about that next week.
You should feel very comfortable interpreting p-values in the nominal setting. You should understand why the nominal setting does not necessarily correspond to the way hypothesis tests are used in practice: there are issues involved in doing multiple comparisons, how corrections like the Bonferroni correction try to fix things.
You should also feel comfortable that in the hypothesis testing framework, the only valid conclusions are “reject the null” and “fail to reject the null.”
You should be good with the idea of conditional probability, why there is a tree representation of the relationship between priors, conditional probabilities, and joint probabilities, and understand why \( p(A|B) \) is not in general the same as \( p(B|A) \). This is a matter of statistical literacy. (If you're not comfortable with this, we should spend a bit more time on it.)
You should be aware that there is a major rival to the hypothesis testing framework, Bayesian analysis. You're expected to know only the most basic sorts of Bayesian calculations, which amount to pulling out the entries from a tree and understand the manipulation of such probabilities in a Bayesian way, e.g. the prior probability being transformed to the posterior in a simple setting such as, “If I have a headache, what's the chance that I have the flu.” We won't do much with Bayesian framework, but you should know that
- It is, in many respects, a more useful framework than hypothesis testing. So it's important.
- It corresponds more closely to common sense than hypothesis testing.
- That nonetheless the hypothesis testing framework is dominant in the research literature, and you'll need to be fluent with it's terminology.

Where We're Heading

We're going to introduce another type of analysis of models, ANOVA, standing for “Analysis of Variance.”

To “analyze” something means to “take it apart” and that's indeed what ANOVA does.
ANOVA allows you to divide up the “credit” that different model terms or sets of terms get for explaining variance. The amount of credit could sensibly be measured by a change in R^2, and that's essentially what ANOVA does. But the traditional calculation involves not R² explicitly but instead “sums of squares.” You'll need to learn to read an ANOVA table, which is analogous to a regression table.
You'll see that “credit” depends on context, and so the amount of credit assigned by ANOVA to a model term depends on what other model terms are included in a model and where the model term of interest comes in the “pecking order.”

A Teaser

We sometimes need to deal with complicated models. We had one strategy for interpreting such models — partial derivatives and partial change. But we also need a strategy for deciding when evidence for including a variable or a term warrants doing so. Looking at coefficients is very difficult. Looking at R² hides how the different terms contribute. Looking at the change in R² involves constructing lots and lots of models, which isn't itself a problem but is a hassle. (Later, we'll see that it does hide something, what we'll come to call “eating the variance.”)

To illustrate, here's a complicated model of wages. Do we really need all these terms:

mod = lm(wage ~ sector * sex + exper * sex * educ + married * union * sector, 
    data = CPS85)

The ANOVA report gives us a p-value for each model term, making it easy to spot what's helping and what's not.

anova(mod)

## Analysis of Variance Table
## 
## Response: wage
##                       Df Sum Sq Mean Sq F value  Pr(>F)    
## sector                 7   2572     367   20.34 < 2e-16 ***
## sex                    1    436     436   24.14 1.2e-06 ***
## exper                  1    264     264   14.64 0.00015 ***
## educ                   1    982     982   54.37 7.1e-13 ***
## married                1     18      18    1.01 0.31610    
## union                  1    172     172    9.50 0.00217 ** 
## sector:sex             6    115      19    1.07 0.38228    
## sex:exper              1     77      77    4.27 0.03935 *  
## exper:educ             1     44      44    2.43 0.11986    
## sex:educ               1      1       1    0.07 0.79450    
## married:union          1      4       4    0.23 0.63341    
## sector:married         7    168      24    1.33 0.23404    
## sector:union           7    174      25    1.37 0.21367    
## sex:exper:educ         1      0       0    0.01 0.92735    
## sector:married:union   4    163      41    2.25 0.06221 .  
## Residuals            492   8886      18                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only one of the interaction terms seems to be contributing, and that's pretty marginal.

Let's try for housing prices:

houses = fetchData("SaratogaHouses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv

hmod = lm(Price ~ Living.Area * Baths * Bedrooms * Fireplace * Age, data = houses)
anova(hmod)

## Analysis of Variance Table
## 
## Response: Price
##                                            Df   Sum Sq  Mean Sq F value
## Living.Area                                 1 4.51e+12 4.51e+12 1960.63
## Baths                                       1 1.56e+11 1.56e+11   68.07
## Bedrooms                                    1 3.30e+10 3.30e+10   14.37
## Fireplace                                   1 5.17e+09 5.17e+09    2.25
## Age                                         1 3.96e+09 3.96e+09    1.72
## Living.Area:Baths                           1 3.31e+11 3.31e+11  143.91
## Living.Area:Bedrooms                        1 1.33e+09 1.33e+09    0.58
## Baths:Bedrooms                              1 5.13e+09 5.13e+09    2.23
## Living.Area:Fireplace                       1 8.74e+09 8.74e+09    3.80
## Baths:Fireplace                             1 5.67e+09 5.67e+09    2.47
## Bedrooms:Fireplace                          1 3.21e+09 3.21e+09    1.40
## Living.Area:Age                             1 1.00e+10 1.00e+10    4.37
## Baths:Age                                   1 3.86e+10 3.86e+10   16.77
## Bedrooms:Age                                1 7.11e+08 7.11e+08    0.31
## Fireplace:Age                               1 2.18e+09 2.18e+09    0.95
## Living.Area:Baths:Bedrooms                  1 2.67e+10 2.67e+10   11.60
## Living.Area:Baths:Fireplace                 1 2.67e+09 2.67e+09    1.16
## Living.Area:Bedrooms:Fireplace              1 1.04e+10 1.04e+10    4.54
## Baths:Bedrooms:Fireplace                    1 3.25e+09 3.25e+09    1.41
## Living.Area:Baths:Age                       1 2.53e+09 2.53e+09    1.10
## Living.Area:Bedrooms:Age                    1 9.33e+09 9.33e+09    4.06
## Baths:Bedrooms:Age                          1 1.06e+09 1.06e+09    0.46
## Living.Area:Fireplace:Age                   1 1.90e+08 1.90e+08    0.08
## Baths:Fireplace:Age                         1 9.00e+09 9.00e+09    3.92
## Bedrooms:Fireplace:Age                      1 8.67e+09 8.67e+09    3.77
## Living.Area:Baths:Bedrooms:Fireplace        1 6.39e+09 6.39e+09    2.78
## Living.Area:Baths:Bedrooms:Age              1 1.33e+09 1.33e+09    0.58
## Living.Area:Baths:Fireplace:Age             1 3.41e+09 3.41e+09    1.48
## Living.Area:Bedrooms:Fireplace:Age          1 2.63e+07 2.63e+07    0.01
## Baths:Bedrooms:Fireplace:Age                1 1.04e+10 1.04e+10    4.54
## Living.Area:Baths:Bedrooms:Fireplace:Age    1 2.94e+09 2.94e+09    1.28
## Residuals                                1031 2.37e+12 2.30e+09        
##                                           Pr(>F)    
## Living.Area                              < 2e-16 ***
## Baths                                    4.8e-16 ***
## Bedrooms                                 0.00016 ***
## Fireplace                                0.13390    
## Age                                      0.18935    
## Living.Area:Baths                        < 2e-16 ***
## Living.Area:Bedrooms                     0.44710    
## Baths:Bedrooms                           0.13542    
## Living.Area:Fireplace                    0.05139 .  
## Baths:Fireplace                          0.11669    
## Bedrooms:Fireplace                       0.23760    
## Living.Area:Age                          0.03678 *  
## Baths:Age                                4.5e-05 ***
## Bedrooms:Age                             0.57811    
## Fireplace:Age                            0.33057    
## Living.Area:Baths:Bedrooms               0.00068 ***
## Living.Area:Baths:Fireplace              0.28170    
## Living.Area:Bedrooms:Fireplace           0.03331 *  
## Baths:Bedrooms:Fireplace                 0.23497    
## Living.Area:Baths:Age                    0.29475    
## Living.Area:Bedrooms:Age                 0.04421 *  
## Baths:Bedrooms:Age                       0.49725    
## Living.Area:Fireplace:Age                0.77378    
## Baths:Fireplace:Age                      0.04805 *  
## Bedrooms:Fireplace:Age                   0.05231 .  
## Living.Area:Baths:Bedrooms:Fireplace     0.09582 .  
## Living.Area:Baths:Bedrooms:Age           0.44756    
## Living.Area:Baths:Fireplace:Age          0.22336    
## Living.Area:Bedrooms:Fireplace:Age       0.91478    
## Baths:Bedrooms:Fireplace:Age             0.03330 *  
## Living.Area:Baths:Bedrooms:Fireplace:Age 0.25792    
## Residuals                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our job today is to start developing a pretty clear understanding of ANOVA and how it works, so that we can interpret the results of ANOVA in useful ways.

ANOVA and Regression Report are Related

Do a very simple model with one quantitative explanatory variable, e.g.

mod = lm(width ~ length, data = KidsFeet)
summary(mod)

## 
## Call:
## lm(formula = width ~ length, data = KidsFeet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8386 -0.3106 -0.0089  0.2762  0.7630 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8623     1.2081    2.37    0.023 *  
## length        0.2479     0.0488    5.08  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.396 on 37 degrees of freedom
## Multiple R-squared: 0.411,   Adjusted R-squared: 0.395 
## F-statistic: 25.8 on 1 and 37 DF,  p-value: 1.1e-05

anova(mod)

## Analysis of Variance Table
## 
## Response: width
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## length     1   4.06    4.06    25.8 1.1e-05 ***
## Residuals 37   5.81    0.16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Work through the ANOVA calculation by comparing:

the simple calculation involving R^2, m and n to get F
the R² report at the end of the regression report
the p-value in the ANOVA report

Show that F is equal to \( t^2 \). The parameter is the number of degrees of freedom in the residual, which is simply n (number of cases) minus m (the number of coefficients,or, equivalently, the number of model vectors, including the intercept)

The translation from t values to p values is based on the t-distribution.

Show that, for large df of residuals, it is effectively the same as the normal distribution.

foo = rt(10000, df = 100)
xhistogram(~foo, fit = "normal")

## Loading required package: MASS

plot of chunk unnamed-chunk-6

For small df, it has fatter tails than the normal

foo = rt(10000, df = 4)
xhistogram(~foo, fit = "normal")  # not a match

plot of chunk unnamed-chunk-7

The normal doesn't put enough density at the center and puts way too little at the tails.

The reason for the long tails is that, when there are few degrees of freedom in the residual, we don't have a very good idea of the length of the residual vector.

Digression

This gets really bad when df is one or two.

Calculate the 95% coverage intervals of t in terms of standard deviations for various df from large to small.

That the distribution of \( t^2 \) is called F, but there's more to F than \( t^2 \)

Example: Do professors vary in how they grade? Revisited

One place where F shines is when we want to look at many explanatory vectors collectively.

In the last class, we looked at professor-wise gradepoint averages, with an eye to the question of whether some professors are easy grading. We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.

Now let's return to the question using analysis of variance.

grades = fetchData("grades.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv

courses = fetchData("courses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv

g2n = fetchData("grade-to-number.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv

all = merge(grades, courses)
all = merge(all, g2n)  # a data set of every grade given, etc.

Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:

mod1 = lm(gradepoint ~ iid, data = all)
r.squared(mod1)

## [1] 0.1745

The regression report actually gives a p-value for this r.squared. It's not any different than we would get by travelling to Planet Null: randomizing iid and seeing what is the distribution of R² on Planet Null.

Another way to summarize the model is with an ANOVA report:

anova(mod1)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    3.16 <2e-16 ***
## Residuals 5350   1665   0.311                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explain R² in terms of the graph of (hypothetical) R² versus number of junky model vectors.

Now do it stepwise by finding the sum of squares of the fitted model values in a set of models ~1 and ~1 + iid

Sum of squares of the fitted model values
DF (“degrees of freedom”) a count of coefficients
Residual SS — sum of squares of the residuals
Residual DF — \( n \) - DF of the model
Look at how things change when going from ~1 to ~1 +iid.
“Mean square” is SS/DF

A way to think of the F statistic: miles per gallon for the model terms compared to miles per gallon for the

Breaking up the Variance into Parts

Of course, it's not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let's you do this:

mod2 = lm(gradepoint ~ sid + iid, data = all)
anova(mod2)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sid        442    645   1.459    6.76 <2e-16 ***
## iid        358    313   0.873    4.04 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this.

mod3 = lm(gradepoint ~ iid + sid, data = all)
anova(mod3)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    4.55 <2e-16 ***
## sid        442    606   1.370    6.34 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sum((fitted(mod2) - fitted(mod3))^2)  # model values are the same

## [1] 1.039e-22