Stats 155 Class Notes 2012-11-12

Grading the Election Model predictions

Hand out random papers with prediction models and direct students to this app for evaluating your election result model. Adjust the parameters as indicated and write the resulting likelihood on the top of the paper.

See who has the largest likelihood in the class.

Suppose you were trying to decide between two hypotheses based on the data. Here's how:

p( mod1 | 332) = p(332|mod1)p(mod1)/p(332)

What's p(332)? Since either mod1 or mod2 is right (by your prior) then
p(332) = p(332|mod1)p(mod1) + p(332|mod2)p(mod2)

More simply, assign model a value of p(332|modX) p(modX). The posterior probability p(modX|332) is simply the value p(332|modX)p(modX) divided by the sum of the similar term.

If the priors are all the same, then the posterior probability is proportional to the likelihood.

ANOVA

The elementary setting for ANOVA

Differences among groups. When there are two groups, this is called a “t-test”. More than two, it's traditionally called simply “ANOVA” or “one-way ANOVA”, but it's just a standard application of the ANOVA procedure to a simple model.

The t-test

Consider the difference between the sexes in wage:

mod = lm(wage ~ sex, data = CPS85)
summary(mod)

## 
## Call:
## lm(formula = wage ~ sex, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.99  -3.53  -1.07   2.39  36.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.879      0.322   24.50  < 2e-16 ***
## sexM           2.116      0.437    4.84  1.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.03 on 532 degrees of freedom
## Multiple R-squared: 0.0422,  Adjusted R-squared: 0.0404 
## F-statistic: 23.4 on 1 and 532 DF,  p-value: 1.7e-06

anova(mod)

## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sex         1    594     594    23.4 1.7e-06 ***
## Residuals 532  13483      25                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the regression report and the ANOVA report give the same p-value. There's no “team” issue when there is just one coefficient at issue.

For multiple groups, consider wage versus sector:

mod2 = lm(wage ~ sector, data = CPS85)
summary(mod2)

## 
## Call:
## lm(formula = wage ~ sector, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.70  -3.04  -1.04   2.28  31.80 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.423      0.475   15.63  < 2e-16 ***
## sectorconst      2.079      1.149    1.81    0.071 .  
## sectormanag      5.281      0.789    6.69  5.7e-11 ***
## sectormanuf      0.613      0.740    0.83    0.407    
## sectorother      1.078      0.740    1.46    0.146    
## sectorprof       4.525      0.659    6.87  1.8e-11 ***
## sectorsales      0.170      0.895    0.19    0.849    
## sectorservice   -0.885      0.699   -1.27    0.206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.68 on 526 degrees of freedom
## Multiple R-squared: 0.183,   Adjusted R-squared: 0.172 
## F-statistic: 16.8 on 7 and 526 DF,  p-value: <2e-16

In the regression report, each sector gets its own p-value with respect to the reference sector. It might be that there is not enough evidence to support a claim that any given sector is different from the reference, but taken together they suggest that the overall variable is playing a role in explaining the response variable. ANOVA provides a simple way to combine the contributions of all the levels of the variable.

anova(mod2)

## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value Pr(>F)    
## sector      7   2572     367    16.8 <2e-16 ***
## Residuals 526  11505      22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the Null Hypothesis world, the “explanatory” variables are no better than the residuals. Here's how that shows up in the ANOVA table

modnull = lm(wage ~ shuffle(sector), data = CPS85)
anova(modnull)

## Analysis of Variance Table
## 
## Response: wage
##                  Df Sum Sq Mean Sq F value Pr(>F)
## shuffle(sector)   7    164    23.5    0.89   0.52
## Residuals       526  13912    26.4

Two categorical variables

Example: Sector and sex

mod = lm(wage ~ sector * sex, data = CPS85)
anova(mod)

## Analysis of Variance Table
## 
## Response: wage
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## sector       7   2572     367   17.50 < 2e-16 ***
## sex          1    436     436   20.77 6.5e-06 ***
## sector:sex   6    175      29    1.39    0.22    
## Residuals  519  10894      21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Have we detected an interaction?

ACTIVITY

Create the set of nested models and construct the ANOVA table by hand.

Digression: Log prices and wages

Prices are relative. An indication of this is the almost universal use of percentages to describe inflation, wage increases, etc. For example, an often quoted number is that women earn approximately 72 cents for each dollar earned by a man.

The naive way to find this number, which is in fact the way it is found, is to divide the average wage of women by the average wage of men, e.g.

mean(wage ~ sex, data = CPS85)

##     F     M 
## 7.879 9.995

7.88/9.99

## [1] 0.7888

Close to the quoted number in this data set from the 1980s.

A better way is to work with log wages, find the contribution from sex, and then convert that back into a multiplier. That let's us adjust for various other factors. Here's the basic calculation, done in log-wage style:

lm(log(wage) ~ sex, data = CPS85)

## 
## Call:
## lm(formula = log(wage) ~ sex, data = CPS85)
## 
## Coefficients:
## (Intercept)         sexM  
##       1.934        0.231

exp(-0.2312)

## [1] 0.7936

Now we can include covariates:

lm(log(wage) ~ sex + sector + exper + educ, data = CPS85)

## 
## Call:
## lm(formula = log(wage) ~ sex + sector + exper + educ, data = CPS85)
## 
## Coefficients:
##   (Intercept)           sexM    sectorconst    sectormanag    sectormanuf  
##        0.6978         0.2197         0.1637         0.2087         0.0447  
##   sectorother     sectorprof    sectorsales  sectorservice          exper  
##        0.0393         0.1976        -0.1513        -0.1539         0.0118  
##          educ  
##        0.0761

exp(-0.2197)

## [1] 0.8028

Hardly any difference. But maybe the model should be more complicated.

Activity

Use ANOVA and log wages to see if interactions should be included in the model. Example:

mod = lm(log(wage) ~ sector * sex, data = CPS85)
anova(mod)

## Analysis of Variance Table
## 
## Response: log(wage)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## sector       7   27.0    3.86   17.78 < 2e-16 ***
## sex          1    5.4    5.44   25.06 7.6e-07 ***
## sector:sex   6    3.2    0.53    2.46   0.024 *  
## Residuals  519  112.8    0.22                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Student Activity

Question: Is there an interaction between age and mileage in the used car data? Does it show up if we look at log prices?

cars = fetchData("used-hondas.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/used-hondas.csv

anova(lm(log(Price) ~ Age * Mileage, data = cars))

## Analysis of Variance Table
## 
## Response: log(Price)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## Age          1   8.33    8.33  599.43 <2e-16 ***
## Mileage      1   2.19    2.19  157.41 <2e-16 ***
## Age:Mileage  1   0.05    0.05    3.29  0.073 .  
## Residuals   88   1.22    0.01                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Do professors vary in how they grade? Revisited

One place where F shines is when we want to look at many explanatory vectors collectively.

In the last class, we looked at professor-wise gradepoint averages, with an eye to the question of whether some professors are easy grading. We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.

Now let's return to the question using analysis of variance.

grades = fetchData("grades.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv

courses = fetchData("courses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv

g2n = fetchData("grade-to-number.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv

all = merge(grades, courses)
all = merge(all, g2n)  # a data set of every grade given, etc.

Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:

mod1 = lm(gradepoint ~ iid, data = all)
r.squared(mod1)

## [1] 0.1745

The regression report actually gives a p-value for this r.squared. It's not any different than we would get by travelling to Planet Null: randomizing iid and seeing what is the distribution of R² on Planet Null.

Another way to summarize the model is with an ANOVA report:

anova(mod1)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    3.16 <2e-16 ***
## Residuals 5350   1665   0.311                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theory of F

Explain R² in terms of the graph of (hypothetical) R² versus number of junky model vectors. F is the ratio of segment slopes

Now do it stepwise by finding the sum of squares of the fitted model values in a set of models ~1 and ~1 + iid

Sum of squares of the fitted model values
DF (“degrees of freedom”) a count of coefficients
Residual SS — sum of squares of the residuals
Residual DF — \( n \) - DF of the model
Look at how things change when going from ~1 to ~1 +iid.
“Mean square” is SS/DF

A way to think of the F statistic: miles per gallon for the model terms compared to miles per gallon for the

Breaking up the Variance into Parts

Of course, it's not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let's you do this:

mod2 = lm(gradepoint ~ sid + iid, data = all)
anova(mod2)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## sid        442    645   1.459    6.76 <2e-16 ***
## iid        358    313   0.873    4.04 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this.

mod3 = lm(gradepoint ~ iid + sid, data = all)
anova(mod3)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df Sum Sq Mean Sq F value Pr(>F)    
## iid        358    352   0.983    4.55 <2e-16 ***
## sid        442    606   1.370    6.34 <2e-16 ***
## Residuals 4908   1060   0.216                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sum((fitted(mod2) - fitted(mod3))^2)  # model values are the same

## [1] 1.039e-22

Eating Up the Variance

The F statistic compares the “credit” earned by a model term to the mean square residual, which can be interpreted as the credit that would be earned by a junky random term.

Fit a model and add in some random terms. Show that the F for the random terms is about 1 and that the mean square of the residual is hardly changed by the random terms.

mod0 = lm(wage ~ sector + sex + exper, data = CPS85)
anova(mod0)

## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367    17.8 < 2e-16 ***
## sex         1    436     436    21.1 5.3e-06 ***
## exper       1    264     264    12.8 0.00037 ***
## Residuals 524  10805      21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean square residual is about 20.

Now throw in some junk:

mod10 = lm(wage ~ sector + sex + exper + rand(10), data = CPS85)
anova(mod10)

## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367   17.72 < 2e-16 ***
## sex         1    436     436   21.03 5.7e-06 ***
## exper       1    264     264   12.76 0.00039 ***
## rand(10)   10    151      15    0.73 0.69703    
## Residuals 514  10654      21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But what if a term eats more variance than a junky term. That term makes it easier for the other terms to show significance.

EXAMPLE: Difference in age between husband and wife in couples getting married.

Ask: Who is older in a married couple, the man or the woman? By how much?

Let's see if the data support this:

m = fetchData("marriage.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv

mod0 = mm(Age ~ Person, data = m)
mod0

## Groupwise Model.
## Call:
## Age ~ Person
## 
## Coefficients:
## Bride  Groom  
##  33.2   35.8

confint(mod0)

##   group 2.5 % 97.5 %
## 1 Bride 29.15  37.33
## 2 Groom 31.70  39.87

mod1 = lm(Age ~ Person, data = m)
summary(mod1)

## 
## Call:
## lm(formula = Age ~ Person, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.42 -12.02  -3.34   8.80  39.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.24       2.06   16.13   <2e-16 ***
## PersonGroom     2.55       2.91    0.87     0.38    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 14.4 on 96 degrees of freedom
## Multiple R-squared: 0.00789, Adjusted R-squared: -0.00245 
## F-statistic: 0.763 on 1 and 96 DF,  p-value: 0.384

The point estimate is about right, but the margin of error is so large that we can't take this estimate very seriously. The p-value is so large that we can't reject the null that there is no relationship between Person and age.

Try adding in some other variables, astrological sign, years of education, etc. and show that this doesn't help much.

Finally, add in the BookpageID variable.

mod2 = lm(Age ~ Person + BookpageID, data = m)
anova(mod2)

## Analysis of Variance Table
## 
## Response: Age
##            Df Sum Sq Mean Sq F value Pr(>F)    
## Person      1    159     159    9.07 0.0041 ** 
## BookpageID 48  19127     398   22.77 <2e-16 ***
## Residuals  48    840      18                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

head(confint(mod2))

##                        2.5 % 97.5 %
## (Intercept)          18.7105 30.728
## PersonGroom           0.8461  4.245
## BookpageIDB230p1354  -9.1763  7.648
## BookpageIDB230p1665  -6.6544 10.169
## BookpageIDB230p1948 -13.7133  3.111
## BookpageIDB230p539   -3.7393 13.085

This gives an individual ID to each marriage. Putting this in the model effectively holds the couple constant when considering the Person. In terms of ANOVA, BookpageID is eating up lots and lots of variance.

Choosing Model Terms

Put terms first: they get credit. Put terms later: they can help the earlier terms to get credit.

Look at sex and race in the wage data. Put them first. Put them last. What are the different interpretations?

mod0 = lm(log(wage) ~ race + sex, data = CPS85)
anova(mod0)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## race        1    0.7    0.66     2.5    0.11    
## sex         1    7.2    7.23    27.3 2.5e-07 ***
## Residuals 531  140.6    0.26                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Race isn't showing up as significant, but sex is. Perhaps if we eat up the variance by putting in more explanatory variables …

mod1 = lm(log(wage) ~ race + sex + sector + educ + exper + south, data = CPS85)
anova(mod1)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## race        1    0.7    0.66    3.47  0.0631 .  
## sex         1    7.2    7.23   37.96 1.4e-09 ***
## sector      7   24.8    3.54   18.56 < 2e-16 ***
## educ        1    5.5    5.52   28.95 1.1e-07 ***
## exper       1    9.5    9.49   49.83 5.4e-12 ***
## south       1    1.5    1.53    8.03  0.0048 ** 
## Residuals 521   99.3    0.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significance of sex is even stronger Race is right on the margin. With more data (and this is a very limited amount of data) we might see something.

But what if someone argues that it's really educational levels etc that matter and you have to see how these effect wage first:

mod3 = lm(log(wage) ~ sector + educ + exper + south + sex + race, data = CPS85)
anova(mod3)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   27.0    3.86   20.27 < 2e-16 ***
## educ        1    6.5    6.50   34.13 9.1e-09 ***
## exper       1    8.8    8.83   46.34 2.7e-11 ***
## south       1    1.5    1.45    7.63  0.0059 ** 
## sex         1    5.2    5.23   27.46 2.3e-07 ***
## race        1    0.1    0.13    0.70  0.4019    
## Residuals 521   99.3    0.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No indication that race is playing a role.

QUESTION: But is race affecting education, experience, and sector? If so, then race should go first.

An Odd Situation

Sometimes the later term gets credit, but only because it was set up by an earlier term. Without the earlier term, it wouldn't get credit.

Examples:

temperature in the alder data
birth year, death year explaining lifespan.

ANOVA and near redundancy

The CPS85 data. With ANOVA, you can throw all sorts of stuff into the model and not have to worry about redundancy or near redundancy.

m1 = lm(wage ~ sector + exper + age + educ, data = CPS85)
summary(m1)  # terrible standard errors

## 
## Call:
## lm(formula = wage ~ sector + exper + age + educ, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10.36  -2.80  -0.65   2.01  33.92 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.0445     6.8083   -0.74   0.4591    
## sectorconst     3.0259     1.0984    2.75   0.0061 ** 
## sectormanag     4.0025     0.7651    5.23  2.4e-07 ***
## sectormanuf     1.7037     0.7186    2.37   0.0181 *  
## sectorother     2.1345     0.7114    3.00   0.0028 ** 
## sectorprof      2.7200     0.6766    4.02  6.7e-05 ***
## sectorsales    -0.1196     0.8467   -0.14   0.8877    
## sectorservice  -0.1434     0.6727   -0.21   0.8313    
## exper          -0.0765     1.1133   -0.07   0.9452    
## age             0.1751     1.1122    0.16   0.8749    
## educ            0.5733     1.1182    0.51   0.6084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.42 on 523 degrees of freedom
## Multiple R-squared: 0.274,   Adjusted R-squared: 0.26 
## F-statistic: 19.7 on 10 and 523 DF,  p-value: <2e-16

anova(m1)

## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367   18.80 < 2e-16 ***
## exper       1    213     213   10.91   0.001 ** 
## age         1   1068    1068   54.65 5.7e-13 ***
## educ        1      5       5    0.26   0.608    
## Residuals 523  10219      20                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1