But not at all the general use for ANOVA
An attempt to be honest with p-values and avoid the problem of multiple comparisons.
Several levels of a categorical variable. Null hypothesis is generally phrased as “the levels are all the same.”
Compare to t-test: two levels.
Professors: many levels.
Typical central question: Is there an interaction between the variables?
Example: Sector and sex
mod = lm(wage ~ sector * sex, data = CPS85)
anova(mod)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 7 2572 367 17.50 < 2e-16 ***
## sex 1 436 436 20.77 6.5e-06 ***
## sector:sex 6 175 29 1.39 0.22
## Residuals 519 10894 21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Have we detected an interaction?
Prices are relative. An indication of this is the almost universal use of percentages to describe inflation, wage increases, etc. For example, an often quoted number is that women earn approximately 72 cents for each dollar earned by a man.
The naive way to find this number, which is in fact the way it is found, is to divide the average wage of women by the average wage of men, e.g.
mean(wage ~ sex, data = CPS85)
## F M
## 7.879 9.995
7.88/9.99
## [1] 0.7888
Close to the quoted number in this data set from the 1980s.
A better way is to work with log wages, find the contribution from sex, and then convert that back into a multiplier. That let's us adjust for various other factors. Here's the basic calculation, done in log-wage style:
lm(log(wage) ~ sex, data = CPS85)
##
## Call:
## lm(formula = log(wage) ~ sex, data = CPS85)
##
## Coefficients:
## (Intercept) sexM
## 1.934 0.231
exp(-0.2312)
## [1] 0.7936
Now we can include covariates:
lm(log(wage) ~ sex + sector + exper + educ, data = CPS85)
##
## Call:
## lm(formula = log(wage) ~ sex + sector + exper + educ, data = CPS85)
##
## Coefficients:
## (Intercept) sexM sectorconst sectormanag sectormanuf
## 0.6978 0.2197 0.1637 0.2087 0.0447
## sectorother sectorprof sectorsales sectorservice exper
## 0.0393 0.1976 -0.1513 -0.1539 0.0118
## educ
## 0.0761
exp(-0.2197)
## [1] 0.8028
Hardly any difference. But maybe the model should be more complicated.
Use ANOVA and log wages to see if interactions should be included in the model. Example:
mod = lm(log(wage) ~ sector * sex, data = CPS85)
anova(mod)
## Analysis of Variance Table
##
## Response: log(wage)
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 7 27.0 3.86 17.78 < 2e-16 ***
## sex 1 5.4 5.44 25.06 7.6e-07 ***
## sector:sex 6 3.2 0.53 2.46 0.024 *
## Residuals 519 112.8 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Is there an interaction between age and mileage in the used car data? Does it show up if we look at log prices?
cars = fetchData("used-hondas.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/used-hondas.csv
anova(lm(log(Price) ~ Age * Mileage, data = cars))
## Analysis of Variance Table
##
## Response: log(Price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 8.33 8.33 599.43 <2e-16 ***
## Mileage 1 2.19 2.19 157.41 <2e-16 ***
## Age:Mileage 1 0.05 0.05 3.29 0.073 .
## Residuals 88 1.22 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One place where F shines is when we want to look at many explanatory vectors collectively.
In the last class, we looked at professor-wise gradepoint averages, with an eye to the question of whether some professors are easy grading. We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.
Now let's return to the question using analysis of variance.
grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
g2n = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
all = merge(grades, courses)
all = merge(all, g2n) # a data set of every grade given, etc.
Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:
mod1 = lm(gradepoint ~ iid, data = all)
r.squared(mod1)
## [1] 0.1745
The regression report actually gives a p-value for this r.squared. It's not any different than we would get by travelling to Planet Null: randomizing iid and seeing what is the distribution of R2 on Planet Null.
Another way to summarize the model is with an ANOVA report:
anova(mod1)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## iid 358 352 0.983 3.16 <2e-16 ***
## Residuals 5350 1665 0.311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Of course, it's not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let's you do this:
mod2 = lm(gradepoint ~ sid + iid, data = all)
anova(mod2)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## sid 442 645 1.459 6.76 <2e-16 ***
## iid 358 313 0.873 4.04 <2e-16 ***
## Residuals 4908 1060 0.216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this.
mod3 = lm(gradepoint ~ iid + sid, data = all)
anova(mod3)
## Analysis of Variance Table
##
## Response: gradepoint
## Df Sum Sq Mean Sq F value Pr(>F)
## iid 358 352 0.983 4.55 <2e-16 ***
## sid 442 606 1.370 6.34 <2e-16 ***
## Residuals 4908 1060 0.216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sum((fitted(mod2) - fitted(mod3))^2) # model values are the same
## [1] 1.039e-22
The F statistic compares the “credit” earned by a model term to the mean square residual, which can be interpreted as the credit that would be earned by a junky random term.
Fit a model and add in some random terms. Show that the F for the random terms is about 1 and that the mean square of the residual is hardly changed by the random terms.
mod0 = lm(wage ~ sector + sex + exper, data = CPS85)
anova(mod0)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 7 2572 367 17.8 < 2e-16 ***
## sex 1 436 436 21.1 5.3e-06 ***
## exper 1 264 264 12.8 0.00037 ***
## Residuals 524 10805 21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The mean square residual is about 20.
Now throw in some junk:
mod10 = lm(wage ~ sector + sex + exper + rand(10), data = CPS85)
anova(mod10)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## sector 7 2572 367 17.63 <2e-16 ***
## sex 1 436 436 20.92 6e-06 ***
## exper 1 264 264 12.69 0.0004 ***
## rand(10) 10 91 9 0.44 0.9280
## Residuals 514 10713 21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But what if a term eats more variance than a junky term. That term makes it easier for the other terms to show significance.
EXAMPLE: Difference in age between husband and wife in couples getting married.
Ask: Who is older in a married couple, the man or the woman? By how much?
Let's see if the data support this:
m = fetchData("marriage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
mod0 = mm(Age ~ Person, data = m)
mod0
## Groupwise Model.
## Call:
## Age ~ Person
##
## Coefficients:
## Bride Groom
## 33.2 35.8
confint(mod0)
## group 2.5 % 97.5 %
## 1 Bride 29.15 37.33
## 2 Groom 31.70 39.87
mod1 = lm(Age ~ Person, data = m)
summary(mod1)
##
## Call:
## lm(formula = Age ~ Person, data = m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.42 -12.02 -3.34 8.80 39.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.24 2.06 16.13 <2e-16 ***
## PersonGroom 2.55 2.91 0.87 0.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 96 degrees of freedom
## Multiple R-squared: 0.00789, Adjusted R-squared: -0.00245
## F-statistic: 0.763 on 1 and 96 DF, p-value: 0.384
The point estimate is about right, but the margin of error is so large that we can't take this estimate very seriously. The p-value is so large that we can't reject the null that there is no relationship between Person and age.
Try adding in some other variables, astrological sign, years of education, etc. and show that this doesn't help much.
Finally, add in the BookpageID variable.
mod2 = lm(Age ~ Person + BookpageID, data = m)
anova(mod2)
## Analysis of Variance Table
##
## Response: Age
## Df Sum Sq Mean Sq F value Pr(>F)
## Person 1 159 159 9.07 0.0041 **
## BookpageID 48 19127 398 22.77 <2e-16 ***
## Residuals 48 840 18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(confint(mod2))
## 2.5 % 97.5 %
## (Intercept) 18.7105 30.728
## PersonGroom 0.8461 4.245
## BookpageIDB230p1354 -9.1763 7.648
## BookpageIDB230p1665 -6.6544 10.169
## BookpageIDB230p1948 -13.7133 3.111
## BookpageIDB230p539 -3.7393 13.085
This gives an individual ID to each marriage. Putting this in the model effectively holds the couple constant when considering the Person. In terms of ANOVA, BookpageID is eating up lots and lots of variance.
Consider the SAT data and sat as the response variable. Make models with frac, salary, expend and salary as the explanatory variables. Try different orders for the explanatory variables. Try both the ANOVA and the regression reports. Come up with a theory of what's happening.