Groupwise models
BUT …
Why would we want to use more than one variable? Because things are complicated.
Example: Do wages depend on sex in the CPS85 data?
mod = mm(wage ~ sex, data = CPS85)
coef(mod) # men make about $2.20 per hour more
## F M
## 7.879 9.995
summary(mod) # the confidence intervals don't overlap
## Groupwise Model
## Call: wage ~ sex
##
## group 2.5 % 97.5 %
## 1 F 7.247 8.511
## 2 M 9.413 10.577
##
## sigma: 5.03
## r-squared: 0.0422
## Adj. r-squared: 0.0404
We could find the resampling distribution on the difference between men's and women's wages using the techniques we talked about in the last class. Just for convenience, we can use the diff() function to find the difference between the two groups:
diff(coef(mod)) # The point estimate
## M
## 2.116
# Do some resampling to get the resampling distribution
trials = do(100) * diff(coef(mm(wage ~ sex, data = resample(CPS85))))
qdata(c(0.025, 0.975), M, data = trials)
## 2.5% 97.5%
## 1.285 2.941
sd(trials) # Standard error
## M
## 0.4307
densityplot(~M, data = trials)
So there seems to be a difference between the men's and women's average wages.
Ways in which men and women might differ:
How could we look at the issue of differences between the sexes in age, taking into account these other possibilities?
Sector of the economy is easy:
summary(mm(wage ~ sex + sector, data = CPS85))
## Groupwise Model
## Call: wage ~ sex + sector
##
## group 2.5 % 97.5 %
## 1 F.clerical 6.372 8.437
## 2 M.clerical 5.525 9.453
## 3 F.const NaN NaN
## 4 M.const 7.489 11.515
## 5 F.manag 9.092 13.020
## 6 M.manag 12.178 15.265
## 7 F.manuf 3.876 7.551
## 8 M.manuf 7.946 10.660
## 9 F.other 2.127 9.476
## 10 M.other 7.619 9.905
## 11 F.prof 9.857 12.353
## 12 M.prof 11.538 14.010
## 13 F.sales 3.059 7.425
## 14 M.sales 7.532 11.460
## 15 F.service 4.774 7.345
## 16 M.service 5.683 8.770
##
## sigma: 4.58
## r-squared: 0.226
## Adj. r-squared: 0.205
Compare the corresponding sector for the different sexes. The confidence intervals often overlap.
How to add in additional variables, e.g. age and education?
Assume we have a response variable D, a quantitative explanatory variable, A, and a categorical variable G.
Here are the basic types of models, illustrated on the swimming data.
More complicated models mainly add additional variables but follow the same overall pattern.
swim = fetchData("swim100m.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv
D ~ 1All cases are the same. There is one parameter, the grand mean.
mod = lm(time ~ 1, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)
D ~ 1 + A
The cases vary with the quantitative variable. There are two parameters: slope and intercept.mod = lm(time ~ 1 + year, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)
D ~ 1 + G
The groupwise mean model. But note that the parameters are written differently, as an “intercept” and a “difference”. QUESTION: Which difference is it?mod = lm(time ~ 1 + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)
D ~ 1 + A + G
Both variables play a role. The parameters are “intercept,” “slope,” and “difference between groups.”mod = lm(time ~ 1 + year + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)
Note that the slope is the same for the two groups.
To see a different slope for the two groups, we have to ask the model to permit such a thing:
D ~ 1 + A + G + A:Gmod = lm(time ~ 1 + year + sex + year:sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)
How a variable is different from a model term.
Model terms that are not variables: intercept, interaction terms.
Visual choice of model terms
Instructions are given here (These have to be reformatted as a handout and as an instructor's guide.)
The only substantive differences is in the way the software is loaded:
fetchData("mLM.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R
## [1] TRUE
The program itself is called mLM() (not lm.select.terms())