Stats 155 Class Notes 2012-09-21

Review

Groupwise models

BUT …

Why would we want to use more than one variable? Because things are complicated.

Example: Do wages depend on sex in the CPS85 data?

mod = mm(wage ~ sex, data = CPS85)
coef(mod)  # men make about $2.20 per hour more
##     F     M 
## 7.879 9.995 
summary(mod)  # the confidence intervals don't overlap
## Groupwise Model
## Call:  wage ~ sex 
## 
##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577
## 
## sigma:  5.03 
## r-squared: 0.0422 
## Adj. r-squared: 0.0404 

We could find the resampling distribution on the difference between men's and women's wages using the techniques we talked about in the last class. Just for convenience, we can use the diff() function to find the difference between the two groups:

diff(coef(mod))  # The point estimate
##     M 
## 2.116 
# Do some resampling to get the resampling distribution
trials = do(100) * diff(coef(mm(wage ~ sex, data = resample(CPS85))))
qdata(c(0.025, 0.975), M, data = trials)
##  2.5% 97.5% 
## 1.285 2.941 
sd(trials)  # Standard error
##      M 
## 0.4307 
densityplot(~M, data = trials)

plot of chunk unnamed-chunk-3

So there seems to be a difference between the men's and women's average wages.

Possible other factors

Ways in which men and women might differ:

How could we look at the issue of differences between the sexes in age, taking into account these other possibilities?

Sector of the economy is easy:

summary(mm(wage ~ sex + sector, data = CPS85))
## Groupwise Model
## Call:  wage ~ sex + sector 
## 
##         group  2.5 % 97.5 %
## 1  F.clerical  6.372  8.437
## 2  M.clerical  5.525  9.453
## 3     F.const    NaN    NaN
## 4     M.const  7.489 11.515
## 5     F.manag  9.092 13.020
## 6     M.manag 12.178 15.265
## 7     F.manuf  3.876  7.551
## 8     M.manuf  7.946 10.660
## 9     F.other  2.127  9.476
## 10    M.other  7.619  9.905
## 11     F.prof  9.857 12.353
## 12     M.prof 11.538 14.010
## 13    F.sales  3.059  7.425
## 14    M.sales  7.532 11.460
## 15  F.service  4.774  7.345
## 16  M.service  5.683  8.770
## 
## sigma:  4.58 
## r-squared: 0.226 
## Adj. r-squared: 0.205 

Compare the corresponding sector for the different sexes. The confidence intervals often overlap.

How to add in additional variables, e.g. age and education?

The basic models

Assume we have a response variable D, a quantitative explanatory variable, A, and a categorical variable G.
Here are the basic types of models, illustrated on the swimming data.

More complicated models mainly add additional variables but follow the same overall pattern.

swim = fetchData("swim100m.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv

All cases are the same. There is one parameter, the grand mean.

mod = lm(time ~ 1, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-6

mod = lm(time ~ 1 + year, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-7

mod = lm(time ~ 1 + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-8

mod = lm(time ~ 1 + year + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-9

Note that the slope is the same for the two groups.

To see a different slope for the two groups, we have to ask the model to permit such a thing:

mod = lm(time ~ 1 + year + sex + year:sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-10

Variables and Model Terms

How a variable is different from a model term.
Model terms that are not variables: intercept, interaction terms.

Activity

Visual choice of model terms

Instructions are given here (These have to be reformatted as a handout and as an instructor's guide.)

The only substantive differences is in the way the software is loaded:

fetchData("mLM.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R
## [1] TRUE

The program itself is called mLM() (not lm.select.terms())

Weekly quiz