Stats 155 Class Notes 2012-09-21

Review

Groupwise models

Easy enough to fit groupwise models. The groupwise mean is a nice estimate for the parameter that eliminates the bias in the residuals and minimizes the variance.
We can use more than one categorical variable as a grouping variable. When we do this, we get results for all the “crosses”.

BUT …

What if we want to use a quantitative variable as an explanatory variable?
What if we have more than one categorical variable?
What if we have a mixture of quantitative and categorical variables?

Why would we want to use more than one variable? Because things are complicated.

Example: Do wages depend on sex in the CPS85 data?

mod = mm(wage ~ sex, data = CPS85)
coef(mod)  # men make about $2.20 per hour more

##     F     M 
## 7.879 9.995

summary(mod)  # the confidence intervals don't overlap

## Groupwise Model
## Call:  wage ~ sex 
## 
##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577
## 
## sigma:  5.03 
## r-squared: 0.0422 
## Adj. r-squared: 0.0404

We could find the resampling distribution on the difference between men's and women's wages using the techniques we talked about in the last class. Just for convenience, we can use the diff() function to find the difference between the two groups:

diff(coef(mod))  # The point estimate

##     M 
## 2.116

# Do some resampling to get the resampling distribution
trials = do(100) * diff(coef(mm(wage ~ sex, data = resample(CPS85))))
qdata(c(0.025, 0.975), M, data = trials)

##  2.5% 97.5% 
## 1.285 2.941

sd(trials)  # Standard error

##      M 
## 0.4307

densityplot(~M, data = trials)

plot of chunk unnamed-chunk-3

So there seems to be a difference between the men's and women's average wages.

Possible other factors

Ways in which men and women might differ:

education
age
type of job

How could we look at the issue of differences between the sexes in age, taking into account these other possibilities?

Sector of the economy is easy:

summary(mm(wage ~ sex + sector, data = CPS85))

## Groupwise Model
## Call:  wage ~ sex + sector 
## 
##         group  2.5 % 97.5 %
## 1  F.clerical  6.372  8.437
## 2  M.clerical  5.525  9.453
## 3     F.const    NaN    NaN
## 4     M.const  7.489 11.515
## 5     F.manag  9.092 13.020
## 6     M.manag 12.178 15.265
## 7     F.manuf  3.876  7.551
## 8     M.manuf  7.946 10.660
## 9     F.other  2.127  9.476
## 10    M.other  7.619  9.905
## 11     F.prof  9.857 12.353
## 12     M.prof 11.538 14.010
## 13    F.sales  3.059  7.425
## 14    M.sales  7.532 11.460
## 15  F.service  4.774  7.345
## 16  M.service  5.683  8.770
## 
## sigma:  4.58 
## r-squared: 0.226 
## Adj. r-squared: 0.205

Compare the corresponding sector for the different sexes. The confidence intervals often overlap.

How to add in additional variables, e.g. age and education?

The basic models

Assume we have a response variable D, a quantitative explanatory variable, A, and a categorical variable G.
Here are the basic types of models, illustrated on the swimming data.

More complicated models mainly add additional variables but follow the same overall pattern.

swim = fetchData("swim100m.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv

D ~ 1

All cases are the same. There is one parameter, the grand mean.

mod = lm(time ~ 1, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-6

D ~ 1 + A The cases vary with the quantitative variable. There are two parameters: slope and intercept.

mod = lm(time ~ 1 + year, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-7

D ~ 1 + G The groupwise mean model. But note that the parameters are written differently, as an “intercept” and a “difference”. QUESTION: Which difference is it?

mod = lm(time ~ 1 + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-8

D ~ 1 + A + G Both variables play a role. The parameters are “intercept,” “slope,” and “difference between groups.”

mod = lm(time ~ 1 + year + sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-9

Note that the slope is the same for the two groups.

To see a different slope for the two groups, we have to ask the model to permit such a thing:

D ~ 1 + A + G + A:G

mod = lm(time ~ 1 + year + sex + year:sex, data = swim)
xyplot(time + fitted(mod) ~ year, data = swim)

plot of chunk unnamed-chunk-10

Variables and Model Terms

How a variable is different from a model term.
Model terms that are not variables: intercept, interaction terms.

Activity

Visual choice of model terms

Instructions are given here (These have to be reformatted as a handout and as an instructor's guide.)

The only substantive differences is in the way the software is loaded:

fetchData("mLM.R")

## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R

## [1] TRUE