Have two students come to the front of the class. Use a ruler to translate the top of their head to the board. But have one of the students stand substantially in front of the other, so that there is a steeper angle on the ruler.
Do the marks on the board indicate that one of the students is taller than the other?
Since we didn't adopt a sensible measurement protocol, there is a large variation in the measurement, so we can't use the measurements to decide which is taller.
Do women earn less than men according to the 1985 CPS survey?
cps = fetchData("CPS85")
## Data CPS85 found in package.
mod = mm(wage ~ sex, data = cps)
coef(mod)
## F M
## 7.879 9.995
The coefficients suggest a difference, but how do we know how precise the measurements are? That's what the confidence intervals are for.
confint(mod)
## group 2.5 % 97.5 %
## 1 F 7.247 8.511
## 2 M 9.413 10.577
If they don't overlap, we have good reason to think that the measurements are precise enough to indicate reliably/repeatably the difference.
If they overlap hugely, we don't have such reason.
If there's a bit of overlap, we'll need to be more careful: age and sex
mod2 = mm(age ~ sex, data = cps)
confint(mod2)
## group 2.5 % 97.5 %
## 1 F 36.37 39.31
## 2 M 34.63 37.33
Two equivalent formats
Three components: 1. Point estimate \( p \) 2. Margin of error \( m \) 3. Confidence level \( c \) typically 95%
Two main methods:
With resampling replications, confint program will do this whichever way you ask:
trials = do(100) * mm(wage ~ sex, data = resample(cps))
confint(trials, method = "stderr")
## name lower upper
## 1 F 7.314460 8.46676
## 2 M 9.344742 10.54030
## 3 sigma 4.383313 5.66285
## 4 r.squared 0.008799 0.07456
confint(trials, method = "quantile")
## name 2.5 % 97.5 %
## 1 F 7.3710 8.47561
## 2 M 9.3716 10.48836
## 3 sigma 4.4221 5.70775
## 4 r.squared 0.0171 0.07587
Almost always, however, you'll take a short-cut and just ask for the confidence interval by passing the model itself, rather than a set of trials, to confint. As it happens, this is done with the standard error method.
mod = mm(net ~ sex, data = sample(run, size = 800))
## Error: object 'run' not found
confint(mod)
## group 2.5 % 97.5 %
## 1 F 7.247 8.511
## 2 M 9.413 10.577
The shorthand way of talking generally involves the standard error. You're supposed to know to multiply it by 2 to get a 95% margin of error.
Why don't we just talk about the margin of error and forget the standard error? Because the standard error is a standard deviation (of a sampling distribution) and for historical reasons the standard deviation is considered basic.
Defined as the standard deviation of the sampling distribution.
In practice, we compute the standard error as the standard deviation of the resampling distribution.
The standard error quantifies how precise a model coefficients is. It depends on several things, and as we go through the course we'll encounter these. But one of them we can understand right now and it shapes critically how you decide to collect a sample:
The standard error depends on the size of the sample.
Student Activity Use sampling and iteration for 200 times to calculate the standard error for different sample sizes. Assign a sample size to each student: 25, 50, 100, 200, 400, 800, 1600, 3200
Example:
run = fetchData("repeat-runners.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/repeat-runners.csv
trials = do(200) * mm(net ~ sex, data = sample(run, size = 800))
sd(trials)
## F M sigma r.squared
## 0.85815 0.69596 NA 0.02575
Draw a plot of the standard error versus sample size. Show that the standard error gets smaller as \( 1/\sqrt{n} \). You need 4 times as much data to get an estimate that's twice as precise.
Groupwise models
BUT …
Why would we want to use more than one variable? Because things are complicated.
Example: Do wages depend on sex in the CPS85 data?
mod = mm(wage ~ sex, data = CPS85)
coef(mod) # men make about $2.20 per hour more
## F M
## 7.879 9.995
summary(mod) # the confidence intervals don't overlap
## Groupwise Model
## Call: wage ~ sex
##
## group 2.5 % 97.5 %
## 1 F 7.247 8.511
## 2 M 9.413 10.577
##
## sigma: 5.03
## r-squared: 0.0422
## Adj. r-squared: 0.0404
We could find the resampling distribution on the difference between men's and women's wages using the techniques we talked about in the last class. Just for convenience, we can use the diff() function to find the difference between the two groups:
diff(coef(mod)) # The point estimate
## M
## 2.116
# Do some resampling to get the resampling distribution
trials = do(100) * diff(coef(mm(wage ~ sex, data = resample(CPS85))))
qdata(c(0.025, 0.975), M, data = trials)
## 2.5% 97.5%
## 1.066 2.773
sd(trials) # Standard error
## M
## 0.4298
densityplot(~M, data = trials)
So there seems to be a difference between the men's and women's average wages.
Ways in which men and women might differ:
How could we look at the issue of differences between the sexes in age, taking into account these other possibilities?
Sector of the economy is easy:
summary(mm(wage ~ sex + sector, data = CPS85))
## Groupwise Model
## Call: wage ~ sex + sector
##
## group 2.5 % 97.5 %
## 1 F.clerical 6.372 8.437
## 2 M.clerical 5.525 9.453
## 3 F.const NaN NaN
## 4 M.const 7.489 11.515
## 5 F.manag 9.092 13.020
## 6 M.manag 12.178 15.265
## 7 F.manuf 3.876 7.551
## 8 M.manuf 7.946 10.660
## 9 F.other 2.127 9.476
## 10 M.other 7.619 9.905
## 11 F.prof 9.857 12.353
## 12 M.prof 11.538 14.010
## 13 F.sales 3.059 7.425
## 14 M.sales 7.532 11.460
## 15 F.service 4.774 7.345
## 16 M.service 5.683 8.770
##
## sigma: 4.58
## r-squared: 0.226
## Adj. r-squared: 0.205
Compare the corresponding sector for the different sexes. The confidence intervals often overlap.
How to add in additional variables, e.g. age and education?
One intuitive idea is to break up the quantitative variables into categorical variables. For instance, age
mod3 = mm(wage ~ sex + sector + cut(age, breaks = c(0, 20, 40, 60, 80)), data = cps)
Look at confint(mod3) and tell me about the overlap between the confidence intervals. They are huge. The problem is that there isn't much data in any one group.
tally(~sex + sector + cut(age, breaks = c(0, 20, 40, 60, 80)), data = cps)
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (0,20]
##
## sector
## sex clerical const manag manuf other prof sales service Total
## F 5 0 1 0 0 1 0 3 10
## M 1 0 0 3 6 1 1 6 18
## Total 6 0 1 3 6 2 1 9 28
##
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (20,40]
##
## sector
## sex clerical const manag manuf other prof sales service Total
## F 42 0 12 15 5 35 10 22 141
## M 17 10 20 27 42 38 12 18 184
## Total 59 10 32 42 47 73 22 40 325
##
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (40,60]
##
## sector
## sex clerical const manag manuf other prof sales service Total
## F 24 0 8 9 0 14 6 19 80
## M 3 10 13 13 14 14 5 6 78
## Total 27 10 21 22 14 28 11 25 158
##
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (60,80]
##
## sector
## sex clerical const manag manuf other prof sales service Total
## F 5 0 0 0 1 2 1 5 14
## M 0 0 1 1 0 0 3 4 9
## Total 5 0 1 1 1 2 4 9 23
##
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = Total
##
## sector
## sex clerical const manag manuf other prof sales service Total
## F 76 0 21 24 6 52 17 49 245
## M 21 20 34 44 62 53 21 34 289
## Total 97 20 55 68 68 105 38 83 534
We can do better. The key to doing better is to recognize that some pairs of groups are “nearby” and the nearby groups should have related results. This is what lm() will do for us. Time to take off the training