February 13 Class Notes

Before class

Accuracy and Precision of calorie information — they are really interested in whether the calorie counts are right on average, but this isn't what they are looking at. What can we say about the evidence that they are providing for calorie labels being systematically off?

Precision Demo

Have two students come to the front of the class. Use a ruler to translate the top of their head to the board. But have one of the students stand substantially in front of the other, so that there is a steeper angle on the ruler.

Do the marks on the board indicate that one of the students is taller than the other?

Since we didn't adopt a sensible measurement protocol, there is a large variation in the measurement, so we can't use the measurements to decide which is taller.

Precision in Practice

Do women earn less than men according to the 1985 CPS survey?

cps = fetchData("CPS85")

## Data CPS85 found in package.

mod = mm(wage ~ sex, data = cps)
coef(mod)

##     F     M 
## 7.879 9.995

The coefficients suggest a difference, but how do we know how precise the measurements are? That's what the confidence intervals are for.

confint(mod)

##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577

If they don't overlap, we have good reason to think that the measurements are precise enough to indicate reliably/repeatably the difference.

If they overlap hugely, we don't have such reason.

If there's a bit of overlap, we'll need to be more careful: age and sex

mod2 = mm(age ~ sex, data = cps)
confint(mod2)

##   group 2.5 % 97.5 %
## 1     F 36.37  39.31
## 2     M 34.63  37.33

Confidence Intervals

Two equivalent formats

\( L \) to \( U \) with \( c \) confidence
\( p \pm m \) with \( c \) confidence

Three components: 1. Point estimate \( p \) 2. Margin of error \( m \) 3. Confidence level \( c \) typically 95%

Two main methods:

Compute the 95% coverage interval of the sampling distribution.
- To convert to the margin-of-error format, divide the width of that interval by 2.
Find the standard error.
- To convert to the margin-of-error format, multiply 2 times the standard error. (Why 2? For bell shaped distributions it covers about 95%.)

With resampling replications, confint program will do this whichever way you ask:

trials = do(100) * mm(wage ~ sex, data = resample(cps))
confint(trials, method = "stderr")

##        name    lower    upper
## 1         F 7.314460  8.46676
## 2         M 9.344742 10.54030
## 3     sigma 4.383313  5.66285
## 4 r.squared 0.008799  0.07456

confint(trials, method = "quantile")

##        name  2.5 %   97.5 %
## 1         F 7.3710  8.47561
## 2         M 9.3716 10.48836
## 3     sigma 4.4221  5.70775
## 4 r.squared 0.0171  0.07587

Almost always, however, you'll take a short-cut and just ask for the confidence interval by passing the model itself, rather than a set of trials, to confint. As it happens, this is done with the standard error method.

mod = mm(net ~ sex, data = sample(run, size = 800))

## Error: object 'run' not found

confint(mod)

##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577

The shorthand way of talking generally involves the standard error. You're supposed to know to multiply it by 2 to get a 95% margin of error.

Why don't we just talk about the margin of error and forget the standard error? Because the standard error is a standard deviation (of a sampling distribution) and for historical reasons the standard deviation is considered basic.

The Standard Error

Defined as the standard deviation of the sampling distribution.

In practice, we compute the standard error as the standard deviation of the resampling distribution.

The standard error quantifies how precise a model coefficients is. It depends on several things, and as we go through the course we'll encounter these. But one of them we can understand right now and it shapes critically how you decide to collect a sample:

The standard error depends on the size of the sample.

More data means a more precise estimate. Remember this always.
Now, let's quantify this dependence.

Student Activity Use sampling and iteration for 200 times to calculate the standard error for different sample sizes. Assign a sample size to each student: 25, 50, 100, 200, 400, 800, 1600, 3200

Example:

run = fetchData("repeat-runners.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/repeat-runners.csv

trials = do(200) * mm(net ~ sex, data = sample(run, size = 800))
sd(trials)

##         F         M     sigma r.squared 
##   0.85815   0.69596        NA   0.02575

Draw a plot of the standard error versus sample size. Show that the standard error gets smaller as \( 1/\sqrt{n} \). You need 4 times as much data to get an estimate that's twice as precise.

Review

Groupwise models

Easy enough to fit groupwise models. The groupwise mean is a nice estimate for the parameter that eliminates the bias in the residuals and minimizes the variance.
We can use more than one categorical variable as a grouping variable. When we do this, we get results for all the “crosses”.
Right now, we have a rule of thumb to see if two groups are different: do their confidence intervals overlap.

BUT …

What if we want to use a quantitative variable as an explanatory variable?
What if we have more than one categorical variable?
What if we have a mixture of quantitative and categorical variables?

Why would we want to use more than one variable? Because things are complicated.

Example: Do wages depend on sex in the CPS85 data?

mod = mm(wage ~ sex, data = CPS85)
coef(mod)  # men make about $2.20 per hour more

##     F     M 
## 7.879 9.995

summary(mod)  # the confidence intervals don't overlap

## Groupwise Model
## Call:  wage ~ sex 
## 
##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577
## 
## sigma:  5.03 
## r-squared: 0.0422 
## Adj. r-squared: 0.0404

Digression using resampling

We could find the resampling distribution on the difference between men's and women's wages using the techniques we talked about in the last class. Just for convenience, we can use the diff() function to find the difference between the two groups:

diff(coef(mod))  # The point estimate

##     M 
## 2.116

# Do some resampling to get the resampling distribution
trials = do(100) * diff(coef(mm(wage ~ sex, data = resample(CPS85))))
qdata(c(0.025, 0.975), M, data = trials)

##  2.5% 97.5% 
## 1.066 2.773

sd(trials)  # Standard error

##      M 
## 0.4298

densityplot(~M, data = trials)

plot of chunk unnamed-chunk-9

So there seems to be a difference between the men's and women's average wages.

Possible other factors

Ways in which men and women might differ:

education
age
type of job

How could we look at the issue of differences between the sexes in age, taking into account these other possibilities?

Sector of the economy is easy:

summary(mm(wage ~ sex + sector, data = CPS85))

## Groupwise Model
## Call:  wage ~ sex + sector 
## 
##         group  2.5 % 97.5 %
## 1  F.clerical  6.372  8.437
## 2  M.clerical  5.525  9.453
## 3     F.const    NaN    NaN
## 4     M.const  7.489 11.515
## 5     F.manag  9.092 13.020
## 6     M.manag 12.178 15.265
## 7     F.manuf  3.876  7.551
## 8     M.manuf  7.946 10.660
## 9     F.other  2.127  9.476
## 10    M.other  7.619  9.905
## 11     F.prof  9.857 12.353
## 12     M.prof 11.538 14.010
## 13    F.sales  3.059  7.425
## 14    M.sales  7.532 11.460
## 15  F.service  4.774  7.345
## 16  M.service  5.683  8.770
## 
## sigma:  4.58 
## r-squared: 0.226 
## Adj. r-squared: 0.205

Compare the corresponding sector for the different sexes. The confidence intervals often overlap.

How to add in additional variables, e.g. age and education?

One intuitive idea is to break up the quantitative variables into categorical variables. For instance, age

mod3 = mm(wage ~ sex + sector + cut(age, breaks = c(0, 20, 40, 60, 80)), data = cps)

Look at confint(mod3) and tell me about the overlap between the confidence intervals. They are huge. The problem is that there isn't much data in any one group.

tally(~sex + sector + cut(age, breaks = c(0, 20, 40, 60, 80)), data = cps)

## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (0,20]
## 
##        sector
## sex     clerical const manag manuf other prof sales service Total
##   F            5     0     1     0     0    1     0       3    10
##   M            1     0     0     3     6    1     1       6    18
##   Total        6     0     1     3     6    2     1       9    28
## 
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (20,40]
## 
##        sector
## sex     clerical const manag manuf other prof sales service Total
##   F           42     0    12    15     5   35    10      22   141
##   M           17    10    20    27    42   38    12      18   184
##   Total       59    10    32    42    47   73    22      40   325
## 
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (40,60]
## 
##        sector
## sex     clerical const manag manuf other prof sales service Total
##   F           24     0     8     9     0   14     6      19    80
##   M            3    10    13    13    14   14     5       6    78
##   Total       27    10    21    22    14   28    11      25   158
## 
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = (60,80]
## 
##        sector
## sex     clerical const manag manuf other prof sales service Total
##   F            5     0     0     0     1    2     1       5    14
##   M            0     0     1     1     0    0     3       4     9
##   Total        5     0     1     1     1    2     4       9    23
## 
## , , cut(age, breaks = c(0, 20, 40, 60, 80)) = Total
## 
##        sector
## sex     clerical const manag manuf other prof sales service Total
##   F           76     0    21    24     6   52    17      49   245
##   M           21    20    34    44    62   53    21      34   289
##   Total       97    20    55    68    68  105    38      83   534

We can do better. The key to doing better is to recognize that some pairs of groups are “nearby” and the nearby groups should have related results. This is what lm() will do for us. Time to take off the training