March 8, 2013 Class Notes

Where We Are

You should understand that …

A covariate is an explanatory variable, but one in which you, the modeler, happen not to be directly interested it.
When you have a model, you can interpret it by running “experiments” on the model. The most common of these is to examine the effect of one variable by changing it and holding the others constant. This corresponds to the idea of a partial derivative for a quantitative variable. For a categorical variable, this is a partial difference. (Sometimes it makes sense to change more than one variable at a time if the two variables are aligned in some way, for instance years of work experience and of education.)
If you're going to be holding a covariate constant, you need to include that variable in the model. There is still an open question about whether to include interaction terms, transformation terms, etc. We have the hint of an idea that terms can be evaluated based on the way they change \( R^2 \) and whether a term adds more to \( R^2 \) than would a random variable.
Leaving a variable out of a model does not leave out that variable's influence on the system, it just hides it from your view. So include covariates and use the interpretation/“experiment-on-model” phase to hold constant what you're not interested in.

Where We're Heading

Sampling distributions, resampling distributions, and confidence intervals on model coefficients can be constructed using the same methodology as we used when examining groupwise means.
Examining confidence intervals on model coefficients gives another way to judge the effectiveness of including additional model terms. When the new terms are highly aligned with other terms in the model, the confidence intervals can grow uselessly large.
You can put a confidence interval on a prediction. This has two parts:
- The interval on the model value
- The wider interval on the prediction itself, which has to include the uncertainly due to the residual.

Classroom Notes

What fraction of the Earth's surface is covered by water?

Toss around the globe.

Look at where your middle finger on the right hand lands. Water=1, Land=0, can't tell: skip.
Collect a sample of about 20 readings. How precise is this?
Resample it and look at the fraction. This is the resampling distribution.
The sampling distribution would involve repeating the trials, but this is too tedious.

Review:

Sampling Distributions and Re-sampling Distributions

Construct an interval using do() and resample().

s = do(100) * lm(width ~ sex + length, data = resample(KidsFeet))
sd(s)  # standard error

## Intercept      sexG    length     sigma r.squared 
##   1.02534   0.12989   0.04023   0.03397   0.10202

densityplot(~length, data = s)

plot of chunk unnamed-chunk-2

Using confint()

Using summary(). Explain the standard error and how to get confidence intervals from that.

Kids' foot width

The setting for this problem is the question of whether girls' feet are narrower than boys'. Confidence intervals give us a quick answer (based on the available data):

mod = lm(width ~ sex, data = KidsFeet)
confint(mod)

##               2.5 %   97.5 %
## (Intercept)  8.9759  9.40411
## sexG        -0.7125 -0.09903

The confidence interval on the difference between boys' and girls' foot widths is entirely in the negative. So even this small sample provides evidence that girls' feet are narrower than boys'.

The real question, however, is whether, for any given shoe size (determined by length) the girls' feet are narrower:

mod2 = lm(width ~ sex + length, data = KidsFeet)
confint(mod2)

##               2.5 %  97.5 %
## (Intercept)  1.1048 6.17752
## sexG        -0.4948 0.02974
## length       0.1202 0.32182

Not so much.

If we make a complicated model, it's harder to interpret the confidence intervals on the coefficients:

mod3 = lm(width ~ sex * length, data = KidsFeet)
confint(mod3)

##                2.5 % 97.5 %
## (Intercept)  0.09816 7.6060
## sexG        -5.70118 4.4534
## length       0.06326 0.3620
## sexG:length -0.18915 0.2208

Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:

f3 = makeFun(mod3)
f3(length = 25, sex = "G")

##     1 
## 8.939

f3(length = 25, sex = c("G", "B"), interval = "confidence")

##     fit   lwr   upr
## 1 8.939 8.734 9.145
## 2 9.168 8.990 9.346

There is considerable overlap between the two intervals.

Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:

f3(length = 25, sex = c("G", "B"), interval = "prediction")

##     fit   lwr   upr
## 1 8.939 8.121 9.758
## 2 9.168 8.356 9.980

The typical boy's value is very reasonable for a girl and vice versa.

Differences between confidence interval on the model value and on the prediction

As the amount of data becomes very large, the CI on the model value becomes very narrow. But the CI on the prediction always reflects the residuals.

SIMULATION: 10000 kids feet.

mod4 = lm(width ~ sex * length, data = resample(KidsFeet, size = 10000))
f3 = makeFun(mod4)
f3(length = 25, sex = c("G", "B"), interval = "confidence")

##     fit   lwr   upr
## 1 8.934 8.923 8.946
## 2 9.170 9.160 9.180

f3(length = 25, sex = c("G", "B"), interval = "prediction")

##     fit   lwr   upr
## 1 8.934 8.213 9.656
## 2 9.170 8.449 9.892

SAT and school spending

Increased spending is associated with lower SAT scores

confint(lm(sat ~ expend, data = SAT))

##               2.5 %   97.5 %
## (Intercept) 1000.04 1178.546
## expend       -35.63   -6.158

Until you adjust for who took the test …

confint(lm(sat ~ expend + frac, data = SAT))

##               2.5 %   97.5 %
## (Intercept) 949.909 1037.754
## expend        3.788   20.785
## frac         -3.284   -2.418

Do the same for salary and ratio.

Can we see electricity offset gas heating in the utilities data?

There are all sorts of things that effect our use of natural gas for heating:

How the thermostat is set.
Winds
Humidity
Snowfall
When the door is open.

One therm (roughly 1ccf) is about 29 kwh. So, 1 kwh is about 1/29 = 0.03 ccf. See Wikipedia entry

u = fetchData("utilities.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/utilities.csv

winter = subset(u, temp < 50)
# There's an outlier
winter = subset(winter, ccf > 50)
confint(lm(ccf ~ temp + kwh, data = winter))

##                 2.5 %     97.5 %
## (Intercept) 270.43774 331.638090
## temp         -4.89808  -3.838684
## kwh          -0.05873   0.009314

u = fetchData("/Users/kaplan/kaplanfiles/stats-book/DataSets/utilities-up-to-date.csv")

## Complete file name given.  No searching necessary.

winter = subset(u, temp < 50)
# There's an outlier
winter = subset(winter, ccf > 50)
confint(lm(ccf ~ temp + kwh, data = winter))

##                 2.5 %     97.5 %
## (Intercept) 270.61172 321.081688
## temp         -4.78583  -3.974095
## kwh          -0.04321   0.005774

Some Choices in context

Geometry of Confidence Intervals

What does the standard error depend on?

Size of residuals
number of cases (minus number of model vectors – but we won't worry about this right now)
collinearity

Logic behind non-resampling estimates of the standard error

Models partition into a deterministic and random component
Deterministic component is along the model vectors
Random component is the residual

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Derivation of standard error for A ~ B
Derivation for A ~ B + C where B and C are somewhat collinear. Spacing of B-coefficient contours is set by direction of C

Geometry of Confidence Intervals

Shiny App