March 8, 2013 Class Notes

Where We Are

You should understand that …

Where We're Heading

Classroom Notes

What fraction of the Earth's surface is covered by water?

Toss around the globe.

Review:

Sampling Distributions and Re-sampling Distributions

Construct an interval using do() and resample().

s = do(100) * lm(width ~ sex + length, data = resample(KidsFeet))
sd(s)  # standard error
## Intercept      sexG    length     sigma r.squared 
##   1.02534   0.12989   0.04023   0.03397   0.10202
densityplot(~length, data = s)

plot of chunk unnamed-chunk-2

Using confint()

Using summary(). Explain the standard error and how to get confidence intervals from that.

Kids' foot width

The setting for this problem is the question of whether girls' feet are narrower than boys'. Confidence intervals give us a quick answer (based on the available data):

mod = lm(width ~ sex, data = KidsFeet)
confint(mod)
##               2.5 %   97.5 %
## (Intercept)  8.9759  9.40411
## sexG        -0.7125 -0.09903

The confidence interval on the difference between boys' and girls' foot widths is entirely in the negative. So even this small sample provides evidence that girls' feet are narrower than boys'.

The real question, however, is whether, for any given shoe size (determined by length) the girls' feet are narrower:

mod2 = lm(width ~ sex + length, data = KidsFeet)
confint(mod2)
##               2.5 %  97.5 %
## (Intercept)  1.1048 6.17752
## sexG        -0.4948 0.02974
## length       0.1202 0.32182

Not so much.

If we make a complicated model, it's harder to interpret the confidence intervals on the coefficients:

mod3 = lm(width ~ sex * length, data = KidsFeet)
confint(mod3)
##                2.5 % 97.5 %
## (Intercept)  0.09816 7.6060
## sexG        -5.70118 4.4534
## length       0.06326 0.3620
## sexG:length -0.18915 0.2208

Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:

f3 = makeFun(mod3)
f3(length = 25, sex = "G")
##     1 
## 8.939
f3(length = 25, sex = c("G", "B"), interval = "confidence")
##     fit   lwr   upr
## 1 8.939 8.734 9.145
## 2 9.168 8.990 9.346

There is considerable overlap between the two intervals.

Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:

f3(length = 25, sex = c("G", "B"), interval = "prediction")
##     fit   lwr   upr
## 1 8.939 8.121 9.758
## 2 9.168 8.356 9.980

The typical boy's value is very reasonable for a girl and vice versa.

Differences between confidence interval on the model value and on the prediction

As the amount of data becomes very large, the CI on the model value becomes very narrow. But the CI on the prediction always reflects the residuals.

SIMULATION: 10000 kids feet.

mod4 = lm(width ~ sex * length, data = resample(KidsFeet, size = 10000))
f3 = makeFun(mod4)
f3(length = 25, sex = c("G", "B"), interval = "confidence")
##     fit   lwr   upr
## 1 8.934 8.923 8.946
## 2 9.170 9.160 9.180
f3(length = 25, sex = c("G", "B"), interval = "prediction")
##     fit   lwr   upr
## 1 8.934 8.213 9.656
## 2 9.170 8.449 9.892

SAT and school spending

Increased spending is associated with lower SAT scores

confint(lm(sat ~ expend, data = SAT))
##               2.5 %   97.5 %
## (Intercept) 1000.04 1178.546
## expend       -35.63   -6.158

Until you adjust for who took the test …

confint(lm(sat ~ expend + frac, data = SAT))
##               2.5 %   97.5 %
## (Intercept) 949.909 1037.754
## expend        3.788   20.785
## frac         -3.284   -2.418

Do the same for salary and ratio.

Can we see electricity offset gas heating in the utilities data?

There are all sorts of things that effect our use of natural gas for heating:

One therm (roughly 1ccf) is about 29 kwh. So, 1 kwh is about 1/29 = 0.03 ccf. See Wikipedia entry

u = fetchData("utilities.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/utilities.csv
winter = subset(u, temp < 50)
# There's an outlier
winter = subset(winter, ccf > 50)
confint(lm(ccf ~ temp + kwh, data = winter))
##                 2.5 %     97.5 %
## (Intercept) 270.43774 331.638090
## temp         -4.89808  -3.838684
## kwh          -0.05873   0.009314
u = fetchData("/Users/kaplan/kaplanfiles/stats-book/DataSets/utilities-up-to-date.csv")
## Complete file name given.  No searching necessary.
winter = subset(u, temp < 50)
# There's an outlier
winter = subset(winter, ccf > 50)
confint(lm(ccf ~ temp + kwh, data = winter))
##                 2.5 %     97.5 %
## (Intercept) 270.61172 321.081688
## temp         -4.78583  -3.974095
## kwh          -0.04321   0.005774

Some Choices in context

Geometry of Confidence Intervals

What does the standard error depend on?

Logic behind non-resampling estimates of the standard error

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Geometry of Confidence Intervals

Shiny App