Stats 155 Class Notes 2012-10-15

Plans for the Week

  1. A bit more about total vs partial change (Monday)
  2. Confidence intervals on model coefficients
  3. Mid-term Exam

Next Monday will be about collinearity and confidence intervals

Next Wednesday about probability models.

Then, after Fall break, we'll move into hypothesis testing.

Where We Are

You should understand that …

Where We're Heading

Classroom Notes

Holding covariates constant

From Friday's notes:

  1. GPA example
  2. Aging and running example

Confidence Intervals

Review:

Sampling Distributions and Re-sampling Distributions

Construct an interval using do() and resample().

s = do(100) * lm(width ~ sex + length, data = resample(KidsFeet))
sd(s)  # standard error
## Intercept      sexG    length     sigma r.squared 
##   0.79180   0.12599   0.03000   0.03354   0.09552 
densityplot(~length, data = s)

plot of chunk unnamed-chunk-2

Using confint()

Using summary(). Explain the standard error and how to get confidence intervals from that.

Kids' foot width

The setting for this problem is the question of whether girls' feet are narrower than boys'. Confidence intervals give us a quick answer (based on the available data):

mod = lm(width ~ sex, data = KidsFeet)
confint(mod)
##               2.5 %   97.5 %
## (Intercept)  8.9759  9.40411
## sexG        -0.7125 -0.09903

The confidence interval on the difference between boys' and girls' foot widths is entirely in the negative. So even this small sample provides evidence that girls' feet are narrower than boys'.

The real question, however, is whether, for any given shoe size (determined by length) the girls' feet are narrower:

mod2 = lm(width ~ sex + length, data = KidsFeet)
confint(mod2)
##               2.5 %  97.5 %
## (Intercept)  1.1048 6.17752
## sexG        -0.4948 0.02974
## length       0.1202 0.32182

Not so much.

If we make a complicated model, it's harder to interpret the confidence intervals on the coefficients:

mod3 = lm(width ~ sex * length, data = KidsFeet)
confint(mod3)
##                2.5 % 97.5 %
## (Intercept)  0.09816 7.6060
## sexG        -5.70118 4.4534
## length       0.06326 0.3620
## sexG:length -0.18915 0.2208

Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:

f3 = makeFun(mod3)
f3(length = 25, sex = "G")
##     1 
## 8.939 
f3(length = 25, sex = c("G", "B"), interval = "confidence")
##     fit   lwr   upr
## 1 8.939 8.734 9.145
## 2 9.168 8.990 9.346

There is considerable overlap between the two intervals.

Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:

f3(length = 25, sex = c("G", "B"), interval = "prediction")
##     fit   lwr   upr
## 1 8.939 8.121 9.758
## 2 9.168 8.356 9.980

The typical boy's value is very reasonable for a girl and vice versa.

Differences between confidence interval on the model value and on the prediction

As the amount of data becomes very large, the CI on the model value becomes very narrow. But the CI on the prediction always reflects the residuals.

SIMULATION: 10000 kids feet.

mod4 = lm(width ~ sex * length, data = resample(KidsFeet, size = 10000))
f3 = makeFun(mod4)
f3(length = 25, sex = c("G", "B"), interval = "confidence")
##     fit   lwr  upr
## 1 8.928 8.917 8.94
## 2 9.170 9.160 9.18
f3(length = 25, sex = c("G", "B"), interval = "prediction")
##     fit   lwr   upr
## 1 8.928 8.197 9.660
## 2 9.170 8.438 9.902

SAT and school spending

Increased spending is associated with lower SAT scores

confint(lm(sat ~ expend, data = SAT))
##               2.5 %   97.5 %
## (Intercept) 1000.04 1178.546
## expend       -35.63   -6.158

Until you adjust for who took the test …

confint(lm(sat ~ expend + frac, data = SAT))
##               2.5 %   97.5 %
## (Intercept) 949.909 1037.754
## expend        3.788   20.785
## frac         -3.284   -2.418

Do the same for salary and ratio.

Can we see electricity offset gas heating in the utilities data?

Some Choices in content

Geometry of Confidence Intervals

What does the standard error depend on?

Logic behind non-resampling estimates of the standard error

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Why does it depend on the number of cases: orthogonality of random vectors in high dimensions.

Groupwise means on the GPA.

Construct the data:

grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)
options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)
##    group 2.5 % 97.5 %
## 1 S31185 2.058  2.767
## 2 S31188 2.781  3.255
## 3 S31191 2.953  3.471
## 4 S31194 3.069  3.649
## 5 S31197 3.078  3.635
## 6 S31200 1.918  2.455

A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).

ci2 = ci[order(coef(mod)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
    lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}

plot of chunk unnamed-chunk-13

All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.

Why doesn't the registrar report the confidence interval?

Bad reasons

Aside: Departmentwise GPAs

Add department and course information to data:

courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, courses)
modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)
##   group 2.5 % 97.5 %
## 1     A 2.694  4.296
## 2     b 3.104  3.381
## 3     B 3.017  3.453
## 4     C 3.450  3.598
## 5     d 3.263  3.365
## 6     D 3.488  3.630
ci2d = cid[order(coef(modd)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, c(2, 3)], c(k, k))
}

plot of chunk unnamed-chunk-15

GPAs adjusting for department, level, and enrollment

mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)

Student-by-student confidence intervals:

cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-17

The inclusion of the covariates has increased the width of the confidence intervals.

Department-by-department confidence intervals:

mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Department")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-18