Stats 155 Class Notes 2012-10-15

Plans for the Week

A bit more about total vs partial change (Monday)
- Gradepoint average example
- Running times changing with age
Confidence intervals on model coefficients
- Mechanics.
  - Resampling and the resampling distribution. (Monday)
  - Model value and prediction distributions (Wednesday)
- Interpretation
  - Role of the variable given the covariates. (Monday)
  - Coverage activity (Wednesday)
Mid-term Exam

Next Monday will be about collinearity and confidence intervals

Next Wednesday about probability models.

Then, after Fall break, we'll move into hypothesis testing.

Where We Are

You should understand that …

A covariate is an explanatory variable, but one in which you, the modeler, happen not to be directly interested it.
When you have a model, you can interpret it by running “experiments” on the model. The most common of these is to examine the effect of one variable by changing it and holding the others constant. This corresponds to the idea of a partial derivative for a quantitative variable. For a categorical variable, this is a partial difference. (Sometimes it makes sense to change more than one variable at a time if the two variables are aligned in some way, for instance years of work experience and of education.)
If you're going to be holding a covariate constant, you need to include that variable in the model. There is still an open question about whether to include interaction terms, transformation terms, etc. We have the hint of an idea that terms can be evaluated based on the way they change \( R^2 \) and whether a term adds more to \( R^2 \) than would a random variable.
Leaving a variable out of a model does not leave out that variable's influence on the system, it just hides it from your view. So include covariates and use the interpretation/“experiment-on-model” phase to hold constant what you're not interested in.

Where We're Heading

Sampling distributions, resampling distributions, and confidence intervals on model coefficients can be constructed using the same methodology as we used when examining groupwise means.
Examining confidence intervals on model coefficients gives another way to judge the effectiveness of including additional model terms. When the new terms are highly aligned with other terms in the model, the confidence intervals can grow uselessly large.
You can put a confidence interval on a prediction. This has two parts:
- The interval on the model value
- The wider interval on the prediction itself, which has to include the uncertainly due to the residual.

Classroom Notes

Holding covariates constant

From Friday's notes:

GPA example
Aging and running example

Confidence Intervals

Review:

Sampling Distributions and Re-sampling Distributions

Construct an interval using do() and resample().

s = do(100) * lm(width ~ sex + length, data = resample(KidsFeet))
sd(s)  # standard error

## Intercept      sexG    length     sigma r.squared 
##   0.79180   0.12599   0.03000   0.03354   0.09552

densityplot(~length, data = s)

plot of chunk unnamed-chunk-2

Using confint()

Using summary(). Explain the standard error and how to get confidence intervals from that.

Kids' foot width

The setting for this problem is the question of whether girls' feet are narrower than boys'. Confidence intervals give us a quick answer (based on the available data):

mod = lm(width ~ sex, data = KidsFeet)
confint(mod)

##               2.5 %   97.5 %
## (Intercept)  8.9759  9.40411
## sexG        -0.7125 -0.09903

The confidence interval on the difference between boys' and girls' foot widths is entirely in the negative. So even this small sample provides evidence that girls' feet are narrower than boys'.

The real question, however, is whether, for any given shoe size (determined by length) the girls' feet are narrower:

mod2 = lm(width ~ sex + length, data = KidsFeet)
confint(mod2)

##               2.5 %  97.5 %
## (Intercept)  1.1048 6.17752
## sexG        -0.4948 0.02974
## length       0.1202 0.32182

Not so much.

If we make a complicated model, it's harder to interpret the confidence intervals on the coefficients:

mod3 = lm(width ~ sex * length, data = KidsFeet)
confint(mod3)

##                2.5 % 97.5 %
## (Intercept)  0.09816 7.6060
## sexG        -5.70118 4.4534
## length       0.06326 0.3620
## sexG:length -0.18915 0.2208

Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:

f3 = makeFun(mod3)
f3(length = 25, sex = "G")

##     1 
## 8.939

f3(length = 25, sex = c("G", "B"), interval = "confidence")

##     fit   lwr   upr
## 1 8.939 8.734 9.145
## 2 9.168 8.990 9.346

There is considerable overlap between the two intervals.

Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:

f3(length = 25, sex = c("G", "B"), interval = "prediction")

##     fit   lwr   upr
## 1 8.939 8.121 9.758
## 2 9.168 8.356 9.980

The typical boy's value is very reasonable for a girl and vice versa.

Differences between confidence interval on the model value and on the prediction

As the amount of data becomes very large, the CI on the model value becomes very narrow. But the CI on the prediction always reflects the residuals.

SIMULATION: 10000 kids feet.

mod4 = lm(width ~ sex * length, data = resample(KidsFeet, size = 10000))
f3 = makeFun(mod4)
f3(length = 25, sex = c("G", "B"), interval = "confidence")

##     fit   lwr  upr
## 1 8.928 8.917 8.94
## 2 9.170 9.160 9.18

f3(length = 25, sex = c("G", "B"), interval = "prediction")

##     fit   lwr   upr
## 1 8.928 8.197 9.660
## 2 9.170 8.438 9.902

SAT and school spending

Increased spending is associated with lower SAT scores

confint(lm(sat ~ expend, data = SAT))

##               2.5 %   97.5 %
## (Intercept) 1000.04 1178.546
## expend       -35.63   -6.158

Until you adjust for who took the test …

confint(lm(sat ~ expend + frac, data = SAT))

##               2.5 %   97.5 %
## (Intercept) 949.909 1037.754
## expend        3.788   20.785
## frac         -3.284   -2.418

Do the same for salary and ratio.

Can we see electricity offset gas heating in the utilities data?

Some Choices in content

Geometry of Confidence Intervals

What does the standard error depend on?

Size of residuals
number of cases (minus number of model vectors – but we won't worry about this right now)
collinearity

Logic behind non-resampling estimates of the standard error

Models partition into a deterministic and random component
Deterministic component is along the model vectors
Random component is the residual

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Derivation of standard error for A ~ B
Derivation for A ~ B + C where B and C are somewhat collinear. Spacing of B-coefficient contours is set by direction of C

Why does it depend on the number of cases: orthogonality of random vectors in high dimensions.

Groupwise means on the GPA.

Construct the data:

grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)

options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)

##    group 2.5 % 97.5 %
## 1 S31185 2.058  2.767
## 2 S31188 2.781  3.255
## 3 S31191 2.953  3.471
## 4 S31194 3.069  3.649
## 5 S31197 3.078  3.635
## 6 S31200 1.918  2.455

A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).

ci2 = ci[order(coef(mod)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
    lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}

plot of chunk unnamed-chunk-13

All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.

Why doesn't the registrar report the confidence interval?

Bad reasons

Most people don't know how to interpret them.
It would suggest that grades are not certain. Good reason
The variation in grades is not just do to a random process, but to systematic effects such as the courses taken. Example: A student who gets a B in every course will have a narrow confidence interval. But a student who takes half her courses in Dept. One and half in Dept. Two, getting an A in every course in One and a C in every course in Two is just as consistent. We need to incorporate those effects when calculating the confidence interval.

Aside: Departmentwise GPAs

Add department and course information to data:

courses = fetchData("courses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv

grades = merge(grades, courses)

modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)

##   group 2.5 % 97.5 %
## 1     A 2.694  4.296
## 2     b 3.104  3.381
## 3     B 3.017  3.453
## 4     C 3.450  3.598
## 5     d 3.263  3.365
## 6     D 3.488  3.630

ci2d = cid[order(coef(modd)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, c(2, 3)], c(k, k))
}

plot of chunk unnamed-chunk-15

GPAs adjusting for department, level, and enrollment

mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)

Student-by-student confidence intervals:

cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-17

The inclusion of the covariates has increased the width of the confidence intervals.

Department-by-department confidence intervals:

mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Department")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-18