Next Monday will be about collinearity and confidence intervals
Next Wednesday about probability models.
Then, after Fall break, we'll move into hypothesis testing.
You should understand that …
From Friday's notes:
Sampling Distributions and Re-sampling Distributions
Construct an interval using do() and resample().
s = do(100) * lm(width ~ sex + length, data = resample(KidsFeet))
sd(s) # standard error
## Intercept sexG length sigma r.squared
## 0.79180 0.12599 0.03000 0.03354 0.09552
densityplot(~length, data = s)
Using confint()
Using summary(). Explain the standard error and how to get confidence intervals from that.
The setting for this problem is the question of whether girls' feet are narrower than boys'. Confidence intervals give us a quick answer (based on the available data):
mod = lm(width ~ sex, data = KidsFeet)
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 8.9759 9.40411
## sexG -0.7125 -0.09903
The confidence interval on the difference between boys' and girls' foot widths is entirely in the negative. So even this small sample provides evidence that girls' feet are narrower than boys'.
The real question, however, is whether, for any given shoe size (determined by length) the girls' feet are narrower:
mod2 = lm(width ~ sex + length, data = KidsFeet)
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 1.1048 6.17752
## sexG -0.4948 0.02974
## length 0.1202 0.32182
Not so much.
If we make a complicated model, it's harder to interpret the confidence intervals on the coefficients:
mod3 = lm(width ~ sex * length, data = KidsFeet)
confint(mod3)
## 2.5 % 97.5 %
## (Intercept) 0.09816 7.6060
## sexG -5.70118 4.4534
## length 0.06326 0.3620
## sexG:length -0.18915 0.2208
Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:
f3 = makeFun(mod3)
f3(length = 25, sex = "G")
## 1
## 8.939
f3(length = 25, sex = c("G", "B"), interval = "confidence")
## fit lwr upr
## 1 8.939 8.734 9.145
## 2 9.168 8.990 9.346
There is considerable overlap between the two intervals.
Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:
f3(length = 25, sex = c("G", "B"), interval = "prediction")
## fit lwr upr
## 1 8.939 8.121 9.758
## 2 9.168 8.356 9.980
The typical boy's value is very reasonable for a girl and vice versa.
As the amount of data becomes very large, the CI on the model value becomes very narrow. But the CI on the prediction always reflects the residuals.
SIMULATION: 10000 kids feet.
mod4 = lm(width ~ sex * length, data = resample(KidsFeet, size = 10000))
f3 = makeFun(mod4)
f3(length = 25, sex = c("G", "B"), interval = "confidence")
## fit lwr upr
## 1 8.928 8.917 8.94
## 2 9.170 9.160 9.18
f3(length = 25, sex = c("G", "B"), interval = "prediction")
## fit lwr upr
## 1 8.928 8.197 9.660
## 2 9.170 8.438 9.902
Increased spending is associated with lower SAT scores
confint(lm(sat ~ expend, data = SAT))
## 2.5 % 97.5 %
## (Intercept) 1000.04 1178.546
## expend -35.63 -6.158
Until you adjust for who took the test …
confint(lm(sat ~ expend + frac, data = SAT))
## 2.5 % 97.5 %
## (Intercept) 949.909 1037.754
## expend 3.788 20.785
## frac -3.284 -2.418
Do the same for salary and ratio.
What does the standard error depend on?
Logic behind non-resampling estimates of the standard error
Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.
Why does it depend on the number of cases: orthogonality of random vectors in high dimensions.
Construct the data:
grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)
options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)
## group 2.5 % 97.5 %
## 1 S31185 2.058 2.767
## 2 S31188 2.781 3.255
## 3 S31191 2.953 3.471
## 4 S31194 3.069 3.649
## 5 S31197 3.078 3.635
## 6 S31200 1.918 2.455
A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).
ci2 = ci[order(coef(mod)), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}
All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.
Why doesn't the registrar report the confidence interval?
Bad reasons
Add department and course information to data:
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, courses)
modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)
## group 2.5 % 97.5 %
## 1 A 2.694 4.296
## 2 b 3.104 3.381
## 3 B 3.017 3.453
## 4 C 3.450 3.598
## 5 d 3.263 3.365
## 6 D 3.488 3.630
ci2d = cid[order(coef(modd)), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, c(2, 3)], c(k, k))
}
mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)
cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA",
ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, ], c(k, k))
}
The inclusion of the covariates has increased the width of the confidence intervals.
mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA",
ylab = "Department")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, ], c(k, k))
}