A demo from our own software:
kids = fetchData("KidsFeet")
## Data KidsFeet found in package.
mod = lm(width ~ length * sex, data = kids)
f = makeFun(mod)
foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids,
interval = "confidence")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)
foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids,
interval = "prediction")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)
xyplot(width + lwr + upr ~ length, data = foo)
What does the confidence-interval/standard-error depend on?
Logic behind non-resampling estimates of the standard error
Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.
The 3-dimensional view is inherently misleading. The “third”“ dimension in the drawings is really a subspace that may have dimension \( > 1 \).
We need to understand something about how random vectors work in high dimensions.
Distribution of random vectors
fetchData("m155development.R") # for the angle program
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE
Generate some randomly directed points in 2 dimensions:
x = rnorm(1000)
y = rnorm(1000)
xyplot(y ~ x)
Angles between random vectors: How are they distributed?
s = do(1000) * angle(rnorm(2), rnorm(2))
densityplot(~result, data = s)
Now do it in three dimensions. In higher dimensions. What's the pattern?
Why is it different in 3 than in two dimensions?
Toss around the globe again. We'll consider the north pole as one of the vectors. Why are you much more likely to land near the equator than near the poles?
Construct the data:
grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)
options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)
## group 2.5 % 97.5 %
## 1 S31185 2.058 2.767
## 2 S31188 2.781 3.255
## 3 S31191 2.953 3.471
## 4 S31194 3.069 3.649
## 5 S31197 3.078 3.635
## 6 S31200 1.918 2.455
A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).
ci2 = ci[order(coef(mod)), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}
All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.
Why doesn't the registrar report the confidence interval?
Bad reasons
Add department and course information to data:
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, courses)
modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)
## group 2.5 % 97.5 %
## 1 A 2.694 4.296
## 2 b 3.104 3.381
## 3 B 3.017 3.453
## 4 C 3.450 3.598
## 5 d 3.263 3.365
## 6 D 3.488 3.630
ci2d = cid[order(coef(modd)), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, c(2, 3)], c(k, k))
}
mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)
cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA",
ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, ], c(k, k))
}
The inclusion of the covariates has increased the width of the confidence intervals.
mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ] # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA",
ylab = "Department")
for (k in 1:nrow(ci2d)) {
lines(ci2d[k, ], c(k, k))
}
Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.
f = fetchData("Cherry-Blossom-Long.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv
nrow(f)
## [1] 41248
sample(f, size = 5)
## name.yob age net gun sex year previous nruns
## 32547 randy gale 1976 31 NA 88.87 M 2007 0 2
## 22033 kellie kubena 1972 32 128.53 134.07 F 2004 1 3
## 32732 rebecca goldman 1967 33 69.78 72.27 F 2000 1 2
## 30905 patricia jenson 1980 25 110.90 118.38 F 2005 2 5
## 2047 ann parnow 1981 24 96.82 102.43 F 2005 0 2
## orig.ids
## 32547 32547
## 22033 22033
## 32732 32732
## 30905 30905
## 2047 2047
f = subset(f, nruns > 5)
nrow(f)
## [1] 6186
Two models
mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)
confint(mod1)
## 2.5 % 97.5 %
## (Intercept) 70.2074 73.8952
## age 0.2657 0.3443
head(confint(mod2))
## 2.5 % 97.5 %
## (Intercept) 53.9686 64.2096
## age 0.7492 0.9293
## name.yobabiy zewde 1967 7.2916 19.7242
## name.yobadam anthony 1966 -15.2825 -1.3436
## name.yobadam stolzberg 1976 -17.5268 -5.1321
## name.yobai-mei chang 1964 -1.5574 10.9556
Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.
Example: Grades
grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
g2pt = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, g2pt) # Convert letter grade to number
grades = merge(grades, courses)
Compute the GPA in the ordinary way:
options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))
## S31185 S31188 S31191 S31194 S31197 S31200
## 2.413 3.018 3.212 3.359 3.331 2.186
… or by a model
conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200
## 2.413 3.018 3.212 3.359 3.331 2.186
What can we hold constant? Department, level, class enrollment?
adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200
## 2.519 3.190 3.137 3.461 3.356 2.142
How do they compare?
xyplot(conventional ~ adjusted[1:443], pch = 20)
Or in terms of class rank:
xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.
xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)
The "takes easy courses” index: a positive number indicates taking easy courses.
densityplot(~rank(conventional) - rank(adjusted[1:443]))
Cross the IDs of the people taking easy courses with the department or instructor … to be done.
Or perhaps the Age coefficient on the running model, with and without controlling for the individual.
We need to understand how confidence intervals can change as we add in covariates.