Monday March 11, 2013 Class notes

The Big Picture

  1. Judgment and expertise is incredibly important
  2. Literacy:
  3. Technical questions:
  4. You can “untangle” explanatory variables.
  5. You can analyze models simulating “holding variables constant”
  6. You can use confidence intervals how precise your measurement of a claimed effect is.

A demo from our own software:

kids = fetchData("KidsFeet")
## Data KidsFeet found in package.
mod = lm(width ~ length * sex, data = kids)
f = makeFun(mod)
foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids, 
    interval = "confidence")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)

plot of chunk kidsexample

foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids, 
    interval = "prediction")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)

plot of chunk kidsexample

xyplot(width + lwr + upr ~ length, data = foo)

plot of chunk kidsexample

What does the confidence-interval/standard-error depend on?

Logic behind non-resampling estimates of the standard error

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Geometry of Confidence Intervals

Shiny App

Why does it depend on the number of cases?

The 3-dimensional view is inherently misleading. The “third”“ dimension in the drawings is really a subspace that may have dimension \( > 1 \).

We need to understand something about how random vectors work in high dimensions.

Distribution of random vectors

fetchData("m155development.R")  # for the angle program
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE

Generate some randomly directed points in 2 dimensions:

x = rnorm(1000)
y = rnorm(1000)
xyplot(y ~ x)

plot of chunk unnamed-chunk-3

Angles between random vectors: How are they distributed?

s = do(1000) * angle(rnorm(2), rnorm(2))
densityplot(~result, data = s)

plot of chunk unnamed-chunk-4

Now do it in three dimensions. In higher dimensions. What's the pattern?

Why is it different in 3 than in two dimensions?

Toss around the globe again. We'll consider the north pole as one of the vectors. Why are you much more likely to land near the equator than near the poles?

Groupwise means on the GPA.

Construct the data:

grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)
options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)
##    group 2.5 % 97.5 %
## 1 S31185 2.058  2.767
## 2 S31188 2.781  3.255
## 3 S31191 2.953  3.471
## 4 S31194 3.069  3.649
## 5 S31197 3.078  3.635
## 6 S31200 1.918  2.455

A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).

ci2 = ci[order(coef(mod)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
    lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}

plot of chunk unnamed-chunk-7

All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.

Why doesn't the registrar report the confidence interval?

Bad reasons

Aside: Departmentwise GPAs

Add department and course information to data:

courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, courses)
modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)
##   group 2.5 % 97.5 %
## 1     A 2.694  4.296
## 2     b 3.104  3.381
## 3     B 3.017  3.453
## 4     C 3.450  3.598
## 5     d 3.263  3.365
## 6     D 3.488  3.630
ci2d = cid[order(coef(modd)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, c(2, 3)], c(k, k))
}

plot of chunk unnamed-chunk-9

GPAs adjusting for department, level, and enrollment

mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)

Student-by-student confidence intervals:

cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-11

The inclusion of the covariates has increased the width of the confidence intervals.

Department-by-department confidence intervals:

mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Department")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-12

Confidence intervals on coefficients

Longitudinal running data

Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.

f = fetchData("Cherry-Blossom-Long.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv
nrow(f)
## [1] 41248
sample(f, size = 5)
##                   name.yob age    net    gun sex year previous nruns
## 32547      randy gale 1976  31     NA  88.87   M 2007        0     2
## 22033   kellie kubena 1972  32 128.53 134.07   F 2004        1     3
## 32732 rebecca goldman 1967  33  69.78  72.27   F 2000        1     2
## 30905 patricia jenson 1980  25 110.90 118.38   F 2005        2     5
## 2047       ann parnow 1981  24  96.82 102.43   F 2005        0     2
##       orig.ids
## 32547    32547
## 22033    22033
## 32732    32732
## 30905    30905
## 2047      2047
f = subset(f, nruns > 5)
nrow(f)
## [1] 6186

Two models

mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)
confint(mod1)
##               2.5 %  97.5 %
## (Intercept) 70.2074 73.8952
## age          0.2657  0.3443
head(confint(mod2))
##                                2.5 %  97.5 %
## (Intercept)                  53.9686 64.2096
## age                           0.7492  0.9293
## name.yobabiy zewde 1967       7.2916 19.7242
## name.yobadam anthony 1966   -15.2825 -1.3436
## name.yobadam stolzberg 1976 -17.5268 -5.1321
## name.yobai-mei chang 1964    -1.5574 10.9556

Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.

Example: Grades

Grades and the GPA

grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
g2pt = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, g2pt)  # Convert letter grade to number
grades = merge(grades, courses)

Compute the GPA in the ordinary way:

options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))
## S31185 S31188 S31191 S31194 S31197 S31200 
##  2.413  3.018  3.212  3.359  3.331  2.186

… or by a model

conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.413     3.018     3.212     3.359     3.331     2.186

What can we hold constant? Department, level, class enrollment?

adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.519     3.190     3.137     3.461     3.356     2.142

How do they compare?

xyplot(conventional ~ adjusted[1:443], pch = 20)

plot of chunk unnamed-chunk-20

Or in terms of class rank:

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)

plot of chunk unnamed-chunk-21

Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)

plot of chunk unnamed-chunk-22

The "takes easy courses” index: a positive number indicates taking easy courses.

densityplot(~rank(conventional) - rank(adjusted[1:443]))

plot of chunk unnamed-chunk-23

Cross the IDs of the people taking easy courses with the department or instructor … to be done.

Or perhaps the Age coefficient on the running model, with and without controlling for the individual.

We need to understand how confidence intervals can change as we add in covariates.