Monday March 11, 2013 Class notes

The Big Picture

Judgment and expertise is incredibly important
- Which explanatory variables are important
- What do you want to hold constant to answer the question at hand.
Literacy:
- Already: variance, variation, outliers, 95% intervals, quantitative vs categorical
- Coming: frugality with degrees of freedom, experiment, causal chains
Technical questions:
- Whether to include interaction and transformation terms
- Is colinearity involved.
You can “untangle” explanatory variables.
You can analyze models simulating “holding variables constant”
You can use confidence intervals how precise your measurement of a claimed effect is.
- confidence intervals on coefficients.
- confidence intervals on model values. See the Modeling notes for Week 4 DCV
- confidence intervals on predictions.

A demo from our own software:

kids = fetchData("KidsFeet")

## Data KidsFeet found in package.

mod = lm(width ~ length * sex, data = kids)
f = makeFun(mod)
foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids, 
    interval = "confidence")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)

plot of chunk kidsexample

foo = with(data.frame(width, length, sex, f(length = length, sex = sex, data = kids, 
    interval = "prediction")), data = kids)
xyplot(width + lwr + upr ~ length | sex, data = foo)

plot of chunk kidsexample

xyplot(width + lwr + upr ~ length, data = foo)

plot of chunk kidsexample

What does the confidence-interval/standard-error depend on?

Size of residuals
number of cases (minus number of model vectors – but we won't worry about this right now)
colinearity

Logic behind non-resampling estimates of the standard error

Models partition into a deterministic and random component
Deterministic component is along the model vectors
Random component is the residual

Key idea: if we collected another sample, the deterministic component would be the same, but the random component would be utterly different — it would point in another direction.

Geometry of Confidence Intervals

Shiny App

Why does it depend on the number of cases?

The 3-dimensional view is inherently misleading. The “third”“ dimension in the drawings is really a subspace that may have dimension \( > 1 \).

We need to understand something about how random vectors work in high dimensions.

Distribution of random vectors

fetchData("m155development.R")  # for the angle program

## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R

## [1] TRUE

Generate some randomly directed points in 2 dimensions:

x = rnorm(1000)
y = rnorm(1000)
xyplot(y ~ x)

plot of chunk unnamed-chunk-3

Angles between random vectors: How are they distributed?

s = do(1000) * angle(rnorm(2), rnorm(2))
densityplot(~result, data = s)

plot of chunk unnamed-chunk-4

Now do it in three dimensions. In higher dimensions. What's the pattern?

Why is it different in 3 than in two dimensions?

Toss around the globe again. We'll consider the north pole as one of the vectors. Why are you much more likely to land near the equator than near the poles?

Groupwise means on the GPA.

Construct the data:

grades = fetchData("grades.csv")
g2pt = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
grades = merge(grades, g2pt)

options(na.rm = TRUE)
mod = mm(gradepoint ~ sid, data = grades)
ci = confint(mod)
head(ci)

##    group 2.5 % 97.5 %
## 1 S31185 2.058  2.767
## 2 S31188 2.781  3.255
## 3 S31191 2.953  3.471
## 4 S31194 3.069  3.649
## 5 S31197 3.078  3.635
## 6 S31200 1.918  2.455

A complicated graphic. Each student is one line. The horizontal extent shows the confidence interval for that student's GPA (based on just half the courses, so the real CI after 4 years would be shorter by \( \sqrt{2} \)).

ci2 = ci[order(coef(mod)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 455), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Class Rank")
for (k in 1:nrow(ci2)) {
    lines(ci2[k, c(2, 3)], c(k, k), col = rgb(0, 0, 0, 1 - 0.8 * (k%%10 != 0)))
}

plot of chunk unnamed-chunk-7

All the students whose confidence intervals intersect a given vertical line are equivalent. But their class ranks can be very different.

Why doesn't the registrar report the confidence interval?

Bad reasons

Most people don't know how to interpret them.
It would suggest that grades are not certain. Good reason
The variation in grades is not just do to a random process, but to systematic effects such as the courses taken. Example: A student who gets a B in every course will have a narrow confidence interval. But a student who takes half her courses in Dept. One and half in Dept. Two, getting an A in every course in One and a C in every course in Two is just as consistent. We need to incorporate those effects when calculating the confidence interval.

Aside: Departmentwise GPAs

Add department and course information to data:

courses = fetchData("courses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv

grades = merge(grades, courses)

modd = mm(gradepoint ~ dept, data = grades)
cid = confint(modd)
head(cid)

##   group 2.5 % 97.5 %
## 1     A 2.694  4.296
## 2     b 3.104  3.381
## 3     B 3.017  3.453
## 4     C 3.450  3.598
## 5     d 3.263  3.365
## 6     D 3.488  3.630

ci2d = cid[order(coef(modd)), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 40), xlim = c(1.5, 4.5), type = "n", xlab = "GPA", ylab = "Department Order")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, c(2, 3)], c(k, k))
}

plot of chunk unnamed-chunk-9

GPAs adjusting for department, level, and enrollment

mod3 = lm(gradepoint ~ sid - 1 + dept + enroll + level, data = grades)

Student-by-student confidence intervals:

cid = confint(mod3)[1:443, ]
ci2d = cid[order(coef(mod3)[1:443]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 450), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Class Rank")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-11

The inclusion of the covariates has increased the width of the confidence intervals.

Department-by-department confidence intervals:

mod4 = lm(gradepoint ~ dept - 1 + sid + enroll + level, data = grades)
cid = confint(mod3)[1:40, ]
ci2d = cid[order(coef(mod3)[1:40]), ]  # sort from lowest to highest.
plot(1:2, ylim = c(0, 45), xlim = c(1.5, 4.5), type = "n", xlab = "Adjusted GPA", 
    ylab = "Department")
for (k in 1:nrow(ci2d)) {
    lines(ci2d[k, ], c(k, k))
}

plot of chunk unnamed-chunk-12

Confidence intervals on coefficients

Longitudinal running data

Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.

f = fetchData("Cherry-Blossom-Long.csv")

## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv

nrow(f)

## [1] 41248

sample(f, size = 5)

##                   name.yob age    net    gun sex year previous nruns
## 32547      randy gale 1976  31     NA  88.87   M 2007        0     2
## 22033   kellie kubena 1972  32 128.53 134.07   F 2004        1     3
## 32732 rebecca goldman 1967  33  69.78  72.27   F 2000        1     2
## 30905 patricia jenson 1980  25 110.90 118.38   F 2005        2     5
## 2047       ann parnow 1981  24  96.82 102.43   F 2005        0     2
##       orig.ids
## 32547    32547
## 22033    22033
## 32732    32732
## 30905    30905
## 2047      2047

f = subset(f, nruns > 5)
nrow(f)

## [1] 6186

Two models

mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)

confint(mod1)

##               2.5 %  97.5 %
## (Intercept) 70.2074 73.8952
## age          0.2657  0.3443

head(confint(mod2))

##                                2.5 %  97.5 %
## (Intercept)                  53.9686 64.2096
## age                           0.7492  0.9293
## name.yobabiy zewde 1967       7.2916 19.7242
## name.yobadam anthony 1966   -15.2825 -1.3436
## name.yobadam stolzberg 1976 -17.5268 -5.1321
## name.yobai-mei chang 1964    -1.5574 10.9556

Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.

Example: Grades

Grades and the GPA

grades = fetchData("grades.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv

g2pt = fetchData("grade-to-number.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv

courses = fetchData("courses.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv

grades = merge(grades, g2pt)  # Convert letter grade to number
grades = merge(grades, courses)

Compute the GPA in the ordinary way:

options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))

## S31185 S31188 S31191 S31194 S31197 S31200 
##  2.413  3.018  3.212  3.359  3.331  2.186

… or by a model

conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)

## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.413     3.018     3.212     3.359     3.331     2.186

What can we hold constant? Department, level, class enrollment?

adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)

## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.519     3.190     3.137     3.461     3.356     2.142

How do they compare?

xyplot(conventional ~ adjusted[1:443], pch = 20)

plot of chunk unnamed-chunk-20

Or in terms of class rank:

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)

plot of chunk unnamed-chunk-21

Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)

plot of chunk unnamed-chunk-22

The "takes easy courses” index: a positive number indicates taking easy courses.

densityplot(~rank(conventional) - rank(adjusted[1:443]))

plot of chunk unnamed-chunk-23

Cross the IDs of the people taking easy courses with the department or instructor … to be done.

Or perhaps the Age coefficient on the running model, with and without controlling for the individual.

We need to understand how confidence intervals can change as we add in covariates.