Stats 155 Class Notes 2012-10-12

Review

We saw a general strategy for modeling when the concern is with effect size:

  1. Decide which sort of effect you want to study: total or partial.
  2. Construct a model that includes both the variable you are interested in and any covariates that you want to hold constant. What sort of model terms (e.g. interactions) you want to include is another matter, you have a choice.
  3. Compute a partial derivative with respect to the variable whose effect size you want to study holding the other variables constant.

IMPORTANT: Leaving a covariate out of a model does not hold it constant. Ignoring is not the same thing as holding constant.

More than one variable?

It can occasionally happen that you want to study the effect of a variable that necessarily has an effect on another variable. For instance, suppose we want to study the effect of education on wages. But each year of education necessarily leads to a year less of “experience.” So, include both in the model:

mod = lm( wage ~ sex*poly(exper,2)*educ + sector*educ, data=CPS85)
f = makeFun(mod)
f(sex="M", exper=9, educ=15,sector="prof") - f( sex="M", exper=10, educ=14, sector="prof")
##      1 
## 0.4116 

You could also do this by constructing the differential involving both experience and education, but this will do.

Hold Constant or Not

The in-class activity:

Total-vs-partial In-class activity

Longitudinal running data

Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.

f = fetchData("Cherry-Blossom-Long.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv
nrow(f)
## [1] 41248
sample(f, size = 5)
##                     name.yob age    net    gun sex year previous nruns
## 12117    erin alexander 1982  25     NA 111.38   F 2007        0     2
## 40268 vincent guerrieri 1943  59 118.20 121.65   M 2002        3     6
## 16003      jane schultz 1964  37  77.68  79.13   F 2001        1     7
## 18058       joan mooney 1964  38  92.20  94.27   F 2002        2     3
## 36705   stephanie brown 1964  36 110.95 114.05   F 2000        1     4
##       orig.ids
## 12117    12117
## 40268    40268
## 16003    16003
## 18058    18058
## 36705    36705
f = subset(f, nruns > 5)
nrow(f)
## [1] 6186

Two models

mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)
head(coef(mod2))
##                 (Intercept)                         age 
##                     59.0891                      0.8393 
##     name.yobabiy zewde 1967   name.yobadam anthony 1966 
##                     13.5079                     -8.3130 
## name.yobadam stolzberg 1976   name.yobai-mei chang 1964 
##                    -11.3294                      4.6991 
mod1
## 
## Call:
## lm(formula = net ~ age, data = f)
## 
## Coefficients:
## (Intercept)          age  
##      72.051        0.305  
## 

Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.

Grades and the GPA

grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
g2pt = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, g2pt)  # Convert letter grade to number
grades = merge(grades, courses)

Compute the GPA in the ordinary way:

options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))
## S31185 S31188 S31191 S31194 S31197 S31200 
##  2.413  3.018  3.212  3.359  3.331  2.186 

… or by a model

conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.413     3.018     3.212     3.359     3.331     2.186 

What can we hold constant? Department, level, class enrollment?

adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.519     3.190     3.137     3.461     3.356     2.142 

How do they compare?

xyplot(conventional ~ adjusted[1:443], pch = 20)

plot of chunk unnamed-chunk-10

Or in terms of class rank:

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)

plot of chunk unnamed-chunk-11

Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)

plot of chunk unnamed-chunk-12

The “takes easy courses” index: a positive number indicates taking easy courses.

densityplot(~rank(conventional) - rank(adjusted[1:443]))

plot of chunk unnamed-chunk-13

Cross the IDs of the people taking easy courses with the department … to be done.