We saw a general strategy for modeling when the concern is with effect size:
IMPORTANT: Leaving a covariate out of a model does not hold it constant. Ignoring is not the same thing as holding constant.
It can occasionally happen that you want to study the effect of a variable that necessarily has an effect on another variable. For instance, suppose we want to study the effect of education on wages. But each year of education necessarily leads to a year less of “experience.” So, include both in the model:
mod = lm( wage ~ sex*poly(exper,2)*educ + sector*educ, data=CPS85)
f = makeFun(mod)
f(sex="M", exper=9, educ=15,sector="prof") - f( sex="M", exper=10, educ=14, sector="prof")
## 1
## 0.4116
You could also do this by constructing the differential involving both experience and education, but this will do.
The in-class activity:
Total-vs-partial In-class activity
Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.
f = fetchData("Cherry-Blossom-Long.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv
nrow(f)
## [1] 41248
sample(f, size = 5)
## name.yob age net gun sex year previous nruns
## 12117 erin alexander 1982 25 NA 111.38 F 2007 0 2
## 40268 vincent guerrieri 1943 59 118.20 121.65 M 2002 3 6
## 16003 jane schultz 1964 37 77.68 79.13 F 2001 1 7
## 18058 joan mooney 1964 38 92.20 94.27 F 2002 2 3
## 36705 stephanie brown 1964 36 110.95 114.05 F 2000 1 4
## orig.ids
## 12117 12117
## 40268 40268
## 16003 16003
## 18058 18058
## 36705 36705
f = subset(f, nruns > 5)
nrow(f)
## [1] 6186
Two models
mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)
head(coef(mod2))
## (Intercept) age
## 59.0891 0.8393
## name.yobabiy zewde 1967 name.yobadam anthony 1966
## 13.5079 -8.3130
## name.yobadam stolzberg 1976 name.yobai-mei chang 1964
## -11.3294 4.6991
mod1
##
## Call:
## lm(formula = net ~ age, data = f)
##
## Coefficients:
## (Intercept) age
## 72.051 0.305
##
Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.
grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
g2pt = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, g2pt) # Convert letter grade to number
grades = merge(grades, courses)
Compute the GPA in the ordinary way:
options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))
## S31185 S31188 S31191 S31194 S31197 S31200
## 2.413 3.018 3.212 3.359 3.331 2.186
… or by a model
conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200
## 2.413 3.018 3.212 3.359 3.331 2.186
What can we hold constant? Department, level, class enrollment?
adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200
## 2.519 3.190 3.137 3.461 3.356 2.142
How do they compare?
xyplot(conventional ~ adjusted[1:443], pch = 20)
Or in terms of class rank:
xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.
xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)
The “takes easy courses” index: a positive number indicates taking easy courses.
densityplot(~rank(conventional) - rank(adjusted[1:443]))
Cross the IDs of the people taking easy courses with the department … to be done.