March 6, 2013 Class Notes

Review of Partial Derivatives

A derivative is a way of describing how to (quantitative) variables are related: it converts a change in one variable into a change in the other.

Partial derivatives are derivatives that hold all other variables constant.

Relate to the two-variable polynomial: \( f(x,y) = a_0 + a_1 x + a_2 y + a_3 x y + ... \)

With an interaction coefficient, the derivative with respect to \( x \) depends on \( y \).

Discussion

Sometimes you want partial derivatives, sometimes you don't.

Kids feet data.

The question is whether girls' shoes are narrower than boys' because girls' feet are narrower. Address this.

Wages

Choice of Covariates

Consider the models A ~ B+C or A ~ B*C

When you include the covariate C, and look at the coefficient on B, you are essentially looking at the partial of A with respect to B.

If you want to look at the total effect, you can either …

Example: Looking at the Used Car Price Models

cars = fetchData("used-hondas.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/used-hondas.csv
mod1 = lm(Price ~ Mileage, data = cars)
mod2 = lm(Price ~ Mileage * Age, data = cars)
amod = lm(Age ~ Mileage - 1, data = cars)  # Why suppress the intercept

A 10000 mile increase in mileage is associated with a 0.57 year increase in age

Looking at the size of the effect:

coef(mod1)
## (Intercept)     Mileage 
##  20766.5803     -0.1013
coef(mod2)
## (Intercept)     Mileage         Age Mileage:Age 
##   2.214e+04  -9.414e-02  -7.495e+02   3.450e-03
coef(amod)
##   Mileage 
## 5.728e-05
f1 = makeFun(mod1)
f2 = makeFun(mod2)

Looking at the coefficients from the two models, the first says that the relationship of price to mileage is to decrease by about 10 cents/mile. For mod 2, for a 4-year old car, the partial of price with respect to mileage is about -8 cents/mile.

f1(Mileage = 40000) - f1(Mileage = 50000)
##    1 
## 1013
f2(Mileage = 40000, Age = 4) - f2(Mileage = 50000, Age = 4)
##     1 
## 803.4
f2(Mileage = 40000, Age = 4) - f2(Mileage = 50000, Age = 4.57)
##    1 
## 1132

Activity

Look at the Saratoga house price data. Build and evaluate appropriate models to answer these questions:

(1) What's the effect of adding a fireplace?

(2) What's the effect of adding a bedroom?

(3) What's the effect of partitioning off the living room to create a bedroom?

Activity

Total-vs-partial In-class activity

Longitudinal running data

Running data: Compare the cross-sectional to the longitudinal data to get at how tie changes versus age. Question: hold individual constant or not.

f = fetchData("Cherry-Blossom-Long.csv")
## Retrieving from
## http://www.mosaic-web.org/go/datasets/Cherry-Blossom-Long.csv
nrow(f)
## [1] 41248
sample(f, size = 5)
##                  name.yob age   net    gun sex year previous nruns
## 40900 william noonan 1972  30 75.70  76.47   M 2002        2     4
## 24513     lidia beer 1957  50    NA  83.17   F 2007        6     8
## 30373  nicole atwell 1978  27 79.20  81.12   F 2005        2     5
## 16747   jeffrey furr 1972  30 64.78  65.03   M 2002        1     2
## 30626   oliver bragg 1928  71    NA 109.25   M 1999        0     3
##       orig.ids
## 40900    40900
## 24513    24513
## 30373    30373
## 16747    16747
## 30626    30626
f = subset(f, nruns > 5)
nrow(f)
## [1] 6186

Two models

mod1 = lm(net ~ age, data = f)
mod2 = lm(net ~ age + name.yob, data = f)

Model 1 suggests a decrease of about 0.3 minutes per year of age, but Model 2 is much larger, 0.83 minutes per year of age.

coef(mod1)
## (Intercept)         age 
##      72.051       0.305
head(coef(mod2))
##                 (Intercept)                         age 
##                     59.0891                      0.8393 
##     name.yobabiy zewde 1967   name.yobadam anthony 1966 
##                     13.5079                     -8.3130 
## name.yobadam stolzberg 1976   name.yobai-mei chang 1964 
##                    -11.3294                      4.6991

Note how substantially the age dependence differs depending on whether you are looking longitudinally or cross-sectionally.

Grades and the GPA

grades = fetchData("grades.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grades.csv
g2pt = fetchData("grade-to-number.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/grade-to-number.csv
courses = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
grades = merge(grades, g2pt)  # Convert letter grade to number
grades = merge(grades, courses)

Compute the GPA in the ordinary way:

options(na.rm = TRUE)
head(mean(gradepoint ~ sid, data = grades))
## S31185 S31188 S31191 S31194 S31197 S31200 
##  2.413  3.018  3.212  3.359  3.331  2.186

… or by a model

conventional = coef(lm(gradepoint ~ sid - 1, data = grades))
head(conventional)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.413     3.018     3.212     3.359     3.331     2.186

What can we hold constant? Department, level, class enrollment?

adjusted = coef(lm(gradepoint ~ sid - 1 + dept + level + enroll, data = grades))
head(adjusted)
## sidS31185 sidS31188 sidS31191 sidS31194 sidS31197 sidS31200 
##     2.519     3.190     3.137     3.461     3.356     2.142

How do they compare?

xyplot(conventional ~ adjusted[1:443], pch = 20)

plot of chunk unnamed-chunk-12

Or in terms of class rank:

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)

plot of chunk unnamed-chunk-13

Suppose the cut-off for class rank was to be \( \geq 150 \). There are students who pass by the adjusted criteria but fail by the unadjusted.

xyplot(rank(conventional) ~ rank(adjusted[1:443]), pch = 20)
plotFun(y >= 150 ~ x & y, add = TRUE)
plotFun(x >= 150 ~ x & y, add = TRUE)

plot of chunk unnamed-chunk-14

The “takes easy courses” index: a positive number indicates taking easy courses.

densityplot(~rank(conventional) - rank(adjusted[1:443]))

plot of chunk unnamed-chunk-15