Stats 155 Class Notes 2012-09-24

Form Groups for Car Project

Outline the used car project and show some data from http://www.cars.com. Perhaps MINI Coopers.

Get students to form groups right now, oriented around different kinds of cars. They should make a Google Spreadsheet and share it with all the other members of their group.

Broad Review

The statistical models we are studying account for the variation in a response variable using the variation in explanatory variables. Some points:

fetchData("m155development.R")  # load the software
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE
f = listFun(Antelope = 2, Beaver = -1, Cougar = 7)
f("Cougar")
## Cougar 
##      7 

Introducing \( R^2 \) Briefly

Since we're trying to account for variation in the response variable, it's helpful to have a way to quantify how much variation we have accounted for. The quantity called “R-squared” — \( R^2 \), officially “the coefficient of determination” although hardly anyone uses that name — is a standard way to do this. Briefly, \( R^2 \) is the fraction of variance in the response variable that is “captured” by the model. It is always between 0 (didn't capture any) and 1 (got it all!).

The Overall Modeling Process

The modelling process, in this framework, is

  1. Choose the response variable, based on the motivation you have for constructing your model.
  2. Pick explanatory variables
  3. Decide which terms to construct from the explanatory variables.
  4. Fit the model — this is automatic, given (3) and the data.
  5. Interpret the model, including consideration of the residuals. We'll spend a lot of time on this in the second half of the course.
  6. Perhaps modify the model based on the information in (5) and go back to step (2). There may be many iterations of this.
  7. Interpret the model in terms of the scientific or policy questions of interest. This may also involve iterating back to steps (1) or (2)

Right now, we are going to focus on step (3). For now, we'll pretend that (1) and (2) are obvious — but (2) especially can involve considerable nuance and debate. We'll also spend a bit of time on the techniques used in (7).

The Math 135 Heuristic and two upcoming heuristics

Reminder of what “heuristic” means and why heuristics are useful.

Possible definition “Computing a solution by rules that are loosely defined.”

From Wikipedia:

Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Where an exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, or common sense.

In more precise terms, heuristics are strategies using readily accessible, though loosely applicable, information to control problem solving in human beings and machines.

We use heuristics to provide guidance and a ready way to move forward, while allowing room for judgment.

Back in Applied Calculus, we introduced a heuristic for including terms in a polynomial model.

That heuristic is still valid. In statistics, we'll be able to introduce another heuristic, based on the amount of evidence we have in the data to support the inclusion of any particular term. We won't be able to introduce this for a few weeks.

At the very end of the course, we'll introduce still another heuristic, based on the connections between variables in our theory of how the system we are modeling works.

Notation

In the lm() program (and mm() and glm() programs), we write the response variable to the left of the ~ and the model terms, separated by the + notation, on the right side. Don't confuse + with mathematical addition. It really is just a separator for the various terms we are going to be including in the linear combination that will constitute the model.

Example of simple models with and without the intercept term.

lm(height ~ 1 + father, data = Galton)
## 
## Call:
## lm(formula = height ~ 1 + father, data = Galton)
## 
## Coefficients:
## (Intercept)       father  
##      39.110        0.399  
## 
lm(height ~ father, data = Galton)  # same thing as previous
## 
## Call:
## lm(formula = height ~ father, data = Galton)
## 
## Coefficients:
## (Intercept)       father  
##      39.110        0.399  
## 
lm(height ~ father - 1, data = Galton)  # leave out intercept
## 
## Call:
## lm(formula = height ~ father - 1, data = Galton)
## 
## Coefficients:
## father  
##  0.964  
## 
lm(height ~ father + mother + father:mother, data = Galton)
## 
## Call:
## lm(formula = height ~ father + mother + father:mother, data = Galton)
## 
## Coefficients:
##   (Intercept)         father         mother  father:mother  
##      132.3478        -1.2060        -1.4294         0.0247  
## 
lm(height ~ father * mother, data = Galton)
## 
## Call:
## lm(formula = height ~ father * mother, data = Galton)
## 
## Coefficients:
##   (Intercept)         father         mother  father:mother  
##      132.3478        -1.2060        -1.4294         0.0247  
## 
lm(height ~ cos(father), data = Galton)  # silly model!
## 
## Call:
## lm(formula = height ~ cos(father), data = Galton)
## 
## Coefficients:
## (Intercept)  cos(father)  
##     66.7618      -0.0094  
## 
swim = fetchData("swim100m.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv
swimmod = lm(time ~ sex * exp(-year/50), data = swim)
xyplot(time + fitted(swimmod) ~ year, data = swim)

plot of chunk unnamed-chunk-6

(N.B.: For instructors … Beware that the above exponential model just barely makes it within the floating-point arithmetic capabilities of the computer. Applying an exponential to numbers arguments like \( -20 \) can be perilous. Better to have rescaled year to, say, start at 1900.)

swimmod2 = lm(time ~ sex * poly(year, 2), data = swim)
xyplot(time + fitted(swimmod2) ~ year, data = swim)

plot of chunk unnamed-chunk-7

N.B. For instructors … You may be tempted to write polynomial terms directly, for instance

swimmod3 = lm(time ~ sex * (year + I(year^2)), data = swim)
xyplot(time + fitted(swimmod3) ~ year, data = swim)

plot of chunk unnamed-chunk-8

Note the use of I() to instruct R to treat arithmetical symbols in the numerical way. Also, note that poly() cleverly does a shift into orthogonal polynomials to preserve numerical stability. As such, the specific coefficients in swimmod2 and swimmod3 will be different, even though the model formulas amount to the same thing.

swimmod2
## 
## Call:
## lm(formula = time ~ sex * poly(year, 2), data = swim)
## 
## Coefficients:
##         (Intercept)                 sexM       poly(year, 2)1  
##               64.57                -9.64               -76.13  
##      poly(year, 2)2  sexM:poly(year, 2)1  sexM:poly(year, 2)2  
##               27.23                36.31               -22.49  
## 
swimmod3
## 
## Call:
## lm(formula = time ~ sex * (year + I(year^2)), data = swim)
## 
## Coefficients:
##    (Intercept)            sexM            year       I(year^2)  
##       1.90e+04       -1.54e+04       -1.90e+01        4.79e-03  
##      sexM:year  sexM:I(year^2)  
##       1.56e+01       -3.96e-03  
## 

In order to deal with the possible double meaning of symbols like +, you can use I() to signal to R to treat the mathematical operations literally.

QUESTION: What's unrealistic about a parabolic form for world records?

Activity

Visual choice of model terms

Instructions are given here (These have to be reformatted as a handout and as an instructor's guide.)

The only substantive differences is in the way the software is loaded:

fetchData("mLM.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R
## [1] TRUE

The program itself is called mLM() (not lm.select.terms())

Model Formulas and Coefficients

Fit some models and work students through the calculations. The makeFun() function can help in this.

A suggested style for a calculation:

mod = lm(wage ~ sex + age, data = CPS85)
coef(mod)
## (Intercept)        sexM         age 
##     4.65425     2.27469     0.08522 

What is the wage for a female of age 40? A male of age 30? Work through the arithmetic “by hand”.

4.65 + 0 * 2.274 + 40 * 0.0852  # for the 40-year old female
## [1] 8.058
4.65 + 1 * 2.274 + 30 * 0.0852  # for the 30-year old male
## [1] 9.48

You can also turn the model information into a mathematical function:

f = makeFun(mod)
f(sex = "F", age = 40)
##     1 
## 8.063 
f(sex = "M", age = 30)
##     1 
## 9.485 

Other settings: