February 22, 2013 Class Notes

Where We Are

  1. You can fit models.
  2. The model partitions variation between what's explained and what's not explained: the fitted model values and the residuals.
  3. Fitting means setting parameters to make the residuals as small as possible. This is automatic.
  4. You can also make the residuals smaller by adding new explanatory variables and terms to the model. Does this mean we should add in many things to a model?

Where We Need to Be

  1. How to interpret a model with multiple variables? With interaction terms and transformation terms, the “effect” of a variable can be spread around several terms in the model.
  2. How to decide when enough is enough in adding new terms to a model?

This week and next, we'll be working with three main ideas:

  1. Colinearity.
  2. Redundancy (which is the extreme state of colinearity)
  3. Measuring how much of the variation in response values a model has captured. (You may already know that \( R^2 \) does this, but it helps to know why it's even possible to do at all.)

The tool that we will use to explore these ideas is geometry. Some people will find that this immediately illuminates what's going on. Some people will not find it helpful at all. I don't know why there are such different reactions from different people. I encourage you to keep an open mind as we talk about geometry. The worst case is that you will spend a couple of hours (spread over the next weeks) that won't lead anywhere for you. That's not too big a bet to place on the possibility that you may find it truly useful, as many people do.

Untangling

Finish Untangling House Prices

houses = fetchData("SaratogaHouses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv
lm(Price ~ Bedrooms, data = houses)
## 
## Call:
## lm(formula = Price ~ Bedrooms, data = houses)
## 
## Coefficients:
## (Intercept)     Bedrooms  
##        4599        51703
lm(Price ~ Bedrooms + Living.Area, data = houses)
## 
## Call:
## lm(formula = Price ~ Bedrooms + Living.Area, data = houses)
## 
## Coefficients:
## (Intercept)     Bedrooms  Living.Area  
##       13458        -9030          101

SAT and School Expenditures

Simpson's paradox in expenditures and fraction taking the SAT.

Least Squares

Visual fitting a line activity.

Load the software:

fetchData("mLineFit.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/mLineFit.R
## [1] TRUE

Then run it:

mLineFit(width ~ length, data = KidsFeet)

Vary the slope and intercept using the sliders.

QUESTIONS:

Aside: “… in the context of what remains unexplained”

Although model fitting makes the residuals as small as possible for a given model “design”, it's not appropriate to choose a design solely on the basis of making the residuals small.

Doing so conflicts with the idea of taking a random sample and the variation that's induced by taking a sample.

Statistical Geometry

Vectors, abstractly

Case Space versus variable space

Adam and Eve picture

Activity comparing case space and variable space

Fitting with one explanatory vector in variable space.

Show that the residual is perpendicular to the model. The model triangle.

*Calculations using the computer [Math_155_Activity_on_Statistical_Geometry_and_Computing] — ask students to build their own models using their own data and confirm this:

Example:

kids = fetchData("kidsfeet.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/kidsfeet.csv
mod = lm(width ~ length * sex + domhand, data = kids)
sum(resid(mod) * kids$length)
## [1] -3.775e-15
sum(resid(mod) * (kids$sex == "B"))
## [1] 2.22e-16
sum(resid(mod) * (kids$domhand == "R"))
## [1] 1.013e-15

Looking forward