February 25, 2013 Class Notes

Statistical Geometry

Back to Case Space versus Variable Space

Adam and Eve picture

lm() does what we do by eye

  1. Make a vector A (small integer values), e.g. A = c(-3,4)
  2. Make a vector B. e.g., B = c(0,2)
  3. Draw the vectors on the board.
  4. Eyeball the coefficient by projection.
  5. Fit the model A ~B-1 and pull off the coefficient.
  6. Now try the model A ~ B+1 (including the intercept)
  7. Show how to fit the model by eye using the vector walk.

Review of Geometry through Arithmetic

From Model Terms to Vectors

Derive the model vectors for interaction terms.

Make a small, illustrative data set and a model from it.

small = sample(CPS85, size = 5)[, c("wage", "educ", "sector")]
small
##      wage educ   sector
## 266  6.00   15  service
## 513  6.50   12     prof
## 102  3.56   12 clerical
## 184  6.25   13 clerical
## 300 16.65   14    manag

Write down the vectors as columns of numbers

with(small, wage)
## [1]  6.00  6.50  3.56  6.25 16.65
with(small, educ)
## [1] 15 12 12 13 14

Geometry of Fitting with Multiple Vectors

  1. Diagram with two explanatory vectors
  2. Why coefficients change as you go from one explanatory vector to two.

What's going on with

Trashing the Intercept

Almost all models will (and should!) include an intercept term.

You've seen that the value of the intercept is not always directly relevant: swimming times in the Roman era!

Since we're interested in explaining variation, we might as well eliminate the part of each variable that isn't about variation, the mean. Many important statistical measures do exactly that.

Let's compare two models with and without the mean:

swim = fetchData("swim100m.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv
mod1 = lm(time ~ year, data = swim)
coef(mod1)
## (Intercept)        year 
##    567.2420     -0.2599
mod1b = lm(time ~ year - 1, data = swim)
coef(mod1b)  # Completely different!
##    year 
## 0.03063
swim = transform(swim, timeV = time - mean(time), yearV = year - mean(year))
mod2 = lm(timeV ~ yearV, data = swim)
coef(mod2)
## (Intercept)       yearV 
##  -2.198e-14  -2.599e-01
mod2b = lm(timeV ~ yearV - 1, data = swim)
coef(mod2b)  # same
##   yearV 
## -0.2599

Measuring alignment

Use the angle between the vectors.

More sophisticated: take into account that there is (almost always) an intercept term in the model, \( r \)

The Consequences of Alignment

When explanatory vectors aren't perpendicular, the coefficients depend on both vectors. Going from A ~ B to A ~ B + C will change the coefficient on B.

Residual and Explanatory Vectors are Orthogonal

Show that the residual is orthogonal to each and every model vector. It doesn't matter which vector or which model.

Redundancy

The extreme case of alignment is when the explanatory vectors are exactly parallel. Then you could choose any coefficient you want on one of the vectors and make it up with the others.

Demonstration:

swim = transform(swim, day = year * 365)
lm(time ~ year + day + sex, data = swim)
## 
## Call:
## lm(formula = time ~ year + day + sex, data = swim)
## 
## Coefficients:
## (Intercept)         year          day         sexM  
##     555.717       -0.251           NA       -9.798
lm(time ~ day + year + sex, data = swim)
## 
## Call:
## lm(formula = time ~ day + year + sex, data = swim)
## 
## Coefficients:
## (Intercept)          day         year         sexM  
##    5.56e+02    -6.89e-04           NA    -9.80e+00

What does R do? It drops the redundant vector.

There's another, more subtle form of redundancy: when a vector lies in a subspace spanned by other vectors. Let's make the indicator variables.

swim = transform(swim, guys = sex == "M", gals = sex == "F")

Now try some models:

coef(lm(time ~ guys, data = swim))
## (Intercept)    guysTRUE 
##       65.19      -10.54
coef(lm(time ~ gals, data = swim))
## (Intercept)    galsTRUE 
##       54.66       10.54
coef(lm(time ~ guys + gals, data = swim))
## (Intercept)    guysTRUE    galsTRUE 
##       65.19      -10.54          NA
coef(lm(time ~ gals + guys, data = swim))
## (Intercept)    galsTRUE    guysTRUE 
##       54.66       10.54          NA
coef(lm(time ~ gals + guys - 1, data = swim))
## galsFALSE  galsTRUE  guysTRUE 
##     54.66     65.19        NA

Every categorical variable is redundant with the intercept vector.

We throw away the indicator vector for one level of the variable to avoid the redundancy.

Putting the intercept into the model changes the meaning of the coefficient on the other vectors as well as the numerical value.

Draw a vector picture with just sexF. Then add in indicator for sexM. This is what mm() is doing. The coefficients for the different sexes are independent of one another.

Draw a vector picture with just the intercept. Then modify it to add in an indicator variable for sexF. Since the intercept and sexF are aligned, adding in sexF changes the meaning of the intercept, and adding in the intercept changes the meaning of sexF.

Other points.