
lm() does what we do by eyeA = c(-3,4)B = c(0,2)A ~B-1 and pull off the coefficient.A ~ B+1 (including the intercept)sum(a*b)Derive the model vectors for interaction terms.
Make a small, illustrative data set and a model from it.
small = sample(CPS85, size = 5)[, c("wage", "educ", "sector")]
small
## wage educ sector
## 266 6.00 15 service
## 513 6.50 12 prof
## 102 3.56 12 clerical
## 184 6.25 13 clerical
## 300 16.65 14 manag
with(small, wage)
## [1] 6.00 6.50 3.56 6.25 16.65
with(small, educ)
## [1] 15 12 12 13 14
indicator vectors due to sector. Include one level that's all zeros in this short data set.
interaction vectors as the component-wise products of the educ vector with each of the sector indicator vectors.
Almost all models will (and should!) include an intercept term.
You've seen that the value of the intercept is not always directly relevant: swimming times in the Roman era!
Since we're interested in explaining variation, we might as well eliminate the part of each variable that isn't about variation, the mean. Many important statistical measures do exactly that.
Let's compare two models with and without the mean:
swim = fetchData("swim100m.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/swim100m.csv
mod1 = lm(time ~ year, data = swim)
coef(mod1)
## (Intercept) year
## 567.2420 -0.2599
mod1b = lm(time ~ year - 1, data = swim)
coef(mod1b) # Completely different!
## year
## 0.03063
swim = transform(swim, timeV = time - mean(time), yearV = year - mean(year))
mod2 = lm(timeV ~ yearV, data = swim)
coef(mod2)
## (Intercept) yearV
## -2.198e-14 -2.599e-01
mod2b = lm(timeV ~ yearV - 1, data = swim)
coef(mod2b) # same
## yearV
## -0.2599
Use the angle between the vectors.
More sophisticated: take into account that there is (almost always) an intercept term in the model, \( r \)
When explanatory vectors aren't perpendicular, the coefficients depend on both vectors. Going from A ~ B to A ~ B + C will change the coefficient on B.
Show that the residual is orthogonal to each and every model vector. It doesn't matter which vector or which model.
The extreme case of alignment is when the explanatory vectors are exactly parallel. Then you could choose any coefficient you want on one of the vectors and make it up with the others.
Demonstration:
swim = transform(swim, day = year * 365)
lm(time ~ year + day + sex, data = swim)
##
## Call:
## lm(formula = time ~ year + day + sex, data = swim)
##
## Coefficients:
## (Intercept) year day sexM
## 555.717 -0.251 NA -9.798
lm(time ~ day + year + sex, data = swim)
##
## Call:
## lm(formula = time ~ day + year + sex, data = swim)
##
## Coefficients:
## (Intercept) day year sexM
## 5.56e+02 -6.89e-04 NA -9.80e+00
What does R do? It drops the redundant vector.
There's another, more subtle form of redundancy: when a vector lies in a subspace spanned by other vectors. Let's make the indicator variables.
swim = transform(swim, guys = sex == "M", gals = sex == "F")
Now try some models:
coef(lm(time ~ guys, data = swim))
## (Intercept) guysTRUE
## 65.19 -10.54
coef(lm(time ~ gals, data = swim))
## (Intercept) galsTRUE
## 54.66 10.54
coef(lm(time ~ guys + gals, data = swim))
## (Intercept) guysTRUE galsTRUE
## 65.19 -10.54 NA
coef(lm(time ~ gals + guys, data = swim))
## (Intercept) galsTRUE guysTRUE
## 54.66 10.54 NA
coef(lm(time ~ gals + guys - 1, data = swim))
## galsFALSE galsTRUE guysTRUE
## 54.66 65.19 NA
Every categorical variable is redundant with the intercept vector.
We throw away the indicator vector for one level of the variable to avoid the redundancy.
Putting the intercept into the model changes the meaning of the coefficient on the other vectors as well as the numerical value.
Draw a vector picture with just sexF. Then add in indicator for sexM. This is what mm() is doing. The coefficients for the different sexes are independent of one another.
Draw a vector picture with just the intercept. Then modify it to add in an indicator variable for sexF. Since the intercept and sexF are aligned, adding in sexF changes the meaning of the intercept, and adding in the intercept changes the meaning of sexF.
lm() does it, mm() does not. By including the intercept vector, one of the indicator vectors from each categorical variable is made redundant, and the meaning of the coefficients on the other vectors changes: difference from reference group rather than groupwise mean.sex and sector in the CPS85 data.