Here we’ll explore how R deals with redundancy using the “swim” dataset as an example. First, we’ll load the data:
data("SwimRecords")
Now we’ll add a new (redundant) variable “beforepres”, which is years before present:
SwimRecords$beforepres = 2015 - SwimRecords$year
We’ll start with a simple model that doesn’t involve redundancy:
mod1 = lm(time ~ year + sex, data=SwimRecords)
coef(mod1)
## (Intercept) year sexM
## 555.7167834 -0.2514637 -9.7979615
Now we’ll add the redundant variable and see what R does:
mod2 = lm(time ~ year + sex + beforepres, data=SwimRecords)
coef(mod2)
## (Intercept) year sexM beforepres
## 555.7167834 -0.2514637 -9.7979615 NA
Note that R assigns NA to the redundant coefficient. Normally, order doesn’t matter when specifying a model, but R blames redundancy on terms added later:
mod3 = lm(time ~ beforepres + year + sex, data=SwimRecords)
coef(mod3)
## (Intercept) beforepres year sexM
## 49.0174651 0.2514637 NA -9.7979615
Notice how the intercept changes! Why does this happen? Despite the different coefficients, the fitted values for both models are the same:
f2 = fitted(mod2)
f3 = fitted(mod3)
all.equal(f2, f3)
## [1] TRUE
You’ve already seen R do something similar for categorical variables. The base level for any categorical variable is redundant due to the intercept, so if we suppress the intercept, then the categorical level remains:
coef(lm(time ~ sex - 1, data=SwimRecords))
## sexF sexM
## 65.19226 54.65613
But if we leave the intercept term in, then it is dropped automatically by R:
coef(lm(time ~ sex, data=SwimRecords))
## (Intercept) sexM
## 65.19226 -10.53613
So why variance and not standard deviation or something else? The reason is due to partitionning. The sd, IQR, or a 95% coverage simply don’t partition variation. That is, the IQR of the fitted + IQR of residuals is not necessarily IQR of response variable:
data("Galton")
m = lm(height ~ 1 + mother + sex, data=Galton)
sd1 = sd(fitted(m))
sd2 = sd(resid(m))
sd3 = sd(height, data=Galton)
sd1+sd2
## [1] 5.057289
sd3
## [1] 3.582918
sd1 = IQR(fitted(m))
sd2 = IQR(resid(m))
sd3 = IQR(height, data=Galton)
sd1+sd2
## [1] 8.191579
sd3
## [1] 5.7
The variance does partition the variation. There are geometric reasons for this, which you are encouraged to read up on, but they won’t be on the midterm…
sd1 = var(fitted(m))
sd2 = var(resid(m))
sd3 = var(height, data=Galton)
sd1 + sd2
## [1] 12.8373
sd3
## [1] 12.8373
This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.