Lecture 13: Untangling and Redundancy

Redundancy

Here we’ll explore how R deals with redundancy using the “swim” dataset as an example. First, we’ll load the data:

data("SwimRecords")

Now we’ll add a new (redundant) variable “beforepres”, which is years before present:

SwimRecords$beforepres = 2015 - SwimRecords$year

We’ll start with a simple model that doesn’t involve redundancy:

mod1 = lm(time ~ year + sex, data=SwimRecords)
coef(mod1)

## (Intercept)        year        sexM 
## 555.7167834  -0.2514637  -9.7979615

Now we’ll add the redundant variable and see what R does:

mod2 = lm(time ~ year + sex + beforepres, data=SwimRecords)
coef(mod2)

## (Intercept)        year        sexM  beforepres 
## 555.7167834  -0.2514637  -9.7979615          NA

Note that R assigns NA to the redundant coefficient. Normally, order doesn’t matter when specifying a model, but R blames redundancy on terms added later:

mod3 = lm(time ~ beforepres + year + sex, data=SwimRecords)
coef(mod3)

## (Intercept)  beforepres        year        sexM 
##  49.0174651   0.2514637          NA  -9.7979615

Notice how the intercept changes! Why does this happen? Despite the different coefficients, the fitted values for both models are the same:

f2 = fitted(mod2)
f3 = fitted(mod3)
all.equal(f2, f3)

## [1] TRUE

You’ve already seen R do something similar for categorical variables. The base level for any categorical variable is redundant due to the intercept, so if we suppress the intercept, then the categorical level remains:

coef(lm(time ~ sex - 1, data=SwimRecords))

##     sexF     sexM 
## 65.19226 54.65613

But if we leave the intercept term in, then it is dropped automatically by R:

coef(lm(time ~ sex, data=SwimRecords))

## (Intercept)        sexM 
##    65.19226   -10.53613

Some thoughts on \(R^2\)

So why variance and not standard deviation or something else? The reason is due to partitionning. The sd, IQR, or a 95% coverage simply don’t partition variation. That is, the IQR of the fitted + IQR of residuals is not necessarily IQR of response variable:

data("Galton")
m = lm(height ~ 1 + mother + sex, data=Galton)

sd1 = sd(fitted(m))
sd2 = sd(resid(m))
sd3 = sd(height, data=Galton)
sd1+sd2

## [1] 5.057289

sd3

## [1] 3.582918

sd1 = IQR(fitted(m))
sd2 = IQR(resid(m))
sd3 = IQR(height, data=Galton)
sd1+sd2

## [1] 8.191579

sd3

## [1] 5.7

The variance does partition the variation. There are geometric reasons for this, which you are encouraged to read up on, but they won’t be on the midterm…

sd1 = var(fitted(m))
sd2 = var(resid(m))
sd3 = var(height, data=Galton)
sd1 + sd2

## [1] 12.8373

sd3

## [1] 12.8373

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.