Although you can make models of considerable complexity, there are some gaps or deficiencies in the story:
We need to fix these deficiencies. It will take a while.
This week, we'll be working with three main ideas:
The tool that we will use to explore these ideas is geometry. Some people will find that this immediately illuminates what's going on. Some people will not find it helpful at all. I don't know why there are such different reactions from different people. I encourage you to keep an open mind as we talk about geometry. The worst case is that you will spend a couple of hours (spread over the next weeks) that won't lead anywhere for you. That's not too big a bet to place on the possibility that you may find it truly useful, as many people do.
Using indicator variables. The 0-1 encoding. Economists and many others call these “dummy variables”. That's fine and you're welcome to use that phrase. But it seems a bit pejorative.
There's a funny pattern with the residuals, fitted model values, and response variable. Actually, two patterns, one of which is obviously a result of definition, but the other isn't.
The response variable is the sum of the fitted model values and the residuals. That's just another way of saying that the residuals are the difference between the response variable and the fitted model values. No big deal.
Have the students do this themselves.
1. Pick some data set along with a response variable and one or more explanatory variables. I'll use my favorite: KidsFeet (I like this one because it's small, and it's practical to look at the individual fitted model values and residuals.)
2. Pick some model on that dataset. Here's mine:
mod = lm(width ~ length * sex + domhand, data = KidsFeet)
together = fitted(mod) + resid(mod)
together is the same as the response variable. Here's one way that involves a command that you won't need in this course, cbind, which takes two vectors of numbers and puts them side by side:head(with(KidsFeet, cbind(width, together)))
## width together
## 1 8.4 8.4
## 2 8.8 8.8
## 3 9.7 9.7
## 4 9.8 9.8
## 5 8.9 8.9
## 6 9.7 9.7
Here's another way: subtract the sum from the response variable and show that the result is always zero:
head(with(KidsFeet, width - together))
## 1 2 3 4 5 6
## 0 0 0 0 0 0
This is not practical when you have a data set of any substantial length. Here's a trick. Square the differences and add them up. This will be zero only if each and every case has a value of zero:
sum(with(KidsFeet, width - together)^2)
## [1] 3.155e-30
This relationship between the fitted model values, the residuals, and the response variable will be true for every model you make. QUESTION: Does anyone in the class find that this is not the case for their model?
The variance of the response variable is the sum of the variance of the fitted model values and the variance of the residuals. The same is also true of the sum of squares.
Try it on your model:
var(width, data = KidsFeet)
## [1] 0.2597
var(fitted(mod)) + var(resid(mod))
## [1] 0.2597
Note that this is not (necessarily) true for the standard deviation:
sd(width, data = KidsFeet)
## [1] 0.5096
sd(fitted(mod)) + sd(resid(mod))
## [1] 0.7206
But it is true for the “sum of squares”
sum(with(KidsFeet, width^2))
## [1] 3163
sum(fitted(mod)^2) + sum(resid(mod)^2)
## [1] 3163
In developing an understanding of why this is, you'll have a tool for understanding what's going on in complicated models.

A ~ 1A ~ B - 1
*Calculations using the computer [Math_155_Activity_on_Statistical_Geometry_and_Computing] — ask students to build their own models using their own data and confirm thisHow to fit with more than one model vector. Simpson's paradox and how it relates to correlations among explanatory variables.
There's a linear combination of the indicator variables for a categorical variable that will produce something that is all 1s.