We've seen the standard deviation in it's role as describing the width of a distribution. That's an excellent way to think about the standard deviation.
First, the variance
Then the standard deviation
Same as those of the quantity itself
Consider the very simple, all-cases-the-same model: x ~ 1 The only “explanatory” vector is the intercept vector.
New idea: the “average per component length” of a vector. This would be useful in describing the typical size of a residual for each case.
ACTIVITY:
Have students fit the all-cases-the-same model to the Galton height data, but with different sample sizes.
runners = fetchData("Cherry-Blossom-2008.csv")
nrow(runners)
## [1] 8787
small = sample(runners, size = 200)
big = sample(runners, size = 8000)
mod1 = lm(age ~ 1, data = small)
mod2 = lm(age ~ 1, data = big)
# average length per component
sqrt(sum(resid(mod1)^2))/200
## [1] 0.6931
sqrt(sum(resid(mod2)^2))/8000
## [1] 0.1062
# average square length per component
sum(resid(mod1)^2)/200
## [1] 96.09
sum(resid(mod2)^2)/8000
## [1] 90.21
# and the square root of that
sqrt(sum(resid(mod1)^2)/200)
## [1] 9.802
sqrt(sum(resid(mod2)^2)/8000)
## [1] 9.498
Note that the average square length is consistent between the two sample sizes, but the average legnh is not. We want our estimates to be about the system under study, not the irrelevant details of the sample we chose to collect. So the average square length is better.
But what is the dimension of the space that the residual lives in? \( n-1 \) since the residual is always orthogonal to the intercept vector.
The residuals will be orthogonal to each and every model vector
mod = lm(wage ~ age * educ + exper, data = CPS85)
with(data = CPS85, sum(resid(mod) * exper))
## [1] -1.02e-12
with(data = CPS85, sum(resid(mod) * age))
## [1] -1.458e-12
with(data = CPS85, sum(resid(mod) * educ))
## [1] 2.391e-12
with(data = CPS85, sum(resid(mod) * age * educ))
## [1] 8.23e-11
# but not
with(data = CPS85, sum(resid(mod) * exper * educ))
## [1] -2521
How to make a model say anything you want about A, through redundancy.
1. Pick a variable B and the coefficient you want on it.
2. Find other vectors V in whose space the variable B lies.
3. Fit B ~ V
4. Fit A ~ V
5. Combine the coefficients from (3) and (4) to make a model.
Example: I run a small college and I want to show that wages increase by $2 per hour for every year of education. Let's do this with the CPS85 data.
Note first that there is a redundancy in the CPS85 data among age, exper, and educ. Confirm that
exper has been estimated in the data as age - (educ+6)
There is one case — # 350 — that's wrong. It was just an arithmetic mistake in the data. (Actually, that's why we're using these data from 1985. Anyone who wants to get some modern CPS data is encouraged to do so and will get some extra credit.)
First, let's work the the corrected data
cps = fetchData("CPS85-corrected.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/CPS85-corrected.csv
We already know what the relationship is between experience and age and education: \( educ = age - exper - 6 \)
Confirm this:
modEduc = lm(educ ~ age + exper, data = cps)
Now fit a model of wage versus age and experience …
mod1 = lm(wage ~ age + exper + educ, data = cps)
coef(mod1)
## (Intercept) age exper educ
## -10.4603 0.9260 -0.8208 NA
Notice that the software has flagged the redundancy. Education isn't playing a role in the model.
One way to characterize the quality of the model is with the typical size of a residual.
sd(resid(mod1))
## [1] 4.591
Now I'm going to construct a function that includes education but gives the same model values as the fitted model and therefore has the same size residuals.
To help, here's a function that's always zero, reflecting the redundancy among age, experience, education and the intercept:
zero = makeFun(educ - (age - (exper + 6)) ~ educ & age & exper)
Zero can be added to the model formula without changing anything. Indeed, you can add 10 times zero without changing anything, or 100 or whatever you like. Here's adding 10 times zero
myMod2 = makeFun(-10.46027 + 0.9259656 * age - 0.820833 * exper + 10 * zero(educ,
age, exper) ~ educ & age & exper)
Both these models give exactly the same model values for all the data in the data set:
v2 = with(cps, myMod2(age = age, educ = educ, exper = exper))
sd(fitted(mod1) - v2)
## [1] 1.157e-05
mean(fitted(mod1) - v2)
## [1] -3.634e-05
sum((fitted(mod1) - v2)^2)
## [1] 7.768e-07
Let's look at the effect of education on wage. One way to do this is to ask how wage changes with education: a derivative.
effect1 = D(myMod(age = age, educ = educ, exper = exper) ~ educ)
effect2 = D(myMod2(age = age, educ = educ, exper = exper) ~ educ)
effect1(age = 30, exper = 10, educ = 14)
## Error: could not find function "myMod"
effect2(age = 30, exper = 10, educ = 14)
## [1] 10
According to the 2nd model, education is worth $10 per hour.
By careful choice of covariates, you can influence what the role of the variable is.
Simpson's paradox, or more generally the dependence of the coefficients of the variables on the covariates, is a simple consequence of alignment. It's going to happen, like it or not.
We'll need more guidance on the choice of covariates.
Approaches we will see:
1. Use of experimental assignment and randomization to create orthogonality
2. Analysis of covariance
3. Causation-related inference.
How much have we explained … what fraction of the distance to the response variable have we gone.
BUT … want to emphasize variance, not the mean. So we want the fraction orthogonal to the intercept. We don't want to give undue credit to something that happens to be aligned with the intercept.