Stats 155 Class Notes 2012-10-05

Main Ideas for Today

  1. The standard deviation, geometrically
  2. Fitting and orthogonality of the residual in high dimensions
  3. \( R^2 \). Fraction of distance travelled to the response vector, but neglecting the intercept.
  4. Simpson's paradox
  5. Redundancy

Standard deviation

We've seen the standard deviation in it's role as describing the width of a distribution. That's an excellent way to think about the standard deviation.

Two rules of thumb for estimating the standard deviation:

The arithmetic:

First, the variance

Then the standard deviation

The units

Same as those of the quantity itself

Geometrically

Consider the very simple, all-cases-the-same model: x ~ 1 The only “explanatory” vector is the intercept vector.

New idea: the “average per component length” of a vector. This would be useful in describing the typical size of a residual for each case.

ACTIVITY:
Have students fit the all-cases-the-same model to the Galton height data, but with different sample sizes.

runners = fetchData("Cherry-Blossom-2008.csv")
nrow(runners)
## [1] 8787
small = sample(runners, size = 200)
big = sample(runners, size = 8000)
mod1 = lm(age ~ 1, data = small)
mod2 = lm(age ~ 1, data = big)
# average length per component
sqrt(sum(resid(mod1)^2))/200
## [1] 0.6931
sqrt(sum(resid(mod2)^2))/8000
## [1] 0.1062
# average square length per component
sum(resid(mod1)^2)/200
## [1] 96.09
sum(resid(mod2)^2)/8000
## [1] 90.21
# and the square root of that
sqrt(sum(resid(mod1)^2)/200)
## [1] 9.802
sqrt(sum(resid(mod2)^2)/8000)
## [1] 9.498

Fitting and orthogonality

The residuals will be orthogonal to each and every model vector

mod = lm(wage ~ age * educ + exper, data = CPS85)
with(data = CPS85, sum(resid(mod) * exper))
## [1] -1.02e-12
with(data = CPS85, sum(resid(mod) * age))
## [1] -1.458e-12
with(data = CPS85, sum(resid(mod) * educ))
## [1] 2.391e-12
with(data = CPS85, sum(resid(mod) * age * educ))
## [1] 8.23e-11
# but not
with(data = CPS85, sum(resid(mod) * exper * educ))
## [1] -2521

Redundancy

How to make a model say anything you want about A, through redundancy.
1. Pick a variable B and the coefficient you want on it.
2. Find other vectors V in whose space the variable B lies.
3. Fit B ~ V
4. Fit A ~ V
5. Combine the coefficients from (3) and (4) to make a model.

Example: I run a small college and I want to show that wages increase by $2 per hour for every year of education. Let's do this with the CPS85 data.

Note first that there is a redundancy in the CPS85 data among age, exper, and educ. Confirm that
exper has been estimated in the data as age - (educ+6)

There is one case — # 350 — that's wrong. It was just an arithmetic mistake in the data. (Actually, that's why we're using these data from 1985. Anyone who wants to get some modern CPS data is encouraged to do so and will get some extra credit.)

First, let's work the the corrected data

cps = fetchData("CPS85-corrected.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/CPS85-corrected.csv

We already know what the relationship is between experience and age and education: \( educ = age - exper - 6 \)
Confirm this:

modEduc = lm(educ ~ age + exper, data = cps)

Now fit a model of wage versus age and experience …

mod1 = lm(wage ~ age + exper + educ, data = cps)
coef(mod1)
## (Intercept)         age       exper        educ 
##    -10.4603      0.9260     -0.8208          NA 

Notice that the software has flagged the redundancy. Education isn't playing a role in the model.

One way to characterize the quality of the model is with the typical size of a residual.

sd(resid(mod1))
## [1] 4.591

Now I'm going to construct a function that includes education but gives the same model values as the fitted model and therefore has the same size residuals.

To help, here's a function that's always zero, reflecting the redundancy among age, experience, education and the intercept:

zero = makeFun(educ - (age - (exper + 6)) ~ educ & age & exper)

Zero can be added to the model formula without changing anything. Indeed, you can add 10 times zero without changing anything, or 100 or whatever you like. Here's adding 10 times zero

myMod2 = makeFun(-10.46027 + 0.9259656 * age - 0.820833 * exper + 10 * zero(educ, 
    age, exper) ~ educ & age & exper)

Both these models give exactly the same model values for all the data in the data set:

v2 = with(cps, myMod2(age = age, educ = educ, exper = exper))
sd(fitted(mod1) - v2)
## [1] 1.157e-05
mean(fitted(mod1) - v2)
## [1] -3.634e-05
sum((fitted(mod1) - v2)^2)
## [1] 7.768e-07

Let's look at the effect of education on wage. One way to do this is to ask how wage changes with education: a derivative.

effect1 = D(myMod(age = age, educ = educ, exper = exper) ~ educ)
effect2 = D(myMod2(age = age, educ = educ, exper = exper) ~ educ)
effect1(age = 30, exper = 10, educ = 14)
## Error: could not find function "myMod"
effect2(age = 30, exper = 10, educ = 14)
## [1] 10

According to the 2nd model, education is worth $10 per hour.

Alignment and Simpson's Paradox

By careful choice of covariates, you can influence what the role of the variable is.

Simpson's paradox, or more generally the dependence of the coefficients of the variables on the covariates, is a simple consequence of alignment. It's going to happen, like it or not.

We'll need more guidance on the choice of covariates.

Approaches we will see:
1. Use of experimental assignment and randomization to create orthogonality
2. Analysis of covariance
3. Causation-related inference.

The Coefficient of Variation: \( R^2 \)

How much have we explained … what fraction of the distance to the response variable have we gone.

BUT … want to emphasize variance, not the mean. So we want the fraction orthogonal to the intercept. We don't want to give undue credit to something that happens to be aligned with the intercept.

Geometry of Fitting with Multiple Vectors

1. Diagram with two explanatory vectors

2. Show that the residual is orthogonal to each and every model vector.