Stats 155 Class Notes 2012-10-05

Main Ideas for Today

The standard deviation, geometrically
Fitting and orthogonality of the residual in high dimensions
$ R^2 $. Fraction of distance travelled to the response vector, but neglecting the intercept.
Simpson's paradox
Redundancy

Standard deviation

We've seen the standard deviation in it's role as describing the width of a distribution. That's an excellent way to think about the standard deviation.

Two rules of thumb for estimating the standard deviation:

The half-width at half height for a bell-shaped distribution
One quarter length of the 95% coverage interval

The arithmetic:

First, the variance

$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - m)^2 $

Then the standard deviation

$ s = \sqrt{s^2} $ … duh!

The units

Same as those of the quantity itself

Geometrically

Consider the very simple, all-cases-the-same model: x ~ 1 The only “explanatory” vector is the intercept vector.

The coefficient from this model is the grand mean $ m $ of the values $ x $.
The residuals are $ x_i - m $. The sum of squares of the residual is the square length of the residual vector.

New idea: the “average per component length” of a vector. This would be useful in describing the typical size of a residual for each case.

It might seem obvious that this should just be the length of the vector divided by the dimension of the space in which the vector lives. But there's a better way. The square-length divided by the dimension of the space. What's nice about this is that it's more or less independent of the size of the sample.

ACTIVITY:
Have students fit the all-cases-the-same model to the Galton height data, but with different sample sizes.

runners = fetchData("Cherry-Blossom-2008.csv")
nrow(runners)

## [1] 8787

small = sample(runners, size = 200)
big = sample(runners, size = 8000)
mod1 = lm(age ~ 1, data = small)
mod2 = lm(age ~ 1, data = big)
# average length per component
sqrt(sum(resid(mod1)^2))/200

## [1] 0.6931

sqrt(sum(resid(mod2)^2))/8000

## [1] 0.1062

# average square length per component
sum(resid(mod1)^2)/200

## [1] 96.09

sum(resid(mod2)^2)/8000

## [1] 90.21

# and the square root of that
sqrt(sum(resid(mod1)^2)/200)

## [1] 9.802

sqrt(sum(resid(mod2)^2)/8000)

## [1] 9.498

Note that the average square length is consistent between the two sample sizes, but the average legnh is not. We want our estimates to be about the system under study, not the irrelevant details of the sample we chose to collect. So the average square length is better.
But what is the dimension of the space that the residual lives in? $ n-1 $ since the residual is always orthogonal to the intercept vector.

Fitting and orthogonality

The residuals will be orthogonal to each and every model vector

mod = lm(wage ~ age * educ + exper, data = CPS85)
with(data = CPS85, sum(resid(mod) * exper))

## [1] -1.02e-12

with(data = CPS85, sum(resid(mod) * age))

## [1] -1.458e-12

with(data = CPS85, sum(resid(mod) * educ))

## [1] 2.391e-12

with(data = CPS85, sum(resid(mod) * age * educ))

## [1] 8.23e-11

# but not
with(data = CPS85, sum(resid(mod) * exper * educ))

## [1] -2521

Redundancy

How to make a model say anything you want about A, through redundancy.
1. Pick a variable B and the coefficient you want on it.
2. Find other vectors V in whose space the variable B lies.
3. Fit B ~ V
4. Fit A ~ V
5. Combine the coefficients from (3) and (4) to make a model.

Example: I run a small college and I want to show that wages increase by $2 per hour for every year of education. Let's do this with the CPS85 data.

Note first that there is a redundancy in the CPS85 data among age, exper, and educ. Confirm that
exper has been estimated in the data as age - (educ+6)

There is one case — # 350 — that's wrong. It was just an arithmetic mistake in the data. (Actually, that's why we're using these data from 1985. Anyone who wants to get some modern CPS data is encouraged to do so and will get some extra credit.)

First, let's work the the corrected data

cps = fetchData("CPS85-corrected.csv")

## Retrieving from http://www.mosaic-web.org/go/datasets/CPS85-corrected.csv

We already know what the relationship is between experience and age and education: $ educ = age - exper - 6 $
Confirm this:

modEduc = lm(educ ~ age + exper, data = cps)

Now fit a model of wage versus age and experience …

mod1 = lm(wage ~ age + exper + educ, data = cps)
coef(mod1)

## (Intercept)         age       exper        educ 
##    -10.4603      0.9260     -0.8208          NA

Notice that the software has flagged the redundancy. Education isn't playing a role in the model.

One way to characterize the quality of the model is with the typical size of a residual.

sd(resid(mod1))

## [1] 4.591

Now I'm going to construct a function that includes education but gives the same model values as the fitted model and therefore has the same size residuals.

To help, here's a function that's always zero, reflecting the redundancy among age, experience, education and the intercept:

zero = makeFun(educ - (age - (exper + 6)) ~ educ & age & exper)

Zero can be added to the model formula without changing anything. Indeed, you can add 10 times zero without changing anything, or 100 or whatever you like. Here's adding 10 times zero

myMod2 = makeFun(-10.46027 + 0.9259656 * age - 0.820833 * exper + 10 * zero(educ, 
    age, exper) ~ educ & age & exper)

Both these models give exactly the same model values for all the data in the data set:

v2 = with(cps, myMod2(age = age, educ = educ, exper = exper))
sd(fitted(mod1) - v2)

## [1] 1.157e-05

mean(fitted(mod1) - v2)

## [1] -3.634e-05

sum((fitted(mod1) - v2)^2)

## [1] 7.768e-07

Let's look at the effect of education on wage. One way to do this is to ask how wage changes with education: a derivative.

effect1 = D(myMod(age = age, educ = educ, exper = exper) ~ educ)
effect2 = D(myMod2(age = age, educ = educ, exper = exper) ~ educ)
effect1(age = 30, exper = 10, educ = 14)

## Error: could not find function "myMod"

effect2(age = 30, exper = 10, educ = 14)

## [1] 10

According to the 2nd model, education is worth $10 per hour.

Alignment and Simpson's Paradox

By careful choice of covariates, you can influence what the role of the variable is.

Simpson's paradox, or more generally the dependence of the coefficients of the variables on the covariates, is a simple consequence of alignment. It's going to happen, like it or not.

We'll need more guidance on the choice of covariates.

Approaches we will see:
1. Use of experimental assignment and randomization to create orthogonality
2. Analysis of covariance
3. Causation-related inference.

The Coefficient of Variation: $ R^2 $

How much have we explained … what fraction of the distance to the response variable have we gone.

BUT … want to emphasize variance, not the mean. So we want the fraction orthogonal to the intercept. We don't want to give undue credit to something that happens to be aligned with the intercept.

Geometry of Fitting with Multiple Vectors