Stats 155 Class Notes 2012-10-08

Hand out the R Command Crib Sheet.

Where We Are

At this point in the course, after the “Week of Theory,” you should have a better understanding of:
1. Why the coefficients on any given term depend on what other terms are in the model. That Simpson's Paradox is not the act of an angry deity, but a simple consequence of alignment of explanatory vectors.
2. The role that the intercept plays. Why it's a choice to represent a model whose model values are groupwise means as either the means themselves or a reference mean and deltas to the other means.
3. That alignment of vectors makes things a bit harder, but is not a fundamental difficulty so long as the alignment is not absolute, as in redundancy. We can measure alignment between two vectors as an angle. But we need a way to measure alignment between a vector and a set of vectors.
4. That a rough measure of the “quality of a model” is the size of the residuals. The smaller the typical residual, the better the model.

You should already know how to fit models, even pretty complicated models that include lots of variables. You understand that the model itself takes the form of a function, but that there are important attributes of a model that go beyond just the function itself, e.g. the residuals, model coefficients, confidence intervals on model coefficients.

But models are looking like they hard to interpret. The quantification of effect size is complicated by interaction terms. It's also complicated by the change in coefficients when an additional model term is put into a model.

Where We're Heading

This week, we'll be developing some tools for interpreting models.

Accounting for Variation

The definition offered for “statistics” at the beginning of the semester was

The explanation of or accounting for variation in the context of what remains unexplained.

At this point, you should know
1. What is meant by variation.

2. What is meant by “explanation” and why “accounted for” might be a more accurate. That We're concerned in our models with explaining just one quantity, labelled the “response variable,” and are using the explanatory variable to do the explaining or accounting for.
3. That “what remains unexplained” is represented by the residual. So it's not at this point a philosophical statement of what we don't know, but just the difference between what the model says and what actually happened.

Now we need to sort through the meaning of “in the context of”.

We will eventually reach \( R^2 \), a very widely used quantity. So you might as well know what \( R^2 \) is: a ratio of variances: the variance in the fitted divided by the variance of the response. That is, how much the model created versus how much there was to create.

Two ways do display \( R^2 \):

Some questions:

Partitioning of Variability

Variance

Why not, say, var(fitted)/var(resid)?

That would be reasonable, too. var(fitted)/var(response) has a nice property of being easy to interpret. It's always between 0 and 1.

Later on we'll switch to a more technical statistic, called the F statistic, which is

Why is var(fitted) always no bigger than var(response)?

This has to do with the length of vectors. The variance is proportional to the length of the vector (after subtracting out the mean). The triangle inequality mandates that the length of the leg of a right triangle cannot be longer than the length of the hypotenuse.

Actually, you can create models that have a var(fitted) larger than var(response). This involves either of two strategies:

  1. Not fitting the model. If you don't fit, there's no guarantee that the residual is perpendicular to the fitted model values.
  2. Not including an intercept (or, the equivalent, which can be had by including a categorical variable). This messes up the link between the length of the vector and the variance.

Example (which shows one reason not to leave off the intercept):

mod = lm(length ~ width - 1, data = KidsFeet)
var(fitted(mod))
## [1] 1.958
with(KidsFeet, var(length))
## [1] 1.736

To avoid this problem, there's another definition of \( R^2 \) used by summary() that reduces to the ratio of variances when the vector of 1s is in the model subspace.

How Big is Big Enough?

This depends on your purpose.

max(do(100) * cor(rnorm(10), rnorm(10))^2)
## result 
## 0.5996 

TURN THIS INTO A BLOG: Early inference.

Correlation Coefficient.

History.

Relationship between \( R^2 \) and \( r \)

\( r \) is just for a straight-line model
* Example using ccf versus month: obviously a relationship but not a straight-line relationship

Comparing Models

The “Coefficient of Determination” \( R^2 \)

Orthogonality and Alignment

Orthogonality of explanatory vectors is a nice, simplifying property. (One more demo of that today.)

Why Groupwise means are so simple

They are so simple that it's tempting to try to cast everything in terms of groupwise means. But this is too limiting — it doesn't, for example, allow us to deal with quantitative explanatory variables.

Are there lessons to learn in terms of simplifying the interpretation of other kinds of coefficients?

EXAMPLE: Suppose you have decided that the issue in wages is about unionization and whether or not the worker is a service worker — all the other sectors are equivalent. You add a new variable to the wage data:

cps = fetchData("CPS85")
## Data CPS85 found in package.
cps = transform(cps, service = sector == "service")
mod1 = mm(wage ~ service, data = cps)

A colleague comes along with a theory that construction workers need to be broken out separately. You're skeptical, but you do this anyways

cps = transform(cps, const = sector == "const")
mod2 = mm(wage ~ service + const, data = cps)

Compare the two models:

coef(mod1)
## FALSE  TRUE 
## 9.482 6.537 
coef(mod2)
## FALSE.FALSE  TRUE.FALSE  FALSE.TRUE   TRUE.TRUE 
##       9.481       6.537       9.502         NaN 

Notice two things
1. There's a redundancy. Why? (There are really only three groups: service, construction and the rest.)
2. The first two coefficients didn't change from mod1 to mod2, even though we added in a new model term.

By comparison, consider these two models:

coef(lm(wage ~ age, data = cps))
## (Intercept)         age 
##     6.16747     0.07755 
coef(lm(wage ~ age + exper, data = cps))
## (Intercept)         age       exper 
##    -10.3819      0.9232     -0.8190 

The coefficients change hugely.

Don't think that this is about mm() versus lm(). It's a matter of the vectors and whether they are orthogonal to one another or not.

You can test whether two vectors are orthogonal by taking their dot product:

with(data = cps, sum(service * const))  # orthogonal
## [1] 0
with(data = cps, sum(age * exper))  # not orthogonal
## [1] 426214

When the new vector is orthogonal to the one already in the model, the inclusion of the new vector in the model won't change the coefficients on the old vector.

Angle program in m155development.R

fetchData("m155development.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE

This takes a model formula or a pair of numerical vectors.

with(data = cps, angle(service, const))  # orthogonal
## [1] 90
with(data = cps, angle(age, exper))  # not orthogonal
## [1] 17.84

Remember that when using angle on a model or a model formul, you are finding the angle between the response and the subspace due to all the explanatory vectors.

angle(age ~ exper, data = cps)
## [1] 3.628
angle(age ~ exper - 1, data = cps)
## [1] 17.84

In PROGRESS

Overall idea: \( R^2 \) summarizes the alignment on a 0-1 scale. 0 means not at all aligned (orthogonal); 1 means perfectly aligned (which might be 0 or 180 degrees). It always includes the intercept in the model — the point is to account for variation. That's why it can be calculated as the variance of the fitted divided by the variance of the response.

Another way to think of it: The fraction of the square-distance that you've travelled to the response on the busses provided by your model vectors.

Little \( r \)

Watch Out