Stats 155 Class Notes 2012-10-08

Where We Are

At this point in the course, after the “Week of Theory,” you should have a better understanding of:
1. Why the coefficients on any given term depend on what other terms are in the model. That Simpson's Paradox is not the act of an angry deity, but a simple consequence of alignment of explanatory vectors.
2. The role that the intercept plays. Why it's a choice to represent a model whose model values are groupwise means as either the means themselves or a reference mean and deltas to the other means.
3. That alignment of vectors makes things a bit harder, but is not a fundamental difficulty so long as the alignment is not absolute, as in redundancy. We can measure alignment between two vectors as an angle. But we need a way to measure alignment between a vector and a set of vectors.
4. That a rough measure of the “quality of a model” is the size of the residuals. The smaller the typical residual, the better the model.

You should already know how to fit models, even pretty complicated models that include lots of variables. You understand that the model itself takes the form of a function, but that there are important attributes of a model that go beyond just the function itself, e.g. the residuals, model coefficients, confidence intervals on model coefficients.

But models are looking like they hard to interpret. The quantification of effect size is complicated by interaction terms. It's also complicated by the change in coefficients when an additional model term is put into a model.

Where We're Heading

This week, we'll be developing some tools for interpreting models.

The “official” statistical method for measuring alignment: not an angle but \( R^2 \) and \( r \): the coefficient of determination and the correlation coefficient. (Bad names: these “coefficients” are not like coefficients in models. That's why \( R^2 \) and \( r \) are so widely used.)
The use of partial derivatives applied to model functions to quantify effect size. This will simplify the interpretation of effect size. However, you'll discover that you need to be thoughtful about whether a partial derivative is what you want or not.

Accounting for Variation

The definition offered for “statistics” at the beginning of the semester was

The explanation of or accounting for variation in the context of what remains unexplained.

At this point, you should know
1. What is meant by variation.

2. What is meant by “explanation” and why “accounted for” might be a more accurate. That We're concerned in our models with explaining just one quantity, labelled the “response variable,” and are using the explanatory variable to do the explaining or accounting for.
3. That “what remains unexplained” is represented by the residual. So it's not at this point a philosophical statement of what we don't know, but just the difference between what the model says and what actually happened.

Now we need to sort through the meaning of “in the context of”.

We will eventually reach \( R^2 \), a very widely used quantity. So you might as well know what \( R^2 \) is: a ratio of variances: the variance in the fitted divided by the variance of the response. That is, how much the model created versus how much there was to create.

Two ways do display \( R^2 \):

Do summary() on the model.
Use do(1), which is arranged to extract the \( R^2 \).

Some questions:

(theoretical) Why variance and not standard deviation or something else? Ans: Partitioning.
(theoretical) Why var(fitted)/var(response) rather than var(fitted)/var(resid)?
(theoretical) Why the fitted model values can never have a larger variance than the response variable?
(practical) How big is big enough?

Partitioning of Variability

Variance

Show that sd, IQR, 95% coverage don't partition variation. That is, the IQR of the fitted + IQR of resids is not necessarily IQR of response variable.
Var does partition variation. This has to do with the residual being perpendicular to the fitted model values and the pythagorean theorem.

Why not, say, var(fitted)/var(resid)?

That would be reasonable, too. var(fitted)/var(response) has a nice property of being easy to interpret. It's always between 0 and 1.

Later on we'll switch to a more technical statistic, called the F statistic, which is

F = (var(fitted)/m-1)/(var(resid)/(n-m-1)) This has nice properties in terms of the shape of the probability distribution. But fundamentally, it has no advantage over \( R^2 \).

Why is var(fitted) always no bigger than var(response)?

This has to do with the length of vectors. The variance is proportional to the length of the vector (after subtracting out the mean). The triangle inequality mandates that the length of the leg of a right triangle cannot be longer than the length of the hypotenuse.

Actually, you can create models that have a var(fitted) larger than var(response). This involves either of two strategies:

Not fitting the model. If you don't fit, there's no guarantee that the residual is perpendicular to the fitted model values.
Not including an intercept (or, the equivalent, which can be had by including a categorical variable). This messes up the link between the length of the vector and the variance.

Example (which shows one reason not to leave off the intercept):

mod = lm(length ~ width - 1, data = KidsFeet)
var(fitted(mod))

## [1] 1.958

with(KidsFeet, var(length))

## [1] 1.736

To avoid this problem, there's another definition of \( R^2 \) used by summary() that reduces to the ratio of variances when the vector of 1s is in the model subspace.

How Big is Big Enough?

This depends on your purpose.

\( R^2 = 0.01 \) on daily stock-market prices would be plenty for me to become the richest person in the world.
Rule of thumb often heard: Bigger than 0.15. But this has no clear justification. EXAMPLE: SAT versus freshman GPA: \( R^2 = 0.16 \) WATCH OUT, people sometimes report R in order to make it seem that the number is bigger. Why this is wrong: it doesn't take into account how much evidence there is. Suppose you did an experiement comparing 10 cases to 10 controls. In reality, there was no relationship. What do you see?

max(do(100) * cor(rnorm(10), rnorm(10))^2)

## result 
## 0.5996

TURN THIS INTO A BLOG: Early inference.

Comparing models. Bigger \( R^2 \) is better. (But is it really?)
Hypothesis testing approach: Can we detect the effect?

Correlation Coefficient.

History.

Ellipical scatter plots.
Ellipses: ratio of minor to major axis legthth.

Relationship between \( R^2 \) and \( r \)

\( r \) is just for a straight-line model
* Example using ccf versus month: obviously a relationship but not a straight-line relationship

Comparing Models

How R² Changes as we add in model terms
Nesting
Random vectors. Use shuffle() and a small data set.

The “Coefficient of Determination” \( R^2 \)

\( R^2 \) as ratio of variances
- Meaning of \( R^2 \) as fraction of variance accounted for by model
- Meaning of correlation coefficient r

Orthogonality and Alignment

Orthogonality of explanatory vectors is a nice, simplifying property. (One more demo of that today.)

Why Groupwise means are so simple

They are so simple that it's tempting to try to cast everything in terms of groupwise means. But this is too limiting — it doesn't, for example, allow us to deal with quantitative explanatory variables.

Are there lessons to learn in terms of simplifying the interpretation of other kinds of coefficients?

EXAMPLE: Suppose you have decided that the issue in wages is about unionization and whether or not the worker is a service worker — all the other sectors are equivalent. You add a new variable to the wage data:

cps = fetchData("CPS85")

## Data CPS85 found in package.

cps = transform(cps, service = sector == "service")
mod1 = mm(wage ~ service, data = cps)

A colleague comes along with a theory that construction workers need to be broken out separately. You're skeptical, but you do this anyways

cps = transform(cps, const = sector == "const")
mod2 = mm(wage ~ service + const, data = cps)

Compare the two models:

coef(mod1)

## FALSE  TRUE 
## 9.482 6.537

coef(mod2)

## FALSE.FALSE  TRUE.FALSE  FALSE.TRUE   TRUE.TRUE 
##       9.481       6.537       9.502         NaN

Notice two things
1. There's a redundancy. Why? (There are really only three groups: service, construction and the rest.)
2. The first two coefficients didn't change from mod1 to mod2, even though we added in a new model term.

By comparison, consider these two models:

coef(lm(wage ~ age, data = cps))

## (Intercept)         age 
##     6.16747     0.07755

coef(lm(wage ~ age + exper, data = cps))

## (Intercept)         age       exper 
##    -10.3819      0.9232     -0.8190

The coefficients change hugely.

Don't think that this is about mm() versus lm(). It's a matter of the vectors and whether they are orthogonal to one another or not.

You can test whether two vectors are orthogonal by taking their dot product:

with(data = cps, sum(service * const))  # orthogonal

## [1] 0

with(data = cps, sum(age * exper))  # not orthogonal

## [1] 426214

When the new vector is orthogonal to the one already in the model, the inclusion of the new vector in the model won't change the coefficients on the old vector.

Angle program in `m155development.R`

fetchData("m155development.R")

## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R

## [1] TRUE

This takes a model formula or a pair of numerical vectors.

with(data = cps, angle(service, const))  # orthogonal

## [1] 90

with(data = cps, angle(age, exper))  # not orthogonal

## [1] 17.84

Remember that when using angle on a model or a model formul, you are finding the angle between the response and the subspace due to all the explanatory vectors.

angle(age ~ exper, data = cps)

## [1] 3.628

angle(age ~ exper - 1, data = cps)

## [1] 17.84

In PROGRESS

Overall idea: \( R^2 \) summarizes the alignment on a 0-1 scale. 0 means not at all aligned (orthogonal); 1 means perfectly aligned (which might be 0 or 180 degrees). It always includes the intercept in the model — the point is to account for variation. That's why it can be calculated as the variance of the fitted divided by the variance of the response.

Another way to think of it: The fraction of the square-distance that you've travelled to the response on the busses provided by your model vectors.

Little \( r \)

Just the square-root of \( R^2 \) from the simple model A ~ B with A and B both quantitative.
People who don't want to talk about model coefficients use the correlation coefficient.
- The sign reflects the sign of the coefficient.
- But the magnitude isn't an effect size.
\( r \) is stupid if the relationship is nonlinear (try ccf versus month in utility data)
Our practices:
- Use the coefficients to describe effect size (or, more generally, partial derivatives)
- Include what needs to be included in the model to capture the phenomena of interest.

Watch Out

Sometimes people report \( R \) instead of \( R^2 \) because it looks bigger! You should be aware of this.
Example: SAT scores and college GPA have \( R \approx 0.4 \). That sounds pretty large, but remember that \( R^2 \approx 0.16 \). It wouldn't really matter which one you use, but it's wrong to compare two different things as if they were the same.
- In SAT scores, there's a question of whether the lack of predictability is due to SAT or to GPA. How would you decide?