Stats 155 Class Notes 2012-10-22

Topics for the Day

Another “Day of Theory”

Review of the factors that shape a confidence interval on a model coefficient.

But model coefficients are not the only thing you might want to put a confidence interval on. Other things:

Geometrical Picture of Confidence Intervals

CI geometry slides

For coefficients and spread of coefficients as the ball gets bigger, see slides 38 and 39 from geometry of statistics

What's the geometry of the number of data points?

Looking at the distribution of angles between random vectors in 2 and higher dimensions.

Ask the class

Confirm by simulation

Ask for suggestions for ways to generate a randomly pointing vector in the plane. Perhaps someone will suggest generating a random angle and taking the sine and cosine of that. Good idea, but this assumes that the angles to randomly generated vectors are evenly distributed. And it doesn't generalize to higher dimensions.

Consider two ways to generate random numbers:

  1. Generate a sample with large n (say n=10000) and report on the distribution.
  2. Generate x and y coordinates and plot them:
xyplot(runif(1000, min = -1, max = 1) ~ runif(1000, min = -1, max = 1))

plot of chunk unnamed-chunk-2

xyplot(rnorm(1000) ~ rnorm(1000))

plot of chunk unnamed-chunk-2

  1. Get the distribution of vectors. From trigonometry, remember that the angle will be the arctan of the x and y coordinates. Do help(atan2) We'll do a very large simulation (n=100000) so we get a very smooth distribution:
sunif = atan2(runif(1e+05, min = -1, max = 1), runif(1e+05, min = -1, max = 1)) * 
    180/pi
densityplot(~sunif)

plot of chunk unnamed-chunk-3

snorm = atan2(rnorm(1e+05), rnorm(1e+05)) * 180/pi
densityplot(~snorm)

plot of chunk unnamed-chunk-3

The uniform distribution doesn't generate uniformly distributed angles, but the normal distribution does.

atan2() works in the plane, but we'll want to experiment in higher dimensions as well. So let's calculate the angle based on the dot product.

Write an angle program. (You'll want to build this up in stages.)

ang = makeFun((180/pi)*acos(sum(u*v)/sqrt(sum(u*u)*sum(v*v)))~
                u&v)

Or, if you prefer, use the angle() program in m155development.R:

fetchData("m155development.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE

Random angles in 2 dimensions:

s = do(1000) * angle(rnorm(2), rnorm(2))
densityplot(~result, data = s)

plot of chunk unnamed-chunk-6

The angle is uniform on 0 to 180 degrees. But this intuition is misleading in higher dimensions. Do the same thing in 3 and higher dimensions.

Do the same thing, but replacing one of the vectors with a fixed vector:

v = rnorm(2)
s = do(1000) * angle(rnorm(2), v)
densityplot(~result, data = s)

plot of chunk unnamed-chunk-7

Now with a 3-dimensional vector:

v = rnorm(3)
s = do(1000) * angle(rnorm(3), v)
densityplot(~result, data = s)

plot of chunk unnamed-chunk-8

Why aren't the angles uniformly distributed? Demo: fix one of the vectors at the North Pole. Toss a globe around the room. Report the co-latitude of where your right middle finger lands. (Co-latitude is the angle from the North Pole. So the equator is at 90 deg.) You're much more likely to land near the equator than near the poles.

Now a 30-dimensional vector:

v = rnorm(30)
s = do(1000) * angle(rnorm(30), v)
densityplot(~result, data = s, xlim = c(0, 180))

plot of chunk unnamed-chunk-9

ACTIVITY: Find a simple relationship between the mean angle and n. Find a simple relationship between the variance of the angle and n. Divide people up into 6 groups:
1. use the angle in degrees, study the mean angle
2. use the angle in degrees, study the variance of the angle
3. use the angle in radians, study the mean angle (set degrees=FALSE in angle())
4. use the angle in radians, study the variance of the angle.
5. use the cosine of the angle, study the mean cosine of the angle.
6. use the cosine of the angle, study the variance of the angle.

Suggestions:

Results
The mean of the random angle is 90deg. The standard deviation of the angle is close to being proportional to \( \sqrt{1}{n} \). Random vectors tend to be orthogonal.

Back to the Geometry

Since the random vectors will tend to be almost orthogonal to the subspace of the explanatory vectors B and C, the size of the random ball will be reduced. Reduced by how much: a factor of \( \cos(\theta) \).

Overall result: the standard deviation of the cosine of the angle is distributed as \( 1/sqrt(n) \).

Disastrous Collinearity

Consider a model of wage that incorporates all of these variables: age, educ, exper.

Let's look at them one at a time:

mod0 = lm(wage ~ exper, data = CPS85)
confint(mod0)
##                 2.5 %  97.5 %
## (Intercept) 7.6159013 9.14403
## exper       0.0009191 0.07136
0.07136 - 0.00092
## [1] 0.07044

Putting in covariates can reduce the confidence interval:

mod1 = lm(wage ~ exper + sector * sex, data = CPS85)
head(confint(mod1))
##                2.5 %  97.5 %
## (Intercept)  5.18536 7.55326
## exper        0.02329 0.08694
## sectorconst -1.24170 4.35135
## sectormanag  1.58599 5.98202
## sectormanuf -3.90760 0.26778
## sectorother -5.32009 2.23733
0.08694 - 0.02329
## [1] 0.06365

The interval is reduced because the length of the residual vector has been reduced:

sqrt(sum(resid(mod0)^2))
## [1] 118.2
sqrt(sum(resid(mod1)^2))
## [1] 103.2

What about colinearity with the covariates:

r2 = r.squared(lm(exper ~ sector * sex, data = CPS85))
acos(sqrt(r2)) * 180/pi
## [1] 78.42
1/sin(acos(sqrt(r2)))  # variance inflation: 2%
## [1] 1.021

Now add in another variable, perhaps treated as a covariate or perhaps because we want to untangle the effects of experience and eduction (lower experience is correlated with higher education).

mod3 = lm(wage ~ exper + educ, data = CPS85)
confint(mod3)
##                2.5 %  97.5 %
## (Intercept) -7.29899 -2.5100
## exper        0.07135  0.1389
## educ         0.76605  1.0859
0.1389 - 0.0713
## [1] 0.0676
mod4 = lm(wage ~ exper + educ + sector * sex, data = CPS85)
head(confint(mod4))
##                2.5 %  97.5 %
## (Intercept) -6.67715 -0.7109
## exper        0.06775  0.1334
## educ         0.52040  0.9139
## sectorconst  0.01977  5.3970
## sectormanag  0.77908  5.0046
## sectormanuf -2.39933  1.6660
0.1334 - 0.0677
## [1] 0.0657

Only a small reduction in the width of the confidence interval. Here's the reduction in the length of the residual vector.

sqrt(sum(resid(mod3)^2))
## [1] 106
sqrt(sum(resid(mod4)^2))
## [1] 98.46

Looking at the collinearity introducted by educ

r2 = r.squared(lm(exper ~ educ + sector * sex, data = CPS85))
acos(sqrt(r2)) * 180/pi
## [1] 64.99
1/sin(acos(sqrt(r2)))  # variance inflation: 10%
## [1] 1.103

Now add in age as well

mod5 = lm(wage ~ exper + educ + age + sector * sex, data = CPS85)
head(confint(mod5))
##                 2.5 % 97.5 %
## (Intercept) -16.89668  9.521
## exper        -2.05383  2.257
## educ         -1.44558  2.882
## age          -2.15431  2.152
## sectorconst   0.01712  5.400
## sectormanag   0.77703  5.007

The standard errors have exploded. The length of the residual vector is smaller than in the previous model,

sqrt(sum(resid(mod4)^2))
## [1] 98.46
sqrt(sum(resid(mod5)^2))
## [1] 98.46

but the variance inflation due to collinearity is much larger

r2 = r.squared(lm(exper ~ educ + age + sector * sex, data = CPS85))
acos(sqrt(r2)) * 180/pi  # less than 1 degree!
## [1] 0.792
1/sin(acos(sqrt(r2)))  # variance inflation: 10%
## [1] 72.34

As we've seen before, in the CPS85 data, exper was derived from educ and age, but there was a mistake made in one case. Fixing that mistake …

cps = fetchData("CPS85-corrected.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/CPS85-corrected.csv
mod6 = lm(wage ~ exper + educ + age + sector * sex, data = cps)
head(confint(mod6))
##                2.5 %  97.5 %
## (Intercept) -6.67715 -0.7109
## exper        0.06775  0.1334
## educ         0.52040  0.9139
## age               NA      NA
## sectorconst  0.01977  5.3970
## sectormanag  0.77908  5.0046

When the colinearity is absolute — redundancy — the software recognizes the situation and corrects it by deleting a redundant vector from the model.

Prediction confidence intervals

For prediction or model values, colinearity is not a problem:

mod1 = lm(wage ~ exper + sector * sex, data = CPS85)
r.squared(mod1)
## [1] 0.243
f1 = makeFun(mod1)
f1(exper = 10, sex = "F", sector = "prof", interval = "confidence")
## Warning: prediction from a rank-deficient fit may be misleading
##     fit   lwr   upr
## 1 10.77 9.518 12.02
mod2 = lm(wage ~ exper + educ + sector * sex, data = CPS85)
r.squared(mod2)
## [1] 0.3113
f2 = makeFun(mod2)
f2(exper = 10, educ = 14, sex = "F", sector = "prof", interval = "confidence")
## Warning: prediction from a rank-deficient fit may be misleading
##     fit   lwr   upr
## 1 9.568 8.329 10.81
mod3 = lm(wage ~ exper + educ + age + sector * sex, data = CPS85)
r.squared(mod3)
## [1] 0.3113
f3 = makeFun(mod3)
f3(exper = 10, educ = 14, age = 30, sex = "F", sector = "prof", interval = "confidence")
## Warning: prediction from a rank-deficient fit may be misleading
##     fit   lwr   upr
## 1 9.568 8.315 10.82

Conclusions

Effect Size

When you construct a model to estimate an effect size (whether directly from a coefficient or from a partial derivative)

If you absolutely need a colinear covariate for adjustment purposes, and your confidence intervals are too broad, then you'll need to collect more data.

Prediction/Classification

When you construct a model to make a prediction, you don't need to worry about the colinearity.

The problem with colinearity arises when you want to “share the credit” among 2 or more related (that is, colinear) variables. Since there is an alignment, there is some ambiguity about how to split up credit for the various ways that the residual will point.