Lecture 17: Confidence in Models

Review

Sampling Distributions and Re-sampling Distributions

Construct an interval using do() and resample().

data("KidsFeet")
s = do(100)*lm(width ~ sex + length, data=resample(KidsFeet))
sd(~length, data=s) # Standard error
## [1] 0.03487237
densityplot(~length, data=s)

To compute confidence intervals from the standard error, we can multiply by 1.96 and calculate the interval around the mean:

confint(s)[3, c("lower", "upper")]  # Grab only `length`
## Warning: confint: Using df=Inf.
##      lower     upper
## 3 0.153881 0.2905782
std = sd(~length, data=s)  # Standard error for `length`
est = mean(~length, data=s)
est + c(-std, +std) * 1.96  # Confidence interval
## [1] 0.1538798 0.2905795

Confidence in coefficients

Kids’ foot width

The setting for this problem is the question of whether girls’ feet are narrower than boys’. Confidence intervals give us a quick answer (based on the available data):

mod = lm(width ~ sex, data=KidsFeet)
confint(mod)
##                  2.5 %      97.5 %
## (Intercept)  8.9758882  9.40411181
## sexG        -0.7125476 -0.09903131

The confidence interval on the difference between boys’ and girls’ foot widths is entirely in the negative. So even this small sample provides evidence that girls’ feet are narrower than boys’.

The real question, however, is whether, for any given shoe size (determined by length) the girls’ feet are narrower:

mod2 = lm(width ~ sex + length, data=KidsFeet)
confint(mod2)
##                  2.5 %     97.5 %
## (Intercept)  1.1048182 6.17751841
## sexG        -0.4947759 0.02974084
## length       0.1202348 0.32181513

Not so much…

Confidence in predictions

If we make a complicated model, it’s harder to interpret the confidence intervals on the coefficients:

mod3 = lm(width ~ sex*length, data=KidsFeet)
confint(mod3)
##                   2.5 %    97.5 %
## (Intercept)  0.09816469 7.6059964
## sexG        -5.70117633 4.4534183
## length       0.06326171 0.3619858
## sexG:length -0.18914537 0.2207867

Notice how the confidence interval on sexG has gotten much wider. This can be confusing, since sexG also enters in to the interaction term. Model values are good to compare here:

f3 = makeFun(mod3)
f3(length=25, sex="G")
##        1 
## 8.939312
f3(length=25, sex=c("G","B"), interval="confidence")
##        fit      lwr      upr
## 1 8.939312 8.734104 9.144521
## 2 9.167675 8.989846 9.345503

There is considerable overlap between the two intervals. Here we are talking about the model value confidence interval…

Another sort of question to ask is, if we have a specific girl and a specific boy of the same foot length, how likely are their foot widths to be different:

f3(length=25, sex=c("G","B"), interval="prediction")
##        fit      lwr      upr
## 1 8.939312 8.120989 9.757635
## 2 9.167675 8.355785 9.979564

Here we are interested in the prediction confidence interval, and we find that the typical boy’s predicted value is very reasonable for a girl, and vice versa.

Confidence and collinearity

Consider a set of models of “wage” from the Current Population Survey data-set from the 1980s. A relatively simple model might take just the worker’s education into account when predicting their wage:

data("CPS85")
mod4 = lm(wage ~ 1 + educ, data=CPS85)
summary(mod4)$coefficients
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.7459797 1.04545406 -0.7135461 4.758208e-01
## educ         0.7504608 0.07873373  9.5316300 5.473845e-20

According to our simple model, the coefficient on “educ” indicates that a one-year increase in education leads to a 0.75 dollars/hour increase in “wage” (these are data from the 1980s after-all). The confidence interval here is (or 0.75+/-0.15 dollars/hour):

confint(mod4)
##                  2.5 %    97.5 %
## (Intercept) -2.7997043 1.3077449
## educ         0.5957936 0.9051279

A slightly more elaborate model includes “age” as a covariate. Maybe we think this is important because education levels are typically lower for older workers, so a comparison of workers with the same level of education ought to hold “age” constant. Here is the model that does this:

mod5 = lm(wage ~ 1 + educ + age, data=CPS85)
summary(mod5)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -5.5342315 1.27936340 -4.325770 1.816262e-05
## educ         0.8211071 0.07704570 10.657403 3.647559e-24
## age          0.1050279 0.01718347  6.112145 1.904603e-09

Now, the coefficient on “educ” has changed from 0.75 to 0.82 dollars/hour while leaving the standard error pretty much the same. Ok, that is good news, but suppose we decide that wage also depends on experience. We can naively add “exper” to the model:

mod6 = lm(wage ~ 1 + educ + age + exper, data=CPS85)
summary(mod6)$coefficients
##                Estimate Std. Error     t value  Pr(>|t|)
## (Intercept) -4.76987017   7.042711 -0.67727758 0.4985254
## educ         0.94832861   1.155237  0.82089541 0.4120749
## age         -0.02241042   1.154752 -0.01940712 0.9845236
## exper        0.12755813   1.155710  0.11037211 0.9121561

What we now have are hugely inflated standard errors! Indeed, the confidence interval on “educ” is now 0.95+/-2.27 dollars/hour - a very imprecise estimate indeed! In fact, the confidence interval now also includes negative values:

confint(mod6)
##                  2.5 %   97.5 %
## (Intercept) -18.604924 9.065183
## educ         -1.321076 3.217734
## age          -2.290863 2.246042
## exper        -2.142776 2.397892

Why has this happened? Probably because every year of additional education nessesarily leads to a reduction in years of experience (by one year). In other words, the multicollinearity of “exper” with “age” and to a lesser degree “educ” is causing the model coefficients to become less precise - this increases the standard errors. We can make a quick assessment of multicollinearity by computing a correlation matrix:

cor(CPS85[,c("educ", "exper", "age")])
##             educ      exper        age
## educ   1.0000000 -0.3526764 -0.1500195
## exper -0.3526764  1.0000000  0.9779612
## age   -0.1500195  0.9779612  1.0000000

where we see that “exper” is highly colinear with “age” (r = 0.9779612). The smart thing to do here is remove one of the terms and refit the model before making any claims about the relationship between wage and education.

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.