Lecture 18: (More) Confidence in Models

Confidence in coefficients

Regression reports are generated using software that you have already encountered: lm to fit a model (linear model), and summary to construct a report from the fitted model (the model after it has generated a least squares fit). Here is a quick example using the “swim100m.csv” data-set:

library(mosaic)

data("SwimRecords")
mod = lm(time ~ year + sex, data=SwimRecords)
summary(mod)
## 
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7027 -2.7027 -0.5968  1.2796 19.0759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 555.71678   33.79991  16.441  < 2e-16 ***
## year         -0.25146    0.01732 -14.516  < 2e-16 ***
## sexM         -9.79796    1.01287  -9.673 8.79e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.8387 
## F-statistic: 159.6 on 2 and 59 DF,  p-value: < 2.2e-16

Confidence intervals from standard errors

Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated. For a 95% confidence interval, you simply multiply the standard error by approximately 2 to get the margin of error. For example, in the above, the margin of error on sexM is 2 * 1.0128719 = 2.0257438.

To calculate the two endpoints of the confidence interval, we can simply subtract and add the margin of error to the estimate:

me = 2.0 * 1.0129
-9.7980 + me
## [1] -7.7722
-9.7980 - me
## [1] -11.8238

The key thing to remember is the multiplier that is applied to the standard error. When computing a 95% confidence interval, a values of approximately 2 is used. To calculate the confidence interval directly, we can use the confint function, which calculates the exact multiplier (which depends to some degree on sample size) and applies it to the standard error to produce the confidence intervals.

confint(mod)
##                   2.5 %      97.5 %
## (Intercept) 488.0833106 623.3502562
## year         -0.2861277  -0.2167997
## sexM        -11.8247135  -7.7712094

It would be convenient if the regression report simply included the confidence intervals on the coefficients rather than the standard error. Part of the reason it doesn’t is historical: the desire to connect to the traditional hand calculations…

Bootstrapping confidence intervals

Confidence intervals on model coefficients can also be computed using the same bootstrapping procedure introduced previously. We first start with our fitted model, in this case, we’ll stick with the previous model, which fits swimming time against year, taking into account sex: time ~ year + sex. The coefficients for this model are given below:

mod = lm(time ~ year + sex, data=SwimRecords) # refitting here for clarity
coef(mod)
## (Intercept)        year        sexM 
## 555.7167834  -0.2514637  -9.7979615

These coefficients reflect one hypothetical random draw from the population-based sampling distribution. Since it is impossible to get another draw from the ‘population’ here, the actual records are all we have to work with. However, we can approximate sampling variation by treating the sample as our population, and re-sampling many times and collecting the results:

s = do(500) * coef(lm(time ~ year + sex, data=resample(SwimRecords)))
head(s)
##   Intercept       year       sexM
## 1  504.9555 -0.2254423  -9.634909
## 2  592.5853 -0.2700457  -9.653906
## 3  526.7751 -0.2371344  -9.490887
## 4  631.4513 -0.2900060 -10.749309
## 5  569.1354 -0.2585922  -8.890205
## 6  545.1696 -0.2459218 -10.232772

To find the standard error of the coefficients, we simply take the standard deviation across the re-sampling trials:

sapply(s, sd)  # Apply sd to each column
##   Intercept        year        sexM 
## 41.83853150  0.02113542  0.93278124

Multiplying the standard error by 2 should give you the approximate 95% margin of error, which in this case, is 1.8655625 for sexM:

sapply(s, sd) * 2.0
##   Intercept        year        sexM 
## 83.67706300  0.04227083  1.86556247

Alternatively, we can compute the actual (approximate) interval:

sapply(s, function(x) mean(x) + (sd(x) * 2 * c(-1, 1)))
##      Intercept       year       sexM
## [1,]  469.9886 -0.2927026 -11.689977
## [2,]  637.3428 -0.2081609  -7.958852

Prediction confidence intervals

When a model is used to make a prediction, it is helpful to be able to describe now precise the prediction is. For example, suppose you want to use the “kidsfeet.csv” data-set to make a prediction of the foot width of a girl whose foot length is 25cm. First, you’ll need to explore you data a bit, and then fit a model:

data("KidsFeet")
names(KidsFeet)
## [1] "name"       "birthmonth" "birthyear"  "length"     "width"     
## [6] "sex"        "biggerfoot" "domhand"
with(data=KidsFeet, levels(sex))
## [1] "B" "G"
mod = lm(width ~ length + sex, data=KidsFeet)

The object created by lm() contains a lot of information about the model and the data used for fitting the model. For the purposes of using the model to make a prediction, you want a function that takes the values for the explanatory variables as inputs and returns the corresponding model value as the output. To turn the model into a function, we can use makeFun() (you can call the resulting function anything you want):

f = makeFun(mod)

Now we can use f() to convert inputs into outputs. Take care to use the right coding for categorical variables:

f(length=25, sex="G")
##        1 
## 8.934275

In order to generate a confidence interval, the prediction function f() needs to be told what type of interval is needed. There are two types of prediction confidence intervals:

Interval on the model value, which reflects the sampling distribution of the coefficients themselves. To calculate this, use the interval="confidence" named argument:

f(length=25, sex="G", interval="confidence")
##        fit      lwr      upr
## 1 8.934275 8.742565 9.125985

The components named lwr and upr are the lower and upper limits of the confidence interval, respectively.

Interval on the prediction, which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the interval="prediction" named argument:

f(length=25, sex="G", interval="prediction")
##        fit      lwr      upr
## 1 8.934275 8.130484 9.738066

The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value. To see this visually, we can plot the confidence and prediction intervals around our KidsFeet data:

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.