Regression reports are generated using software that you have already encountered: lm to fit a model (linear model), and summary to construct a report from the fitted model (the model after it has generated a least squares fit). Here is a quick example using the “swim100m.csv” data-set:
library(mosaic)
data("SwimRecords")
mod = lm(time ~ year + sex, data=SwimRecords)
summary(mod)
##
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7027 -2.7027 -0.5968 1.2796 19.0759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 555.71678 33.79991 16.441 < 2e-16 ***
## year -0.25146 0.01732 -14.516 < 2e-16 ***
## sexM -9.79796 1.01287 -9.673 8.79e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.8387
## F-statistic: 159.6 on 2 and 59 DF, p-value: < 2.2e-16
Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated. For a 95% confidence interval, you simply multiply the standard error by approximately 2 to get the margin of error. For example, in the above, the margin of error on sexM is 2 * 1.0128719 = 2.0257438.
To calculate the two endpoints of the confidence interval, we can simply subtract and add the margin of error to the estimate:
me = 2.0 * 1.0129
-9.7980 + me
## [1] -7.7722
-9.7980 - me
## [1] -11.8238
The key thing to remember is the multiplier that is applied to the standard error. When computing a 95% confidence interval, a values of approximately 2 is used. To calculate the confidence interval directly, we can use the confint function, which calculates the exact multiplier (which depends to some degree on sample size) and applies it to the standard error to produce the confidence intervals.
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 488.0833106 623.3502562
## year -0.2861277 -0.2167997
## sexM -11.8247135 -7.7712094
It would be convenient if the regression report simply included the confidence intervals on the coefficients rather than the standard error. Part of the reason it doesn’t is historical: the desire to connect to the traditional hand calculations…
Confidence intervals on model coefficients can also be computed using the same bootstrapping procedure introduced previously. We first start with our fitted model, in this case, we’ll stick with the previous model, which fits swimming time against year, taking into account sex: time ~ year + sex. The coefficients for this model are given below:
mod = lm(time ~ year + sex, data=SwimRecords) # refitting here for clarity
coef(mod)
## (Intercept) year sexM
## 555.7167834 -0.2514637 -9.7979615
These coefficients reflect one hypothetical random draw from the population-based sampling distribution. Since it is impossible to get another draw from the ‘population’ here, the actual records are all we have to work with. However, we can approximate sampling variation by treating the sample as our population, and re-sampling many times and collecting the results:
s = do(500) * coef(lm(time ~ year + sex, data=resample(SwimRecords)))
head(s)
## Intercept year sexM
## 1 504.9555 -0.2254423 -9.634909
## 2 592.5853 -0.2700457 -9.653906
## 3 526.7751 -0.2371344 -9.490887
## 4 631.4513 -0.2900060 -10.749309
## 5 569.1354 -0.2585922 -8.890205
## 6 545.1696 -0.2459218 -10.232772
To find the standard error of the coefficients, we simply take the standard deviation across the re-sampling trials:
sapply(s, sd) # Apply sd to each column
## Intercept year sexM
## 41.83853150 0.02113542 0.93278124
Multiplying the standard error by 2 should give you the approximate 95% margin of error, which in this case, is 1.8655625 for sexM:
sapply(s, sd) * 2.0
## Intercept year sexM
## 83.67706300 0.04227083 1.86556247
Alternatively, we can compute the actual (approximate) interval:
sapply(s, function(x) mean(x) + (sd(x) * 2 * c(-1, 1)))
## Intercept year sexM
## [1,] 469.9886 -0.2927026 -11.689977
## [2,] 637.3428 -0.2081609 -7.958852
When a model is used to make a prediction, it is helpful to be able to describe now precise the prediction is. For example, suppose you want to use the “kidsfeet.csv” data-set to make a prediction of the foot width of a girl whose foot length is 25cm. First, you’ll need to explore you data a bit, and then fit a model:
data("KidsFeet")
names(KidsFeet)
## [1] "name" "birthmonth" "birthyear" "length" "width"
## [6] "sex" "biggerfoot" "domhand"
with(data=KidsFeet, levels(sex))
## [1] "B" "G"
mod = lm(width ~ length + sex, data=KidsFeet)
The object created by lm() contains a lot of information about the model and the data used for fitting the model. For the purposes of using the model to make a prediction, you want a function that takes the values for the explanatory variables as inputs and returns the corresponding model value as the output. To turn the model into a function, we can use makeFun() (you can call the resulting function anything you want):
f = makeFun(mod)
Now we can use f() to convert inputs into outputs. Take care to use the right coding for categorical variables:
f(length=25, sex="G")
## 1
## 8.934275
In order to generate a confidence interval, the prediction function f() needs to be told what type of interval is needed. There are two types of prediction confidence intervals:
Interval on the model value, which reflects the sampling distribution of the coefficients themselves. To calculate this, use the interval="confidence" named argument:
f(length=25, sex="G", interval="confidence")
## fit lwr upr
## 1 8.934275 8.742565 9.125985
The components named lwr and upr are the lower and upper limits of the confidence interval, respectively.
Interval on the prediction, which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the interval="prediction" named argument:
f(length=25, sex="G", interval="prediction")
## fit lwr upr
## 1 8.934275 8.130484 9.738066
The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value. To see this visually, we can plot the confidence and prediction intervals around our KidsFeet data:
This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.