In a previous example, linear regression was examined through the simple regression setting, i.e., one independent variable. Fitting a linear model allows one to answer questions such as:
In the case of the cars
dataset, which was examined in the previous example and will be used again here, these questions can be more precise:
Confidence and prediction intervals are often formed to answer questions such as the above. Intervals allow one to estimate a range of values that can be said with reasonable confidence (typically 95%) contains the true population parameter. It should be noted that although the questions above sound similar, there is a difference in estimating a mean response and predicting a new value. This difference will be seen in the interval equations. This post will explore confidence and prediction intervals as well as the confidence ‘bands’ around a linear regression line.
Begin by loading the cars
dataset and the ggplot2
package.
library(ggplot2)
data("cars")
Once again, plot the two variables as a scatterplot and draw a linear regression line through the points. The gray bands around the line represent the standard error of the regression line. This plot will be recreated using base R graphics in a function below.
ggplot(cars, aes(x=speed, y=dist)) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, color='#2C3E50')
Fit a linear regression model and print a summary.
cars.lm <- lm(dist ~ speed, data = cars)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As before, the speed of the car has a relatively high and significant relationship with stopping distance. Knowing this relationship, one can then form and answer questions such as those above. A confidence interval is created to respond to the first question, “What is the mean stopping distance of the cars given a particular speed?”.
When estimating the confidence interval (also called the mean interval), the question one is trying to answer is typically as mentioned above: What is the mean stopping distance of the car at a certain speed?
The confidence interval around the mean response, denoted \(\mu_y\), when the predictor value is \(x_k\) is defined as:
\[ \hat{y}_h \pm t_{\alpha / 2, n - 2} \sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)} \]
Where \(\hat{y}_h\) is the fitted response for predictor value \(x_h\), \(t_{\alpha / 2, n - 2}\) is the t-value with \(n - 2\) degrees of freedom, while \(\sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)}\) is the standard error of the fit.
To find the confidence interval in R, create a new data.frame
with the desired value to predict. The prediction is made with the predict()
function. The interval
argument is set to ‘confidence’ to output the mean interval.
new.dat <- data.frame(speed=30)
predict(cars.lm, newdata = new.dat, interval = 'confidence')
## fit lwr upr
## 1 100.3932 87.43543 113.3509
From the output, the fitted stopping distance at a speed of 30 mph is just above 100 feet. The confidence interval of (87.44, 113.35) signifies the range in which the true population parameter lies at a 95% level of confidence.
The prediction interval is rather similar to the confidence interval in calculation, but as mentioned earlier, there are significant differences. The prediction interval equation is defined as:
\[ \hat{y}_h \pm t_{\alpha / 2, n - 2} \sqrt{MSE \left(1 + \frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right)} \]
Where \(\hat{y}_h\) is the fitted reponse at predictor value \(x_k\) and the critical t-value is \(t_{\alpha / 2, n - 2}\) with \(n - 2\) degrees of freedom. \(\sqrt{MSE \left(1 + \frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right)}\) is the standard error of prediction.
To find the prediction interval in R, the predict()
function is utilized once again, but this time, the interval argument is given ‘prediction.’
predict(cars.lm, newdata = new.dat, interval = 'prediction')
## fit lwr upr
## 1 100.3932 66.86529 133.921
Notice the fitted value is the same as before, but the interval is wider. This is due to the additional term in the standard error of prediction. It should be noted prediction and confidence intervals are similar in that they are both predicting a response, however, they differ in what is being represented and interpreted. The best predictor of a random variable (assuming the variable is normally distributed) is the mean \(\mu\). The best predictor for an observation from a sample of \(x\) data points, \(x_1, x_2, \cdots, x_n\) and error \(\sigma\) is \(\bar{x}\). Since the prediction interval must take into account the variability of the estimators for \(\mu\) and \(\sigma\), the interval will be wider.
Confidence intervals can also be formed around the estimated \(\beta_1\). The equation for the interval for \(\beta_1\) is defined as:
\[ \beta_1 \pm t_{\alpha / 2, n - 2} \left(\frac{\sqrt{MSE}}{\sqrt{\sum (x_i - \bar{x})^2}}\right) \]
If the confidence interval for \(\beta_1\) contains 0, it can be concluded there is no significant evidence of a linear relationship between predictor \(x\) and response \(y\) in the population.
The \(\beta_1\) interval can be found with the confint()
function.
confint(cars.lm)
## 2.5 % 97.5 %
## (Intercept) -31.167850 -3.990340
## speed 3.096964 4.767853
The fitted \(\beta_1\) is 3.93 with an interval of (3.10, 4.77)
The equations written above can be verified with a custom function. The following is a function that replicates the mean, prediction and coefficient intervals found earlier.
mean.pred.intervals <- function(x, y, pred.x) {
n <- length(y) # Find sample size
lm.model <- lm(y ~ x) # Fit linear model
y.fitted <- lm.model$fitted.values # Extract the fitted values of y
# Coefficients of the linear model, beta0 and beta1
b0 <- lm.model$coefficients[1]
b1 <- lm.model$coefficients[2]
pred.y <- b1 * pred.x + b0 # Predict y at the given value of x (argument pred.x)
# Find SSE and MSE
sse <- sum((y - y.fitted)^2)
mse <- sse / (n - 2)
t.val <- qt(0.975, n - 2) # Critical value of t
mean.se.fit <- (1 / n + (pred.x - mean(x))^2 / (sum((x - mean(x))^2))) # Standard error of the mean estimate
pred.se.fit <- (1 + (1 / n) + (pred.x - mean(x))^2 / (sum((x - mean(x))^2))) # Standard error of the prediction
# Mean Estimate Upper and Lower Confidence limits at 95% Confidence
mean.conf.upper <- pred.y + t.val * sqrt(mse * mean.se.fit)
mean.conf.lower <- pred.y - t.val * sqrt(mse * mean.se.fit)
# Prediction Upper and Lower Confidence limits at 95% Confidence
pred.conf.upper <- pred.y + t.val * sqrt(mse * pred.se.fit)
pred.conf.lower <- pred.y - t.val * sqrt(mse * pred.se.fit)
# Beta 1 Upper and Lower Confidence limits at 95% Confidence
b1.conf.upper <- b1 + t.val * sqrt(mse) / sqrt(sum((x - mean(x))^2))
b1.conf.lower <- b1 - t.val * sqrt(mse) / sqrt(sum((x - mean(x))^2))
# Build data.frame of upper and lower limits calculated above, as well as the predicted y and beta 1 values
upper <- data.frame(rbind(round(mean.conf.upper, 2), round(pred.conf.upper, 2), round(b1.conf.upper, 2)))
lower <- data.frame(rbind(round(mean.conf.lower, 2), round(pred.conf.lower, 2), round(b1.conf.lower, 2)))
fit <- data.frame(rbind(round(pred.y, 2), round(pred.y, 2), round(b1, 2)))
# Collect all into data.frame and rename columns
results <- data.frame(cbind(lower, upper, fit), row.names = c('Mean', 'Prediction', 'Coefficient'))
colnames(results) <- c('Lower', 'Upper', 'Fit')
return(results)
}
mean.pred.intervals(cars$speed, cars$dist, new.dat)
## Lower Upper Fit
## Mean 87.44 113.35 100.39
## Prediction 66.87 133.92 100.39
## Coefficient 3.10 4.77 3.93
The output confirms the equations above return the same results as the built-in R functions.
The gray ‘bands’ around the regression line in the plot above represent the range in which the true regression line lies at a certain level of confidence (95% in the plot). The bands visualize all intervals for every possible x and are tightest where the data is grouped more densely.
The confidence interval around the regression line is calculated similarly to the other intervals examined above.
\[ \hat{y}_h \pm t_{\alpha / 2, n - 2} (s.e.)_y \]
Where the \((s.e.)\) is defined as standard error of the regression line multipled by the standard error of the estimate at \(x_k\):
\[ (s.e.)_y = \sqrt{\frac{\sum^n_{i=1} (y_i - \hat{y})^2}{n - 2}} \sqrt{\frac{1}{n} + \frac{(x - \bar{x})^2}{\sum^n_{i=1} (x_i - \bar{x})^2}} \]
The below function fits a linear regression line on one independent variable and finds the confidence interval around the line. The function also outputs a graph with the computed confidence intervals using base R graphics. Obviously, it’s not meant to replace the much nicer plot above, but rather to demonstrate an example of how the intervals are calculated and plotted.
reg.conf.intervals <- function(x, y) {
n <- length(y) # Find length of y to use as sample size
lm.model <- lm(y ~ x) # Fit linear model
# Extract fitted coefficients from model object
b0 <- lm.model$coefficients[1]
b1 <- lm.model$coefficients[2]
# Find SSE and MSE
sse <- sum((y - lm.model$fitted.values)^2)
mse <- sse / (n - 2)
t.val <- qt(0.975, n - 2) # Calculate critical t-value
# Fit linear model with extracted coefficients
x_new <- 1:max(x)
y.fit <- b1 * x_new + b0
# Find the standard error of the regression line
se <- sqrt(sum((y - y.fit)^2) / (n - 2)) * sqrt(1 / n + (x - mean(x))^2 / sum((x - mean(x))^2))
# Fit a new linear model that extends past the given data points (for plotting)
x_new2 <- 1:max(x + 100)
y.fit2 <- b1 * x_new2 + b0
# Warnings of mismatched lengths are suppressed
slope.upper <- suppressWarnings(y.fit2 + t.val * se)
slope.lower <- suppressWarnings(y.fit2 - t.val * se)
# Collect the computed confidence bands into a data.frame and name the colums
bands <- data.frame(cbind(slope.lower, slope.upper))
colnames(bands) <- c('Lower Confidence Band', 'Upper Confidence Band')
# Plot the fitted linear regression line and the computed confidence bands
plot(x, y, cex = 1.75, pch = 21, bg = 'gray')
lines(y.fit2, col = 'black', lwd = 2)
lines(bands[1], col = 'blue', lty = 2, lwd = 2)
lines(bands[2], col = 'blue', lty = 2, lwd = 2)
return(bands)
}
conf.intervals <- reg.conf.intervals(cars$speed, cars$dist)
With the blue dashed lines representing the confidence interval.
This post examined confidence and prediction intervals of predicted values from a linear regression model as well as the model’s coefficients. Confidence intervals are often a much more useful representation of the data than a point estimate (such as a prediction) as intervals given more information about the population.
The functions above can also be found as a Gist.
3.3 - prediction interval for a new response. (2016). Retrieved July 12, 2016, from Penn State Eberly College of Science, https://onlinecourses.science.psu.edu/stat501/node/274
Understanding shape and calculation of confidence bands in linear regression. (2016). From http://stats.stackexchange.com/questions/101318/understanding-shape-and-calculation-of-confidence-bands-in-linear-regression
Confidence interval (2016). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Confidence_interval#Statistical_theory