A key use of OLS regressions is to predict what values of Y would be for some X. A closely related skill is graphically displaying the results of linear regression. The mechanics behind these two tasks in R are very similar, although the philosophy is very different. This handout addresses both.
One important point that may not be obvious is that this process generates the same kind of output – predicted values – that is essential to 2SLS instrumental variables.
Let’s go back to the swiss dataset.
swiss.lm <- lm(Fertility~Agriculture,data=swiss)
summary(swiss.lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5374 -7.8685 -0.6362 9.0464 24.4858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.30438 4.25126 14.185 <2e-16 ***
## Agriculture 0.19420 0.07671 2.532 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
swiss.lm <- lm(Fertility~Agriculture,data=swiss)
This should be a pretty straightforward exercise by now. What comes next is a little more complex. We’re going to create a vector of X values, pred.x, which we will use to generate a vector of predicted Y values, y.hat (think \(\hat{Y}\)). We will calculate those predicted Y values by applying the regression formula we’ve just estimated to the X values we are specifying.
To create pred.x, we start by specifying a reasonable range of X values – in this case, the range of Agriculture plus a little extra and minus a little extra (to go just beyond the range we estimated this equation on). We use the seq() command to generate a “sequence” of numbers
pred.x <- seq(min(swiss$Agriculture)-5,
max(swiss$Agriculture)+5,
1)
pred.x ## displaying the values
## [1] -3.8 -2.8 -1.8 -0.8 0.2 1.2 2.2 3.2 4.2 5.2 6.2 7.2 8.2 9.2
## [15] 10.2 11.2 12.2 13.2 14.2 15.2 16.2 17.2 18.2 19.2 20.2 21.2 22.2 23.2
## [29] 24.2 25.2 26.2 27.2 28.2 29.2 30.2 31.2 32.2 33.2 34.2 35.2 36.2 37.2
## [43] 38.2 39.2 40.2 41.2 42.2 43.2 44.2 45.2 46.2 47.2 48.2 49.2 50.2 51.2
## [57] 52.2 53.2 54.2 55.2 56.2 57.2 58.2 59.2 60.2 61.2 62.2 63.2 64.2 65.2
## [71] 66.2 67.2 68.2 69.2 70.2 71.2 72.2 73.2 74.2 75.2 76.2 77.2 78.2 79.2
## [85] 80.2 81.2 82.2 83.2 84.2 85.2 86.2 87.2 88.2 89.2 90.2 91.2 92.2 93.2
## [99] 94.2
This isn’t the most exciting thing in the world, but I wanted you to see what was going on there. (seq() is a useful tool – play around with it.)
Now we will use predict() to generate our y.hat vector:
y.hat <- predict(swiss.lm, ## taking the regression output from above -- this is our model
data.frame(Agriculture=pred.x), ## telling the code that it should use `pred.x` as values for Agriculture
interval="confidence") ## note this option
head(y.hat) ## again,just displaying these...
## fit lwr upr
## 1 59.56641 50.46412 68.66869
## 2 59.76061 50.80096 68.72026
## 3 59.95481 51.13740 68.77223
## 4 60.14901 51.47341 68.82462
## 5 60.34322 51.80899 68.87745
## 6 60.53742 52.14410 68.93074
So y.hat has three columns: fit, which is the actual fitted value; lwr, which is the lower bound of predictions based on our confidence intervals; and upr, which is – you guessed it – the upper bound of fitted values. Remember how we estimate a coefficient estimate and the standard deviation of our coefficient estimate? lwr and upr represent the consequences of our uncertainty of the estimate of our coefficient estimate – if it is two standard deviations less than the “point” estimate, \(\hat{\beta_1}\), then we should expect to see a predicted Y at the value given by lwr. If the coefficient were two standard deviations more than the point estimate, then we would predict Y at the value given by `upr’. (And look at how big that range turns out to be!)
We can do a lot of things with our predicted values. One of them is easy: we can plot our confidence intervals graphically. This takes a little more work than just plotting the regression line, as we’ve been doing.
The first step is to plot the space we’ll be using:
plot(swiss$Fertility~swiss$Agriculture,pch="")
This is … weird. It’s just a blank space! That’s because pch is the plotting character and we told R to use, literally, nothing to plot this. That creates an x-y Cartesian plane that is scaled appropriately but onto which we can add more things later on.
Such as, for instance:
polygon(c(rev(pred.x),pred.x),
c(rev(y.hat[,3]),y.hat[,2]),
col='grey80',border=NULL)
polygon creates a polygon – a solid two-dimensional space – along the dimensions we select. (Note that rev() literally reverses the order of elements.)
That gray area, then, shows the bounds of uncertainty at a 95% confidence level for our estimate of \(\beta_1\) — if our model is properly specified, then 95% of the time we will esitimate a \(\beta_1\) that produces lines that will lie within the gray area.
Next, we want to fill in the confidence intervals and the regression line itself:
lines(pred.x,y.hat[,1],col="red")
lines(pred.x,y.hat[,2],col="blue",lty="dashed")
lines(pred.x,y.hat[,3],col="blue",lty="dashed")
(The colors are up to you; these just seemed nice.)
And, finally, we add the points as an overlay:
points(x=swiss$Agriculture,y=swiss$Fertility,
pch=16,col="purple")
Again, the color is up to you.
Putting this all together in a nice form….
plot(swiss$Fertility~swiss$Agriculture,pch="",
main = "Swiss Provincial Fertility Rates Vs Agricultural Employment",
xlab = "Percent Men in Agriculture by Province",
ylab = "Fertility Measure")
polygon(c(rev(pred.x),pred.x),
c(rev(y.hat[,3]),y.hat[,2]),
col='grey80',border=NULL)
lines(pred.x,y.hat[,1],col="red")
lines(pred.x,y.hat[,2],col="blue",lty="dashed")
lines(pred.x,y.hat[,3],col="blue",lty="dashed")
points(x=swiss$Agriculture,y=swiss$Fertility,
pch=16,col="purple")
The above sample used a bivariate regression to help you focus on the code. Now, we turn to the more general multivariate context. Again, we’ll create a pred.x and y.hat, but we need to be a little more careful here because, in a multivariate context, we have lots of x variables that we need to be mindful of. For instance, take the regression:
swiss.lm2 <- lm(Fertility~Agriculture+Education+Catholic+Infant.Mortality,data=swiss)
summary(swiss.lm2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
Here, our x variables include not only Agriculture but also Education, Catholic, and Infant.Mortality. That means that when we predict our Y values, we need to have some values of X for all of these, or else the math won’t work.
We solve this problem by letting the most important variable that we’re interested in – here, we’ll stick with Agriculture – vary across a range of values and then by setting the other variables to some fixed amount. That fixed amount could be the sample mean (the arithmetic average of all units on that characteristic), an important category value (for instance, a female dummy set to 1 or 0), or something else. Here, we’ll go with the sample mean.
pred.x <- seq(min(swiss$Agriculture)-5,
max(swiss$Agriculture)+5,
1)
y.hat <- predict(swiss.lm2,
data.frame(Agriculture=pred.x,
Education=mean(swiss$Education),
Catholic=mean(swiss$Catholic),
Infant.Mortality=mean(swiss$Infant.Mortality)),
interval="confidence")
plot(swiss$Fertility~swiss$Agriculture,pch="",
main = "Swiss Provincial Fertility Rates Vs Agricultural Employment",
xlab = "Percent Men in Agriculture by Province",
ylab = "Fertility Measure")
polygon(c(rev(pred.x),pred.x),
c(rev(y.hat[,3]),y.hat[,2]),
col='grey80',border=NULL)
lines(pred.x,y.hat[,1],col="red")
lines(pred.x,y.hat[,2],col="blue",lty="dashed")
lines(pred.x,y.hat[,3],col="blue",lty="dashed")
points(x=swiss$Agriculture,y=swiss$Fertility,
pch=16,col="purple")
How interesting! It’s the same plots, but a different predicted relationship. Remember why:
library(stargazer)
stargazer(swiss.lm,swiss.lm2,type="html")
| Dependent variable: | ||
| Fertility | ||
| (1) | (2) | |
| Agriculture | 0.194** | -0.155** |
| (0.077) | (0.068) | |
| Education | -0.980*** | |
| (0.148) | ||
| Catholic | 0.125*** | |
| (0.029) | ||
| Infant.Mortality | 1.078*** | |
| (0.382) | ||
| Constant | 60.304*** | 62.101*** |
| (4.251) | (9.605) | |
| Observations | 47 | 47 |
| R2 | 0.125 | 0.699 |
| Adjusted R2 | 0.105 | 0.671 |
| Residual Std. Error | 11.816 (df = 45) | 7.168 (df = 42) |
| F Statistic | 6.409** (df = 1; 45) | 24.424*** (df = 4; 42) |
| Note: | p<0.1; p<0.05; p<0.01 | |
In the multivariate framework, we’ve found that there’s actually a negative relationship between Agriculture and Fertility. The plot reflects that … which is what we want. The big point of multivariate OLS is that it reveals relationships that aren’t obvious — that don’t even hold — in the bivariate context.
Confidence intervals are derived with reference to our estimates of the slope of the line. That’s great, but it’s actually kind of weak for prediction in many cases. When we want to predict something, we want to take notice of the observed distribution of the data because our task is to give bands within which we will observe the next data point in a sequence. Those bands will appear much larger than the confidence intervals; this page explains why.
I should note that there’s a huge philosophical difference between prediction and causal inference. This isn’t going to matter much for what you’re doing right now, but it’s worthwhile enough to flag so that you can be aware of it. Prediction is much more of an empirical activity: given this data, what will future data look like? Causal inference is much more of a theoretical activity: given what I think about the universe, does this data offer confirming or disconfirming evidence for it? When we deal with theory, we might be interested in detecting something that has a tiny effect but establishing its presence with great certainty; when we deal with empirics, we are interested in not getting the outcome wrong. Those aren’t really the same thing, although we sometimes – wrongly – talk about them as if they were. This article explains more
Fortunately, the code for graphically displaying predictions is easy:
swiss.lm <- lm(Fertility~Agriculture,data=swiss)
pred.x <- seq(min(swiss$Agriculture)-5,
max(swiss$Agriculture)+5,
1)
y.hat <- predict(swiss.lm,
data.frame(Agriculture=pred.x),
interval="prediction") ## here's the difference!
plot(swiss$Fertility~swiss$Agriculture,pch="")
polygon(c(rev(pred.x),pred.x),
c(rev(y.hat[,3]),y.hat[,2]),
col='grey80',border=NULL)
lines(pred.x,y.hat[,1],col="red")
lines(pred.x,y.hat[,2],col="blue",lty="dashed")
lines(pred.x,y.hat[,3],col="blue",lty="dashed")
points(x=swiss$Agriculture,y=swiss$Fertility,
pch=16,col="purple")
When I say the bounds are much larger for predictive versus confidence intervals, note that they are, indeed, much larger! That’s because these bounds include the area within which we would expect to see the next value 95% of the time. Given that the error in this dataset is pretty large, that’s something to be aware of.
Strictly as a technical exercise, here is the parallel code for prediction using multivariate OLS:
swiss.lm2 <- lm(Fertility~Agriculture+Education+Catholic+Infant.Mortality,data=swiss)
pred.x <- seq(min(swiss$Agriculture)-5,
max(swiss$Agriculture)+5,
1)
y.hat <- predict(swiss.lm2,
data.frame(Agriculture=pred.x,
Education=mean(swiss$Education),
Catholic=mean(swiss$Catholic),
Infant.Mortality=mean(swiss$Infant.Mortality)),
interval="prediction")
plot(swiss$Fertility~swiss$Agriculture,pch="",
main = "Predicting Provincial Fertility Rates Vs Agricultural Employment",
xlab = "Percent Men in Agriculture by Province",
ylab = "Fertility Measure")
polygon(c(rev(pred.x),pred.x),
c(rev(y.hat[,3]),y.hat[,2]),
col='grey80',border=NULL)
lines(pred.x,y.hat[,1],col="red")
lines(pred.x,y.hat[,2],col="blue",lty="dashed")
lines(pred.x,y.hat[,3],col="blue",lty="dashed")
points(x=swiss$Agriculture,y=swiss$Fertility,
pch=16,col="purple")
Note that the prediction bounds are still pretty large, but they’re much narrower (and differently sloped) than in the bivariate example because we are reducing the uncertainty on other dimensions. (By the way, the fit looks bad in two dimensions, but it would look better as a four-dimensional hyperplane in a five-dimensional space.)