Predictions
Up until now, we’ve covered the process of estimating a linear model and making inferences based on that model. In this lab, our focus will shift towards the topic of prediction. The problem at hand can be summarized as follows: given a new set of predictors \(x_0\), what is the predicted response? It can be proved to be:
\[ \hat{y}_0=x^T_0\hat{\beta} \] Here, \(\hat{\beta}\) is estimated without taking into account the new information provided by \(x_0\), hence we can express \(\boldsymbol{\hat{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\). However, for this scenario, a point estimate is not sufficient to make reliable predictions. We need to assess the uncertainty associated with this prediction. For instance, if we want to predict the high water mark of a river, we may need to construct barriers that are high enough to withstand floods that could potentially exceed the predicted maximum. Financial projections are similarly not very useful without a realistic estimate of the associated uncertainty.
Just before introducing you to confidence intervals, we need to define what we mean by the predicted mean response and a prediction of a future observation. Let’s suppose we have built a regression model that predicts the rental prices of houses in a given area based on predictors such as the number of bedrooms and proximity to a major highway. We have two types of predictions that can be made:
Suppose that a specific house comes to the market were its characteristics are in the vector \(x_0\). Its rental price will be \(Y_0=x_0^T\boldsymbol{\beta}+\epsilon_0\), without loss of generality we can still assume that \(\epsilon_0\sim \mathcal{N} (0, \sigma^2)\). It is easy to prove that \(E[\hat{Y}_0- Y_0]=0\) and \(Var[\hat{Y}_0- Y_0]=\sigma^2[x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0+1]\) (please attempt to do this). In assessing the variance of the prediction \(x_0^T \hat{\boldsymbol{\beta}}\) we need to take into account the variance of \(\epsilon_0\)
On the other hand, suppose our manager ask the question “What would a house with characteristics \(x_0\) rent for on average. This average selling price is \(x_0^T\boldsymbol{\beta}\) and is predicted by \(x_0^T\hat{\boldsymbol{\beta}}\), in this case it makes sense to take into account only the variance of \(\hat{\boldsymbol{\beta}}\).
The recall the the first case we are looking for a prediction of a future value while in the second case prediction of the mean rensponse.
For (1) a \(100(1-\alpha)\%\) CI for a single future response is
\[ \hat{y}_0\pm t_{n-p, \alpha/2}\hat{\sigma}\sqrt{1+x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0} \]
So far CI have been for parameters, under our Frequentist framework we are assuming that a parameter \(\theta\) is a fixed quantity, they are not random. However, a future observation by definition is random. For this reason the interval above is called prediction Interval. We say that there is a \(95\%\) chance that the future value fall within this interval. This would be incorrect and may outrageous for some statisticians ( including me ) to say for a parameter. Please take care about what you write in you lab exam.
For (2) a \(100(1-\alpha)\%\) is
\[ \hat{y}_0\pm t_{n-p, \alpha/2}\hat{\sigma}\sqrt{x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0} \] the CI for (2) is usually much narrower that the prediction interval. It is tempting to use this version to make intervals for predicted values, please do not do that, this is a serious mistake especially if done for a report in a firm.
Let’s see how to compute this quantities in R
library(faraway)
data(fat)
lmod <- lm(brozek ~ age + weight+ height + neck + chest+ abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data= fat)Measuring body fat is not an easy task. One method requires submerging the body underwater in a tank and measuring the increase in the water level. As many people would prefer not to be submerged underwater, researchers recorded age, weight, height, and 10 body circumference measurements for 252 with the aim to build a model able to predict body fat.
In the model just fitted we use brozek as the rensponse (Brozek’s equation estimates percent body fat from density). Lets consider a typical man, this is exemplified by the median value of all the predictors.
## (Intercept) age weight height neck chest
## 1.00 43.00 176.50 70.00 38.00 99.65
## abdom hip thigh knee ankle biceps
## 90.95 99.30 59.00 38.50 22.80 32.05
## forearm wrist
## 28.70 18.30
We obtained a vector of with the same columns of the design but with just one row. We can obtain \(\hat{y}_0\). using predict
## 1
## 17.49322
Please take care as the predict function requires the new value argument beign in the form of a data frame with the same format as the data used to fit the model.
Now if we want a \(95\%\) CI we have just decide whether we are predicting the body fat for one particular man of the mean body fat for all men have these same characteristic.
## fit lwr upr
## 1 17.49322 9.61783 25.36861
## fit lwr upr
## 1 17.49322 16.94426 18.04219
Transformation
Transformations can be an incredible tool to use to improve the fit and correct some violation (e.g. heteroschedasticity can be usually be corrected by applying a \(\log()\) transformation on the \(Y's\)). Another option is to add additional predictors that are functions of the existing predictors like quadratic or cross product terms.
Suppose that we are considering the following model
\[ \log(Y)=\beta_0+\beta_1X+\epsilon \]
In the original scale of the rensponse, this model becomes (you have just apply \(\exp()\) to both sides)
\[ Y=\exp(\beta_0+\beta_1X)\times\exp(\epsilon) \]
Under this framerwork the model enters in a multiplicative way. Take home message, use \(\log()\) can save your model and your mark during the exam lab.
In practice we may not know how the error enter the model. The best approach is to try different transformations to get the structural form of the model right and worry about the error component later, usually non linearity in the error terms can be detected by exploring your data by plotting (\(y\) vs \(x_i\)). Once applied the transformation and fitted the model we can check the residuals to see whether they satisfy the conditions required for linear regression. When transform you have to take care when interpreting your data ( more details in the lecture notes ).
When you use log on the independent variable, the regression coeffieiciens can be interpreted as
\[ \log(\hat{y})=\hat{\beta}_0+\hat{\beta}_1x_1+\dots+\hat{\beta}_px_p \]
\[ \hat{y}=e^{\hat{\beta}_0}e^{\hat{\beta}_{1} x_1}\dots\,e^{\hat{\beta}_p x_p} \] An increase of one unit in \(x_1\) would multiply \(\hat{y}\) by \(e^{\hat{\beta}_{1}}\). thus when a log scale is used, the regression coefficients can be interpreted in a multiplicative rather that an additive manner.
We can recall that \(\log(1+x)\approx\,x\) (this is obtained by a Taylor expansion around zero). So for example suppose to have \(\hat{\beta}_1=0.09\); then an increase of one in \(x_1\) woud lead to about a \(0.09\) increase in \(\log y\) which is a \(9\%\) increase in \(y\).
Box Cox transformation
A useful tool you can employ to get which transformation apply is the Box-Cox method. This method is designed for strictly politive responses and choses the transformation to find the best fit to the data. this method transforms the rensponse \(y\to g_{\lambda}(y)\) where the family of transformation is indexed by \(\lambda\)
\[ g_{\lambda}(y)=\begin{cases} \frac{y^{\lambda}-1}{\lambda}\,\,\,\,\,\,\, \lambda\ne0\\ log(y)\,\,\, \lambda=0 \end{cases} \]
For fixed \(y>0\), \(g_{\lambda}(y)\) is continuous in \(\lambda\). The idea is to chose \(\lambda\) using the maximum likelihood method. Assuming the normality of the errors the Likelihood function is like this
\[ L(\lambda)=-n/2\log(RSS_{\lambda}/n)+(\lambda-1)\sum_i\log(y_i) \] where \(RSS_\lambda\) is the residual sum of squares when \(g_{\lambda}(y)\) is the rensponse. Transforming the rensponse can make our model hard to interpret so we want to be sure to do it unless is really necessary. One way to check this is to form a confidence interval of \(\lambda\). A \(100(1-\alpha)\%\) CI is
\[ \{\lambda: L(\lambda)>L(\hat{\lambda})-0.5\chi_{1, (1-\alpha)}^2\} \]
For each value of \(\lambda\) a transformation is proposed, some are represented in table
| Lambda | trasformation |
|---|---|
| -3 | \(Y^{-3}\) |
| -2 | \(Y^{-2}\) |
| -1 | \(Y^{-1}\) |
| -0.5 | \(Y^{-0.5}\) |
| 0 | \(\log(Y)\) |
| 1 | \(Y\) |
| 2 | \(Y^{2}\) |
| 3 | \(Y^{3}\) |
Lets move into R to see how things works, to use the Box-Cox transformation we need to install and recall the MASS library ( I suggest you to explore the documentation of MASS as it is an interesting package with tons of statistical tools implemented.)
library(MASS)
library(faraway)
data(savings)
lmod <- lm(sr~ pop15+pop75+dpi+ddpi, savings)
boxcox(lmod, plotit = T, lambda= seq(0.5,1.5, by=0.1))We can see from the plot that seems to be no good reason to transform our data.
Let’s consider the Galapagos Islands
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
A possible transformation is the square root, as this falls just within
the confidence intervals
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
##
## Call:
## lm(formula = sqrt(Species) ~ Area + Elevation + Nearest + Scruz +
## Adjacent, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5572 -1.4969 -0.3031 1.3527 5.2110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3919243 0.8712678 3.893 0.000690 ***
## Area -0.0019718 0.0010199 -1.933 0.065080 .
## Elevation 0.0164784 0.0024410 6.751 5.55e-07 ***
## Nearest 0.0249326 0.0479495 0.520 0.607844
## Scruz -0.0134826 0.0097980 -1.376 0.181509
## Adjacent -0.0033669 0.0008051 -4.182 0.000333 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.774 on 24 degrees of freedom
## Multiple R-squared: 0.7827, Adjusted R-squared: 0.7374
## F-statistic: 17.29 on 5 and 24 DF, p-value: 2.874e-07
the residuals in the fitting with \(\sqrt{Y}\) seem to be approximately normal (see the \(min\), \(max\), \(Q_1\) and \(Q_2\)) and the residual standard error decreased.
Just some words of warning about Box-Cox transformation
The Box-Cox is not robust against outliers, so if you get a value of \(\hat{\lambda}=5\) the reason could be that you are working with data affected by outliers. (Do you see why is so important your data first ?)
If some \(y_i<0\), it is customary ( not an elegant solution ) to add a constant to all \(y's\). Note that the constant to add should be small (e.g, 0.05, 0.10).
There is doubt whether the estimation of \(\lambda\) counts as an extra parameter to be considered in the degrees of freedom. This doubt comes from the fact that \(\lambda\) is not a linear parameter and that its estimation is not part of the least squares fit.
Exercises
- For the prostate data, fit a model with lpsa as the response and the other variables as predictors.
- Supose a new patient with the following values arrives.
lcavol=1.44692, lweight==3.62301, age=65, lbph=0.30010, svi=0.00000, lcp=-0.79851, gleason=7.00000, pgg45=15.00000.
predict the lpsa for this patient along with an appropriate \(95\%\) CI
repeat the last question for a patient with the same values except that he is age 20. Explain why the CI is wider.
For the model of the previous question, remove all the predictors that are not significant at the \(5\%\) level. Now recompute the predictors would you prefer ? Explain.
- Using the teengamb data, fit a model with gamble as the response and the other variables as predictors.
- Predict the amount that a male with average (given these data) status, income and verbal score would gamble along with an approrpiate \(95\%\) CI.
- Repeat the prediction for a male with maximal values (for this data) of status, income and verbal score. Which CI is wider and why is this result expected ?
- Fit a model with sqrt(gamble) as the response but with the same predictors. Now predict the response and give a \(95\%\) prediction interval for the individual in (a). Take care to give your answer in the original units of the rensponse
- Repeat the prediction for the model in (c) for a female with status=20, income=1, verbal=10. Comment on the credibility of the result.
References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in Statistical Science.