This is a case study of Bayesian modelling of the R cars dataset. A square-root transformation of the response variable is undertaken before fitting a linear regression model, based on the diagnostic plots. Standard reference priors are used given that we have little knowledge of the unknown parameters, and this leads to equivalent results to the classical regression analysis. There is possibility to use more informative priors to improve the model. Gibbs sampling is applied to draw samples from the posterior distribution, and meanwhile the posterior predictive distribution for future observations is obtained from the simulations.
The cars dataset contains 50 observations. We begin by looking at the scatter plot and fit (1) a linear regression line and (2) a quadratic regression curve. Let dist represent distance and x represent speed, and regress dist on x:
car.lm1 = lm(dist~speed,data=cars)
car.lm2 = lm(dist~speed+I(speed^2),data=cars)
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
las = 1, xlim = c(0, 25), pch=20)
title(main = "cars data")
abline(car.lm1,col="blue",lwd=3,lty=2)
lines(cars$speed, fitted(car.lm2), col="red", lwd=3,lty=1)
legend("topleft", c(expression(paste("dist"[i],"=",beta[0]+beta[1],"x"[i]+epsilon[i])),
expression(paste("dist"[i],"=",beta[0]+beta[1],"x"[i]+beta[2],"x"[i]^2+epsilon[i]))),
lty = c(2,1), lwd = 3, bty = "n", col = c("blue", "red"))
The addition of a quadratic term appears to model the data better for the minimum and maximum \(x\) values, but overall they give pretty close results. However, the Residual vs. Fitted plot of the first-order linear regression model shows a slightly curved pattern, and this could mean that we could potentially get a better model by including the quadratic term.
plot(car.lm1,which=1)
From the diagnostic plots for the second-order regression model, we see that the red line in the Residual vs. Fitted plot is roughly horizontal now. The Scale-Location plot indicates if residuals are spread equally across the predictor range. The residuals seem to spread wider along the first half of the x-axis. It does not look too severe and is acceptable, but may not be ideal if we were to rely on the assumption of Homoscedasticity, i.e. the residual variances are uniform across the range of predictors. Also, the QQ plot indicates heavier tails of residuals compared to the normal distribution.
par(mfrow=c(2,2))
plot(car.lm2)
While these can be considered minor issues, in the analysis below we take an alternative approach of square-root transforming the response variable dist. It produces a model that better fits the assumptions and that explains more variance (has higher R2):
summary(car.lm2)
##
## Call:
## lm(formula = dist ~ speed + I(speed^2), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.720 -9.184 -3.188 4.628 45.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.47014 14.81716 0.167 0.868
## speed 0.91329 2.03422 0.449 0.656
## I(speed^2) 0.09996 0.06597 1.515 0.136
##
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
## F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
car.lm3 = lm(sqrt(dist)~speed,data=cars)
summary(car.lm3)
##
## Call:
## lm(formula = sqrt(dist) ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0684 -0.6983 -0.1799 0.5909 3.1534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27705 0.48444 2.636 0.0113 *
## speed 0.32241 0.02978 10.825 1.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 48 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
## F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
plot(cars$speed,sqrt(cars$dist), xlab = "Speed (mph)", ylab = expression(sqrt("Stopping distance (ft)")),las = 1, xlim = c(0, 25))
abline(car.lm3,col="red",lwd=3,lty=1)
legend("topleft", bty="n",expression(paste(sqrt("dist"[i]),"=",beta[0]+beta[1],"x"[i]+epsilon[i])))
Square rooting the response variable also helps alleviate the problem that the scatter of residuals progressively increases with predictors, from the Scale-Location plot. And the Normal QQ plot looks better too:
par(mfrow=c(2,2))
plot(car.lm3)
Write \(\beta=\{\beta_0,\beta_1\}\). We use the traditional reference prior for \(\beta\) and \(p(\sigma^2)\): \[p(\beta~|~\sigma^2) \propto 1, \qquad\qquad p(\sigma^2) \propto \frac{1}{\sigma^2},\] Then the joint prior distribution is: \[p(\beta, \sigma^2)\propto \frac{1}{\sigma^2}.\] Write \(y=\{y_1,\ldots,y_n\}\); and \(X=\{X_1,X_2\}\) as a \(n\times 2\) matrix, where \(X_1\) is a vector of 1’s and \(X_2\) is the vector of measured car speeds. The posterior distribution of \(\beta\) conditioning on \(\sigma^2\) is proportional to the product of likelihood and prior: \[p(\beta~|~\sigma^2,y) \propto p(\beta~|~\sigma^2)p(y|\beta,\sigma^2)\propto p(y|\beta,\sigma^2)\sim N(\hat\beta,\sigma^2(X^TX)^{-1}) \] The marginal posterior distribution of \(\sigma^2\) follows a scaled inverse-\(\chi^2\) distribution: \[\sigma^2|y\sim \textsf{Inv-}\chi^2(n-2,s^2)\] where \(s^2=\frac{1}{n-2}(y-X\hat\beta)^T(y-X\hat\beta).\) This is the same as an inverse Gamma distribution \(\textsf{Inv-Gamma}\left(\frac{n-2}{2}, \frac{(n-2)s^2}{2}\right)\).
To simulate {\(\beta,\sigma^2\)} \(\sim p(\beta,\sigma^2~|~y)\):
First draw simuations of \(\sigma^2\) from \(p(\sigma^2|y)\sim\textsf{Inv-Gamma}\left(\frac{n-2}{2}, \frac{(n-2)s^2}{2}\right)\) with \(s^2=\frac{1}{n-2}(y-X\hat\beta)^T(y-X\hat\beta)\)
Use this \(\sigma^2\) to simulate \(\beta\) from \(p(\beta~|~\sigma^2,y)\sim N(\hat\beta,\sigma^2(X^TX)^{-1})\)
Repeat \(M\) times to obtain MCMC samples: \(\{\sigma^2,\beta\}^{(1)},\ldots,\{\sigma^2,\beta\}^{(M)}\)
n = nrow(cars)
X = cbind(1,cars$speed)
X.new = cbind(1,seq(0,25,by=.5))
y = sqrt(cars$dist)
beta.hat = solve(t(X)%*%X)%*%t(X)%*%y
s2 = t(y-X%*%beta.hat) %*% (y-X%*%beta.hat) / (n-2)
M = 1000
set.seed(1)
sigma2.post=beta.post=y.new=NULL
for(i in 1:M) {
sigma2.post = c(sigma2.post,rinvgamma(1,shape = (n-2)/2, rate = (n-2)*s2/2))
beta.post = rbind(beta.post,mvrnorm(1, beta.hat, sigma2.post[i]*solve(t(X)%*%X)))
y.new = rbind(y.new,mvrnorm(1, X.new%*%beta.post[i,], sigma2.post[i]*diag(nrow(X.new))))
}
Some results:
head(sigma2.post,5)
## [1] 1.4185395 1.1794999 0.9018846 0.9015705 1.2171636
head(beta.post,5)
## [,1] [,2]
## [1,] 0.5802870 0.3498101
## [2,] 0.9820851 0.3412191
## [3,] 1.4496720 0.3203886
## [4,] 0.9514233 0.3354863
## [5,] 0.8669939 0.3618727
(beta.post.mean = apply(beta.post,2,mean)) ## posterior mean of beta
## [1] 1.2799550 0.3224963
apply(beta.post,2,sd) ## their standard deviations
## [1] 0.49499574 0.03064602
summary(car.lm3)$coef ## Compare with estimates from classical regression
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2770502 0.48444202 2.636126 1.126220e-02
## speed 0.3224125 0.02978377 10.825106 1.773141e-14
## Marginal posterior distribution of the parameters
s2plot = ggplot(mapping=aes(x=sigma2.post)) + geom_histogram(aes(y=..density..), binwidth=.1, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") + xlab(bquote(sigma^2~"| y"))
b0plot = ggplot(mapping=aes(x=beta.post[,1])) + geom_histogram(aes(y=..density..), binwidth=.2, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") + xlab(bquote(beta[0]~"| y"))
b1plot = ggplot(mapping=aes(x=beta.post[,2])) + geom_histogram(aes(y=..density..), binwidth=.01, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666")+ xlab(bquote(beta[1]~"| y"))
grid.arrange(s2plot, b0plot, b1plot, ncol=1)
Under the noninformative reference prior, Bayesian analysis is equivalent to frequentist analysis. The regression function with fitted values tranformed back to the original scale for both analyses overlap as shown below:
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",las = 1, xlim = c(0, 25), pch=20)
title(main = "Cars data: transformed back to the original scale")
lines(cars$speed, fitted(car.lm3)^2, col=4, lwd=3,lty=2)
lines(cars$speed,(beta.post.mean[1]+beta.post.mean[2]*cars$speed)^2,col=2,lwd=2,lty=1)
legend("topleft", c("Classical","Bayesian"),
lty = c(2,1), lwd = 3, bty = "n", col = c(4,2))
The posterior mean of the y-intercept of our model \(\hat\beta_0=1.28\) predicts an expected value of \(\sqrt{dist}\) when \(x=0\), which corresponds to an expected braking distance of \(1.28^2=1.64\) when the initial speed is zero. However from common sense we know that the expected distance should be zero if there is no speed, or in other words the car is not moving. To meet this condition if we simply drop the constant term to force the regression line to go through the origin, it may cause biased regression coefficients and predictions. Therefore, we need to include the intercept in the regression model, but should not interpret it as the expected braking distance when the speed is zero as it does not make sense. The intercept has no intrinsic meaning in this case. There will be no data with car speed of 0 mph, because the speed has to be larger than 0 mph to do the experiment and collect data. The expected distance should be zero if the initial speed is zero from thought experiment.
Given that 15 meters \(\approx\) 49.21 ft, we want to know the maximum speed that results in dist\(\le 49\) with 95% probability. As \(0\le dist\le 49\equiv\sqrt{dist}\le\sqrt{49}\equiv y\le\sqrt{49}\equiv y\le 7\), with our model the problem translates into finding the maximum \(x\) such that the predicted \(y\) is less than or equal to \(7\) with 95% probability. To solve this we want the posterior predictive distribution of unobserved data, \(p(\tilde y|y)\).
Suppose we have observed the matrix \(\tilde X\) of explanatory variables, and we wish to predict the outcomes \(\tilde y\). In this study the posterior predictive distribution \(p(\tilde y|y)\) can be obtained analytically - it follows a multivariate t distribution with center \(\tilde X\hat\beta\), squared scale matrix \(s^2(I+\tilde X (X^TX)^{-1}\tilde {X}^T)\), and \(n-2\) degrees of freedom (Equation 14.9 from the reference book): \[\tilde{y}~|~y\sim \textsf{t}_{n-2}(\tilde X\hat\beta,s^2(I+\tilde X (X^TX)^{-1}\tilde{X}^T))\] Alternatively, we can draw random samples \(\tilde y\) naturally from MCMC iterations. For MCMC iteration \(m\) we have \(\beta^{(m)}\) and \({\sigma^2}^{(m)}\), we then sample \(\tilde y^{(m)}\sim N(\tilde X\beta^{(m)},{\sigma^2}^{(m)}I)\). \(\tilde y^{(1)},\ldots,\tilde y^{(M)}\) are samples from the posterior predicitive distribution. The R code is given in the code chunk of the Gibbs sampling section. The 95% credible inteval of each new prediction in the plot are not straight lines, as they are obtained from simulations:
ynew.ci = apply(y.new,2,quantile,c(.025,.975))
ggplot(data = cars, aes(x = speed, y = sqrt(dist))) + geom_point() + ylab(bquote(sqrt("dist"))) +
geom_line(mapping = aes(x=X.new[-1,2],y=beta.post.mean[1]+beta.post.mean[2]*X.new[-1,2], color = "posterior mean"),lwd=1) +
geom_line(mapping = aes(x=X.new[-1,2],y=ynew.ci[1,-1],color="95% CI for posterior predictions")) +
geom_line(mapping = aes(x=X.new[-1,2],y=ynew.ci[2,-1],color="95% CI for posterior predictions")) +
theme_bw() + theme(legend.position = c(1, 0), legend.justification = c(1.5, 0))
We see that for the upper bound of the 95% credibal interval for a prediction to be \(<=7\), the maximum \(x\) should be \(<11\), in units of \(0.5\) mph it is approximately \(10.5\) mph \(\approx 17\) kmh, or \(10\) mph if the speed is in units of \(1\) mph.
max(which(ynew.ci[2,]<=7))
## [1] 22
ynew.ci[,22:23]
## [,1] [,2]
## 2.5% 2.498079 2.832669
## 97.5% 6.813289 7.048618
ynew.ci[,22:23]^2
## [,1] [,2]
## 2.5% 6.240396 8.024015
## 97.5% 46.420909 49.683009
X.new[22,2]
## [1] 10.5
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin. Bayesian Data Analysis 3rd edn. 2013 Boca Raton, Chapman and Hall.