Bayesian Linear Regression is an alternative approach to linear regression that is based on Bayesian inference, and the Bayes Theorem. While the better known Ordinary Least Squares approach to linear regression estimates a single value for the regression model parameters, the Bayesian approach generates a probability distribution of possible values for each model parameter. BLR offers some benefits and additional insight compared to the OLS method:
In this example analysis, we have a dataset of used cars, which
includes their prices and various features. We will try to predict the
price of used cars, first using regular OLS linear regression, then with
BLR. We will compare their results, and analyze the additional
information provided by BLR.
The dataset was sourced from Kaggle,
provided by the contributors Nehal Birla, Nishant Verma and Nikhil Kushawa.
Let’s load our dataset and view the first 5 rows, along with simple summary statistics.
| Used cars dataset | ||||||||
|---|---|---|---|---|---|---|---|---|
| name | year | selling_price | km_driven | fuel | seller_type | transmission | owner | |
| 1 | Maruti 800 AC | 2007 | 60000 | 70000 | Petrol | Individual | Manual | First Owner |
| 2 | Maruti Wagon R LXI Minor | 2007 | 135000 | 50000 | Petrol | Individual | Manual | First Owner |
| 3 | Hyundai Verna 1.6 SX | 2012 | 600000 | 100000 | Diesel | Individual | Manual | First Owner |
| 4 | Datsun RediGO T Option | 2017 | 250000 | 46000 | Petrol | Individual | Manual | First Owner |
| 5 | Honda Amaze VX i-DTEC | 2014 | 450000 | 141000 | Diesel | Individual | Manual | Second Owner |
## name year selling_price km_driven
## Length:4340 Min. :1992 Min. : 20000 Min. : 1
## Class :character 1st Qu.:2011 1st Qu.: 208750 1st Qu.: 35000
## Mode :character Median :2014 Median : 350000 Median : 60000
## Mean :2013 Mean : 504127 Mean : 66216
## 3rd Qu.:2016 3rd Qu.: 600000 3rd Qu.: 90000
## Max. :2020 Max. :8900000 Max. :806599
## fuel seller_type transmission
## CNG : 40 Dealer : 994 Automatic: 448
## Diesel :2153 Individual :3244 Manual :3892
## Electric: 1 Trustmark Dealer: 102
## LPG : 23
## Petrol :2123
##
## owner
## Test Drive Car : 17
## First Owner :2832
## Second Owner :1106
## Third Owner : 304
## Fourth & Above Owner: 81
##
The original dataset has 4340 observations, 8 columns. The variables in the original dataset are:
There are some data cleaning and recoding steps to consider:
## name year selling_price km_driven
## Length:4182 Min. :1995 Min. : 20.0 Min. : 0.001
## Class :character 1st Qu.:2011 1st Qu.: 211.2 1st Qu.: 35.000
## Mode :character Median :2014 Median : 370.0 Median : 60.000
## Mean :2013 Mean : 512.4 Mean : 65.724
## 3rd Qu.:2016 3rd Qu.: 600.0 3rd Qu.: 90.000
## Max. :2020 Max. :8900.0 Max. :806.599
## fuel seller_type transmission owner
## Diesel:2112 Dealer :1066 Automatic: 444 First Owner :2801
## Petrol:2070 Individual:3116 Manual :3738 Second Owner:1081
## Third Owner : 300
##
##
##
With these changes, we have a dataset of 4182 variables, and simplified, less unbalanced factor variables.
Let’s look at the distributions of our numeric variables.
Our outcome variable, selling price, follows a very right skewed
distribution. There are numerous very large outliers compared to the
median and the interquartile range.
Let’s look at the balances of our factor variables.
Let’s check the correlations between our variables, starting with the
numeric variables.
Let’s check the degrees of association between our factor variables,
with tile plots and chi-square tests.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi1
## X-squared = 4.5774, df = 1, p-value = 0.0324
We see that dealers are a bit more likely to sell diesel cars, while
individuals are a bit more likely to sell petrol cars. The association
is weak but statistically significant.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi2
## X-squared = 8.0562, df = 1, p-value = 0.004535
Automatic cars are a bit more likely to be diesel, and manual cars
are a bit more likely to be petrol. Again, a weak but statistically
significant association.
##
## Pearson's Chi-squared test
##
## data: tchi3
## X-squared = 0.5651, df = 2, p-value = 0.7539
There is no significant association between fuel type and number of
owners.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi4
## X-squared = 192.08, df = 1, p-value < 0.00000000000000022
Individual sellers are much more likely to sell manual cars compared
to dealers. This is a strong, highly statistically significant
association.
##
## Pearson's Chi-squared test
##
## data: tchi5
## X-squared = 284.24, df = 2, p-value < 0.00000000000000022
Dealers sell roughly 1/3rd of first owner cars, 1/10th of second
owner cars, and 1/33th of third owner cars. There is a very strong and
highly statistically significant association between these variables.
Dealers strongly prefer selling first owner cars.
##
## Pearson's Chi-squared test
##
## data: tchi6
## X-squared = 30.363, df = 2, p-value = 0.0000002552
First owner cars are a bit more likely to be automatic transmission
cars. There is a moderate, statistically significant association between
these variables.
Finally, let’s look at the correlations between numeric and factor
variables, using boxplots of numeric variables grouped by factor
variables.
Now let’s consider the relationships between selling price, and the other variables in our dataset. We have already done this to a degree with our correlations analysis above. For factor variables, we can say:
Let’s use scatterplots to better visualize the relationship between
the two numeric predictors, year and km_driven, and the selling prices.
Let’s also group the observations by the factor variables for more
possible insights.
Before we fit an OLS linear regression model, there are a few transformations we need to consider:
As for variable choice, we saw that year, fuel type, transmission
type and owner type are all potential predictors of selling price. There
is no apparent strong correlation within these variables, so we can
include all of them as predictors.
Kilometers driven and seller type exhibit weak relationships with
selling price. Besides, they may be largely caused by the other
variables in our dataset:
Therefore, we will keep kilometers driven and seller type out of our regression model.
Let’s fit our first linear model, with the log transformed selling price as the outcome variable, and year, fuel, transmission, owner as our predictor variables.
lm1 <- lm(log(selling_price) ~ year + fuel + transmission + owner, data=df)
summary(lm1)
##
## Call:
## lm(formula = log(selling_price) ~ year + fuel + transmission +
## owner, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77309 -0.29788 0.00815 0.31172 2.35456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -229.245853 4.136982 -55.414 < 0.0000000000000002 ***
## year 0.117310 0.002052 57.171 < 0.0000000000000002 ***
## fuelPetrol -0.504037 0.015026 -33.543 < 0.0000000000000002 ***
## transmissionManual -0.850391 0.024433 -34.805 < 0.0000000000000002 ***
## ownerSecond Owner -0.081459 0.018610 -4.377 0.0000123193 ***
## ownerThird Owner -0.166369 0.031027 -5.362 0.0000000867 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4811 on 4176 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6661
## F-statistic: 1669 on 5 and 4176 DF, p-value: < 0.00000000000000022
Since the coefficients are log transformed, let’s reverse the transformations, convert them into percentage changes, and get the 95% confidence intervals.
## 2.5 % estimate 97.5 %
## year 11.99531 12.446749 12.900012
## fuelPetrol -41.34493 -39.591267 -37.785173
## transmissionManual -59.27353 -57.275207 -55.178830
## ownerSecond Owner -11.12550 -7.822957 -4.397695
## ownerThird Owner -20.32377 -15.326617 -10.016050
Here’s how to interpret these transformed coefficient estimates, and the confidence intervals:
Another way to visualize the individual effect of each variable on
price, while holding the effects of other variables constant, is with
added variable plots:
Let’s plot the diagnostic graphs for lm1, and see if our model fits
the assumptions for an OLS linear regression.
## GVIF Df GVIF^(1/(2*Df))
## year 1.286950 1 1.134438
## fuel 1.019771 1 1.009837
## transmission 1.023523 1 1.011693
## owner 1.253309 2 1.058070
Another way to visually assess the model’s fit is to plot the fitted
values against the real observed values. Let’s do this, while reversing
the log transformations on the model’s predictions.
While the fitted vs. observed values are reasonably close to following a
straight line, quite a few observations, especially higher priced cars
above 2.5m, are vastly underpredicted. We expected this from our
exploratory analysis, and splitting the data can offer more accurate
predictions, but for the sake of our example, let’s stick with these
results.
Now, let’s fit a BLR model with the same formula as lm1. We will go over the outputs of a BLR model, and compare them with the outputs of our OLS model lm1. Our outcome variable is log transformed, which leads to a distribution reasonably close to a standard normal distribution, so we will leave the default setting of a normal distribution as the prior distribution. We don’t have any prior information on the regression coefficients’ distributions, so we will leave them default too.
Our BLR model is constructed with a Markov Chain Monte Carlo algorithm, which draws random samples from the model parameters’ distribution. Each iteration of a BLR model will yield different results, so we need to set seed to get replicable outputs.
set.seed(1)
blr1 <- stan_glm(log(selling_price) ~ year + fuel + transmission + owner,
data=df, seed=1)
The coefficients of the model blr1 are log transformed, so let’s reverse the transformations, convert them into percentage changes, and see the posterior distributions of the model parameters.
## Summary of Posterior Distribution
##
## Parameter | Median | 95% CI | pd | ROPE | % in ROPE | Rhat | ESS
## ---------------------------------------------------------------------------------------------------
## year | 12.46 | [ 11.96, 12.90] | 100% | [-1.00, 1.01] | 0% | 0.999 | 4329.00
## fuelPetrol | -39.58 | [-41.30, -37.76] | 100% | [-1.00, 1.01] | 0% | 1.000 | 5993.00
## transmissionManual | -57.26 | [-59.22, -55.19] | 100% | [-1.00, 1.01] | 0% | 1.000 | 5666.00
## ownerSecond Owner | -7.81 | [-11.10, -4.32] | 100% | [-1.00, 1.01] | 0% | 1.000 | 4917.00
## ownerThird Owner | -15.35 | [-20.29, -9.90] | 100% | [-1.00, 1.01] | 0% | 1.000 | 4602.00
Let’s compare these with the estimated coefficients of lm1, and their 95% confidence intervals.
## 2.5 % estimate 97.5 %
## year 11.99531 12.446749 12.900012
## fuelPetrol -41.34493 -39.591267 -37.785173
## transmissionManual -59.27353 -57.275207 -55.178830
## ownerSecond Owner -11.12550 -7.822957 -4.397695
## ownerThird Owner -20.32377 -15.326617 -10.016050
We can see that the estimates of lm1 and blr1 are very close. The 95% confidence intervals of lm1 are also very close to the 95% credible intervals of blr1. However, their interpretations are a bit different:
We also have a few metrics to assess the accuracy of each parameter in blr1:
BLR generates a posterior probability distribution for the model
parameters, which can help visualize their level of uncertainty. Let’s
plot the posterior distributions of our model coefficients, with the log
transformations reversed. Keep in mind that the X-axis values in the
below plots are just exponents of the log coefficients. The closer they
are to 1, the smaller their effect on price. A coefficient greater than
1 has a positive effect on price, and smaller than 1 has a negative
effect on price.
Let’s plot the posterior probability distributions of the model
coefficients.
If we want to compare the effect and uncertainty of each coefficient, we
can plot their distributions together, though it will be hard to see the
distribution for each individual coefficient this way.
In this plot, coefficients that are farther away from 1 have a bigger
unit effect on the price, coefficients greater than 1 have a positive
effect, and coefficients smaller than 1 have a negative effect.
The ranges of the coefficient posterior distributions give us an idea about their uncertainty.
The BLR model also gives us a posterior probability distribution for
the predicted values of the outcome variable. We can plot this against
the distribution of the original selling price in our dataset, to check
if the predictions fit well with the original data.
Overall, the posterior probability distributions are reasonably close to
the distribution of the original outcome variable, though there are
considerable deviations in certain ranges. We can see from these plots
that the log transformation brought the distribution of selling price
reasonably close to a normal distribution.
We can plot the average selling prices predicted by blr1, against the
actual values of selling price, just like we did with lm1. This will
yield very similar results, as the average estimates of blr1 are very
close to the estimates of lm1.
We can see that the average fitted values of blr1 are practically the
same with the fitted values of lm1. They overlap almost perfectly in the
plot, though blr1’s predictions are slightly higher.
While the OLS model lm1 generates one prediction per one data point,
the BLR model blr1 generates 4,000 predictions per data point. We can
visualize the range of predictions made by blr1 as a histogram, and
overlay the single prediction made by lm1. Let’s do this for one
observation in our dataset:
## name year selling_price km_driven fuel
## 1048 Ford Ecosport 1.5 DV5 MT Ambiente 2013 500 60 Diesel
## seller_type transmission owner
## 1048 Individual Manual Second Owner
Let’s predict a single price for this car, using lm1, and 4,000 prices, using blr1. Let’s plot the prices predicted by blr1 as a histogram, along with the point prediction of lm1, and the original price of the car.
lm1’s point prediction for this observation is 390k, underestimating the
real price of 500k. However, blr1’s posterior prediction tells us more
about the degree of uncertainty in the predictions of price:
Our OLS linear regression model, lm1, yielded a decent fit for our data, generally satisfying the assumptions of OLS linear regression, or coming close.
Our BLR model, blr1, yielded average estimates for model parameters and average predictions close to those of lm1.
Overall, both models confirmed the predictor variables used as very significant predictors of selling price, but the overall predictions can be very inaccurate, especially for higher priced cars.