Bayes Theorem is one of the most powerful equation in Statistics. It uses prior knowledge and takes a different approach compared to Frequenist approch. Bayesian vs Frequentist debate has a big history and an on-going debate. It is imperative to understand both sides of the table.
In this blog, we will introduce Bayesian Linear regression and the math behind it. We will also compare it with general linear regression.
From the frequentist approch, the linear regression will be of below form.
\(y=\beta_0+\beta_1x_1+\beta_2x_2+e\)
We have seen various forms of linear regression throughout the class. Here y is the response variable and x are the predictor variables. All we do is to find the weights of \(\beta\) by minimizing residual sum squares(RSS). Line of best fit is formed using OLS method.
With all these calculation, we boild down to single estimate for the model parameters with confidence interval using training data. Our model is completely informed by the data. Once we have our \(\hat\beta\) values, we will estimate the response values.
excercise <- read.csv('./data/excercise.csv',stringsAsFactors = FALSE)
calories <- read.csv('./data/calories.csv',stringsAsFactors = FALSE)
df = merge(x=excercise,y=calories,by='User_ID')
head(df)
## User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
## 1 10001159 female 67 176 74 12 103 39.6 76
## 2 10001607 female 34 178 79 19 96 40.6 93
## 3 10005485 female 38 178 77 14 82 40.5 49
## 4 10005630 female 39 169 66 8 90 39.6 36
## 5 10006441 male 23 169 73 25 102 40.7 122
## 6 10006606 male 50 183 89 23 96 40.4 130
Now, we will fit a simple linear regression using one variable and estimate the model parameters.
linear_model <- lm(Calories ~ Duration,df)
summary(linear_model)
##
## Call:
## lm(formula = Calories ~ Duration, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.290 -11.215 -0.215 9.995 135.019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.8597 0.3189 -68.55 <2e-16 ***
## Duration 7.1729 0.0181 396.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.44 on 14998 degrees of freedom
## Multiple R-squared: 0.9128, Adjusted R-squared: 0.9128
## F-statistic: 1.571e+05 on 1 and 14998 DF, p-value: < 2.2e-16
Here the intercept is -21.85 and the estimated \(\hat\beta\) for duration is is 7.1729. This means, if you excercise duration is 20 minutes, calories which you will burn is
plot(Calories ~ Duration,df)
-21.87+(7*20)
## [1] 118.13
around 118.13 calories. So it provides a single point of estimate. We can add confidence interval to it. Current dataset is of decent size, but if we have a small dataset the estimate fluctuates. This is were Bayesian linear regression comes to the picture.
Bayesian linear regression is an one form of approch to linear regression within the context of Bayesian inference. In Bayesian viewpoint, linear regression is formulated using probability distributions rather than point estimates. Response variable y is not estimated as a single value, but it is assumed to be drawn from probability distribution.
\(y = N(\beta^T X, \sigma^2 I)\)
Response y is generated from a normal distribution characterized by a mean and variance. The mean for linear regression is the transpose of the weight matrix multipled by predictor matrix. The mean for linear regression is the transpose of the weight matrix multiplied by predictor matrix.
Bayesian linear regression will determine the posterior distribution for model parameters. Here response variable also assumed to come from a normal distribution.
\(P(\beta|y,X) = P(y|\beta, X) * P(\beta|X) / P(y|X)\)
Here \(P(\beta|y,X)\) is the posterior probability distribution of the model paraters given the inputs and outputs. This is equal to the likelihood of the data \(P(y|\beta,X)\), multiplied by the prior probabiliy of the paraters and divded by a normalization constant.
Priors: If we have domain knowledge, we can include them in our model paramters. In frequentist approach, all the parameters comes from the data.
Posterior: The result of performing Bayesian linear Regression is a distribution of possible model paraters based on the data and the prior knowledge. If we have fewer data points, the posterier distribution will be more spread out.
In this type of model, more the data points will converge to the values obtatined by OLS. Here we start with initial estimate, as we gather more evidence, our model becomes less wrong.
In practice, sampling methods to draw samples from posterior in order to approximate the posterior. Drawing random samples from the distribution to approximate the distribution is an application of Monte Carlo methods.
Lets first try to fit a model with just 6 records. Bayesian regression models are good to estimate the model parameters with any number of records.
#y = df$Calories %>% head()
#X = as.matrix(as.numeric(unlist(df %>% select(Duration))) %>% head())
#model1_lessdata <- BLR(y,XL=X)
print(paste0("Intercept: ",model1_lessdata$bL))
## [1] "Intercept: 0.824959394377874"
print(paste0("Beta-hat estimate: ",model1_lessdata$mu))
## [1] "Beta-hat estimate: 68.2164393705854"
hist(rnorm(100,model1_lessdata$bL,model1_lessdata$SD.bL),main = 'Bayesian model with 6 records')
Above plot is an approximate generation of normal distribution on \(\hat\beta\). Even the input records is very small, still Bayesian model has the orginal estimate in its distribution.
Now lets try to create a model using all the records in the dataset.
#y = df$Calories
#X = as.matrix(as.numeric(unlist(df %>% select(Duration))) )
#model2_moredata <- BLR(y,XL=X)
print(paste0("Intercept: ",model2_moredata$bL))
## [1] "Intercept: 7.17534789975482"
print(paste0("Beta-hat estimate: ",model2_moredata$mu))
## [1] "Beta-hat estimate: -21.8855582322848"
hist(rnorm(100,model2_moredata$bL,model2_moredata$SD.bL),main = 'Bayesian model with all records')
Above plot shows the exact \(\hat\beta\) parameters which we got from OLS method. It also perfectly predicts the intercept of the model. Initially the estimate has a wide normal distribution. It gradually shrinks to the actual value. This is the power of Bayesian Regression.
Bayesian inference and modelling is used in various parts of machine learning. It is important to understand how it works compared to frequentist approch. So it can be used in the correct dataset.