This is a reproduction for study purpose from https://towardsdatascience.com/bayesian-regression-with-implementation-in-r-fa71396dd59e

Starting from the basics

Recall that in linear regression, we are given target values y, data X, and we use the model: where y is N1 vector, X is ND matrix, w is D1 vector, and the error is N1 vector. We have N data points. Dimension D is understood in terms of features, so if we use a list of x, a list of x² (and a list of 1’s corresponding to w_0), we say D=3. In classic linear regression, the error term is assumed to have Normal distribution, and so it immediately follows that y is normally distributed with mean Xw, and variance of whatever variance the error term has (denote by σ², or diagonal matrix with entries σ²). The normal assumption turns out well in most cases, and this normal model is also what we use in Bayesian regression.

Parameter Inference

We are now faced with two problems: inference of w, and prediction of y for any new X. Using the well-known Bayes rule and the above assumptions, we are only steps away towards not only solving these two problems, but also giving a full probability distribution of y for any new X. Here is the Bayes rule using our notations, which expresses the posterior distribution of parameter w given data: Bayes π and f are probability density functions. Since the result is a function of w, we can ignore the denominator, knowing that the numerator is proportional to lefthand side by a constant. We know from assumptions that the likelihood function f(y|w,x) follows the normal distribution. The other term is prior distribution of w, and this reflects, as the name suggests, prior knowledge of the parameters. - Prior Distribution Defining the prior is an interesting part of the Bayesian workflow. For convenience we let \(w \sim N(m_0, S_0)\), and the hyperparameters \(m_0\) and \(S_0\) now reflect prior knowledge of w. If you have little knowledge of \(w\), or find any assignment of \(m_0\) and \(S_0\) too subjective, ‘non-informative’ priors are an amendment. In this case, we set \(m_0\) to 0 and more importantly set \(S_0\) as a diagonal matrix with very large values. We are saying that \(w\) has a very high variance, and so we have little knowledge of what \(w\) will be.

With all these probability functions defined, a few lines of simply algebraic manipulations (quite a few lines in fact) will give the posterior after observation of N data points: posterior

Predictive Distribution

A full Bayesian approach means not only getting a single prediction (denote new pair of data by \(y_0, x_0\)), but also acquiring the distribution of this new point. What we have done is the reverse of marginalizing from joint to get marginal distribution on the first line, and using Bayes rule inside the integral on the second line, where we have also removed unnecessary dependences. Notice that we know what the last two probability functions are. The result of full predictive distribution is:

Implementation in R

Implementation in R is quite convenient. Backed up with the above theoretical results, we just input matrix multiplications into our code and get results of both predictions and predictive distributions. To illustrate with an example, we use a toy problem: X is from -1 to 1, evenly spaced, and y is constructed as the following additions of sinusoidal curves with normal noise (see graph below for illustration of y). BayesianToyProblem The following code gets this data.

The following code (under section Maximum a Posteriori) implements all the above theoretical results through matrix multiplications. We also expand features of x (denoted in code as phi_X, under section Construct basis functions). Just as we would expand x into x², etc., we now expand it into 9 radial basis functions, each one looking like the follows. Note that although these look like normal density, they are not interpreted as probabilities. One advantage of radial basis functions is that radial basis functions can fit a variety of curves, including polynomial and sinusoidal. RBF1

# — — — — — Construct basis functions — — — — — — — — — — — —+
phi_X <- matrix(0, nrow=N, ncol=D)
phi_X[,1] <- X
mu <- seq(min(X),max(X),length.out=D+1)
mu <- mu[c(-1,-length(mu))]
for(i in 2:D){
 phi_X[,i] <- exp(-(X-mu[i-1])^2/(2*var))
}
# — — — — — Maximum a Posteriori — — — — — — — — — — — — — — —+
# Commented out is general prior
# m0 <- matrix(0,D,1)
# S0 <- diag(x=1000,D,D) 
# SN <- inv(inv(S0)+t(phi_X)%*%phi_X/var)
# mN <- SN%*%(inv(S0)%*%m0 + t(phi_X)%*%Y/var)
# Y_hat <- t(mN) %*% t(phi_X)
# We use non-informative prior for now
m0 <- matrix(0,D,1)
SN <- solve(t(phi_X)%*%phi_X/var)
mN <- SN%*%t(phi_X)%*%Y/var
Y_hat <- t(mN) %*% t(phi_X)
var_hat <- array(0, N)
for(i in 1:N){
 var_hat[i] <- var + phi_X[i,]%*%SN%*%phi_X[i,]
}
g_bayes <- g1 + geom_line(mapping=aes(x=X,y=Y_hat[1,]),color='#0000FF') +
              geom_line(aes(x=X,y=EY),color='red')
g_bayes

g_bayes_full <- g_bayes + geom_ribbon(mapping=aes(x=X,y=Y_hat[1,],
                  ymin=Y_hat[1,]-1.96*var_hat^0.5,
                  ymax=Y_hat[1,]+1.96*var_hat^0.5, alpha=0.1),fill="#9999FF")
Ignoring unknown aesthetics: y
g_bayes_full

The blue line is the expected value of the predictive distribution at each point x, and the light blue region refers to regions within two standard deviations. Red line is the true function of y. Dots are data randomly generated from the given function with normal noise.

Remarks

Multiple linear regression result is same as the case of Bayesian regression using improper prior with an infinite covariance matrix. However, Bayesian regression’s predictive distribution usually has a tighter variance. Generally, it is good practice to obtain some domain knowledge regarding the parameters, and use an informative prior. Bayesian regression can then quickly quantify and show how different prior knowledge impact predictions.

Bayesian regression is quite flexible as it quantifies all uncertainties — predictions, and all parameters. This flexibility offers several conveniences. For example, you can marginalize out any variables from the joint distributions, and study the distribution of any combinations of variables. Also, data fitting in this perspective makes it easy for you to ‘learn as you go’. Say I first observed 10000 data points, and computed a posterior of parameter w. After that, I somehow managed to acquire 1000 more data points, and instead of running the whole regression again, I can use the previously computed posterior as my prior for these 1000 points. This sequential process yields the same result as using the whole data all over again. I like this idea in that it’s very intuitive, in the manner as a learned opinion is proportional to previously learned opinions plus new observations, and the learning goes on. A joke says that a Bayesian who dreams of a horse and observes a donkey, will call it a mule. But if he takes more observations of it, eventually he will say it is indeed a donkey.

---
title: "Bayesian regression with implementation in R"
output: html_notebook
---
This is a reproduction for study purpose from https://towardsdatascience.com/bayesian-regression-with-implementation-in-r-fa71396dd59e

# Starting from the basics
Recall that in linear regression, we are given target values y, data X, and we use the model:
![](/Users/maxandchang/OneDrive/ModelComparison/LinearModel.png)
where y is N*1 vector, X is N*D matrix, w is D*1 vector, and the error is N*1 vector. We have N data points. Dimension D is understood in terms of features, so if we use a list of x, a list of x² (and a list of 1’s corresponding to w_0), we say D=3.
In classic linear regression, the error term is assumed to have Normal distribution, and so it immediately follows that y is normally distributed with mean Xw, and variance of whatever variance the error term has (denote by σ², or diagonal matrix with entries σ²). The normal assumption turns out well in most cases, and this normal model is also what we use in Bayesian regression.

# Parameter Inference
We are now faced with two problems: inference of w, and prediction of y for any new X. Using the well-known Bayes rule and the above assumptions, we are only steps away towards not only solving these two problems, but also giving a full probability distribution of y for any new X. Here is the Bayes rule using our notations, which expresses the posterior distribution of parameter w given data:
![Bayes](/Users/maxandchang/OneDrive/ModelComparison/Bayes.png)
π and f are probability density functions. Since the result is a function of w, we can ignore the denominator, knowing that the numerator is proportional to lefthand side by a constant. We know from assumptions that the likelihood function f(y|w,x) follows the normal distribution. The other term is prior distribution of w, and this reflects, as the name suggests, prior knowledge of the parameters.
- Prior Distribution 
Defining the prior is an interesting part of the Bayesian workflow. For convenience we let $w \sim N(m_0, S_0)$, and the hyperparameters $m_0$ and $S_0$ now reflect prior knowledge of w. If you have little knowledge of $w$, or find any assignment of $m_0$ and $S_0$ too subjective, ‘non-informative’ priors are an amendment. In this case, we set $m_0$ to 0 and more importantly set $S_0$ as a diagonal matrix with very large values. We are saying that $w$ has a very high variance, and so we have little knowledge of what $w$ will be.

With all these probability functions defined, a few lines of simply algebraic manipulations (quite a few lines in fact) will give the posterior after observation of N data points:
![posterior](/Users/maxandchang/OneDrive/ModelComparison/posterior.png)

# Predictive Distribution
A full Bayesian approach means not only getting a single prediction (denote new pair of data by $y_0, x_0$), but also acquiring the distribution of this new point.
![](/Users/maxandchang/OneDrive/ModelComparison/NewDistribution.png)
What we have done is the reverse of marginalizing from joint to get marginal distribution on the first line, and using Bayes rule inside the integral on the second line, where we have also removed unnecessary dependences. Notice that we know what the last two probability functions are. The result of full predictive distribution is:
![](/Users/maxandchang/OneDrive/ModelComparison/PredictiveDistribution.png)

# Implementation in R

Implementation in R is quite convenient. Backed up with the above theoretical results, we just input matrix multiplications into our code and get results of both predictions and predictive distributions. To illustrate with an example, we use a toy problem: X is from -1 to 1, evenly spaced, and y is constructed as the following additions of sinusoidal curves with normal noise (see graph below for illustration of y).
![BayesianToyProblem](/Users/maxandchang/OneDrive/ModelComparison/BayesianToyProblem.png)
The following code gets this data.
```{r}
library(ggplot2)
# — — — — — Get data — — — — — — — — — — — — — — — — — — — — —+
X <- (-30:30)/30 
N <- length(X) 
D <- 10 
var <- 0.15*0.15 
e <- rnorm(N,0,var^0.5) 
EY <- sin(2*pi*X)*(X<=0) + 0.5*sin(4*pi*X)*(X>0) 
Y <- sin(2*pi*X)*(X<=0) + 0.5*sin(4*pi*X)*(X>0) + e 
data <- data.frame(X,Y) 
g1 <- ggplot(data=data) + geom_point(mapping=aes(x=X,y=Y))
g1
```
The following code (under section Maximum a Posteriori) implements all the above theoretical results through matrix multiplications. We also expand features of x (denoted in code as phi_X, under section Construct basis functions). Just as we would expand x into x², etc., we now expand it into 9 radial basis functions, each one looking like the follows. Note that although these look like normal density, they are not interpreted as probabilities.
One advantage of radial basis functions is that radial basis functions can fit a variety of curves, including polynomial and sinusoidal.
![RBF1](/Users/maxandchang/OneDrive/ModelComparison/RBF1.png)
```{r}
# — — — — — Construct basis functions — — — — — — — — — — — —+
phi_X <- matrix(0, nrow=N, ncol=D)
phi_X[,1] <- X
mu <- seq(min(X),max(X),length.out=D+1)
mu <- mu[c(-1,-length(mu))]
for(i in 2:D){
 phi_X[,i] <- exp(-(X-mu[i-1])^2/(2*var))
}

# — — — — — Maximum a Posteriori — — — — — — — — — — — — — — —+
# Commented out is general prior
# m0 <- matrix(0,D,1)
# S0 <- diag(x=1000,D,D) 
# SN <- inv(inv(S0)+t(phi_X)%*%phi_X/var)
# mN <- SN%*%(inv(S0)%*%m0 + t(phi_X)%*%Y/var)
# Y_hat <- t(mN) %*% t(phi_X)

# We use non-informative prior for now
m0 <- matrix(0,D,1)
SN <- solve(t(phi_X)%*%phi_X/var)
mN <- SN%*%t(phi_X)%*%Y/var
Y_hat <- t(mN) %*% t(phi_X)
var_hat <- array(0, N)
for(i in 1:N){
 var_hat[i] <- var + phi_X[i,]%*%SN%*%phi_X[i,]
}
g_bayes <- g1 + geom_line(mapping=aes(x=X,y=Y_hat[1,]),color='#0000FF') +
              geom_line(aes(x=X,y=EY),color='red')
g_bayes
g_bayes_full <- g_bayes + geom_ribbon(mapping=aes(x=X,y=Y_hat[1,],
                  ymin=Y_hat[1,]-1.96*var_hat^0.5,
                  ymax=Y_hat[1,]+1.96*var_hat^0.5, alpha=0.1),fill="#9999FF")
g_bayes_full
```
The blue line is the expected value of the predictive distribution at each point x, and the light blue region refers to regions within two standard deviations. Red line is the true function of y. Dots are data randomly generated from the given function with normal noise.

# Remarks
Multiple linear regression result is same as the case of Bayesian regression using improper prior with an infinite covariance matrix. However, Bayesian regression’s predictive distribution usually has a tighter variance. Generally, it is good practice to obtain some domain knowledge regarding the parameters, and use an informative prior. Bayesian regression can then quickly quantify and show how different prior knowledge impact predictions.

Bayesian regression is quite flexible as it quantifies all uncertainties — predictions, and all parameters. This flexibility offers several conveniences. For example, you can marginalize out any variables from the joint distributions, and study the distribution of any combinations of variables. Also, data fitting in this perspective makes it easy for you to ‘learn as you go’. Say I first observed 10000 data points, and computed a posterior of parameter w. After that, I somehow managed to acquire 1000 more data points, and instead of running the whole regression again, I can use the previously computed posterior as my prior for these 1000 points. This sequential process yields the same result as using the whole data all over again. I like this idea in that it’s very intuitive, in the manner as a learned opinion is proportional to previously learned opinions plus new observations, and the learning goes on. A joke says that a Bayesian who dreams of a horse and observes a donkey, will call it a mule. But if he takes more observations of it, eventually he will say it is indeed a donkey.
