The objective of the experiments is comparing and understanding the fundamental differences in the modeling procedure and evaluation. I will pick a simple regression dataset and run linear regression using both the paradigms. As I’m no expert of bayesian statistics, the notebook is a beginner’s first impression to bayesian modeling and an attempt to develop intuitions on the same.

library(ggplot2)
library(GGally)
library(rstan)
library(bayesplot)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

For simplicity I will use the airquality dataset. It has few independent variables, fairly small. I will not be doing any preprocessing. The target variable in the dataset is Ozone.

airq <- airquality[complete.cases(airquality),]  # drop NAs
dim(airq)
## [1] 111   6
head(airq)
ggpairs(airq, progress=FALSE)

Note that we are in violation of the assumption for linear regression that the target is normal. In advanced exercise, I will attempt modeling with different likelihood distributions and generalized models. Transforming the inputs (quadratic?) can also help, but I would like to explore the methods will all the imperfections in the data. So, I will not be performing any preprocessing.

Univariate regression

Ozone vs temperature

There seems to be a simple relationship between the two. Allowing for easy comparison and evaluation of estimates.

Linear regression

model <- lm(Ozone ~ Temp, airq)
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = airq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.922 -17.459  -0.874  10.444 118.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -147.6461    18.7553  -7.872 2.76e-12 ***
## Temp           2.4391     0.2393  10.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4833 
## F-statistic: 103.9 on 1 and 109 DF,  p-value: < 2.2e-16
plot(model, pch=20)

plot(airq$Temp, airq$Ozone, pch=20)
abline(model, col="blue")

On an irrelevant note, the lowess fit in residual plot shows a pattern that the linear model failed to capture.

Bayesian regression

Assuming non-informative prior. For simplicity, I’m including the prediction component in the stan code below, depending on the use-case this might not be ideal. I’m also not using an external test set as I’m not necessarily performing prediction.

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  vector[N] y_out;
  for (n in 1:N)
    y_out[n] = normal_rng(alpha + x[n] * beta, sigma);
}

In RStudio, the stan code chunk automatically compiles to the stanmodel object for us, so we can directly use the rstan::sampling method.

bayesfit <- sampling(bayesmodel, data=list(x=airq$Temp, y=airq$Ozone, N=nrow(airq)), iter=1000)

Visualizing the posteriors

print(bayesfit, pars=c("alpha", "beta", "sigma"))
## Inference for Stan model: 8a02eb78101a371090d8c70a5bee506c.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##          mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## alpha -146.34    0.83 19.17 -182.72 -159.10 -146.18 -133.45 -107.81   527 1.00
## beta     2.42    0.01  0.24    1.92    2.25    2.42    2.59    2.88   529 1.00
## sigma   24.31    0.07  1.74   21.29   23.11   24.22   25.39   28.17   672 1.01
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 22:12:18 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(bayesfit, plotfun="hist", pars=c("alpha", "beta", "sigma"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pairs(bayesfit, pars=c("alpha", "beta", "sigma"))

traceplot(bayesfit, inc_warmup=T, pars=c("alpha", "beta", "sigma"))

Plot predictions and 90% intervals

cols <- names(bayesfit)
cols <- cols[5:length(cols)-1]

fitmat <- as.matrix(bayesfit)
fitmat <- fitmat[,cols]

color_scheme_set("red")

# posterior outcome distribution
ppc_dens_overlay(airq$Ozone, fitmat)

# prediction plot
ppc_ribbon(y = airq$Ozone, yrep = fitmat, x = airq$Temp)

# individual uncertainties
ppc_intervals(y = airq$Ozone, yrep = fitmat, x = airq$Temp)

As expected, there is higher uncertainty in predictions around 75 to 85 in the interval plots. The ribbon doesn’t clearly show this uncertainty in the region. This type of local uncertainty will be comparatively difficult to capture with a frequentist approach. If the model wasn’t fixed to be normal, the y posterior distributions might look closer to the original distribution. Then again the original relationship isn’t purely linear. We can get around this by normalizing or fixing the skewness of the target variable, which I do not do.

Comparing estimates

Linear regression

## 
## Call:
## lm(formula = Ozone ~ Temp, data = airq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.922 -17.459  -0.874  10.444 118.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -147.6461    18.7553  -7.872 2.76e-12 ***
## Temp           2.4391     0.2393  10.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4833 
## F-statistic: 103.9 on 1 and 109 DF,  p-value: < 2.2e-16

Bayes regression

## Inference for Stan model: 8a02eb78101a371090d8c70a5bee506c.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##          mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## alpha -146.34    0.83 19.17 -182.72 -159.10 -146.18 -133.45 -107.81   527 1.00
## beta     2.42    0.01  0.24    1.92    2.25    2.42    2.59    2.88   529 1.00
## sigma   24.31    0.07  1.74   21.29   23.11   24.22   25.39   28.17   672 1.01
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 22:12:18 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The coefficient estimates are same

Confidence vs credible intervals

Since the airq dataframe as multiple x values, I take unique set of x values to ease plotting the intervals in plots

Confidence and predictive intervals for the linear regression model

idx <- !duplicated(airq$Temp)  # take unique
x_preds <- airq$Temp[idx]
sort_idx <- order(x_preds)  # sort x
x_preds <- x_preds[sort_idx]

# confidence interval of lm 
conf_int <- predict(model, newdata=data.frame(Temp = x_preds), interval="confidence", level = 0.95)
pred_int <- predict(model, newdata=data.frame(Temp = x_preds), interval="prediction", level = 0.95)
plot(airq$Temp, airq$Ozone, xlab="x", ylab="y", main="Linear Regression", pch=20)
abline(model, col = "blue", lwd=2)
matlines(x_preds, pred_int[,2:3], col = "blue", lty=3, lwd=1)
sort_idx <- order(x_preds)
polygon(c(x_preds[sort_idx], rev(x_preds[sort_idx])), c(conf_int[sort_idx,3], rev(conf_int[sort_idx,2])), col=rgb(0, 0, 1, 0.3), border = NA)

Credible and predictive interval for bayesian regression

While the plots shown earlier with predictions are a more bayesian way to visualize the intervals, I plot the credible intervals similar to that of the linear regression model for comparison.

# predictive interval
idx <- !duplicated(airq$Temp)  # take unique
x_preds <- airq$Temp[idx]
sort_idx <- order(x_preds)  # sort both x and y
x_preds <- x_preds[sort_idx]
y_preds <- fitmat[,idx]
y_preds <- y_preds[,sort_idx]

plot(airq$Temp, airq$Ozone, xlab="x", ylab="y", main="Bayesian Regression", pch=20)
cred_int <- apply(y_preds, 2, quantile, probs=c(0.025,0.5,0.975))  # 95% interval
lines(x_preds, cred_int[1,], col="red", lty=3, lwd=1)
lines(x_preds, cred_int[3,], col="red", lty=3, lwd=1)

# credible interval
x <- seq(min(airq$Temp), max(airq$Temp), 1)
x <- as.matrix(x)
beta <- as.matrix(bayesfit)[,2]
beta <- as.matrix(beta)
alpha <- as.matrix(bayesfit)[,1]
alpha <- as.matrix(alpha)

y_mean <- sweep(x %*% t(beta), 2, alpha, '+')
ci <- apply(y_mean, 1, quantile, probs=c(0.025,0.5,0.975))
lines(x, ci[2,], col="red", lty=1, lwd=2)
polygon(c(x, rev(x)), c(ci[1,], rev(ci[3,])), border=NA, col=rgb(1, 0, 0, 0.3))

We can also plot the different regression lines from different samples of the posterior to check all the possible fits.

Observations:

  1. Since we used a normal likelihood and uninformative priors, the credible interval is the same as the confidence interval.

  2. Credible intervals are conceptually natural and intuitive (taken directly out of probability distributions). The definition of confidence intervals of estimates (that estimates will be observed in the interval 95% of the time on repeated sampling) might be awkward for many people.

  3. With bayesian approach, we can obtain uncertainties for individual predictions easily.

Robust regression using student-t distribution

The normal assumption above, places some restrictions on our model. Particularly, the models are sensitive to outliers. Wide-tail distributions like the t-distribution help address this issue. We will rewrite the stan code to use a student-t likelihood. I will use the common Gamma(2, 0.1) prior for the degree of freedom parameter of the student-t distribution.

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
  real<lower=1> nu;
}
model {
  y ~ student_t(nu, alpha + beta * x, sigma);
  nu ~ gamma(2, 0.1);
}
generated quantities {
  vector[N] y_out;
  for (n in 1:N)
    y_out[n] = student_t_rng(nu, alpha + x[n] * beta, sigma);
}
bayesfit <- sampling(bayesmodel, data=list(x=airq$Temp, y=airq$Ozone, N=nrow(airq)), iter=1500)
print(bayesfit, pars=c("alpha", "beta", "sigma", "nu"))
## Inference for Stan model: 0ce30b79169eea018403a1ebdc422b15.
## 4 chains, each with iter=1500; warmup=750; thin=1; 
## post-warmup draws per chain=750, total post-warmup draws=3000.
## 
##          mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## alpha -137.09    0.52 15.80 -167.99 -147.70 -136.79 -126.51 -106.32   923    1
## beta     2.27    0.01  0.21    1.87    2.14    2.27    2.41    2.67   920    1
## sigma   17.91    0.05  1.95   14.30   16.56   17.79   19.21   21.87  1429    1
## nu       6.29    0.09  3.50    2.61    4.17    5.43    7.38   15.01  1401    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jun 27 22:13:53 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(bayesfit, plotfun="hist", pars=c("alpha", "beta", "sigma", "nu"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pairs(bayesfit, pars=c("alpha", "beta", "sigma", "nu"))

traceplot(bayesfit, inc_warmup=T, pars=c("alpha", "beta", "sigma", "nu"))

cols <- names(bayesfit)
cols <- cols[6:length(cols)-1]

fitmat <- as.matrix(bayesfit)
fitmat <- fitmat[,cols]

color_scheme_set("red")

# posterior outcome distribution
ppc_dens_overlay(airq$Ozone, fitmat)

# individual uncertainties
ppc_intervals(y = airq$Ozone, yrep = fitmat, x = airq$Temp)

Note the wider tails of the output distribution. This correctly shows the possible, albeit less likely, values the response variable can take.

Compare regression lines

# predictive interval
idx <- !duplicated(airq$Temp)  # take unique
x_preds <- airq$Temp[idx]
sort_idx <- order(x_preds)  # sort both x and y
x_preds <- x_preds[sort_idx]
y_preds <- fitmat[,idx]
y_preds <- y_preds[,sort_idx]

plot(airq$Temp, airq$Ozone, xlab="x", ylab="y", main="Bayesian Regression", pch=20)
abline(model, col="blue", lwd=2)
cred_int <- apply(y_preds, 2, quantile, probs=c(0.025,0.5,0.975))  # 95% interval
lines(x_preds, cred_int[1,], col="red", lty=3, lwd=1)
lines(x_preds, cred_int[3,], col="red", lty=3, lwd=1)

# credible interval, by manually computing output from the posterior samples of the estimates
x <- seq(min(airq$Temp), max(airq$Temp), 1)
x <- as.matrix(x)
beta <- as.matrix(bayesfit)[,2]
beta <- as.matrix(beta)
alpha <- as.matrix(bayesfit)[,1]
alpha <- as.matrix(alpha)

y_mean <- sweep(x %*% t(beta), 2, alpha, '+')  #  beta * x  alpha
ci <- apply(y_mean, 1, quantile, probs=c(0.025,0.5,0.975))
lines(x, ci[2,], col="red", lty=1, lwd=2)
polygon(c(x, rev(x)), c(ci[1,], rev(ci[3,])), border=NA, col=rgb(1, 0, 0, 0.3), add=TRUE)
## Warning in polygon(c(x, rev(x)), c(ci[1, ], rev(ci[3, ])), border = NA, : "add"
## is not a graphical parameter

We see that our previous regression line (in blue) was more influenced by the outliers (present outside the prediction interval) compared to the robust bayesian model (in red).

In a future exercise, we can look into complex problems where the differences aren’t as subtle as this one.