Bayesian Linear Regression
- The Linear model Strategy
Fitting the model
Interpreting the model fit

Bayesian Linear Regression

This notebook is a continuation of this other one: http://rpubs.com/SergioMarrero/349638.

Let’s load data:

library(rethinking)
data("Howell1")
d <- Howell1
d2 <- d[d$age >= 18, ]

What we want to do now is to add a predictor. Our hypothesis is that the height and weight are correlated. It means that learning about the weight, we’ll improve our knowledge over the height.

plot(d2$height ~ d2$weight)

The Linear model Strategy

The strategy is to make the parameter for the mean of a Gaussian distribution, \(\mu\), into a linear function of the predictor variable. Let’s see the assumption we’ll do, naming \(Y\) the predicted variable and \(X\) the predictor or explicative variable:

\(Y\) follows a normal distribution
\(E[Y|X]\) is linear respect of \(X\)
\(VAR[Y|X]\) is constant.

In the classic approach, we’d try to reach the best fit, but in the bayesian approach, each likelihood value represent the relative number of ways in that each model could produce the data. It means that for each combination of values, the machine computes the posterior probability, which is a measure of of relative plausibility, given the model and data. So the posterior distribution ranks the infinite possible combinations of parameter values by their logical plausibility.

Let’s going to see the model. If we don’t take into account any predictor, the model looked like as:

\[\begin{align*} & h_i \backsim \mathcal{N}(\mu, \sigma) \quad \quad \quad \textrm{Likelihood}\\ & \mu \backsim \mathcal{N}(178, 20) \quad \quad \quad \mu \textrm{ prior}\\ & \sigma \backsim \mathcal{U}(0, 50) \quad \quad \quad \sigma \textrm{ prior} \end{align*}\]

But now we added a predictor. So we can remove this line: \(\mu \backsim \mathcal{N}(178, 20)\) because now we are not interested in the globar mean of the height (\(Y\)), instead we are interesting in the mean of the each conditioned to the height. So let’s express this idea: \(h_i \backsim \mathcal{N}(\mu_i, \sigma)\) and \(\mu_i =\alpha + \beta x_i\) where \(x_i\) is each data of weight. Now we need som prior for the new parameters, the slope \(\beta\) and the intercept \(\alpha\).

Putting all together we have:

\[\begin{align*} & h_i \backsim \mathcal{N}(\mu_i, \sigma) \quad \quad \quad \textrm{Likelihood}\\ &\mu_i = \alpha + \beta x_i \quad \quad \quad \textrm{ Linear Model}\\ &\alpha \backsim \mathcal{N}(178, 100) \quad \quad \quad \textrm{ Priors for } \alpha\\ &\beta \backsim \mathcal{N}(0, 10) \quad \quad \quad \textrm{ Priors for } \beta\\ & \sigma \backsim \mathcal{U}(0, 50) \quad \quad \quad \sigma \textrm{ prior} \end{align*}\]

Fitting the model

Let’s go to fit the model using quadratic approximation.

m4.3 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight,
    a ~ dnorm(156, 100),
    b ~ dnorm(0, 10),
    sigma ~ dunif(0, 50)), 
  data = d2,
  start = list(a = 125, b = 2))

Be aware that everything that depend upon parameters has a posterior distribution. The parameter \(\mu\) is no longer a parameter, since it has become a function of the parameters \(\alpha\) and \(\beta\). But since parameters \(\alpha\) and \(\beta\) have a joint posterior, so too does \(\mu\). Since parameters are uncertain, everything that depends upon them is also uncertain.

Interpreting the model fit

Let’s plot a the summary of posterios:

precis(m4.3)

##         Mean StdDev   5.5%  94.5%
## a     113.90   1.91 110.85 116.94
## b       0.90   0.04   0.84   0.97
## sigma   5.07   0.19   4.77   5.38

The first row gives the quadratic approximation for \(\alpha\), the second the approximation for \(\beta\) and the third the approximation for \(\sigma\). Let’s try to make some sense of them: \(\beta\) is a slope. The value can be read a person 1 kg heavier is expected to be 0.90 cm taller. \(89\%\) of the posterior probability lies between 0.84 and 0.97. The value of intercept is frequently uninterpetable and this is the reason because we need very weak priors for intercepts. Finally, the estimate for \(\sigma\) informs us of the width of the distriubtion of heights around the mean. A quick way to interpret it is to recall that about \(95\%\) of the probability in a Gaussian distribution lies between two standard deviations. So in this case, the estimate tells us that \(95%\) of plausible heights lie within 10 cm (\(2\sigma\)) of the mean height. But there is also uncertainty about this, as indicated by the \(89\%\) percentile interval.

Let’s hgo now to make the summary table, but now with the variance-covariance matrix:

cov2cor(vcov(m4.3))

##                   a             b         sigma
## a      1.0000000000 -0.9898830221  0.0006375356
## b     -0.9898830221  1.0000000000 -0.0006373961
## sigma  0.0006375356 -0.0006373961  1.0000000000

\(\alpha\) and \(\beta\) are almost perfectly correlated. It just means that these two parameters carry the same information - as you change the slope of the line, the best intercept changes to match it. But in more complex models, strong correlations like this can make it difficult to fit the model to the data. So we’ll want to use some golem engineering tricks to avoid it, when possible.

The firs trick is CENTERING. Centering is the procedure of subtracting the mean of a varible from each value. So:

d2$weight.c <- d2$weight - mean(d2$weight)

Let’s refit the model:

m4.4 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * weight.c,
    a ~ dnorm(156, 100),
    b ~ dnorm(0, 10),
    sigma ~ dunif(0, 50)), 
  data = d2,
  start = list(a = 125, b = 2))

Let’s see the summary:

precis(m4.3, corr = TRUE)

##         Mean StdDev   5.5%  94.5%     a     b sigma
## a     113.90   1.91 110.85 116.94  1.00 -0.99     0
## b       0.90   0.04   0.84   0.97 -0.99  1.00     0
## sigma   5.07   0.19   4.77   5.38  0.00  0.00     1

precis(m4.4, corr = TRUE)

##         Mean StdDev   5.5%  94.5% a b sigma
## a     154.60   0.27 154.16 155.03 1 0     0
## b       0.91   0.04   0.84   0.97 0 1     0
## sigma   5.07   0.19   4.77   5.38 0 0     1

mean(d2$height)

## [1] 154.5971

The estimate for the intercept, \(\alpha\), still means the same thing it did before: the expected value of the outcome variable, when the predictor variable is equal to zero. But now the mean value of the predictor is also zero. So the intercept also means: the expected value of the outcome, when the predictor is at its averate value. This makes interpreting the intercept as the expected value of the height data. The expected value of the height does not depend of the \(\beta\) and for this reason it is now uncorrelated.

Plotting posterior inference against the data

We’re going to start with a simple version of that task, superimposing just the MAP values over the height and weight daata.. Then, we’ll slowly add more and more information to the prediction plots, until we’ve used the entire posterior distribution.

plot(height ~ weight, data = d2, col = "blue")
abline( a = coef(m4.3)["a"], b = coef(m4.3)["b"])

Be aware that this line is the best fitted line, but not the only one. This is the peak of the MAP.

Adding uncertainty around the mean.

The MAP line is just the posterior mean, the most plausible line in the infinite universe of lines the posterior distribution has considered.

Remember, the posterior distribution consideres every possible line connecting height to weight. It assigns a relative plausibility to each. This means that each combination of \(\alpha\) and \(\beta\) has a posterior probability. So how can we get that uncertainty onto the plot? Together, a combination of \(\alpha\) and \(\beta\) define a line. And so we could sample a bunch of lines from the posterior distribution. The we could display those lines on the plot, to visualize the uncertainty in the regression relatinship.

post <- extract.samples(m4.3)
head(post, 6)

coef(m4.3)

##           a           b       sigma 
## 113.8955139   0.9046745   5.0718207

Each row of post dataframe is a model: a line (intercept and slope) with a standard deviation (sigma). It means that each one is a correlated randome sample from the joint posterior of all three parameters.The average of very many of these lines is the MAP line. But the scatter around that average is meaningful, because it alters our confidence in the relationship between the predictor and the outcome. So let’s display a bunch of these lines, so you can see the scatter. This lesson will be easier to appreciate, if we use only some of the data to begin. Then you can see how adding in more data changes the scatter of the lines.

Make the model:

make_scatterLinearModels <- function(nOfSamples, nOfLines, alp){
  dN <- d2[ 1:nOfSamples , ]
  mN <- map(
    alist(
      height ~ dnorm( mu , sigma ) ,
      mu <- a + b*weight ,
      a ~ dnorm( 178 , 100 ) ,
      b ~ dnorm( 0 , 10 ) ,
      sigma ~ dunif( 0 , 50 )
    ) , data=dN )
  
  
  # extract 20 samples from the posterior
  post <- extract.samples( mN , n= n <- nOfLines )
  # display raw data and sample size
  plot( dN$weight , dN$height ,
        xlim=range(d2$weight) , ylim=range(d2$height) ,
        col="red" , xlab="weight" , ylab="height" , type = "p")
  mtext(concat("N = ",nOfSamples))
  # plot the lines, with transparency
  for ( i in 1:n )
    abline( a=post$a[i] , b=post$b[i] , col=col.alpha("black",alp) )
  
}

par(mfrow = c(2,2))
make_scatterLinearModels(nOfSamples = 10, nOfLines = 1000, alp = 0.01)
make_scatterLinearModels(nOfSamples = 50, nOfLines = 1000, alp = 0.01)
make_scatterLinearModels(nOfSamples = 150, nOfLines = 1000, alp = 0.01)
make_scatterLinearModels(nOfSamples = 352, nOfLines = 1000, alp = 0.01)

Plotting regression intervals and countours:

The cloud of regression lines is an appealing display, because it communicates uncertainty about the relationship in a way that many people find intuitive. But it’s much more common to see the uncertainty displayed by plotting an interval or contour around the MAP regression line. So what we want is to plot a shaded region around the MAP line, to display the interval.

The interval we’re going to plot will incorporate uncertainty in both the slope \(\beta\) and intercept \(\alpha\) at the same time. Let’s go to understand focusing our attention in one weight value, say 50 kilograms. You can quickly make a list of 10000 values of \(\mu\) for an individual who weighs 50 kilograms, by using your samples from the posterior. Remember that:

\[\mu_i = \alpha + \beta x_i\] Suppose we use the MAP line to calculate for \(x_i = 50\), we’d have:

\[\mu|(x = 50) = \alpha_{MAP} + \beta_{MAP} 50\] This would be the \(\mu\) if we use the peak of the curve. But what happen if we sample 10000 lines, and use all them for calculate the \(mu\). What we will obtain will be a list of 10000 \(\mu\), this list contain the uncertainty of the model. Let’s do it:

mu_at_50 <- post$a + post$b * 50
str(mu_at_50)

##  num [1:10000] 159 159 160 160 159 ...

Let’s see the histogram and density of it:

hist(mu_at_50, col = "gray", prob = TRUE,
     xlab = "E[height|weight = 50]")
lines(density(mu_at_50))

As the components of \(\mu\) have distributions, so too does \(\mu\). And since the distributions of \(\alpha\) and \(\beta\) are Gaussian, so to is the distribution of \(\mu\) (adding Gaussian distributions always produces a Gaussian distribution). Since the posterior for \(\mu\) is a distribution, you can find intervals for it, just like for any posterior distribution. To find the \(89\%\) highest posterior density interval of \(\mu\) at 50 kg, just use the HPDI command:

HPDI(mu_at_50, prob = 0.89)

##    |0.89    0.89| 
## 158.5853 159.6713

Let’s going to do the same for all possible values of x. We use the link function to do it. As our x is represented for all \(x_i\) possibilities, we’ll do the same as the case of \(x = 50\), but now with \(x \in DataSet\) so:

mu <- link(m4.3, n = 1e4)

## [ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]

str(mu)

##  num [1:10000, 1:352] 157 157 158 157 158 ...

Each row of \(mu\) matrix is a sample from the posterior distribution for each value \(x_i\). There will be 352 columns as the 352 cases in the dataset, and each columns with 10000 samples of the \(\mu\) related with each weight. So the function link provides a posterior distribution of \(\mu\) for each case we feed it. Actually what we want is rather slightly different. We want a value for the uncertainty for each value on the \(x\) axes.

# Define sequence of weights to compute predictions for the values on the x axes
weight.seq <- seq( from = 25, to = 70, by = 1)
# Use link for compute mu using the x values
mu <- link(m4.3, data = data.frame(weight = weight.seq))

## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]

str(mu)

##  num [1:1000, 1:46] 138 137 136 135 136 ...

You can see now the amount of uncertainty in \(\mu\) depends upon the value of weight. The final step is to summarize the distribution for each weight value. We’ll use apply, which applies a function of your choice to a matrix.

# summarize the distribution of mu
mu.mean <- apply(mu, 2, mean)
mu.HPDI <- apply(mu, 2, HPDI, prob = 0.89)

So let’s plot this:

par(mfrow = c(1,2))
# FIRST PLOT
#  use typw = n" to hide raw data"
plot(height ~weight, d2, type = "n")
# loop over samples
for (i in 1:100)
  points(weight.seq, mu[i,], pch = 16, col = col.alpha(rangi2, 0.1))


# SECOND PLOT

# plot raw data
plot(height ~ weight, data = d2, col = col.alpha(rangi2, 0.5))

# plot the map line
lines(weight.seq, mu.mean)

# plot a shaded region for 89% HPDI
shade(mu.HPDI, weight.seq, col = "red")

And now, compare this plot with the previous one.

For every model, take into account:

Fit the model
Use link to generate posterior of \(\mu|x\).
Summary to find mean and HPDI
Plot data, a line with the mean and the uncertainty with shade

You have to keep in mind that these inferences are always conditional on the model. Even a very bad model can have very tight confidence intervals. Is better to think as: Conditional on the assumption that mean of height and weight are related by a straight line, the this is the most plausible line, and these are its plausible bounds

Prediction Intervals

Now let’s walk through generating \(89\%\) prediction interval for actual heights, not just the average height. This means we will incorporate the standard deviation \(\sigma\) and its uncertainty as well. Remenber, the statistical model here (omiting priors)

\[h_i \backsim Normal(\mu_i, \sigma)\] \[\mu_i = \alpha + \beta x_i\]

As we’ve seen as \(\alpha\) and \(\beta\) are random variables with normal distribution so does \(\mu\) normal. But we are interesetd now in each \(h_i\). What should we do? Here is where \(\sigma\) get into to game.

Here’s how you do it. Imagine simulating heights. For any unique weight value, you sample from a Gaussian distribution with a selected mean \(\mu\) for that weight, using correspond value of \(\sigma\) sampled from the same posterior distribution. If you do this for every sample \((\mu, \sigma)\) from the posterior, for every weight value of interest, you end up with a collection of simulated heights that embody the uncertainty in the posterior as well as the uncertainty in the Gaussial Likelihood.

sim.height <- sim(m4.3, data = list(weight = weight.seq))

## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]

str(sim.height)

##  num [1:1000, 1:46] 134 139 131 131 123 ...

This matrix contains simulated heights.

We can summarize these simulated heights in the same way we summarized the distributions of \(\mu\), by using apply:

height.PI <- apply(sim.height, 2, PI, prob = 0.89)

Now height.PI contains the 89\(\%\) posterior prediction posterior prediction interval of observable (according to the model) heights, across the values of weight in weight.seq.

Let’s plot everything we’ve built up:

the MAP line
the shaded region of \(89\%\) plausible \(\mu\)
The boundaries of the simulated heights the model expects

# plot raw data
plot(height ~ weight, d2, col = col.alpha(rangi2, 0.5))

# draw MAP line
lines(weight.seq, mu.mean)

# draw HPDI region for line
shade(mu.HPDI, weight.seq, col = "red")

# draw PI region for simulated heights

shade(height.PI, weight.seq)

The wide shaded region in the figure represents the area within which the model expects to find \(89\%\) of actual heights in the population, at each weight. The red region is the uncertainty (\(89\%\)) related with \(\mu|x\)

Notice that the outline for the wide shaded interval is a little jagged. This is the simulation variance in the tails of the sampled Gaussian values. We could avoid it increasing the number of samples we take from the posterior distribution. With extreme percentiles, it can be very hard to get out all oh the jaggedness. Luckily, it hardly matters, excepts for aesthetics.

A bayesian Linear Regression

Reference Book: Statistical Rethinking - Richard McElreath. Chapter 4

Sergio Marrero

22 de enero de 2018