Let’s make a regression. For this, we’ll need to install three components in RStudio:

the rethinking library, which we’ll use to build our model via quadratic approximation;
the tidyverse library, which we’ll use for data manipulation and visualization; and
the Howell1 data set, listing biometric measurements from a sample of !Kung San people.

library(rethinking)
library(tidyverse)
data(Howell1)

Linear Model

Step 1: Setup our data

From this data set, we can use a linear regression model to predict height from weight. To simplify things, we’ll consider only adults for now.

Look at the code below.

# R programs are built from four types of components:
# - variables, like df, Howell1, and age
# - functions, like filter()
# - operators, like <-, |>, and >=
# - constants, like 18

df <- Howell1 |> filter( age >= 18 )

# We've defined a new variable called `df`.
# `df` is created by the assignment operator `<-`.
# It will contain whatever value is on the other side of the <- symbol.
# In the future, we can use that content just by using the variable.
#
# Here, the other side of <- has three parts:
# - the Howell1 variable, which we imported during setup above
# - the pipe operator |>
# - the function filter(), with an instruction `age >= 18`.
# Combined, they tell R to:
# - start with the content of Howell1,
# - pipe it into the filter function, and
# - send the result back to df.
#
# Note: The name of the variable we've created here is arbitrary.
# In R, "df" is commonly used as an abbreviation for 'data frame'. 
# Plus, it's nice and short.

Next, let’s plot the data to see what we’re working with.

# We'll start with the filtered data `df` that we defined above
df |>
  # Then we pipe our data into ggplot with the mapping parameter we want
  # We need the `aes()` function to convert our mappings into a format ggplot can read
  # We'll map the weight variable in `df` to the x-axis, and height to the y-axis
  ggplot( mapping = aes(x=weight, y=height) ) +
    # We'll add a geom_point layer to make a scatter plot
    geom_point( color="purple" , shape="plus" ) +
    # and we'll add some labels for fun
    labs( 
      title = "Scatterplot (height ~ weight)",
      x = "Weight (kg)",
      y= "Height (cm)"
    )

Nice. Our data looks reasonable. There’s clearly a relationship here between height and weight. Let’s see if we can model it.

[Some Review]

So, how do we build a model? And what is a model?

If we’re unclear about these questions, we can return to our core Bayesian principles. That’s the key to all of this. However messy the coding or the probability theory might get, every analysis is just an extension of the same underlying logic. Even if we don’t understand every detail of how a statistical model works, we can understand what it is doing by connecting it back to this small set of basic ideas. Let’s review those ideas now.

All research – whether qualitative or quantitative – asks the same question:

Given the things I have observed, what can I say about the world?

In statistical inference, we ask that question in a more specific way:

Given the observations in my data set, what can I say about the parameters of my model?

Some terms:

A parameter is a variable whose value we do not know and cannot easily measure.
Our data is a collection of observations about variables that we do know because we have measured them.
A model is a theoretical account of the relationships among our variables.

Typically, we’ll use our model to establish a connection between our parameters and our data, estimating the variables we don’t know (our parameters) from the variables we do know (our data).

We can express this question formally as P( parameters | observation ); in English, it means “given the observations we have made, what is the probability that some particular set of parameters values is correct?”

In statistics, we only rarely deal with absolutes of true or false. More often, we’re comparing explanations that seem relatively more plausible with ones that are relatively less plausible. Randomness and probability, in this sense, is not a fact about a world but rather a way of talking about our own uncertainty.

To Bayesian analysis, this probability expression P( parameters | observation ) is known as the posterior.

Understanding this posterior is the reason for all this work. It’s the thing that (we hope!) will tell us something about the world. With Bayes’ theorem, we can calculate our posterior from other values: posterior ∝ prior ️× likelihood.

Our priors are the set of assumptions we had about our parameters before we made our new observations.

Sometimes, especially if there’s been good research on the subject already, we’ll know a lot going in. Sometimes, we’ll know very little. Either way, we need to make our assumptions explicit from the start.

Typically, we’ll express our priors not as single values but rather as probability distributions. A probability distribution allows us to say “we don’t know exactly what this value is, but we believe that some possibilities are more likely to be correct than others”.

Formally, our priors can be represented as probability distributions in the form P( parameters ).

Next, we need to figure out our likelihood. This is where things can start to get very messy. Nearly all of the work we do in statistical modeling relates to figuring out how to calculate likelihood, but don’t let that rattle you. Conceptually, what likelihood expresses is very straightforward.

Likelihood is just counting. Specifically, it is counting the number of different ways that the thing we observed could have happened, according to our model. Things that can be produced more ways are more likely, and things that can be produced fewer ways are less likely.

In formal terms, likelihood is the inversion of our posterior: P( observation | parameters ). We’re asking, for each possible combination of parameter values, how likely were we to get the data we observed?

To answer that question, we need to define our model.

Step 2: Define our model

Models are made of theoretical assumptions. When we define a model, we’re stating these assumptions explicitly. Specifically, we need to do two things: (1) define our priors, and (2) come up with a way to estimate the likelihood of different combinations of parameters relative to our observations.

To start, let’s think about how height works. Within a population, height is both variable and random. Some people are taller, some people are shorter, and we don’t have any sure way to predict who will be what.

But, that doesn’t mean we don’t know anything. Based on prior research, we have reason to believe that height within a population tends to follow a normal distribution (at least roughly). A normal distribution always takes two parameters: a mean and a standard deviation. The mean tells us what the average value is, and the standard deviation tells us how much variation there is within the values.

We don’t know what the average height is nor how spread out height tends to be, but we can still make our theory here explicit as follows:

height_i ~ Normal( μ , σ )

This just says, the height of any given individual is random, but not all possibilities are equally likely. Rather, we expect the height in our sample to have a normal distribution with a mean of μ and a standard deviation of σ. (Why μ and σ? These are the Greek letters mu and sigma, and their association with mean and standard deviation is just an old convention.)

As yet, we don’t know what this particular population’s mean or standard deviation is. If we wanted, we could treat both of these values as parameters, take a guess at priors, and gather likelihoods from there. But, that’s not actually our goal in this exercise. Rather, we’re looking to build a regression. That means that we’re trying to estimate height not for the whole population at once but rather as a function dependent on weight.

To do that, we need to change how we define μ. Instead of treating μ as a parameter with a single unknown value, we can propose that it’s actually calculated from an individual’s weight. There are countless ways we can define the relationship between height and weight, but in a linear regression we’ll always use some variation of the format y = a + bx. For our needs, this will work:

μ_i = α + β(x_i - x̅)

More Greek letters, but we can unpack the meaning.

First, notice that we’re no longer talking about μ (the mean height of the entire population). Instead, we’re talking about μ_i. That subscript i means that this is the predicted value for this individual. Put differently, μ_i can be thought of as the “expected” height for a given individual based on their weight.

If the algebra above doesn’t make sense to you, don’t worry. All we’re doing is defining a line that maps each possible weight value to a predicted height. Our values x_i and x̅ refer to individual weight and the average weight in the sample, respectively. The parameter α is the average height of the population, and β is our “verb”. When weight goes up or down by 1 kilogram, we expect height to change by β centimeters.

Notice that μ_i is no longer a parameter. It’s not a distribution (~) but rather by an equation (=). We have broken μ_i down to be defined by two other parameters (α and β).

We now have three parameters in our model:

σ (the standard deviation of height in our population),
α (the average height in our population), and
β (the coefficient of weight on height)

Next, we’ll set prior estimates for these three parameters. Once we’ve done that, our model is now complete:

height_i ~ Normal( μ_i , σ )
μ_i = ɑ + β(x_i - x̅)
α ~ Normal( 178 , 20 )
β ~ Normal( 0 , 10 )
σ ~ Uniform( 0 , 50 )

It’s worth noting that these aren’t great priors. In Bayesian terms, we can say that they are relatively “uninformed”. That just means that they’re very wide guesses. This is visible in our very large standard deviation values. In a normal distribution, ±1 sd from the mean covers about 68% of the probability, and ±2 sd from the mean covers about 95%. Here, with standard deviations so big, we’re saying we’re expect some pretty extreme possibilities to be common. Specifically:

For mean height (α), we’ve said that we think the correct parameter value is very likely (about 95% sure) between 138cm and 218cm, centered at 178cm. That’s not unreasonable, but it’s a big range, and it seems unlikely that the average person is really 178cm tall. We could have picked a lot of other values with just as much certainty.
For β, we’ve guessed that each additional kilo will most likely (95%) be linked to a change of height between -20 and 20 centimeters. This is a massive range. Furthermore, we probably could have ruled out all the negative values, since it is hard to imagine that people who weigh more will tend to be shorter than people who weigh less. We could something other than a normal distribution to eliminate negative values, but we’ll stick with this for simplicity.
For the population’s standard deviation (σ), we’ve assigned a uniform distribution. We’re saying we believe the correct value is somewhere between 0 and 50, but within those bounds we don’t think any particular possibility is more likely than any other. This is a huge range, allowing for possibilities that are completely implausible. Using this prior is nearly the same as saying we know nothing at all.

If we did a bit of desk research, we could likely come up with much better priors. But, as we’ll see, these are probably good enough for our purposes here.

If this process of model definition feels uncomfortably arbitrary, good. The fact that models are math-y doesn’t make them objective. They are, like all theoretical arguments, subjective constructions built from our intuitions about the world. Here, the stakes are pretty low, but we can imagine how complicated this process can get once we start dealing with more charged matters like power, poverty, and identity.

Anyway, now what?

We have the two things we needed: priors and a definition of our model’s likelihood function. What we really want, though, is our posterior probabilities. How do we get those?

We’ve seen a few answers to this question already, including the garden of forking paths and grid approximation. Now it’s time to move onto a new method called “quadratic approximation”. The math behind grid approximation is beyond our scope here, but we don’t need to understand how it works to use it effectively. We just need to understand that it fundamentally does the same thing that our path counting and our grid approximations did.

In R, quadratic approximation is implemented in the rethinking package by the function quap:

# There's a fair bit going on here, so let's unpack it.

# To define our linear model, we'll need the mean weight of our sample.
# We'll calculate this value and assign it to a new variable called `xbar`
xbar <- mean(df$weight)

# Then, we will run `quap` and save the output to a variable we've called `model`.
# The `quap` function takes two parameters:
# - (1) the list of relationships that define our model
# - (2) our filtered version of the Howell1 data set.
# That's all we need. `quap` will do all the heavy lifting behind the scenes
model <- quap(
  alist(
    height ~ dnorm(mu , sigma) ,       # likelihood function
    mu <- a + b * (weight - xbar) ,    # linear model
    a ~ dnorm(178 , 20) ,              # parameter (mean height of population)
    b ~ dnorm(0 , 10) ,                # parameter (coefficient of weight)
    sigma ~ dunif(0 , 50)              # parameter (standard deviation of height)
  ) ,
  data = df
)

# To see our results, we'll run `precis` on the model we saved.
# This gives us a tidy summary of our posterior values.
precis(model)

##              mean         sd        5.5%       94.5%
## a     154.6013675 0.27030697 154.1693647 155.0333702
## b       0.9050132 0.04192754   0.8380049   0.9720215
## sigma   5.0718680 0.19115333   4.7663680   5.3773679

So what did we find?

For our parameter a (α), we’re quite confident that the correct value is between 154 and 155. This is, we believe, the average height of the population we’ve sampled.
For b (β), we think the likeliest value is about 0.9. This means that for every kilogram more or less an individual is, we expect their height to change by 0.9cm.
For sigma (σ), we now believe that the likeliest standard deviation of the population’s height is about 5cm. Pratically, that means we expect 95% of people to be within 10cm of the value our linear model expects.

Now, take some time to play around with quap. Try changing the priors to see what happens. As long as you don’t do anything extreme, you might notice that the priors here don’t have much effect on our posteriors. That’s because we have a large sample size here. Whatever we thought before, our data is robust enough to really change what we believe. If our sample size had been a lot smaller, our priors would have played a much larger role in our posterior results. This is a common dynamic in statistical inference.

Step 3: Priors and posteriors

Before we try to visualize our regression, it’s worth taking a moment to compare the probability distributions for our priors and posteriors. I’m not going to get into the code I’ve presented here in too much detail, but I would like to point out one thing about how we represent or prior and posterior distributions.

What we don’t do is use calculus or some other technique to define an analytic equation that represents our distribution mathematically. That’s tedious in simple cases, impossible in complicated cases, and not at all necessary for what we are trying to accomplish.

Instead, we define our prior and posterior distributions by sampling from our model. This is what quap helps us to do. The variable in the code below called posterior, for example, is just a giant list of 10,000 different possible values for each of our three parameters. Run the code and see for yourself. Every time we run this code, we’ll get a slightly different result. That’s okay. Because our priors and posteriors are both probability distributions, more likely values will show up more often and less likely values will show up less often. This is a simple but very effective “brute force” method.

In the output of the code below, what do you notice? For each parameter (with sigma’s prior as a bit of an exception), our prior and our posterior estimates actually to look pretty similar. There is a very important difference though. Can you spot it?

prior <- extract.prior( model , n=1e4) |> as.data.frame()
posterior <- extract.samples( model )

rbind(
  prior |> mutate(type = 'prior'),
  posterior |> mutate(type = 'posterior')
) |>
  mutate(type = fct_relevel(type, "prior", "posterior")) |>
  pivot_longer(cols=c(a,b,sigma), names_to='parameter') |>
  ggplot() + 
    geom_density(mapping = aes(x = value, color = type)) +
    facet_wrap(vars(parameter, type), nrow=3, ncol=2, scales='free') +
    
    labs(title="Prior/Posterior visualization")

Step 4: Visualize our model

Finally, we’ve reached the end! Congratulations on making it this far. We’ve defined our model, including our priors and our likelihood function. We’ve applied this model to our observed data to create a posterior. We’ve done some visualization, and we’ve sanity-checked our results. Now, all that’s left is to visualize our regression.

Our approach will have two steps:

First, we’re going to create a new table called summary that will highlight some key properties of our posterior distribution.
Next, we’ll feed this summary into ggplot (along with our original scatterplot data) to create a lovely little graph of our findings.

We’re calling this a “summary” of our model because that’s what it is. Our model’s output is the full probability distribution of the posteriors of all three of our parameters. We’ve graphed those distributions in the section above, but it’s possible that we can present this information in a more useful way for our audience.

The scatter plot we created at the start of this tutorial offered an intuitive way to understand our observation data, so let’s superimpose our analysis onto that. Specifically, let’s create:

A line showing our model’s most likely value of height for each weight;
A band highlighting the range of values we expect for this likely height for some credibility interval; and
A band highlighting the range of individual variation we expect for some prediction interval.

The code below is a bit complex, so let’s walk through it.

# To start, we're going to create a new summary table with tibble()
# In that table, we're just going to create a single column called `weight.seq`
# This column will have evenly spaced values covering the range we're interested in
# We'll use 30 to 65 because it covers the range in our data set (31.07 to 62.99)
summary <- tibble(
  weight.seq = seq( from=30 , to=65 )
) |>
  #Next, we'll use the `mutate` function to add additional columns to our table
  mutate(
    # The first new column we'll make is `mu` (same as in our model)
    #
    # In our posterior sample, we have 10k values for `a` and 10k values for `b`.
    # We'll use these values in our linear model to generate 
    # 10,000 possible values of `mu` *for each value in our weight sequence*.
    #
    # So, we'll generate 10,000 estimates of `mu` for 30kg, 
    # another 10,000 estimates for 31kg, 
    # another 10,000 for 32kg, 
    # another 10,000 for 33kg, 
    # and so on, all the way up to 65kg.
    #
    # For each set of 10,000 estimates, we'll generate a mean
    # and also generate bounds for a 96% interval (the middle 9600 cases)
    mu = map(weight.seq, \(x) posterior$a + posterior$b * (x - xbar)),
    mu.mean = map_dbl(mu, mean),
    mu.ci.min = map_dbl(mu, quantile, probs=0.02),
    mu.ci.max = map_dbl(mu, quantile, probs=0.98),
    
    # We'll also generate a column called "sim", short for simulation.
    # This is calculated in a very similar way to `mu`,
    # but it represents something different (more below)
    # We'll also create a 96% interval
    sim = map(weight.seq, \(x) rnorm(
      n=nrow(posterior),
      mean=posterior$a + posterior$b * (x - xbar),
      sd=posterior$sigma
    )),
    sim.pi.min = map_dbl(sim, quantile, probs=0.02),
    sim.pi.max = map_dbl(sim, quantile, probs=0.98)
  )

It is important to understand what these different values represent. Our mu estimates aren’t a prediction of what individual heights will look like. Instead, it represents our degree of certainty about the average height we expect for each weight. To generate predictions about what an actual population will look like, we have to simulate that.

The sim column we’ve created does exactly that. With the means we calculated like mu and our model’s posterior standard deviation (sigma), we will use the rnorm function in R to simulate 10,000 individuals from our model for each value in our weight sequence. Then, just like above, we’ll create a prediction interval for the middle 9600 (once again ignoring the top 2% and the bottom 2%).

You can see the results in the table below here. (Note: The values for mu and sim aren’t being displayed here because each cell in the table contains a list of 10,000 different values, which would be impossible to print in a readable way. )

Our Regression Summary

Now, let’s combine our original scatter plot with this new summary. I won’t try to explain everything in the code below, but feel free to explore and play around. What’s important here is that we’re asking ggplot to create four visual (“geometry”) layers: two geom_ribbon layers to show our mu.ci and sim.pi ranges, a geom_line to show our mu.mean values, and a geom_point scatter plot to show our actual sample data.

Take a look at the graph. What do you see?

summary |> 
  ggplot() +
    geom_ribbon(
      mapping=aes(x=weight.seq, ymin=sim.pi.min, ymax=sim.pi.max), 
      fill="lightgray", alpha=0.6) +
    geom_ribbon(
      mapping=aes(x=weight.seq, ymin=mu.ci.min, ymax=mu.ci.max), 
      fill="gray", alpha=0.8) +
    geom_line(mapping=aes(x = weight.seq, y = mu.mean)) +
    geom_point(
      data=df, 
      mapping=aes(x = weight, y = height), 
      color="purple", shape="plus", alpha=0.6) +
    labs(
      title="Linear Regression of height by weight",
      x="Weight (kg)",
      y="Height (cm)"
    )

Well, that’s it! We’ve covered a lot in this demo, and for the first time so far we’re digging into some real analysis. Predicting height from weight isn’t that interesting, perhaps, but the work we’ve done lays the foundation for any kind of regression we want to do. Once we all feel comfortable, we’ll move onto some data sets where the interpretation isn’t so obvious.

Cubic Model (Bonus)

I’m not going to get into this code here, but I wanted to share a bit of analysis that demonstrates some more advanced techniques. If we include the children from the Howell1 data set, our linear model starts to look a bit dodgy. A cubic model fits much better, and we can create that simply by changing the definition of our model very slightly.

There’s one more change here related to how the weight variable is handled. I’ll leave it to those who are starting to feel a bit more comfortable with the coding to try to decipher what’s happening there. Feel free to ask questions!

df <- Howell1 |>
  mutate(weight_s = ( weight - mean(weight) ) / sd(weight) ) |>
  mutate(weight_s2 = weight_s^2) |>
  mutate(weight_s3 = weight_s^3)

model <- quap(
  alist(
    height ~ dnorm(mu , sigma) ,
    mu <- a + b1*weight_s + b2*weight_s2  + b3*weight_s3,
    a ~ dnorm(178 , 20) ,
    sigma ~ dunif(0 , 50) ,
    b1 ~ dlnorm( 0 , 1 ) , 
    b2 ~ dnorm( 0 , 1 ) ,
    b3 ~ dnorm( 0, 1 )
  ) ,
  data = df
)
precis(model)

##             mean        sd       5.5%      94.5%
## a     146.394390 0.3099942 145.898959 146.889821
## sigma   4.829994 0.1469506   4.595138   5.064849
## b1     15.220005 0.4762755  14.458825  15.981185
## b2     -6.202628 0.2571633  -6.613624  -5.791631
## b3      3.583239 0.2287781   3.217608   3.948871

prior <- extract.prior( model ) |> as.data.frame()
posterior <- extract.samples( model )

summary <- tibble(
  weight.seq = seq( from=-2.2 , to=2 , length.out=30 )
) |>
  mutate(
    mu = map(
      weight.seq, 
      \(x) posterior$a + posterior$b1 * x + posterior$b2 * x^2 + posterior$b3 * x^3
    ),
    mu.mean = map_dbl(mu, mean),
    mu.ci.min = map_dbl(mu, quantile, probs=0.02),
    mu.ci.max = map_dbl(mu, quantile, probs=0.98),
    sim = map(weight.seq, \(x) rnorm(
      n=nrow(posterior),
      mean=posterior$a + posterior$b1 * x + posterior$b2 * x^2 + posterior$b3 * x^3,
      sd=posterior$sigma
    )),
    sim.pi.min = map_dbl(sim, quantile, probs=0.02),
    sim.pi.max = map_dbl(sim, quantile, probs=0.98)
  )

summary |> 
  mutate(weight.seq = mean(df$weight) + sd(df$weight) * weight.seq) |>
  ggplot() +
    geom_ribbon(
      mapping=aes(x=weight.seq, ymin=sim.pi.min, ymax=sim.pi.max), 
      fill="lightgray", alpha=0.6) +
    geom_ribbon(
      mapping=aes(x=weight.seq, ymin=mu.ci.min, ymax=mu.ci.max), 
      fill="gray", alpha=0.8) +
    geom_line(mapping=aes(x = weight.seq, y = mu.mean)) +
    geom_point(
      data=df, 
      mapping=aes(x = weight, y = height), 
      color="purple", shape="plus", alpha=0.6) +
    labs(
      title="Cubic Regression of height by weight",
      x="Weight (kg)",
      y="Height (cm)"
    )