Models step-by-step

Models define relationships between parameters. We can use models to estimate the values of parameters that interest us by applying them to data we’ve observed.

To understand how this works, let’s build a complex model from simple parts.

library(rethinking)
library(tidyverse)
data(Howell1)

df <- Howell1 |>
  filter(age >= 18) |>
  mutate(
    height_std = standardize(height),
    age_std = standardize(age),
    sex = ifelse((male == 0), "female", "male") |> as_factor()
  )

Step 1: Parameter estimate

The simplest model we can define is just a single parameter estimate.

In the Howell1 data set, we have a sample of biometric data from individuals in a !Kung San community. If think our sample is representative, we can use this data to infer characteristics about the population as a whole.

For that, we need a model:

\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu, \sigma) \\ \mu &\sim \text{Normal}(50, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

This model is very simple. It describes how we think weight works.

It has a likelihood function that accounts for random (i.e., unexplained) variation.
This likelihood function assumes that weight will be normally distributed.
This normal distribution is defined by two parameters:
- a mean μ (mu),
- and a standard deviation σ (sigma).
For each of these parameters, we have defined a prior probability distribution (based on some very loose assumptions).

From this definition, ulam will calculate our posterior probability distribution.

set.seed(101)
model <- ulam(
  alist(
    weight ~ normal(mu, sigma),
    mu ~ normal(50, 10),
    sigma ~ uniform(0, 50)
  ),
  data = df
)

and then…

precis(model, prob = 0.95)

           mean        sd      2.5%     97.5%     rhat ess_bulk
mu    44.964487 0.3588423 44.216600 45.646569 1.001590 388.2219
sigma  6.458195 0.2365766  6.022862  6.954235 1.027775 350.1732

If we trust our assumptions, we can be 95% sure that the real average weight is between 44.21 and 45.65 kg with a standard deviation between 6.02 and 6.95 kg. This doesn’t tell us much beyond what we already knew from looking at the raw data, but it’s a perfectly valid model.

Visually, here’s what that looks like:

Show R code

posterior <- extract.samples(model)
ggplot(df) +
  geom_jitter(
    data = df,
    aes(x = 0, y = weight),
    shape = "cross",
    color = "purple"
  ) +
  geom_hline(
    yintercept = mean(posterior$mu),
    color = "red",
    linetype = "dashed"
  ) +
  xlim(-2, 2)

Step 2: Linear regression

Let’s make things more interesting by trying to predict weight from height. It’s reasonable to think that height and weight are related, which means we should be able to explain more of the random variation in weight if we pay attention to height than we could do looking just at weight alone.

To create a regression, we need to define one of our parameters in terms of our other variables. That definition is called a linear predictor.

We’ll start with our original model…

\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu, \sigma) \\ \mu &\sim \text{Normal}(50, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

…but, instead of treating mu as a parameter, we’ll redefine it as a prediction built from a combination of other parameters (ɑ and β) and observed data (height_i).

\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta\times \text{height}_i \\ \alpha &\sim \text{Normal}(50, 10) \\ \beta &\sim \text{Normal}(0, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

Now we have three parameters. Sigma (σ) remains the same as before, but mu (μ) has been redefined.

In our first model we had two parameters (μ and σ) and one observed variable (height_i). In this expanded model, we have three parameters(ɑ, β, and σ) and two observed variables (height_i and weight_i).

set.seed(101)
model <- ulam(
  alist(
    weight ~ normal(mu, sigma),
    mu <- a + b * height_std,
    a ~ normal(50, 10),
    b ~ normal(0, 10),
    sigma ~ uniform(0, 50)
  ),
  data = df
)

and then…

precis(model, prob = 0.95)

           mean        sd      2.5%     97.5%      rhat ess_bulk
a     45.008302 0.2302654 44.568423 45.460216 0.9998891 564.7267
b      4.858297 0.2194084  4.427990  5.285442 1.0049412 472.3716
sigma  4.246329 0.1531567  3.967734  4.547446 1.0101649 344.9605

Our posterior for a is very similar to the value we found for mu in our previous model, as expected. In addition, we’ve got a value for our new parameter b: when height changes by a standard deviation, our model predicts that weight will change on average by 4.86 kilograms.

Also interesting is the fact that our sigma posterior has decreased from 6.02-6.95 (95%) to 3.97-4.55 (95%). This is important! The sigma represents the amount of variation that our model isn’t able to explain. Our height regression explains the data better than our simple parameter estimate model did, so the sigma has gotten smaller.

Visually, we can plot our regression line like this:

Show R code

posterior <- extract.samples(model)
df |>
  rowwise() |>
  mutate(
    mu = mean(posterior$a + posterior$b * height_std),
    mu_min = quantile(posterior$a + posterior$b * height_std, 0.025),
    mu_max = quantile(posterior$a + posterior$b * height_std, 0.975),
  ) |>
  ggplot(aes(x = height_std, y = weight)) +
  geom_point(color = "purple", shape = "plus") +
  geom_line(aes(y = mu)) +
  geom_ribbon(aes(ymin = mu_min, ymax = mu_max), alpha = 0.1)

Step 3: …add a second variable

Now let’s add a second variable. In addition to height, let’s see if age has a separate effect on weight.

To do that, we just extend our previous model with a new parameter:

\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_H \times \text{height}_i \color{green}+ \beta_A \times \text{age}_i \\ \alpha &\sim \text{Normal}(50, 10) \\ \beta_H &\sim \text{Normal}(0, 10) \\ {\color{green}\beta_A} &\sim{\color{green} \text{Normal}(0, 10)} \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

set.seed(101)
model <- ulam(
  alist(
    weight ~ normal(mu, sigma),
    mu <- a + bH * height_std + bA * age_std,
    a ~ normal(50, 10),
    bH ~ normal(0, 10),
    bA ~ normal(0, 10),
    sigma ~ uniform(0, 50)
  ),
  data = df
)

and then…

precis(model, prob = 0.95)

            mean        sd      2.5%      97.5%      rhat ess_bulk
a     45.0063333 0.2332397 44.537425 45.4432791 0.9990798 708.1853
bH     4.8074260 0.2235092  4.378068  5.2530347 1.0000991 759.2218
bA    -0.6256028 0.2445704 -1.112845 -0.1637151 1.0003396 639.5563
sigma  4.2086974 0.1683824  3.890773  4.5289087 1.0019273 734.6700

It looks like weight tends to go down with age, but the effect is much smaller than the effect of height is. Since we have two variables now, we can’t create a simple scatter plot like before, but we can understand the effect from the numbers directly.

Step 4: …add a categorical variable

This time, instead of age, let’s add a categorical variable to the mix.

There are a few ways to approach this, but often the most elegant is to use an approach called “indexing”. With indexing, we create multiple versions of a parameter, corresponding to the possible values of our categorical variable.

Look at this example: \[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\color{green}sex[i]} + \beta\times \text{height}_i \\ \alpha_{\color{green}sex[i]} &\sim \text{Normal}(50, 10) \\ \beta &\sim \text{Normal}(0, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

This suggests that we have actually two different versions of our alpha parameter, one that is used for males and one used for females.

set.seed(101)
model <- ulam(
  alist(
    weight ~ normal(mu, sigma),
    mu <- a[sex] + b * height_std,
    a[sex] ~ normal(50, 10),
    b ~ normal(0, 10),
    sigma ~ uniform(0, 50)
  ),
  data = df
)

and then…

precis(model, prob = 0.95, depth = 2)

           mean        sd      2.5%     97.5%      rhat  ess_bulk
a[1]  44.950999 0.3972514 44.196660 45.768317 1.0115343 215.74544
a[2]  45.041006 0.3504517 44.315962 45.726883 1.0374340  44.78828
b      4.907916 0.3143433  4.282036  5.491431 1.0258512  96.18636
sigma  4.252760 0.1531087  3.974815  4.583011 0.9990955 364.64369

We can see that we have four parameters now: our b and sigma like before, but now also two versions of a…a[1] (male) and a[2] (female). This seems to indicate that there is very little difference in the value of a between men and women. That might seem strange, since we know that men tend to be physically larger than women. To understand this better, let’s plot it:

Show R code

posterior <- extract.samples(model)
df |>
  rowwise() |>
  mutate(
    mu = mean(
      posterior$a[, as.integer(sex)] +
        posterior$b * height_std
    ),
    mu_min = quantile(
      posterior$a[, as.integer(sex)] +
        posterior$b * height_std,
      0.025
    ),
    mu_max = quantile(
      posterior$a[, as.integer(sex)] +
        posterior$b * height_std,
      0.975
    ),
  ) |>
  ggplot(aes(x = height_std, y = weight, color = sex)) +
  geom_point(shape = "plus") +
  geom_line(aes(y = mu, group = sex)) +
  geom_ribbon(
    aes(ymin = mu_min, ymax = mu_max, fill = sex, color = NULL),
    alpha = 0.2
  )

Men are definitely heavier than women average, but the very small difference in a values suggests that the effect is indirect. Men are typically heavier than women, but it’s not because they’re men; the difference is fully explained by the differences in average height. This is a really important part of regressions: we can separate out the effects of different variables to understand which is actually predicting the variation. Here, it’s not sex but height.

Step 5: …add interactions

Let’s extend our categorical variable even farther. Perhaps sex doesn’t just affect average weight also affects how height affects weight. To accomplish this, we just add our sex-index to the value of our weight coefficient as well:

\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{sex[i]} + \beta_{\color{green}sex[i]} \times \text{height}_i \\ \alpha_{sex[i]} &\sim \text{Normal}(50, 10) \\ \beta_{\color{green}sex[i]} &\sim \text{Normal}(0, 10) \\ \sigma &\sim \text{Uniform}(0, 50) \end{align} \]

And in R:

set.seed(101)
model <- ulam(
  alist(
    weight ~ normal(mu, sigma),
    mu <- a[sex] + b[sex] * height_std,
    a[sex] ~ normal(50, 10),
    b[sex] ~ normal(0, 10),
    sigma ~ uniform(0, 50)
  ),
  data = df
)

and then…

precis(model, prob = 0.95, depth = 2)

           mean        sd      2.5%     97.5%      rhat ess_bulk
a[1]  45.040130 0.4534546 44.073248 45.913876 0.9987345 383.3840
a[2]  45.147611 0.4317244 44.314936 45.984924 0.9998939 325.3868
b[1]   4.772401 0.4239174  3.967369  5.613861 1.0006380 375.7449
b[2]   5.059902 0.4687931  4.127644  5.924877 1.0051617 280.7667
sigma  4.256292 0.1554622  3.978444  4.555664 1.0012784 490.8568

Now we have five parameters: sigma, like always, but also two versions of a corresponding to male and female and two versions of b corresponding to male and female. There’s still not much difference in a, but there a bit more difference in b. It looks like height actually has a moderately larger effect on weight for women than it does for me.

Is this enough of a difference to be interesting? Possibly, but very possibly not. Remember that our parameters are never single values but rather probability distributions. If we compare the mean values, it looks like there’s some difference between the two, but if we look at our 95% credibility intervals, it’s less clear. For men, that interval is 3.97-5.61, and for women it’s 4.13-5.92. That’s a lot of overlap.

Bottom line: There’s a very significant chance that these values are not actually different in the real world.

Show R code

posterior <- extract.samples(model)
df |>
  rowwise() |>
  mutate(
    mu = mean(
      posterior$a[, as.integer(sex)] +
        posterior$b[, as.integer(sex)] * height_std
    ),
    mu_min = quantile(
      posterior$a[, as.integer(sex)] +
        posterior$b[, as.integer(sex)] * height_std,
      0.025
    ),
    mu_max = quantile(
      posterior$a[, as.integer(sex)] +
        posterior$b[, as.integer(sex)] * height_std,
      0.975
    ),
  ) |>
  ggplot(aes(x = height_std, y = weight, color = sex)) +
  geom_point(shape = "plus") +
  geom_line(aes(y = mu, group = sex)) +
  geom_ribbon(
    aes(ymin = mu_min, ymax = mu_max, fill = sex, color = NULL),
    alpha = 0.2
  )

Step 6: Logistic regression

Setup

library(tidybayes)
library(tidybayes.rethinking)

data(UCBadmit)
df <- UCBadmit |>
  select(
    admits = admit,
    applications = applications,
    gender = applicant.gender,
    dept = dept
  )

Version 1

\[ \begin{align} \text{admits}_i &\sim \text{Binomial}(\text{applications}_i, p_i) \\ \text{logit}(p_i) &= \alpha_{gender[i]} \\ \alpha_{j} &\sim \text{Normal}(0, 1) \end{align} \]

set.seed(101)
model1 <- ulam(
  alist(
    admits ~ dbinom(applications, p),
    logit(p) <- a[gender],
    a[gender] ~ normal(0, 1)
  ),
  data = df,
)

precis(model1, depth = 2)

           mean         sd       5.5%      94.5%     rhat ess_bulk
a[1] -0.8287053 0.05148636 -0.9105779 -0.7522498 1.003775 350.2840
a[2] -0.2199847 0.03612165 -0.2751070 -0.1578954 1.001938 353.1245

Version 2

\[ \begin{align} \text{admits}_i &\sim \text{Binomial}(\text{applications}_i, p_i) \\ \text{logit}(p_i) &= \alpha_{(gender[i],dept[i])} \\ \alpha_{(j,k)} &\sim \text{Normal}(0, 1) \end{align} \]

set.seed(101)
model2 <- ulam(
  alist(
    admits ~ binomial(applications, p),
    logit(p) <- a[gender, dept],
    matrix[gender, dept]:a ~ normal(0, 1)
  ),
  data = df
)

precis(model2, depth = 3)

             mean         sd        5.5%      94.5%      rhat ess_bulk
a[1,1]  1.4705355 0.26555147  1.07462659  1.9004895 1.0010981 707.6872
a[2,1]  0.4869733 0.07283152  0.37198367  0.6050517 0.9985599 739.1043
a[1,2]  0.6308035 0.40991415 -0.02780612  1.2841253 0.9985723 550.2971
a[2,2]  0.5226522 0.08488525  0.38869063  0.6559704 0.9984990 786.2539
a[1,3] -0.6588740 0.08712420 -0.79616406 -0.5253976 0.9985570 551.1637
a[2,3] -0.5348634 0.11361395 -0.72090596 -0.3576598 1.0020146 606.0370
a[1,4] -0.6173820 0.10950972 -0.79924161 -0.4374970 1.0009823 785.4720
a[2,4] -0.7007157 0.10254216 -0.87168024 -0.5391900 1.0021981 504.3807
a[1,5] -1.1476876 0.11311679 -1.33419340 -0.9734511 0.9981301 748.9539
a[2,5] -0.9324493 0.15283201 -1.16657739 -0.6944724 1.0025646 528.7649
a[1,6] -2.4897135 0.19669078 -2.81859585 -2.2027468 0.9989111 493.8628
a[2,6] -2.6736262 0.20578594 -2.99045092 -2.3490099 1.0081540 535.9687

Version 3

\[ \begin{align} \text{admits}_i &\sim \text{Binomial}(\text{applications}_i, p_i) \\ \text{logit}(p_i) &= \alpha\text{G}_{gender[i]} + \alpha\text{D}_{dept[i]} \\ \alpha\text{G}_{j} &\sim \text{Normal}(0, 1) \\ \alpha\text{D}_{k} &\sim \text{Normal}(0, 1) \end{align} \]

set.seed(101)
model3 <- ulam(
  alist(
    admits ~ binomial(applications, p),
    logit(p) <- aG[gender],
    aG[gender] ~ normal(0, 1),
    aD[dept] ~ normal(0, 1)
  ),
  data = df
)

precis(model3, depth = 3)

              mean         sd       5.5%      94.5%      rhat ess_bulk
aG[1] -0.827822973 0.05011072 -0.9111907 -0.7482978 0.9989513 816.7754
aG[2] -0.222438336 0.03701411 -0.2853299 -0.1619034 1.0019150 891.5175
aD[1] -0.024118451 0.98351088 -1.5900159  1.4041107 1.0009420 958.9369
aD[2]  0.017274972 0.94524875 -1.4321499  1.5112977 0.9987031 668.0473
aD[3] -0.038440686 0.95941149 -1.5884755  1.4294079 1.0008976 689.4736
aD[4] -0.005316958 0.96939207 -1.5272949  1.5725989 0.9982843 795.5305
aD[5] -0.003089526 0.93492886 -1.4953379  1.4335258 1.0004981 616.1700
aD[6]  0.002814438 1.01241243 -1.5946927  1.6854526 1.0013627 726.8505