When learning statistics, it’s often very difficult to know what’s important and what’s not. In most R scripts, 95% of the code is just there to tidy things up: converting variable formats, pushing columns around, twisting our external data into the shape that our software wants, etc. These things are important because they let us do the analysis we want, but they’re not very interesting.

Much more important is the conceptual underpinnings. That includes:

Two things that really tend to confuse people are standardization and link functions. Both of these concepts are important, but they’re not very conceptually interesting. We use them because they make analysis easier. That’s it. But, if you’re unclear about what they do, they can undermine your ability to conceptualize how the model is working.

The bottom line: standardization and link functions convert between different units of measurement.

Standardization

In a regression model comparing height and weight, it makes sense that we could measure weight in either pounds or kilograms. Whichever unit we use, it doesn’t change our analysis because it’s only the numbers that are different. The thing being represented stays the same. When we standardize the variable, we’re doing exactly the same thing. We’re just centering the scale around our sample mean and setting the unit so that 1 is equal to our sample’s standard deviation:

library(rethinking)
library(tidyverse)
data(Howell1)

df <- Howell1 |>
  filter(age >= 18) |>
  mutate(
    weight_kilos = weight,
    weight_pounds = weight * 2.2,
    weight_std = standardize(weight),
    height_std = standardize(height),
    sex = ifelse((male == 0), "female", "male") |> as_factor()
  )

Let’s compare the three plots

df |>
  ggplot(mapping = aes(x = weight_kilos, y = height)) +
  geom_point(color = "purple", shape = "plus") +
  geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
  labs(x = "Weight (kg)", y = "Height (cm)")

df |>
  ggplot(mapping = aes(x = weight_pounds, y = height)) +
  geom_point(color = "purple", shape = "plus") +
  geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
  labs(x = "Weight (pounds)", y = "Height (cm)")

df |>
  ggplot(mapping = aes(x = weight_std, y = height)) +
  geom_point(color = "purple", shape = "plus") +
  geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
  labs(x = "Weight (standardized)", y = "Height (cm)")

All three plots are exactly the same. The only thing that’s changing is the numbers on the x-axis. Standardization is doing the same thing as converting from kilos to pounds.

So, why standardize? There are a few advantages.

There’s really only one disadvantage, which is that a standardized variable might mean less to your audience than the natural units would. For this reason, a very common workflow is to standardize all variables for manipulation and analysis, then convert them all back to default units when done. This is easy to do.