library(rethinking)
library(tidyverse)
data(Howell1)
df <- Howell1 |>
filter(age >= 18) |>
mutate(
weight_kilos = weight,
weight_pounds = weight * 2.2,
weight_std = standardize(weight),
height_std = standardize(height),
sex = ifelse((male == 0), "female", "male") |> as_factor()
)Standardization and Link Functions
When learning statistics, it’s often very difficult to know what’s important and what’s not. In most R scripts, 95% of the code is just there to tidy things up: converting variable formats, pushing columns around, twisting our external data into the shape that our software wants, etc. These things are important because they let us do the analysis we want, but they’re not very interesting.
Much more important is the conceptual underpinnings. That includes:
- How to understand model definitions,
- What likelihood functions do,
- Why we need linear predictor expressions, and
- How to understand parameter probability distributions (prior and posterior)
Two things that really tend to confuse people are standardization and link functions. Both of these concepts are important, but they’re not very conceptually interesting. We use them because they make analysis easier. That’s it. But, if you’re unclear about what they do, they can undermine your ability to conceptualize how the model is working.
The bottom line: standardization and link functions convert between different units of measurement.
Standardization
In a regression model comparing height and weight, it makes sense that we could measure weight in either pounds or kilograms. Whichever unit we use, it doesn’t change our analysis because it’s only the numbers that are different. The thing being represented stays the same. When we standardize the variable, we’re doing exactly the same thing. We’re just centering the scale around our sample mean and setting the unit so that 1 is equal to our sample’s standard deviation:
Let’s compare the three plots
df |>
ggplot(mapping = aes(x = weight_kilos, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (kg)", y = "Height (cm)")df |>
ggplot(mapping = aes(x = weight_pounds, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (pounds)", y = "Height (cm)")df |>
ggplot(mapping = aes(x = weight_std, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (standardized)", y = "Height (cm)")All three plots are exactly the same. The only thing that’s changing is the numbers on the x-axis. Standardization is doing the same thing as converting from kilos to pounds.
So, why standardize? There are a few advantages.
- First, standardization makes it easier to compare variables with very different ranges by converting them to a common scale.
- Second, in linear regressions of the form y = a + bx, standardization makes our intercept value
aeasier to interpret because it represents the expected value of y when all independent variables have average values. - Third, using standardized units makes it easier to define priors even when we don’t have much preexisting knowledge.
There’s really only one disadvantage, which is that a standardized variable might mean less to your audience than the natural units would. For this reason, a very common workflow is to standardize all variables for manipulation and analysis, then convert them all back to default units when done. This is easy to do.
Link Functions
In a linear regression, we make predictions about a continuous variable. We might, for example, create a function like weight = a + b*height to predict weight from a combination of height and sex.
- Our
arepresents a “base” weight when all other parameters are zero (i.e., the average weight if we’re using standardized varaibles). - Our
brepresents the “effect” of height. When height changes by 1 unit, weight will change bybunits.
In this way, all of our parameters represent different numerical parts of the weight variable we are trying to estimate.
But, what if we want to predict something other than a continuous variable?
Maybe we want to predict the number of times an event will happen, expected connections in a networks analysis, the importance of time series effects, or simple binary outcomes (yes/no). We can do all these things with regressions, but we need to change our models. We’re no longer trying to represent variation around a mean but rather something else. This something else will often require a very different structure of parameters.
Link functions “connect” our linear predictor to our likelihood function, ensuring that our parameters all behave they way they’re supposed to.
In our Berkeley example, we used a logistic regression to predict university admissions. Admission is a binary event. An applicant is either admitted or rejected with nothing in between. To do this, we’ll need a different kind of likelihood function. Instead of a normal distribution…
\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \end{align} \]
…we’ll use a binomial distribution…
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \end{align} \]
…where n is the number of applications and p is the probability each application has of being admitted.
This introduces a new constraint. In our linear regression, our predicted height μ could potentially be any value. Our value p, however, represents a probability, and likewise it can take only values between 0 and 1.
To ensure this, we can write our linear predictor with logit as a link function:
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \\ \text{logit(p}_i) &= \alpha_{sex[i]} \\ \alpha_{sex[i]} &\sim \text{Normal}(0, 1) \end{align} \]
All the logit function does here is convert between two different ways of representing the likelihood of binary outcomes:
Probability (
p), which gives us a value between 0 and 1 equivalent to the percent chance that a thing will happen, andLog Odds (
logit(p)), which represents the same thing on a scale between −∞ and ∞.
It’s just a scale conversion, equivalent to variable standardization or converting between pounds and kilograms.
| probability [p] | log odds [logit(p)] |
|---|---|
| 0 | −∞ |
| .01 | -4.59512 |
| .1 | -2.197225 |
| .3 | -0.8472979 |
| .5 | 0 |
| .7 | 0.8472979 |
| .9 | 2.197225 |
| .99 | 4.59512 |
| 1 | ∞ |
Or, as a visual graph:
curve(logit(x), n = 100000)But, why go to all this trouble? Why not just define our parameter alpha to use a value between 0 and 1?
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \\ \text{p}_i &= \alpha_{sex[i]} \\ \alpha_{sex[i]} &\sim \text{Uniform}(0, 1) \end{align} \]
In this simple case, we absolutely could. Apart from having a different prior distribution, this model is functionally equivalent to the one above. But, if we wanted to add additional independent variables to our predictor function, using logit makes everything much, much easier.
For example, let’s say that base admission rates are 10%. But, maybe there are a few additional factors:
a GPA above 3.5 increases their base chance from 10% to 33%
First division test scores increase their base chance from 10% to 20%
No work experience decreases their base chance from 10% to 6%
Imagine somebody with all of these factors. What is their total chance of getting in? We can figure it out, but the math gets pretty messy.
Alternately, we can convert everything to log-odds.
| Type | Probability | Log-odds conversion | Log-odds |
|---|---|---|---|
| Base probability | 10% | logit(0.1) | -2.2 |
| Good GPA | 10% -> 33% | logit(0.33) - logit(0.1) | 1.5 |
| Good Test Scores | 10% -> 20% | logit(0.2) - logit(0.1) | 0.8 |
| No Work Experience | 10% -> 6% | logit(0.06) - logit(0.1) | -0.6 |
Unlike probabilities, log-odds values can just be added and subtracted. We start with the base log-odds of -2.2, add 1.5 for a good GPA, add 0.8 for good test scores, subtract 0.6 for no work experience, and end up with a value of -0.5.
We can use the inv_logit function to convert back from log-odds to probability: inv_logit(-0.5) equals about 38%. Somebody with all three of these factors would have rough a 38% chance of admission.
Bottom line: link functions, like variable standardization, are just a way of converting between scales of representation. Logistic regressions expect parameters on the log-odds scale, but the binomial distribution expects a probability. The logit link function is there to bridge that gap.
In a logistic regression, combining multiple probabilities is messy but combining multiple log-odds is simple. We define our priors in terms of log-odds, do the work we want to do, and then convert back before giving the value to our likelihood function. We could do all the math manually if we wanted. That would allow us to skip the logit function. But, it would be a lot more work.