When learning statistics, it’s often very difficult to know what’s important and what’s not. In most R scripts, 95% of the code is just there to tidy things up: converting variable formats, pushing columns around, twisting our external data into the shape that our software wants, etc. These things are important because they let us do the analysis we want, but they’re not very interesting.
Much more important is the conceptual underpinnings. That includes:
Two things that really tend to confuse people are standardization and link functions. Both of these concepts are important, but they’re not very conceptually interesting. We use them because they make analysis easier. That’s it. But, if you’re unclear about what they do, they can undermine your ability to conceptualize how the model is working.
The bottom line: standardization and link functions convert between different units of measurement.
In a regression model comparing height and weight, it makes sense that we could measure weight in either pounds or kilograms. Whichever unit we use, it doesn’t change our analysis because it’s only the numbers that are different. The thing being represented stays the same. When we standardize the variable, we’re doing exactly the same thing. We’re just centering the scale around our sample mean and setting the unit so that 1 is equal to our sample’s standard deviation:
library(rethinking)
library(tidyverse)
data(Howell1)
df <- Howell1 |>
filter(age >= 18) |>
mutate(
weight_kilos = weight,
weight_pounds = weight * 2.2,
weight_std = standardize(weight),
height_std = standardize(height),
sex = ifelse((male == 0), "female", "male") |> as_factor()
)
Let’s compare the three plots
df |>
ggplot(mapping = aes(x = weight_kilos, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (kg)", y = "Height (cm)")
df |>
ggplot(mapping = aes(x = weight_pounds, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (pounds)", y = "Height (cm)")
df |>
ggplot(mapping = aes(x = weight_std, y = height)) +
geom_point(color = "purple", shape = "plus") +
geom_smooth(method = "lm", formula = y ~ x, se = TRUE) +
labs(x = "Weight (standardized)", y = "Height (cm)")
All three plots are exactly the same. The only thing that’s changing is the numbers on the x-axis. Standardization is doing the same thing as converting from kilos to pounds.
So, why standardize? There are a few advantages.
a easier to
interpret because it represents the expected value of y when all
independent variables have average values.There’s really only one disadvantage, which is that a standardized variable might mean less to your audience than the natural units would. For this reason, a very common workflow is to standardize all variables for manipulation and analysis, then convert them all back to default units when done. This is easy to do.
In a linear regression, we make predictions about a continuous
variable. We might, for example, create a function like
weight = a + b*height to predict weight from a combination
of height and sex.
a represents a “base” weight when all other
parameters are zero (i.e., the average weight if we’re using
standardized varaibles).b represents the “effect” of height. When height
changes by 1 unit, weight will change by b units.In this way, all of our parameters represent different numerical
parts of the weight variable we are trying to estimate.
But, what if we want to predict something other than a continuous variable?
Maybe we want to predict the number of times an event will happen, expected connections in a networks analysis, the importance of time series effects, or simple binary outcomes (yes/no). We can do all these things with regressions, but we need to change our models. We’re no longer trying to represent variation around a mean but rather something else. This something else will often require a very different structure of parameters.
Link functions “connect” our linear predictor to our likelihood function, ensuring that our parameters all behave they way they’re supposed to.
In our Berkeley example, we used a logistic regression to predict university admissions. Admission is a binary event. An applicant is either admitted or rejected with nothing in between. To do this, we’ll need a different kind of likelihood function. Instead of a normal distribution…
\[ \begin{align} \text{weight}_i &\sim \text{Normal}(\mu_i, \sigma) \end{align} \]
…we’ll use a binomial distribution…
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \end{align} \]
…where n is the number of applications and
p is the probability each application has of being
admitted.
This introduces a new constraint. In our linear regression, our
predicted height μ could potentially be any value. Our value
p, however, represents a probability, and likewise it can
take only values between 0 and 1.
To ensure this, we can write our linear predictor with
logit as a link function:
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \\ \text{logit(p}_i) &= \alpha_{sex[i]} \\ \alpha_{sex[i]} &\sim \text{Normal}(0, 1) \end{align} \]
All the logit function does here is convert between two different ways of representing the likelihood of binary outcomes:
Probability (p), which gives us a value between 0
and 1 equivalent to the percent chance that a thing will happen,
and
Log Odds (logit(p)), which represents the same thing
on a scale between −∞ and ∞.
It’s just a scale conversion, equivalent to variable standardization or converting between pounds and kilograms.
| probability \[p\] | log odds \[logit(p)\] |
|---|---|
| 0 | −∞ |
| .01 | -4.59512 |
| .1 | -2.197225 |
| .3 | -0.8472979 |
| .5 | 0 |
| .7 | 0.8472979 |
| .9 | 2.197225 |
| .99 | 4.59512 |
| 1 | ∞ |
Or, as a visual graph:
curve(logit(x), n = 100000)
But, why go to all this trouble? Why not just define our parameter alpha to use a value between 0 and 1?
\[ \begin{align} \text{admitted}_i &\sim \text{Binomial}(n_i, p_i) \\ \text{p}_i &= \alpha_{sex[i]} \\ \alpha_{sex[i]} &\sim \text{Uniform}(0, 1) \end{align} \]
In this simple case, we absolutely could. Apart from having a
different prior distribution, this model is functionally equivalent to
the one above. But, if we wanted to add additional independent variables
to our predictor function, using logit makes everything
much, much easier.
For example, let’s say that base admission rates are 10%. But, maybe there are a few additional factors:
a GPA above 3.5 increases their base chance from 10% to 33%
First division test scores increase their base chance from 10% to 20%
No work experience decreases their base chance from 10% to 6%
Imagine somebody with all of these factors. What is their total chance of getting in? We can figure it out, but the math gets pretty messy.
Alternately, we can convert everything to log-odds.
| Type | Probability | Log-odds conversion | Log-odds |
|---|---|---|---|
| Base probability | 10% | logit(0.1) | -2.2 |
| Good GPA | 10% -> 33% | logit(0.33) - logit(0.1) | 1.5 |
| Good Test Scores | 10% -> 20% | logit(0.2) - logit(0.1) | 0.8 |
| No Work Experience | 10% -> 6% | logit(0.06) - logit(0.1) | -0.6 |
Unlike probabilities, log-odds values can just be added and
subtracted. We start with the base log-odds of -2.2, add 1.5 for a good
GPA, add 0.8 for good test scores, subtract 0.6 for no work experience,
and end up with a value of -0.5.
We can use the inv_logit function to convert back from
log-odds to probability: inv_logit(-0.5) equals about 38%.
Somebody with all three of these factors would have rough a 38% chance
of admission.
Bottom line: link functions, like variable standardization,
are just a way of converting between scales of representation.
Logistic regressions expect parameters on the log-odds scale, but the
binomial distribution expects a probability. The logit link
function is there to bridge that gap.
In a logistic regression, combining multiple probabilities is messy
but combining multiple log-odds is simple. We define our priors in terms
of log-odds, do the work we want to do, and then convert back before
giving the value to our likelihood function. We could do all the math
manually if we wanted. That would allow us to skip the
logit function. But, it would be a lot more work.