A linear model allows us to calculate conditional means. For example, a model to estimate the mean number of visits to the health clinic for two dorms at a university might be
\[ \text{visits} = 0.9 + 2.2x \]
where \(x = 0\) for dorm 1 and \(x = 1\) for dorm 2. In other words, the mean number of visits to the health clinic for students in dorm 1 is 0.9, and the mean number of visits to the health clinic for students in dorm 2 is 0.9 + 2.2 = 3.1.
This model will not always be right. It’s just an approximation. It will be off by some amount. This amount is usually called the error, symbolized with \(\epsilon\).
\[ \text{visits} = 0.9 + 2.2x + \epsilon \]
We assume the error is a random draw from a probability distribution. The probability distribution for a basic linear model, also known as the ordinary least squares (OLS) model, is the normal distribution. The normal distribution has two parameters: the mean and standard deviation. For a linear model we assume the mean is 0 and we estimate the standard deviation. This is reported as residual standard error in model output.
Here’s what the linear model looks like when we fit it in R:
m <- lm(visits ~ x, data = d)
summary(m)
Call:
lm(formula = visits ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.1 -1.1 0.1 0.9 3.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9000 0.2896 3.108 0.00316 **
x 2.2000 0.3738 5.885 3.76e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.295 on 48 degrees of freedom
Multiple R-squared: 0.4191, Adjusted R-squared: 0.407
F-statistic: 34.63 on 1 and 48 DF, p-value: 3.757e-07
The intercept is the expected mean visits for dorm 1. The x coefficient is the difference in mean visits between dorm 2 and dorm 1. The standard deviation of the normal distribution from which the error is sampled is estimated to be 1.295.
Because the error is normally distributed, this implies the estimated means are also normally distributed. We can state this mathematically as
\[ \text{dorm 1 count} \sim N(0.9, 1.295) \] \[ \text{dorm 2 count} \sim N(0.9 + 2.2, 1.295) \]
We can use this model to estimate the expected number of visits to
the health clinic for 10 incoming students to dorm 1 where the expected
number of visits is 0.9. We do this by sampling from a normal
distribution with mean 0.9 and standard deviation 1.295. We can do the
sampling in R using the rnorm()
function.
rnorm(n = 10, mean = 0.9, sd = 1.295)
[1] 0.08874232 1.13781810 -0.18213905 2.96588864 1.32671256 -0.16250656
[7] 1.53122062 1.85613049 1.64563685 0.50452204
These expected values are bizarre. For one, they all have decimals. A student won’t have 0.08874232 visits to the health clinic. They will have zero or more visits. Another problem is that two of the values are negative. That certainly doesn’t make any sense. But the normal distribution doesn’t know this. It works for any range of numbers with any level of precision. It has no lower or upper bound. Therefore we would like to consider another probability distribution for this model.
Basic linear models assume a normal distribution for the conditional
means. There is no option to change the distribution in the
lm()
function. If we want to consider alternative
distributions, we need to use generalized linear models. We can
do this in R with the glm()
function.
One distribution we might like to consider for our example above is
the Poisson distribution. The Poisson distribution has one
parameter, a mean. This is usually symbolized with \(\lambda\). Draws from this distribution are
whole numbers that are greater than or equal to 0. We specify this
distribution in the glm()
function using the
family
argument.
m2 <- glm(visits ~ x, data = d, family = poisson)
summary(m2)
Call:
glm(formula = visits ~ x, family = poisson, data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1054 0.2357 -0.447 0.655
x 1.2368 0.2575 4.803 1.56e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 67.018 on 49 degrees of freedom
Residual deviance: 37.416 on 48 degrees of freedom
AIC: 159.08
Number of Fisher Scoring iterations: 5
This implies the estimated means are Poisson distributed. We might state this mathematically as
\[ \text{dorm 1 count} \sim \text{Poisson}(-0.1054) \]
\[ \text{dorm 2 count} \sim \text{Poisson}(-0.1054 + 1.2368) \]
But this would be wrong! You may have already been suspicious of the negative intercept, -0.1054, which is the estimated mean number of visits to the health clinic for students in dorm 1. That doesn’t make sense. In fact, if we use this value to draw random samples from a Poisson distribution we get a warning and missing values.
rpois(n = 10, lambda = -0.1054)
Warning in rpois(n = 10, lambda = -0.1054): NAs produced
[1] NA NA NA NA NA NA NA NA NA NA
What’s going on? It turns out we need to exponentiate the model to get the conditional means.
\[ \text{dorm 1 count} \sim \text{Poisson}(e^{-0.1054}) \]
\[ \text{dorm 2 count} \sim \text{Poisson}(e^{-0.1054 + 1.2368}) \]
Using the exp()
function, we can now get estimated
counts for the next 10 students in dorm 1:
rpois(n = 10, lambda = exp(-0.1054))
[1] 2 0 1 0 0 0 0 0 2 0
Why do we have to exponentiate? Because exponentiating ensures the estimated means are positive. This allows our model, called the linear predictor, to take any range of values, positive or negative. No matter the estimated model, we’ll always get positive means once we exponentiate.
But why didn’t R just do that for us and report the exponentiated coefficients in the output? Because the estimated mean has to be exponentiated, not the individual coefficients. We don’t exponentiate 1.2368. We exponentiate -0.1054 + 1.2368. We can exponentiate 1.2368, but that doesn’t return an estimated mean. It returns a multiplicative effect.
exp(1.2358)
[1] 3.44113
This says the counts for dorm 2 are expected to be about 3.4 times higher than the counts for dorm 1.
Now we can express our model as a single probability expression:
\[ \text{visits} \sim \text{Poisson}(e^{-0.1054 + 1.2368x}) \]
The model, or linear predictor, is \(-0.1054 + 1.2368x\). To extract the model, we have to take the log of the expression. The log is the inverse of exponentiation.
\[ \text{log}(e^{x}) = x \]
Because the log transformation allows us to extract the linear model,
we call the log the “link” to our model. The log is the default
link for the poisson()
function in R. We can explicitly
state this when we fit a Poisson model as follows:
m2 <- glm(visits ~ x, data = d, family = poisson(link = "log"))
The data for the example above was simulated using the
rpois()
function. Notice the true linear model is
\(0.2 + 0.8x\). Also notice the
exp()
function exponentiates the model.
RNGversion("4.4.3")
set.seed(12) # make reproducible
x <- sample(0:1, size = 50, replace = TRUE)
visits <- rpois(n = 50, lambda = exp(0.2 + 0.8*x))
d <- data.frame(visits, x)
The link to the model is the log transformation:
\[ \text{log}(e^{0.2 + 0.8x}) = 0.2 + 0.8x \]
Two other links for the poisson family are the “identity” and “sqrt”
(square root) links. The identity link simply means there is no
transformation. We can simulate data appropriate for an identity link by
not using the exp()
function. Notice the estimated
coefficients are close to the true values. (We use a larger sample size
to help the model come closer to estimating the true coefficients of the
linear model.)
set.seed(13)
n <- 1000
x <- sample(0:1, size = n, replace = TRUE)
visits <- rpois(n = n, lambda = 0.2 + 0.8*x)
d <- data.frame(visits, x)
m_i <- glm(visits ~ x, data = d, family = poisson(link = "identity"))
summary(m_i)
Call:
glm(formula = visits ~ x, family = poisson(link = "identity"),
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.17624 0.01868 9.434 <2e-16 ***
x 0.89447 0.05012 17.847 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1247.50 on 999 degrees of freedom
Residual deviance: 890.27 on 998 degrees of freedom
AIC: 1827.5
Number of Fisher Scoring iterations: 3
We can simulate data appropriate for a square root link by squaring the model. That’s because the square root is the inverse of the square.
\[ \sqrt{(0.2 + 0.8x)^2} = 0.2 + 0.8x \]
set.seed(14)
x <- sample(0:1, size = n, replace = TRUE)
visits <- rpois(n = n, lambda = (0.2 + 0.8*x)^2) # square the model
d <- data.frame(visits, x)
m_s <- glm(visits ~ x, data = d, family = poisson(link = "sqrt"))
summary(m_s)
Call:
glm(formula = visits ~ x, family = poisson(link = "sqrt"), data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.20135 0.02197 9.165 <2e-16 ***
x 0.80897 0.03164 25.565 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1279.45 on 999 degrees of freedom
Residual deviance: 709.04 on 998 degrees of freedom
AIC: 1463.6
Number of Fisher Scoring iterations: 5