A linear model allows us to calculate conditional means. For example, a model to estimate the mean number of visits to the health clinic for two dorms at a university might be

\[ \text{visits} = 0.9 + 2.2x \]

where \(x = 0\) for dorm 1 and \(x = 1\) for dorm 2. In other words, the mean number of visits to the health clinic for students in dorm 1 is 0.9, and the mean number of visits to the health clinic for students in dorm 2 is 0.9 + 2.2 = 3.1.

This model will not always be right. It’s just an approximation. It will be off by some amount. This amount is usually called the error, symbolized with \(\epsilon\).

\[ \text{visits} = 0.9 + 2.2x + \epsilon \]

We assume the error is a random draw from a probability distribution. The probability distribution for a basic linear model, also known as the ordinary least squares (OLS) model, is the normal distribution. The normal distribution has two parameters: the mean and standard deviation. For a linear model we assume the mean is 0 and we estimate the standard deviation. This is reported as residual standard error in model output.

Here’s what the linear model looks like when we fit it in R:

m <- lm(visits ~ x, data = d)
summary(m)

Call:
lm(formula = visits ~ x, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.1   -1.1    0.1    0.9    3.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9000     0.2896   3.108  0.00316 ** 
x             2.2000     0.3738   5.885 3.76e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.295 on 48 degrees of freedom
Multiple R-squared:  0.4191,    Adjusted R-squared:  0.407 
F-statistic: 34.63 on 1 and 48 DF,  p-value: 3.757e-07

The intercept is the expected mean visits for dorm 1. The x coefficient is the difference in mean visits between dorm 2 and dorm 1. The standard deviation of the normal distribution from which the error is sampled is estimated to be 1.295.

Because the error is normally distributed, this implies the estimated means are also normally distributed. We can state this mathematically as

\[ \text{dorm 1 count} \sim N(0.9, 1.295) \] \[ \text{dorm 2 count} \sim N(0.9 + 2.2, 1.295) \]

We can use this model to estimate the expected number of visits to the health clinic for 10 incoming students to dorm 1 where the expected number of visits is 0.9. We do this by sampling from a normal distribution with mean 0.9 and standard deviation 1.295. We can do the sampling in R using the rnorm() function.

rnorm(n = 10, mean = 0.9, sd = 1.295)
 [1]  0.08874232  1.13781810 -0.18213905  2.96588864  1.32671256 -0.16250656
 [7]  1.53122062  1.85613049  1.64563685  0.50452204

These expected values are bizarre. For one, they all have decimals. A student won’t have 0.08874232 visits to the health clinic. They will have zero or more visits. Another problem is that two of the values are negative. That certainly doesn’t make any sense. But the normal distribution doesn’t know this. It works for any range of numbers with any level of precision. It has no lower or upper bound. Therefore we would like to consider another probability distribution for this model.

The generalized linear model

Basic linear models assume a normal distribution for the conditional means. There is no option to change the distribution in the lm() function. If we want to consider alternative distributions, we need to use generalized linear models. We can do this in R with the glm() function.

One distribution we might like to consider for our example above is the Poisson distribution. The Poisson distribution has one parameter, a mean. This is usually symbolized with \(\lambda\). Draws from this distribution are whole numbers that are greater than or equal to 0. We specify this distribution in the glm() function using the family argument.

m2 <- glm(visits ~ x, data = d, family = poisson)
summary(m2)

Call:
glm(formula = visits ~ x, family = poisson, data = d)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.1054     0.2357  -0.447    0.655    
x             1.2368     0.2575   4.803 1.56e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 67.018  on 49  degrees of freedom
Residual deviance: 37.416  on 48  degrees of freedom
AIC: 159.08

Number of Fisher Scoring iterations: 5

This implies the estimated means are Poisson distributed. We might state this mathematically as

\[ \text{dorm 1 count} \sim \text{Poisson}(-0.1054) \]

\[ \text{dorm 2 count} \sim \text{Poisson}(-0.1054 + 1.2368) \]

But this would be wrong! You may have already been suspicious of the negative intercept, -0.1054, which is the estimated mean number of visits to the health clinic for students in dorm 1. That doesn’t make sense. In fact, if we use this value to draw random samples from a Poisson distribution we get a warning and missing values.

rpois(n = 10, lambda = -0.1054)
Warning in rpois(n = 10, lambda = -0.1054): NAs produced
 [1] NA NA NA NA NA NA NA NA NA NA

What’s going on? It turns out we need to exponentiate the model to get the conditional means.

\[ \text{dorm 1 count} \sim \text{Poisson}(e^{-0.1054}) \]

\[ \text{dorm 2 count} \sim \text{Poisson}(e^{-0.1054 + 1.2368}) \]

Using the exp() function, we can now get estimated counts for the next 10 students in dorm 1:

rpois(n = 10, lambda = exp(-0.1054))
 [1] 2 0 1 0 0 0 0 0 2 0

Why do we have to exponentiate? Because exponentiating ensures the estimated means are positive. This allows our model, called the linear predictor, to take any range of values, positive or negative. No matter the estimated model, we’ll always get positive means once we exponentiate.

But why didn’t R just do that for us and report the exponentiated coefficients in the output? Because the estimated mean has to be exponentiated, not the individual coefficients. We don’t exponentiate 1.2368. We exponentiate -0.1054 + 1.2368. We can exponentiate 1.2368, but that doesn’t return an estimated mean. It returns a multiplicative effect.

exp(1.2358)
[1] 3.44113

This says the counts for dorm 2 are expected to be about 3.4 times higher than the counts for dorm 1.

Now we can express our model as a single probability expression:

\[ \text{visits} \sim \text{Poisson}(e^{-0.1054 + 1.2368x}) \]

The model, or linear predictor, is \(-0.1054 + 1.2368x\). To extract the model, we have to take the log of the expression. The log is the inverse of exponentiation.

\[ \text{log}(e^{x}) = x \]

Because the log transformation allows us to extract the linear model, we call the log the “link” to our model. The log is the default link for the poisson() function in R. We can explicitly state this when we fit a Poisson model as follows:

m2 <- glm(visits ~ x, data = d, family = poisson(link = "log"))

Simulating data for a Poisson model

The data for the example above was simulated using the rpois() function. Notice the true linear model is \(0.2 + 0.8x\). Also notice the exp() function exponentiates the model.

RNGversion("4.4.3")
set.seed(12) # make reproducible
x <- sample(0:1, size = 50, replace = TRUE)
visits <- rpois(n = 50, lambda = exp(0.2 + 0.8*x))
d <- data.frame(visits, x)

The link to the model is the log transformation:

\[ \text{log}(e^{0.2 + 0.8x}) = 0.2 + 0.8x \]

Two other links for the poisson family are the “identity” and “sqrt” (square root) links. The identity link simply means there is no transformation. We can simulate data appropriate for an identity link by not using the exp() function. Notice the estimated coefficients are close to the true values. (We use a larger sample size to help the model come closer to estimating the true coefficients of the linear model.)

set.seed(13)
n <- 1000
x <- sample(0:1, size = n, replace = TRUE)
visits <- rpois(n = n, lambda = 0.2 + 0.8*x)
d <- data.frame(visits, x)
m_i <- glm(visits ~ x, data = d, family = poisson(link = "identity"))
summary(m_i)

Call:
glm(formula = visits ~ x, family = poisson(link = "identity"), 
    data = d)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.17624    0.01868   9.434   <2e-16 ***
x            0.89447    0.05012  17.847   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1247.50  on 999  degrees of freedom
Residual deviance:  890.27  on 998  degrees of freedom
AIC: 1827.5

Number of Fisher Scoring iterations: 3

We can simulate data appropriate for a square root link by squaring the model. That’s because the square root is the inverse of the square.

\[ \sqrt{(0.2 + 0.8x)^2} = 0.2 + 0.8x \]

set.seed(14)
x <- sample(0:1, size = n, replace = TRUE)
visits <- rpois(n = n, lambda = (0.2 + 0.8*x)^2) # square the model
d <- data.frame(visits, x)
m_s <- glm(visits ~ x, data = d, family = poisson(link = "sqrt"))
summary(m_s)

Call:
glm(formula = visits ~ x, family = poisson(link = "sqrt"), data = d)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.20135    0.02197   9.165   <2e-16 ***
x            0.80897    0.03164  25.565   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1279.45  on 999  degrees of freedom
Residual deviance:  709.04  on 998  degrees of freedom
AIC: 1463.6

Number of Fisher Scoring iterations: 5