- Introduce probability model for count data.
- Interpretation of coefficients on log and non-log scale.
- Making predictions from a model in log and non-log counts.
- Running comparisons for nested models and interpreting the results.
In a Poisson model
\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \end{aligned} \]
we assume that \(y\) is distributed following a Poisson process.
A logarithmic link function maps the rate \(\lambda\) to a linear function of a predictor \(x\):
\[ \log(\lambda_i) = \beta_0 + \beta_1 \times x_i \]
This allows us to model rate changes relative to changes in one (more more) predictor variable(s).
Similar to binary logistic regression: probability is not a linear function of the predictor variables but log-odds are.
We will be focusing on cigs: number of cigarettes smoked in a particular period of time (per day).
# Load the smoking data smoking_df <- read_csv( "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") # Check variables in data glimpse(smoking_df)
Rows: 807 Columns: 10 $ educ <dbl> 16.0, 16.0, 12.0, 13.5, 10.0, 6.0, 12.0, 15.0, 12.0, 12.0, 13… $ cigpric <dbl> 60.506, 57.883, 57.664, 57.883, 58.320, 59.340, 57.883, 57.88… $ white <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1… $ age <dbl> 46, 40, 58, 30, 17, 86, 35, 48, 48, 31, 27, 24, 71, 44, 22, 2… $ income <dbl> 20000, 30000, 30000, 20000, 20000, 6500, 20000, 30000, 20000,… $ cigs <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 10, 20, 30, 0, 0, 20, 0, 30, 0,… $ restaurn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ lincome <dbl> 9.903487, 10.308952, 10.308952, 9.903487, 9.903487, 8.779557,… $ agesq <dbl> 2116, 1600, 3364, 900, 289, 7396, 1225, 2304, 2304, 961, 729,… $ lcigpric <dbl> 4.102743, 4.058424, 4.054633, 4.058424, 4.065945, 4.083283, 4…
Number of each value of cigs: do any values stand out?
Distribution family is Poisson and link is log (default).
# Poisson model with intercept only
m_cigs_0 <- glm(cigs ~ 1,
data = smoking_df,
family = poisson(link = "log"))
# Return estimate for model parameters
coef(m_cigs_0)
(Intercept) 2.161769
The intercept is the log number of cigarettes.
coef(m_cigs_0) # Intercept coefficient on log scale (log number of cigs)
(Intercept) 2.161769
Coefficient of intercept-only model is the same as the log of the average (mean) number of cigs.
# Average of cigs d_mean <- describe(data = smoking_df, avg = mean(cigs)) # Add log of the average mutate(d_mean, log_avg = log(avg))
# A tibble: 1 × 2
avg log_avg
<dbl> <dbl>
1 8.69 2.16
How can we turn the intercept estimate into “number of cigarettes”? Remember odds vs log odds.
How does the number of cigs smoked vary as a function of age, education and income as predictors?
m_cigs <- glm(cigs ~ age + educ + lincome, data = smoking_df, family = poisson(link = "log"))
summary(m_cigs)
Call:
glm(formula = cigs ~ age + educ + lincome, family = poisson(link = "log"),
data = smoking_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7978288 0.1858839 4.292 1.77e-05 ***
age -0.0049920 0.0007308 -6.831 8.44e-12 ***
educ -0.0458054 0.0042516 -10.774 < 2e-16 ***
lincome 0.2193037 0.0195039 11.244 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 15821 on 806 degrees of freedom
Residual deviance: 15596 on 803 degrees of freedom
AIC: 17075
Number of Fisher Scoring iterations: 6
coef(m_cigs)
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
cigs smoked change for every unit change in age, educ, lincome?Extract model coefficients and store them in betas:
betas <- coef(m_cigs) betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
Change in the log mean of cigaretts per day for each year of age.
betas['age']
age -0.004992017
The log number of cigs for an 18 year old (after controlling for income and education) is
log_mean <- betas["(Intercept)"] + betas['age'] * 18 log_mean
(Intercept) 0.7079725
What is the log mean (number of cigs) if age = 25, educ = 15, and lincome = 10?
betas <- coef(m_cigs) # assigning the output of the function to an object 'betas' betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
Use the linear regression equation as usual; use indexing for the coefficients stored in betas.
What is the log mean (number of cigs) if age = 25, educ = 15, and lincome = 10?
# Saving the log average of the calculation to the object 'log_mean' log_mean <- betas[1] + betas[2] * 25 + betas[3] * 15 + betas[4] * 10 log_mean
(Intercept) 2.178984
Convert the log average number of cigs to the actual number of cigs.
\(e\) to the power of 2.179 which is \(e^\beta\) in R is exp() is the inverse of logarithm.
# Exponential of log_mean value (2.178984) # Saving this calculation to an object called 'lambda' lambda <- exp(log_mean) lambda # <- average number of cigs
(Intercept)
8.83732
transforms it back
# the opposite of exponential is the natural logarithm log(lambda)
(Intercept) 2.178984
This value (~8.837) is the mean of the distribution of people who have the specific characteristics of age = 25, educ = 15, lincome = 10; average of \(\text{Poisson}(\lambda = 8.837)\).
What is the probability that a person with these characteristics smokes exactly 5 cigs?
We can calculate the probability of obtaining any given value in a Poisson distribution with a given \(\lambda\):
dpois(x = 5, lambda = lambda)
[1] 0.06522625
This is the probability of smoking exactly 5 cigs given the Poisson distribution implied by a hypothetical participant with the above characteristics.
For a hypothetical participant with the following characteristics, age = 18, educ = 4, lincome = 7, what is the probability that they smoke 3, 6, 10 cigarettes?
cigs using the regression equation.dpois to get the probabilityWhich number for x maximizes this probability value?
betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
\(e^{\beta_\text{age}}\) is the multiplicative factor by which the mean changes for each one-unit increase in the predictor (year for age) holding other predictors constant.
exp(betas['age'])
age 0.9950204
If the factor of change is less than than 1 it decreases! (example follows)
What’s the factor which the mean of cigs changes for each one-unit language in lincome?
Alternative way to get these log means is to use predict or add_predictions
Let’s use three different values for age but, lets keep other variables constant.
# New dataframe with 3 values of age (21, 22, 23) but each have an educ of 15, lincome of 10 smoking_df2 <- tibble(age = c(21, 22, 23), educ = 15, lincome = 10) smoking_df2
# A tibble: 3 × 3
age educ lincome
<dbl> <dbl> <dbl>
1 21 15 10
2 22 15 10
3 23 15 10
Predicted log of the mean values for these three different characteristics:
# Unless specified, these will be in the units of the model so logs) library(modelr) add_predictions(data = smoking_df2, model = m_cigs)
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 2.20
2 22 15 10 2.19
3 23 15 10 2.19
Predicted the mean values for these three different characteristics
# We can specify predictions to be the same unit as the data (number of cigs) add_predictions(data = smoking_df2, model = m_cigs, type = 'response')
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 9.02
2 22 15 10 8.97
3 23 15 10 8.93
The factor by which the mean changes for a unit change in age was exp(betas['age']) = 0.9950204.
Given the predicted values
add_predictions(data = smoking_df2, model = m_cigs, type = 'response')
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 9.02
2 22 15 10 8.97
3 23 15 10 8.93
the change between the number of cigarettes for adjacent ages 21 and 22 years
8.970664 / 9.015557 # change in age by 1 unit
[1] 0.9950205
What’s the change between 22 and 23 years?
Implement the following two models with \(\text{cigs}_i \sim \text{Poisson}(\lambda_i)\):
\[
\mathcal{M_1}: \text{log}(\lambda_i) = \beta_0 + \beta_1 \times \log(\text{income}_i)\\
\mathcal{M_2}: \text{log}(\lambda_i) = \beta_0 + \beta_1 \times \log(\text{income}_i) + \beta_2 \times \text{education}_i + \beta_3 \times \text{white}_i
\] We use white (binary), educ (as continuous variable), and the log of income (lincome).
Using anova(..., ...., test = 'Chisq') evaluate whether education and being white has a (combined) association with smoking (after controlling for income).
For the same models \(\mathcal{M}_1\) and \(\mathcal{M}_2\):
deviance function)?pchisq()?Hint: You need the difference in deviance and the difference in number of parameters (as we did for binary logistic regression). Make sure the difference is positive (the deviance / number of parameters of the more complex model can’t be smaller than 0).
Interpretation of p-value i.e. \(Pr(>\chi^2)\): Probability to observe our difference in the deviances (or something more extreme) if both models were to make identical predictions. If p is below 0.05, the models are unlikely to be identical.
For binary logistic regression, we have seen that the deviance \(\mathcal{D}\) and the likelihood \(\mathcal{L}\) have the following correspondence:
\[ \mathcal{D} = -2 \times \log(\mathcal{L}) \]
This isn’t true for Poisson regression: The deviance of model m1 is
deviance(m1)
[1] 15740.23
but for the log likelihood we see that
-2 * logLik(m1)
'log Lik.' 17210.8 (df=2)
However, for differences \(\Delta\) in \(\mathcal{D}\) and \(\mathcal{L}\) the following relation still holds:
\[ \Delta\mathcal{D} = -2 \times \Delta\log(\mathcal{L}) \]
This can be shown for our two models (code for models not included): The difference between the deviance
deviance(m1) - deviance(m2)
[1] 98.51167
is the same as as the negative double for the difference in log likelihoods
-2 * (logLik(m1) - logLik(m2))
'log Lik.' 98.51167 (df=2)