- Another type of generalised linear model
- Type of count model (one of three we will be looking at)
Number of crimes committed by someone
Number of times someone got sent to jail
Number of times someone had an affair
Sometimes / often people treat counts as a continuous variable
Sometimes that’s fine (large quantities) sometimes it is not (small quantities, i.e. when events are rare)
In a Poison model
\[ \begin{aligned} y &\sim \text{Poisson}(\lambda_i) \end{aligned} \]
we assume that \(y\) is distributed following a Poisson process.
When the rate changes as a function of changes in predictor variables, we require a logarithmic link function to map the rate \(\lambda\) to a linear function of a set of predictors:
\[ \log(\lambda_i) = \beta_0 + \beta_1 \times x_i \]
Similar to binary logistic regression where the probability is not a linear function of the predictor variables but the log odds are.
We will be focusing on cigs: number of cigarettes smoked in a particular period of time (per day).
# Load the smoking data smoking_df <- read_csv( "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") # Check variables in data glimpse(smoking_df)
Rows: 807 Columns: 10 $ educ <dbl> 16.0, 16.0, 12.0, 13.5, 10.0, 6.0, 12.0, 15.0, 12.0, 12.0, 13… $ cigpric <dbl> 60.506, 57.883, 57.664, 57.883, 58.320, 59.340, 57.883, 57.88… $ white <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1… $ age <dbl> 46, 40, 58, 30, 17, 86, 35, 48, 48, 31, 27, 24, 71, 44, 22, 2… $ income <dbl> 20000, 30000, 30000, 20000, 20000, 6500, 20000, 30000, 20000,… $ cigs <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 10, 20, 30, 0, 0, 20, 0, 30, 0,… $ restaurn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ lincome <dbl> 9.903487, 10.308952, 10.308952, 9.903487, 9.903487, 8.779557,… $ agesq <dbl> 2116, 1600, 3364, 900, 289, 7396, 1225, 2304, 2304, 961, 729,… $ lcigpric <dbl> 4.102743, 4.058424, 4.054633, 4.058424, 4.065945, 4.083283, 4…
Number of each value of cigs: does any value stand out?
Distribution family is Poisson and link is log (default).
m_cigs_0 <- glm(cigs ~ 1,
data = smoking_df,
family = poisson(link = "log"))
coef(m_cigs_0)
(Intercept) 2.161769
Hence, the intercept is the log number of cigarettes.
Coefficient of intercept only model is the same as the log of the average (mean) number of cigs.
# Average of cigs d_mean <- describe(data = smoking_df, avg = mean(cigs)) # Add log of the average mutate(d_mean, log_avg = log(avg))
# A tibble: 1 × 2
avg log_avg
<dbl> <dbl>
1 8.69 2.16
How can we turn the intercept estimate into “number of cigarettes”? Remember odds vs log odds.
coef(m_cigs_0)
(Intercept) 2.161769
How does the number of cigs smoked vary as a function of age, education and income as predictors?
m_cigs <- glm(cigs ~ age + educ + lincome, data = smoking_df, family = poisson(link = "log"))
summary(m_cigs)
Call:
glm(formula = cigs ~ age + educ + lincome, family = poisson(link = "log"),
data = smoking_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7978288 0.1858839 4.292 1.77e-05 ***
age -0.0049920 0.0007308 -6.831 8.44e-12 ***
educ -0.0458054 0.0042516 -10.774 < 2e-16 ***
lincome 0.2193037 0.0195039 11.244 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 15821 on 806 degrees of freedom
Residual deviance: 15596 on 803 degrees of freedom
AIC: 17075
Number of Fisher Scoring iterations: 6
coef(m_cigs)
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
cigs smoked change for every unit change in age, educ, lincome?Extract model coefficients and store them im betas:
betas <- coef(m_cigs) betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
Change in the log of the mean for each year of age.
betas['age']
age -0.004992017
Extract model coefficients and store them im betas:
betas <- coef(m_cigs) betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
\(e^{\beta_\text{age}}\) is the multiplicative factor by which the mean changes for each one-unit increase in the predictor (year for age) holding other predictors constant.
exp(betas['age'])
age 0.9950204
Lower and upper bound of 95% confidence interval for the logarithm of the rate cigs for age
confint.default(m_cigs, parm='age')
2.5 % 97.5 % age -0.00642435 -0.003559684
What is the 95% confidence interval for the factor by which the rate of cigs smokes change with age?
exp(confint.default(m_cigs, parm='age'))
2.5 % 97.5 % age 0.9935962 0.9964466
If a factor of change is less than than 1 it will always decrease! (example follows)
Alternative way to get these log means is to use predict or add_predictions
Let’s use three different values for age but, lets keep other variables constant.
# New dataframe with 3 values of age (21, 22, 23) but each have an educ of 15, lincome of 10 smoking_df2 <- tibble(age = c(21, 22, 23), educ = 15, lincome = 10) smoking_df2
# A tibble: 3 × 3
age educ lincome
<dbl> <dbl> <dbl>
1 21 15 10
2 22 15 10
3 23 15 10
Predicted log of the mean values for these three different characteristics:
# Unless specified, these will be in the units of the model so logs) library(modelr) add_predictions(data = smoking_df2, model = m_cigs)
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 2.20
2 22 15 10 2.19
3 23 15 10 2.19
Predicted the mean values for these three different characteristics
# We can specify predictions to be the same unit as the data (number of cigs) add_predictions(data = smoking_df2, model = m_cigs, type = 'response')
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 9.02
2 22 15 10 8.97
3 23 15 10 8.93
The factor by which the mean changes for a unit change in age was exp(betas['age']) = 0.9950204.
Given the predicted values
add_predictions(data = smoking_df2, model = m_cigs, type = 'response')
# A tibble: 3 × 4
age educ lincome pred
<dbl> <dbl> <dbl> <dbl>
1 21 15 10 9.02
2 22 15 10 8.97
3 23 15 10 8.93
the change between the number of cigarettes for adjacent ages 21 and 22 years
8.970664 / 9.015557 # change in age by 1 unit
[1] 0.9950205
What’s the change between 22 and 23 years?
What is the log mean (number of cigs) if age = 25, educ = 15, and lincome = 10?
Use the linear regression equation as usual; use indexing for the coefficients stored in betas.
betas <- coef(m_cigs) # assigning the output of the function to an object 'betas' betas
(Intercept) age educ lincome 0.797828756 -0.004992017 -0.045805445 0.219303695
What is the log mean (number of cigs) if age = 25, educ = 15, and lincome = 10?
# Saving the log average of the calculation to the object 'log_mean' log_mean <- betas[1] + betas[2] * 25 + betas[3] * 15 + betas[4] * 10 log_mean
(Intercept) 2.178984
Convert the log average number of cigs to the actual number of cigs.
\(e\) to the power of 2.179 which is \(e^\beta\) in R is exp() is the inverse of logarithm.
# Exponential of log_mean value (2.178984) # Saving this calculation to an object called 'lambda' lambda <- exp(log_mean) lambda # <- average number of cigs
(Intercept)
8.83732
transforms it back
# the opposite of exponential is the natural logarithm log(lambda)
(Intercept) 2.178984
This value (~8.837) is the mean of the distribution of people who have the specific characteristics of age = 25, educ = 15, lincome = 10; average of \(\text{Poisson}(\lambda = 8.837)\).
What is the probability that a person with these characteristics smokes exactly 5 cigs?
We can calculate the probability of obtaining any given value in a Poisson distribution with a given \(\lambda\):
dpois(x = 5, lambda = lambda)
[1] 0.06522625
This is the probability of smoking exactly 5 cigs given the Poisson distribution implied by a hypothetical participant with the above characteristics.
For a hypothetical participant with the following characteristics, age = 18, educ = 4, lincome = 7, what is the probability that they smoke 3, 6, 10 cigarettes?
cigs using the regression equation.dpois to get the probabilityWhich number for x maximizes this probability value?
Implement the following two models with \(\text{cigs}_i \sim \text{Poisson}(\lambda_i)\):
\[
\mathcal{M_1}: \text{log}(\lambda_i) = \beta_0 + \beta_1 \times \log(\text{income})\\
\mathcal{M_2}: \text{log}(\lambda_i) = \beta_0 + \beta_1 \times \log(\text{income}) + \beta_2 \times \text{education} + \beta_3 \times \text{white}
\] We use white (binary), educ (as continuous variable), and the log of income (lincome).
Using anova(..., ...., test = 'Chisq') evaluate whether education and being white has a (combined) association with smoking (after controlling for income).
What is the deviance of each of the two models?
deviance() (the lower the deviance the better fit the model).pchisq()