1 Data

We use the data set warpbreaks.

  • This data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn.
Model vs actual photo of the warp break when weaving.
Model vs actual photo of the warp break when weaving.
remove(list=ls())

?warpbreaks

df <- warpbreaks
library(visdat)
vis_dat(df)

head(df)
##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L
library("psych")
describe(warpbreaks,
         skew = F, 
         ranges = T
         )
##          vars  n  mean    sd median min max range   se
## breaks      1 54 28.15 13.20   26.0  10  70    60 1.80
## wool*       2 54  1.50  0.50    1.5   1   2     1 0.07
## tension*    3 54  2.00  0.82    2.0   1   3     2 0.11
plot(x = df$breaks, 
     ylab = "Breaks"
     )

table(df$tension)
## 
##  L  M  H 
## 18 18 18
table(df$wool)
## 
##  A  B 
## 27 27
  • Considering breaks as the response variable.

  • The wool type and tension are taken as predictor variables.

2 OLS

When the response variable is a count, it is generally not appropriate to use ordinary least squares (OLS) regression, which assumes that the response variable is continuous and normally distributed. Count data often exhibit characteristics such as skewness, overdispersion and a discrete distribution, which violate the assumptions of OLS regression.

If count data is used as the response variable in OLS regression, the resulting model may produce biased and unreliable estimates of the regression coefficients, and the standard errors may be underestimated. In addition, the predicted values may fall outside of the possible range of counts, leading to implausible predictions.

Therefore, it is recommended to use count regression models such as poisson or negative binomial regression instead of OLS regression for count data. These models are specifically designed to handle count data and can account for the inherent characteristics of count data.

ols <- lm(formula = breaks ~ wool + tension,
             data = warpbreaks
         )

par(mfrow = c(2,2))

plot(ols)

par(mfrow = c(1, 1))  # Set the layout to a single plot
dev.off()             # Close the current device (in this case, the 2 by 2 layout)
## null device 
##           1
  • Residual plots do not show any major issues, but we know the OLS estimator will be inefficient, inconsistent and unbiased.

    • Fancier way to do residual analysis charts (user written commands).
require("see")
## Loading required package: see
require("patchwork")
## Loading required package: patchwork
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(performance)
performance::check_model(ols)

3 Recap: Distributions

Poisson and negative binomial distributions are related, but they are not equal. However, they can approximate each other under certain conditions.

  • The Poisson distribution is often used to model the number of events occurring in a fixed interval of time or space when these events occur with a known constant rate and independently of the time since the last event. It has a single parameter \(\lambda\) (lambda), which represents the average rate of occurrence.

  • The negative binomial distribution, on the other hand, models the number of failures that occur before a specified number of successes (r) is achieved in a series of independent Bernoulli trials. It has two parameters: the number of successes (r) and the success probability (p).

Under certain conditions, as the number of trials in the negative binomial distribution becomes large and the success probability becomes small, it can approximate the Poisson distribution. This is known as the Poisson approximation of the negative binomial distribution.

  • Specifically, when the number of successes (r) in the negative binomial distribution becomes large and the success probability (p) becomes small, such that r * p approaches a constant (\(\lambda\)), the negative binomial distribution converges to the Poisson distribution with parameter \(\lambda\). This convergence is often referred to as the law of rare events.

https://cran.r-project.org/web/views/Distributions.html

?distribution
## Help on topic 'distribution' was found in the following packages:
## 
##   Package               Library
##   stats                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   bayestestR            /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...
x <- 0:100
y_pois <- dpois(x, lambda = 30)
?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
plot(x = x, 
     y = y_pois, 
     type = "l", 
     xlab = "Count", 
     ylab = "PMF", 
     main="Poisson Distribution (lambda=30)"
     )

x <- 0:100
y_nb <- dnbinom(x, size = 5, prob = .1)
?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
plot(x = x, 
     y = y_nb, 
     type = "l", 
     xlab = "Count", 
     ylab = "PMF", 
     main="Negative Binomial Distribution (size = 5; prob = .1)"
     )

4 Poisson

Implementing the poisson regression model:

  1. Make sure you have replaced lm with glm

  2. Specify the family as poisson or poisson(link = "log")

?glm # glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

?stats::family
poisson <- glm(formula = breaks ~ wool + tension,
                  data = warpbreaks, 
                family = poisson # a description of the error distribution and link function to be used in the model. 
              )

summary(poisson)
## 
## Call:
## glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
## woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
## tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
## tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 210.39  on 50  degrees of freedom
## AIC: 493.06
## 
## Number of Fisher Scoring iterations: 4
poisson2 <- glm(formula = breaks ~ wool + tension,
                   data = warpbreaks, 
                 family = poisson(link = "log") 
               ) 

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(poisson,poisson2, 
          type="text"
          )
## 
## ==============================================
##                       Dependent variable:     
##                   ----------------------------
##                              breaks           
##                        (1)            (2)     
## ----------------------------------------------
## woolB               -0.206***      -0.206***  
##                      (0.052)        (0.052)   
##                                               
## tensionM            -0.321***      -0.321***  
##                      (0.060)        (0.060)   
##                                               
## tensionH            -0.518***      -0.518***  
##                      (0.064)        (0.064)   
##                                               
## Constant             3.692***      3.692***   
##                      (0.045)        (0.045)   
##                                               
## ----------------------------------------------
## Observations            54            54      
## Log Likelihood       -242.528      -242.528   
## Akaike Inf. Crit.    493.056        493.056   
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

Same output, as expected. Better to explicitly specify all arguments.

5 Negative Binomial

The glm.nb function is used to fit a negative binomial regression model, and the MASS package is loaded to provide access to this function.

Implementing the negative binomial regression model:

# Fit negative binomial regression model
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:patchwork':
## 
##     area
?glm.nb # A modification of the system function glm() to include estimation of the additional parameter, theta, for a Negative Binomial generalized linear model.

negative_binomial <- glm.nb(formula = breaks ~ wool + tension,
                               data = warpbreaks
                             )

negative_binomial2 <- glm.nb(formula = breaks ~ wool + tension,
                                data = warpbreaks,
                                link = log
                             )



stargazer(negative_binomial, negative_binomial2, 
          type="text"
          )
## 
## ===================================================
##                          Dependent variable:       
##                   ---------------------------------
##                                breaks              
##                         (1)              (2)       
## ---------------------------------------------------
## woolB                 -0.186*          -0.186*     
##                       (0.101)          (0.101)     
##                                                    
## tensionM              -0.299**         -0.299**    
##                       (0.122)          (0.122)     
##                                                    
## tensionH             -0.511***        -0.511***    
##                       (0.124)          (0.124)     
##                                                    
## Constant              3.673***         3.673***    
##                       (0.098)          (0.098)     
##                                                    
## ---------------------------------------------------
## Observations             54               54       
## Log Likelihood        -200.382         -200.382    
## theta             9.944*** (2.561) 9.944*** (2.561)
## Akaike Inf. Crit.     408.764          408.764     
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01
?distribution # shows you negative binomial distribution.
## Help on topic 'distribution' was found in the following packages:
## 
##   Package               Library
##   stats                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   bayestestR            /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...

5.1 Negative Binomial or Poisson?

Over-dispersion is a common issue that arises in count data analysis, where the variance of the count variable is greater than its mean, which violates the assumption of a Poisson distribution.

Over-dispersion can occur due to several reasons, such as the presence of unobserved heterogeneity, extra zeros, or clustering of the counts. When over-dispersion is present, the standard Poisson regression model may produce biased and inefficient estimates of the regression coefficients, and the model fit may be poor.

To account for over-dispersion in count data, a common approach is to use a negative binomial regression model, which allows for the variance of the count variable to be greater than its mean. This model adds an additional parameter, known as the dispersion parameter, that estimates the amount of over-dispersion present in the data.

Another approach to account for over-dispersion is to use a quasi-Poisson regression model, which assumes that the variance of the count variable is a function of its mean. This model does not explicitly model the over-dispersion parameter but allows for the variance to be greater than the mean, providing more flexibility than the standard Poisson model.

In summary, over-dispersion is a common issue in count data analysis, where the variance of the count variable is greater than its mean. To account for over-dispersion, negative binomial regression or quasi-Poisson regression models can be used instead of the standard Poisson regression model.

6 Compare Models

stargazer(ols, poisson, negative_binomial,
          type  = "text",
          title = "Regression Results"
          )
## 
## Regression Results
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                          breaks                     
##                              OLS           Poisson      negative    
##                                                         binomial    
##                              (1)             (2)          (3)       
## --------------------------------------------------------------------
## woolB                      -5.778*        -0.206***     -0.186*     
##                            (3.162)         (0.052)      (0.101)     
##                                                                     
## tensionM                  -10.000**       -0.321***     -0.299**    
##                            (3.872)         (0.060)      (0.122)     
##                                                                     
## tensionH                 -14.722***       -0.518***    -0.511***    
##                            (3.872)         (0.064)      (0.124)     
##                                                                     
## Constant                  39.278***       3.692***      3.673***    
##                            (3.162)         (0.045)      (0.098)     
##                                                                     
## --------------------------------------------------------------------
## Observations                 54              54            54       
## R2                          0.269                                   
## Adjusted R2                 0.225                                   
## Log Likelihood                            -242.528      -200.382    
## theta                                               9.944*** (2.561)
## Akaike Inf. Crit.                          493.056      408.764     
## Residual Std. Error   11.617 (df = 50)                              
## F Statistic         6.138*** (df = 3; 50)                           
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

7 Interpretation

The interpretation is in terms of the multiplicative effect on the rate of the event (e.g., warp breaks).

7.1 Poisson

woolB​: -0.206

  • Exponentiating the coefficient: \(exp(-0.206) \approx 0.83\)

  • Interpretation: Using wool B (compared to wool A) results in an decrease in expected count of warp breaks by a factor of 0.83, keeping everything else constant.

  • Specifically, it means the expected count of warp breaks using wool B is about 83% of the count using wool A, implying a reduction of approximately 17% {\(({1-e^{-0.206}})*100 \%\)}, cetrius paribus.

7.2 Negative Binomial

woolB​: -0.186

  • Exponentiating the coefficient: \(exp(-0.186) \approx 0.814\)

  • Interpretation: Using wool B (compared to wool A) results in an decrease in expected count of warp breaks by a factor of 0.814, keeping everything else constant.

  • Specifically, it means the expected count of warp breaks using wool B is about 81% of the count using wool A, implying a reduction of approximately 19% {\(({1-e^{-0.186}})*100 \%\)}, cetrius paribus.

7.3 Additional Models

Try zero-inflated poisson and zero-inflated negative binomial model as well.

7.4 References

  1. Negative Binomial

  2. Poisson Code