1 Data

We use the data set warpbreaks.

This data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn.

remove(list=ls())

?warpbreaks

df <- warpbreaks
library(visdat)
vis_dat(df)

head(df)

##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

library("psych")
describe(warpbreaks,
         skew = F, 
         ranges = T
         )

##          vars  n  mean    sd min max range   se
## breaks      1 54 28.15 13.20  10  70    60 1.80
## wool*       2 54  1.50  0.50   1   2     1 0.07
## tension*    3 54  2.00  0.82   1   3     2 0.11

plot(x = df$breaks, 
     ylab = "Breaks"
     )

table(df$tension)

## 
##  L  M  H 
## 18 18 18

table(df$wool)

## 
##  A  B 
## 27 27

Considering “breaks” as the response variable.
The wool “type” and “tension” are taken as predictor variables.

2 OLS

When the response variable is a count, it is generally not appropriate to use ordinary least squares (OLS) regression, which assumes that the response variable is continuous and normally distributed. Count data often exhibit characteristics such as skewness, overdispersion and a discrete distribution, which violate the assumptions of OLS regression.

If count data is used as the response variable in OLS regression, the resulting model may produce biased and unreliable estimates of the regression coefficients, and the standard errors may be underestimated. In addition, the predicted values may fall outside of the possible range of counts, leading to implausible predictions.

Therefore, it is recommended to use count regression models such as Poisson or negative binomial regression instead of OLS regression for count data. These models are specifically designed to handle count data and can account for the inherent characteristics of count data.

ols <- lm(formula = breaks ~ wool + tension,
             data = warpbreaks
         )

par(mfrow = c(2,2))

plot(ols)

par(mfrow = c(1, 1))  # Set the layout to a single plot
dev.off()            # Close the current device (in this case, the 2 by 2 layout)

## null device 
##           1

Residual plots do not show any major issues, but we know the OLS estimator will be inefficient, inconsistent and unbiased.

Fancier way to do residual analysis charts (user written commands).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(performance)
performance::check_model(ols)

3 Poisson

Implementing the poisson regression model:

Make sure you have replaced lm with glm
Specify the family as poisson or poisson(link = "log")

?glm # glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

?stats::family
poisson <- glm(formula = breaks ~ wool + tension,
                  data = warpbreaks, 
                family = poisson # a description of the error distribution and link function to be used in the model. 
              )

summary(poisson)

## 
## Call:
## glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6871  -1.6503  -0.4269   1.1902   4.2616  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
## woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
## tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
## tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 210.39  on 50  degrees of freedom
## AIC: 493.06
## 
## Number of Fisher Scoring iterations: 4

poisson2 <- glm(formula = breaks ~ wool + tension,
                   data = warpbreaks, 
                 family = poisson(link = "log") 
               ) 

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(poisson,poisson2, 
          type="text"
          )

## 
## ==============================================
##                       Dependent variable:     
##                   ----------------------------
##                              breaks           
##                        (1)            (2)     
## ----------------------------------------------
## woolB               -0.206***      -0.206***  
##                      (0.052)        (0.052)   
##                                               
## tensionM            -0.321***      -0.321***  
##                      (0.060)        (0.060)   
##                                               
## tensionH            -0.518***      -0.518***  
##                      (0.064)        (0.064)   
##                                               
## Constant             3.692***      3.692***   
##                      (0.045)        (0.045)   
##                                               
## ----------------------------------------------
## Observations            54            54      
## Log Likelihood       -242.528      -242.528   
## Akaike Inf. Crit.    493.056        493.056   
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

4 Negative Binomial

The glm.nb function is used to fit a negative binomial regression model, and the MASS package is loaded to provide access to this function.

implementing the negative binomial regression model:

# Fit negative binomial regression model
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

?glm.nb # A modification of the system function glm() to include estimation of the additional parameter, theta, for a Negative Binomial generalized linear model.

negative_binomial <- glm.nb(formula = breaks ~ wool + tension,
                               data = warpbreaks
                             )

negative_binomial2 <- glm.nb(formula = breaks ~ wool + tension,
                                data = warpbreaks,
                                link = log, 
                          init.theta = 7
                             )



stargazer(negative_binomial, negative_binomial2, 
          type="text"
          )

## 
## ===================================================
##                          Dependent variable:       
##                   ---------------------------------
##                                breaks              
##                         (1)              (2)       
## ---------------------------------------------------
## woolB                 -0.186*          -0.186*     
##                       (0.101)          (0.101)     
##                                                    
## tensionM              -0.299**         -0.299**    
##                       (0.122)          (0.122)     
##                                                    
## tensionH             -0.511***        -0.511***    
##                       (0.124)          (0.124)     
##                                                    
## Constant              3.673***         3.673***    
##                       (0.098)          (0.098)     
##                                                    
## ---------------------------------------------------
## Observations             54               54       
## Log Likelihood        -200.382         -200.382    
## theta             9.944*** (2.561) 9.944*** (2.561)
## Akaike Inf. Crit.     408.764          408.764     
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

?distribution # shows you negative binomial distribution.

## Help on topic 'distribution' was found in the following packages:
## 
##   Package               Library
##   bayestestR            /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##   stats                 /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## 
## 
## Using the first match ...

4.1 Negative Binomial or Poisson?

Over-dispersion is a common issue that arises in count data analysis, where the variance of the count variable is greater than its mean, which violates the assumption of a Poisson distribution.

Over-dispersion can occur due to several reasons, such as the presence of unobserved heterogeneity, extra zeros, or clustering of the counts. When over-dispersion is present, the standard Poisson regression model may produce biased and inefficient estimates of the regression coefficients, and the model fit may be poor.

To account for over-dispersion in count data, a common approach is to use a negative binomial regression model, which allows for the variance of the count variable to be greater than its mean. This model adds an additional parameter, known as the dispersion parameter, that estimates the amount of over-dispersion present in the data.

Another approach to account for over-dispersion is to use a quasi-Poisson regression model, which assumes that the variance of the count variable is a function of its mean. This model does not explicitly model the over-dispersion parameter but allows for the variance to be greater than the mean, providing more flexibility than the standard Poisson model.

In summary, over-dispersion is a common issue in count data analysis, where the variance of the count variable is greater than its mean. To account for over-dispersion, negative binomial regression or quasi-Poisson regression models can be used instead of the standard Poisson regression model.

5 Compare Models

stargazer(ols, poisson, negative_binomial,
          type  = "text",
          title = "Regression Results"
          )

## 
## Regression Results
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                          breaks                     
##                              OLS           Poisson      negative    
##                                                         binomial    
##                              (1)             (2)          (3)       
## --------------------------------------------------------------------
## woolB                      -5.778*        -0.206***     -0.186*     
##                            (3.162)         (0.052)      (0.101)     
##                                                                     
## tensionM                  -10.000**       -0.321***     -0.299**    
##                            (3.872)         (0.060)      (0.122)     
##                                                                     
## tensionH                 -14.722***       -0.518***    -0.511***    
##                            (3.872)         (0.064)      (0.124)     
##                                                                     
## Constant                  39.278***       3.692***      3.673***    
##                            (3.162)         (0.045)      (0.098)     
##                                                                     
## --------------------------------------------------------------------
## Observations                 54              54            54       
## R2                          0.269                                   
## Adjusted R2                 0.225                                   
## Log Likelihood                            -242.528      -200.382    
## theta                                               9.944*** (2.561)
## Akaike Inf. Crit.                          493.056      408.764     
## Residual Std. Error   11.617 (df = 50)                              
## F Statistic         6.138*** (df = 3; 50)                           
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

6 Interpretation

The interpretation is in terms of the multiplicative effect on the rate of the event (e.g., warp breaks).

6.1 Poisson

woolB: -0.206

Exponentiating the coefficient: \(exp(-0.206) \approx 0.814\)
Interpretation: Using wool B (compared to wool A) results in an decrease in expected rate of warp breaks by approximately \({1-e^{-0.206}}*100 \%\) i.e. 26 %, cetrius paribus.

6.2 Negative Binomial

woolB: -0.186

Exponentiating the coefficient: \(exp(-0.186) \approx 0.814\)
Interpretation: Using wool B (compared to wool A) results in an decrease in expected rate of warp breaks by approximately \(1-e^{-0.186} *100\%\) i.e =17%, cetrius paribus.

6.3 References

Negative Binomial
Poisson Code

Count Models

Arvind Sharma

2023-11-18