We use the data set warpbreaks
.
remove(list=ls())
?warpbreaks
df <- warpbreaks
library(visdat)
vis_dat(df)
head(df)
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
library("psych")
describe(warpbreaks,
skew = F,
ranges = T
)
## vars n mean sd median min max range se
## breaks 1 54 28.15 13.20 26.0 10 70 60 1.80
## wool* 2 54 1.50 0.50 1.5 1 2 1 0.07
## tension* 3 54 2.00 0.82 2.0 1 3 2 0.11
plot(x = df$breaks,
ylab = "Breaks"
)
table(df$tension)
##
## L M H
## 18 18 18
table(df$wool)
##
## A B
## 27 27
Considering breaks
as the response
variable.
The wool type
and tension
are taken as
predictor variables.
When the response variable is a count, it is generally not appropriate to use ordinary least squares (OLS) regression, which assumes that the response variable is continuous and normally distributed. Count data often exhibit characteristics such as skewness, overdispersion and a discrete distribution, which violate the assumptions of OLS regression.
If count data is used as the response variable in OLS regression, the resulting model may produce biased and unreliable estimates of the regression coefficients, and the standard errors may be underestimated. In addition, the predicted values may fall outside of the possible range of counts, leading to implausible predictions.
Therefore, it is recommended to use count regression models such as poisson or negative binomial regression instead of OLS regression for count data. These models are specifically designed to handle count data and can account for the inherent characteristics of count data.
ols <- lm(formula = breaks ~ wool + tension,
data = warpbreaks
)
par(mfrow = c(2,2))
plot(ols)
par(mfrow = c(1, 1)) # Set the layout to a single plot
dev.off() # Close the current device (in this case, the 2 by 2 layout)
## null device
## 1
Residual plots do not show any major issues, but we know the OLS estimator will be inefficient, inconsistent and unbiased.
require("see")
## Loading required package: see
require("patchwork")
## Loading required package: patchwork
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(performance)
performance::check_model(ols)
Poisson and negative binomial distributions are related, but they are not equal. However, they can approximate each other under certain conditions.
The Poisson distribution is often used to model the number of events occurring in a fixed interval of time or space when these events occur with a known constant rate and independently of the time since the last event. It has a single parameter \(\lambda\) (lambda), which represents the average rate of occurrence.
The negative binomial distribution, on the other hand, models the number of failures that occur before a specified number of successes (r) is achieved in a series of independent Bernoulli trials. It has two parameters: the number of successes (r) and the success probability (p).
Under certain conditions, as the number of trials in the negative binomial distribution becomes large and the success probability becomes small, it can approximate the Poisson distribution. This is known as the Poisson approximation of the negative binomial distribution.
https://cran.r-project.org/web/views/Distributions.html
?distribution
## Help on topic 'distribution' was found in the following packages:
##
## Package Library
## stats /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## bayestestR /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##
##
## Using the first match ...
x <- 0:100
y_pois <- dpois(x, lambda = 30)
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## Using the first match ...
plot(x = x,
y = y_pois,
type = "l",
xlab = "Count",
ylab = "PMF",
main="Poisson Distribution (lambda=30)"
)
x <- 0:100
y_nb <- dnbinom(x, size = 5, prob = .1)
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## Using the first match ...
plot(x = x,
y = y_nb,
type = "l",
xlab = "Count",
ylab = "PMF",
main="Negative Binomial Distribution (size = 5; prob = .1)"
)
Implementing the poisson regression model:
Make sure you have replaced lm
with
glm
Specify the family
as poisson
or
poisson(link = "log")
?glm # glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
?stats::family
poisson <- glm(formula = breaks ~ wool + tension,
data = warpbreaks,
family = poisson # a description of the error distribution and link function to be used in the model.
)
summary(poisson)
##
## Call:
## glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
## woolB -0.20599 0.05157 -3.994 6.49e-05 ***
## tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
## tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 297.37 on 53 degrees of freedom
## Residual deviance: 210.39 on 50 degrees of freedom
## AIC: 493.06
##
## Number of Fisher Scoring iterations: 4
poisson2 <- glm(formula = breaks ~ wool + tension,
data = warpbreaks,
family = poisson(link = "log")
)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(poisson,poisson2,
type="text"
)
##
## ==============================================
## Dependent variable:
## ----------------------------
## breaks
## (1) (2)
## ----------------------------------------------
## woolB -0.206*** -0.206***
## (0.052) (0.052)
##
## tensionM -0.321*** -0.321***
## (0.060) (0.060)
##
## tensionH -0.518*** -0.518***
## (0.064) (0.064)
##
## Constant 3.692*** 3.692***
## (0.045) (0.045)
##
## ----------------------------------------------
## Observations 54 54
## Log Likelihood -242.528 -242.528
## Akaike Inf. Crit. 493.056 493.056
## ==============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Same output, as expected. Better to explicitly specify all arguments.
The glm.nb
function is used to fit a
negative binomial regression model, and the
MASS
package is loaded to provide access
to this function.
Implementing the negative binomial regression model:
# Fit negative binomial regression model
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:patchwork':
##
## area
?glm.nb # A modification of the system function glm() to include estimation of the additional parameter, theta, for a Negative Binomial generalized linear model.
negative_binomial <- glm.nb(formula = breaks ~ wool + tension,
data = warpbreaks
)
negative_binomial2 <- glm.nb(formula = breaks ~ wool + tension,
data = warpbreaks,
link = log
)
stargazer(negative_binomial, negative_binomial2,
type="text"
)
##
## ===================================================
## Dependent variable:
## ---------------------------------
## breaks
## (1) (2)
## ---------------------------------------------------
## woolB -0.186* -0.186*
## (0.101) (0.101)
##
## tensionM -0.299** -0.299**
## (0.122) (0.122)
##
## tensionH -0.511*** -0.511***
## (0.124) (0.124)
##
## Constant 3.673*** 3.673***
## (0.098) (0.098)
##
## ---------------------------------------------------
## Observations 54 54
## Log Likelihood -200.382 -200.382
## theta 9.944*** (2.561) 9.944*** (2.561)
## Akaike Inf. Crit. 408.764 408.764
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
?distribution # shows you negative binomial distribution.
## Help on topic 'distribution' was found in the following packages:
##
## Package Library
## stats /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## bayestestR /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##
##
## Using the first match ...
Over-dispersion is a common issue that arises in count data analysis, where the variance of the count variable is greater than its mean, which violates the assumption of a Poisson distribution.
Over-dispersion can occur due to several reasons, such as the presence of unobserved heterogeneity, extra zeros, or clustering of the counts. When over-dispersion is present, the standard Poisson regression model may produce biased and inefficient estimates of the regression coefficients, and the model fit may be poor.
To account for over-dispersion in count data, a common approach is to use a negative binomial regression model, which allows for the variance of the count variable to be greater than its mean. This model adds an additional parameter, known as the dispersion parameter, that estimates the amount of over-dispersion present in the data.
Another approach to account for over-dispersion is to use a quasi-Poisson regression model, which assumes that the variance of the count variable is a function of its mean. This model does not explicitly model the over-dispersion parameter but allows for the variance to be greater than the mean, providing more flexibility than the standard Poisson model.
In summary, over-dispersion is a common issue in count data analysis, where the variance of the count variable is greater than its mean. To account for over-dispersion, negative binomial regression or quasi-Poisson regression models can be used instead of the standard Poisson regression model.
stargazer(ols, poisson, negative_binomial,
type = "text",
title = "Regression Results"
)
##
## Regression Results
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## breaks
## OLS Poisson negative
## binomial
## (1) (2) (3)
## --------------------------------------------------------------------
## woolB -5.778* -0.206*** -0.186*
## (3.162) (0.052) (0.101)
##
## tensionM -10.000** -0.321*** -0.299**
## (3.872) (0.060) (0.122)
##
## tensionH -14.722*** -0.518*** -0.511***
## (3.872) (0.064) (0.124)
##
## Constant 39.278*** 3.692*** 3.673***
## (3.162) (0.045) (0.098)
##
## --------------------------------------------------------------------
## Observations 54 54 54
## R2 0.269
## Adjusted R2 0.225
## Log Likelihood -242.528 -200.382
## theta 9.944*** (2.561)
## Akaike Inf. Crit. 493.056 408.764
## Residual Std. Error 11.617 (df = 50)
## F Statistic 6.138*** (df = 3; 50)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The interpretation is in terms of the multiplicative effect on the rate of the event (e.g., warp breaks).
woolB
: -0.206
Exponentiating the coefficient: \(exp(-0.206) \approx 0.83\)
Interpretation: Using wool B (compared to wool A) results in an decrease in expected count of warp breaks by a factor of 0.83, keeping everything else constant.
Specifically, it means the expected count of warp breaks using wool B is about 83% of the count using wool A, implying a reduction of approximately 17% {\(({1-e^{-0.206}})*100 \%\)}, cetrius paribus.
woolB
: -0.186
Exponentiating the coefficient: \(exp(-0.186) \approx 0.814\)
Interpretation: Using wool B (compared to wool A) results in an decrease in expected count of warp breaks by a factor of 0.814, keeping everything else constant.
Specifically, it means the expected count of warp breaks using wool B is about 81% of the count using wool A, implying a reduction of approximately 19% {\(({1-e^{-0.186}})*100 \%\)}, cetrius paribus.
Try zero-inflated poisson and zero-inflated negative binomial model as well.
Poisson Code