Introduction
In this blog, I aim to explore when to use Zero-Inflated Poisson (ZIP)
and Zero-Inflated Negative Binomial (ZINB) models. More specifically, I
will outline the steps involved in determining which of these models
provides the best fit for a given dataset. The data used in this blog
comes from the AER package which includes a dataset ‘DoctorVisits’. The
target variable in this dataset is the variable ‘visits’, which is
defined as a count variable representing the number of annual doctor
visits. Each row represents one patient. Since ‘visits’ is a known count
variable given it’s description and, since we know the ‘visits’ variable
has many many zeros, we need to determine which model, ZIP or ZINB, is
more appropriate for handling the data.
In datasets where the target variable is a count variable with many zero values, the choice between ZIP and ZINB models depends on the presence of overdispersion. If the data exhibits both overdispersion, denoted by variance greater than the mean, and lower AIC for the negative binomial model, then the Negative Binomial model is more appropriate. However, if the variance is comparable or approximately equal to the mean and the Poisson model exhibits a lower AIC, the ZIP model is the more appropriate choice.
Below is a summary of steps for this analysis:
Step 1: Exploratory Data Analysis including missing
values.
Step 2: Use a histograms to visualize the data including histogram
of the target variable ‘visits’)’
Step 3: Overlay ZIP and ZINB curves to assess fit.
Step 4: Calculate mean and variance to check for
overdispersion.
Step 5: Compare AIC.
1. Exploratory Data Analysis
Data is loaded into a dataframe and missing values are checked for each
variable. There are 5,180 rows and 12 vriables. No missing variables are
found.
Variables
Visits: Numeric and is the target variable defined as the number of doctor visits per individual in a year.
Gender: Factor variables and is defined as the gender of the individual (Male/Female).
Age: Numeric variable and is defined as the age of the individual in years.
Income: Numeric variable and is defined as the household income in thousands of dollars.
Illness: Numeric variable and is the number of illnesses reported in the past two weeks.
Reduced: Numeric variable and is defined as the number of days with reduced activity due to illness in the past two weeks.
Health: Factor variable and is defined as self-reported health status (excellent, good, fair, poor).
Private: Factor variable and defined as whether the individual has private health insurance (“yes”/“no”).
Freepoor: Factor variable and defined as free healthcare status for individuals classified as “poor”(“yes”/“no”).
Freerepat: Factor variable and defined as free healthcare status for veterans or certain other groups (“yes”/“no”).
Nchronic: Factor variable and defined as the number of chronic conditions reported.
Lchronic: Factor variable and defined as an indicator for chronic conditions (may record the presence of long-term health issues).
library(AER)
library(pscl)
library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(MASS)
library(pscl) # For zero-inflated models
url <- "https://raw.githubusercontent.com/greggmaloy/621/refs/heads/main/DoctorVisits.csv"
df<- read_csv(url)
print(df)
## # A tibble: 5,190 × 12
## visits gender age income illness reduced health private freepoor freerepat
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 female 0.19 0.55 1 4 1 yes no no
## 2 1 female 0.19 0.45 1 2 1 yes no no
## 3 1 male 0.19 0.9 3 0 0 no no no
## 4 1 male 0.19 0.15 1 0 0 no no no
## 5 1 male 0.19 0.45 2 5 1 no no no
## 6 1 female 0.19 0.35 5 1 9 no no no
## 7 1 female 0.19 0.55 4 0 2 no no no
## 8 1 female 0.19 0.15 3 0 6 no no no
## 9 1 female 0.19 0.65 2 0 5 yes no no
## 10 1 male 0.19 0.15 1 0 0 yes no no
## # ℹ 5,180 more rows
## # ℹ 2 more variables: nchronic <chr>, lchronic <chr>
Missing Values/NA’s
There are no missing values or NA’s to be concern about.
#missing values/na
apply(df, 2, function(col) sum(is.na(col))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, position = "center")
| x | |
|---|---|
| visits | 0 |
| gender | 0 |
| age | 0 |
| income | 0 |
| illness | 0 |
| reduced | 0 |
| health | 0 |
| private | 0 |
| freepoor | 0 |
| freerepat | 0 |
| nchronic | 0 |
| lchronic | 0 |
Summary Statistics
summary(df)
## visits gender age income
## Min. :0.0000 Length:5190 Min. :0.1900 Min. :0.0000
## 1st Qu.:0.0000 Class :character 1st Qu.:0.2200 1st Qu.:0.2500
## Median :0.0000 Mode :character Median :0.3200 Median :0.5500
## Mean :0.3017 Mean :0.4064 Mean :0.5832
## 3rd Qu.:0.0000 3rd Qu.:0.6200 3rd Qu.:0.9000
## Max. :9.0000 Max. :0.7200 Max. :1.5000
## illness reduced health private
## Min. :0.000 Min. : 0.0000 Min. : 0.000 Length:5190
## 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000 Class :character
## Median :1.000 Median : 0.0000 Median : 0.000 Mode :character
## Mean :1.432 Mean : 0.8619 Mean : 1.218
## 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :5.000 Max. :14.0000 Max. :12.000
## freepoor freerepat nchronic lchronic
## Length:5190 Length:5190 Length:5190 Length:5190
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
Step 2: Histograms of Numeric Variables
A histogram of the target variable, ‘visits,’ reveals a high frequency
of zero values(~1,500), suggesting that ZIP and ZINB models are
well-suited for analyzing this data. Zero values for the ‘visits’
variable denotes lack of a doctor visit.
# Select numeric columns
numeric_df <- df %>% select_if(is.numeric)
# Convert to long format
df_long <- numeric_df %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Plot histograms
ggplot(df_long, aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of All Numeric Variables", x = "Value", y = "Frequency")
Step 3: ZIP and ZINB curves for fit assessment
Both the ZIP and ZINB distributions are overlaid on a histogram of the
observed visit counts. The green points represent the ZIP model, while
the purple points correspond to the ZINB model. Both model results are
generally comparable, denoting the need for further analysis.
# Fit ZIP and ZINB
zip_model <- pscl::zeroinfl(visits ~ 1 | 1, data = df, dist = "poisson")
zinb_model <- pscl::zeroinfl(visits ~ 1 | 1, data = df, dist = "negbin")
# Extract parameters for ZIP and ZINB
zip_lambda <- exp(coef(zip_model)["count_(Intercept)"])
zip_pi <- exp(coef(zip_model)["zero_(Intercept)"]) / (1 + exp(coef(zip_model)["zero_(Intercept)"]))
zinb_mu <- exp(coef(zinb_model)["count_(Intercept)"])
zinb_theta <- zinb_model$theta
zinb_pi <- exp(coef(zinb_model)["zero_(Intercept)"]) / (1 + exp(coef(zinb_model)["zero_(Intercept)"]))
# Histogram
hist(df$visits, breaks = 30, probability = TRUE, main = "ZIP vs ZINB Fit", xlab = "Counts")
# Values for plotting
x_vals <- 0:max(df$visits)
# Overlay ZIP
zip_probs <- (1 - zip_pi) * dpois(x_vals, zip_lambda)
zip_probs[1] <- zip_probs[1] + zip_pi
lines(x_vals, zip_probs, col = "green", type = "b", pch = 18)
# Overlay ZINB
zinb_probs <- (1 - zinb_pi) * dnbinom(x_vals, size = zinb_theta, mu = zinb_mu)
zinb_probs[1] <- zinb_probs[1] + zinb_pi
lines(x_vals, zinb_probs, col = "purple", type = "b", pch = 19)
legend("topright", legend = c("ZIP", "ZINB"), col = c("green", "purple"), pch = c(18, 19))
Step 4: Calculate mean and variance to check for
overdispersion
Below, the mean and variance of the target variable ‘visits’ are
calculated. The variance moderately exceeds the mean, indicating
overdispersion and suggesting that the ZINB model may provide a better
fit for the data. If the mean and variance had been approximately equal,
the ZIP model would have been the more suitable choice.
mean_count <- mean(df$visits)
var_count <- var(df$visits)
cat("Mean:", mean_count, "Variance:", var_count, "\n")
## Mean: 0.3017341 Variance: 0.6370176
Step 5: Compare AIC
The Negative Binomial model provides a better fit as denoted by the
lower AIC, suggesting that the Negative Binomial distribution should be
used.
# Fit Poisson model
poisson_model <- glm(visits ~ 1, family = poisson, data = df)
# Fit Negative Binomial model
nb_model <- glm.nb(visits ~ 1, data = df)
# Compare AIC values
AIC(poisson_model, nb_model)
Conclusion
The since the data exhibits overdispersion, meaning the variance is
greater than the mean, and since the ZINB model provides a better fit as
denoted by the lower AIC, the ZINB model is preferred to use with this
data. Next steps in this analysis would be possible variable
transformation and model refinement.