DATA 621: Blog 3

Introduction
In this blog, I aim to explore when to use Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models. More specifically, I will outline the steps involved in determining which of these models provides the best fit for a given dataset. The data used in this blog comes from the AER package which includes a dataset ‘DoctorVisits’. The target variable in this dataset is the variable ‘visits’, which is defined as a count variable representing the number of annual doctor visits. Each row represents one patient. Since ‘visits’ is a known count variable given it’s description and, since we know the ‘visits’ variable has many many zeros, we need to determine which model, ZIP or ZINB, is more appropriate for handling the data.

In datasets where the target variable is a count variable with many zero values, the choice between ZIP and ZINB models depends on the presence of overdispersion. If the data exhibits both overdispersion, denoted by variance greater than the mean, and lower AIC for the negative binomial model, then the Negative Binomial model is more appropriate. However, if the variance is comparable or approximately equal to the mean and the Poisson model exhibits a lower AIC, the ZIP model is the more appropriate choice.

Below is a summary of steps for this analysis:
Step 1: Exploratory Data Analysis including missing values.
Step 2: Use a histograms to visualize the data including histogram of the target variable ‘visits’)’
Step 3: Overlay ZIP and ZINB curves to assess fit.
Step 4: Calculate mean and variance to check for overdispersion.
Step 5: Compare AIC.

1. Exploratory Data Analysis
Data is loaded into a dataframe and missing values are checked for each variable. There are 5,180 rows and 12 vriables. No missing variables are found.

Variables

Visits: Numeric and is the target variable defined as the number of doctor visits per individual in a year.
Gender: Factor variables and is defined as the gender of the individual (Male/Female).
Age: Numeric variable and is defined as the age of the individual in years.
Income: Numeric variable and is defined as the household income in thousands of dollars.
Illness: Numeric variable and is the number of illnesses reported in the past two weeks.
Reduced: Numeric variable and is defined as the number of days with reduced activity due to illness in the past two weeks.
Health: Factor variable and is defined as self-reported health status (excellent, good, fair, poor).
Private: Factor variable and defined as whether the individual has private health insurance (“yes”/“no”).
Freepoor: Factor variable and defined as free healthcare status for individuals classified as “poor”(“yes”/“no”).
Freerepat: Factor variable and defined as free healthcare status for veterans or certain other groups (“yes”/“no”).
Nchronic: Factor variable and defined as the number of chronic conditions reported.
Lchronic: Factor variable and defined as an indicator for chronic conditions (may record the presence of long-term health issues).

library(AER)
library(pscl)
library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(MASS)
library(pscl)  # For zero-inflated models
url <- "https://raw.githubusercontent.com/greggmaloy/621/refs/heads/main/DoctorVisits.csv"
df<- read_csv(url)
print(df)

## # A tibble: 5,190 × 12
##    visits gender   age income illness reduced health private freepoor freerepat
##     <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>    <chr>    
##  1      1 female  0.19   0.55       1       4      1 yes     no       no       
##  2      1 female  0.19   0.45       1       2      1 yes     no       no       
##  3      1 male    0.19   0.9        3       0      0 no      no       no       
##  4      1 male    0.19   0.15       1       0      0 no      no       no       
##  5      1 male    0.19   0.45       2       5      1 no      no       no       
##  6      1 female  0.19   0.35       5       1      9 no      no       no       
##  7      1 female  0.19   0.55       4       0      2 no      no       no       
##  8      1 female  0.19   0.15       3       0      6 no      no       no       
##  9      1 female  0.19   0.65       2       0      5 yes     no       no       
## 10      1 male    0.19   0.15       1       0      0 yes     no       no       
## # ℹ 5,180 more rows
## # ℹ 2 more variables: nchronic <chr>, lchronic <chr>

Missing Values/NA’s
There are no missing values or NA’s to be concern about.

#missing values/na
apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, position = "center")

	x
visits	0
gender	0
age	0
income	0
illness	0
reduced	0
health	0
private	0
freepoor	0
freerepat	0
nchronic	0
lchronic	0

Summary Statistics

summary(df)

##      visits          gender               age             income      
##  Min.   :0.0000   Length:5190        Min.   :0.1900   Min.   :0.0000  
##  1st Qu.:0.0000   Class :character   1st Qu.:0.2200   1st Qu.:0.2500  
##  Median :0.0000   Mode  :character   Median :0.3200   Median :0.5500  
##  Mean   :0.3017                      Mean   :0.4064   Mean   :0.5832  
##  3rd Qu.:0.0000                      3rd Qu.:0.6200   3rd Qu.:0.9000  
##  Max.   :9.0000                      Max.   :0.7200   Max.   :1.5000  
##     illness         reduced            health         private         
##  Min.   :0.000   Min.   : 0.0000   Min.   : 0.000   Length:5190       
##  1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000   Class :character  
##  Median :1.000   Median : 0.0000   Median : 0.000   Mode  :character  
##  Mean   :1.432   Mean   : 0.8619   Mean   : 1.218                     
##  3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000                     
##  Max.   :5.000   Max.   :14.0000   Max.   :12.000                     
##    freepoor          freerepat           nchronic           lchronic        
##  Length:5190        Length:5190        Length:5190        Length:5190       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

Step 2: Histograms of Numeric Variables
A histogram of the target variable, ‘visits,’ reveals a high frequency of zero values(~1,500), suggesting that ZIP and ZINB models are well-suited for analyzing this data. Zero values for the ‘visits’ variable denotes lack of a doctor visit.

# Select numeric columns 
numeric_df <- df %>% select_if(is.numeric)

# Convert to long format
df_long <- numeric_df %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot histograms 
ggplot(df_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of All Numeric Variables", x = "Value", y = "Frequency")

Step 3: ZIP and ZINB curves for fit assessment
Both the ZIP and ZINB distributions are overlaid on a histogram of the observed visit counts. The green points represent the ZIP model, while the purple points correspond to the ZINB model. Both model results are generally comparable, denoting the need for further analysis.

# Fit ZIP and ZINB 
zip_model <- pscl::zeroinfl(visits ~ 1 | 1, data = df, dist = "poisson")
zinb_model <- pscl::zeroinfl(visits ~ 1 | 1, data = df, dist = "negbin")

# Extract parameters for ZIP and ZINB
zip_lambda <- exp(coef(zip_model)["count_(Intercept)"])
zip_pi <- exp(coef(zip_model)["zero_(Intercept)"]) / (1 + exp(coef(zip_model)["zero_(Intercept)"]))

zinb_mu <- exp(coef(zinb_model)["count_(Intercept)"])
zinb_theta <- zinb_model$theta
zinb_pi <- exp(coef(zinb_model)["zero_(Intercept)"]) / (1 + exp(coef(zinb_model)["zero_(Intercept)"]))

# Histogram
hist(df$visits, breaks = 30, probability = TRUE, main = "ZIP vs ZINB Fit", xlab = "Counts")

# Values for plotting
x_vals <- 0:max(df$visits)

# Overlay ZIP
zip_probs <- (1 - zip_pi) * dpois(x_vals, zip_lambda)
zip_probs[1] <- zip_probs[1] + zip_pi
lines(x_vals, zip_probs, col = "green", type = "b", pch = 18)

# Overlay ZINB
zinb_probs <- (1 - zinb_pi) * dnbinom(x_vals, size = zinb_theta, mu = zinb_mu)
zinb_probs[1] <- zinb_probs[1] + zinb_pi
lines(x_vals, zinb_probs, col = "purple", type = "b", pch = 19)


legend("topright", legend = c("ZIP", "ZINB"), col = c("green", "purple"), pch = c(18, 19))

Step 4: Calculate mean and variance to check for overdispersion
Below, the mean and variance of the target variable ‘visits’ are calculated. The variance moderately exceeds the mean, indicating overdispersion and suggesting that the ZINB model may provide a better fit for the data. If the mean and variance had been approximately equal, the ZIP model would have been the more suitable choice.

mean_count <- mean(df$visits)
var_count <- var(df$visits)
cat("Mean:", mean_count, "Variance:", var_count, "\n")

## Mean: 0.3017341 Variance: 0.6370176

Step 5: Compare AIC
The Negative Binomial model provides a better fit as denoted by the lower AIC, suggesting that the Negative Binomial distribution should be used.

# Fit Poisson model
poisson_model <- glm(visits ~ 1, family = poisson, data = df)

# Fit Negative Binomial model
nb_model <- glm.nb(visits ~ 1, data = df)

# Compare AIC values
AIC(poisson_model, nb_model)

Conclusion
The since the data exhibits overdispersion, meaning the variance is greater than the mean, and since the ZINB model provides a better fit as denoted by the lower AIC, the ZINB model is preferred to use with this data. Next steps in this analysis would be possible variable transformation and model refinement.

DATA 621: Blog 3

Gregg Maloy