library(descr)
## Warning: package 'descr' was built under R version 4.2.3
library(MASS)#glm modeling
## Warning: package 'MASS' was built under R version 4.2.3
library(rstatix) #Summary Tables
## Warning: package 'rstatix' was built under R version 4.2.3
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
library(tidyverse) #General coding
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks rstatix::select(), MASS::select()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(patchwork) #Merge GGPlots togeter
## Warning: package 'patchwork' was built under R version 4.2.3
##
## Attaching package: 'patchwork'
##
## The following object is masked from 'package:MASS':
##
## area
library(ggplot2) #Graphing
library(stargazer) #Tabular Regression Results
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(jtools) #Tabular Regression Results
## Warning: package 'jtools' was built under R version 4.2.3
library(descr) #Easy Frequency Tables
library(haven) #Imports survey data
## Warning: package 'haven' was built under R version 4.2.3
library(stats) #Various statistical calculations
library(ggeffects) #Predicted Probabilities from Regressions
## Warning: package 'ggeffects' was built under R version 4.2.3
library(pscl) #Zero Inflated Negative Binomial
## Warning: package 'pscl' was built under R version 4.2.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
anes <- read_dta("C:/Users/Carey/Dropbox/UMass/2024-2025/Spring/AQM/Data/anes_2020_count.dta")
anes <- anes %>%
mutate(pid_x = labelled(
case_when(V201228 == 1 & V201229 == 1 ~ 1,
V201228 == 1 & V201229 == 2 ~ 2,
V201230 == 3 ~ 3,
V201230 == 2 ~ 4,
V201230 == 1 ~ 5,
V201228 == 2 & V201229 == 2 ~ 6,
V201228 == 2 & V201229 == 1 ~ 7),
labels = c("Strong Dem" = 1,
"Weak Dem" = 2,
"Lean Dem" = 3,
"Ind" = 4,
"Lean Rep" = 5,
"Weak Rep" = 6,
"Strong Rep" = 7)))
anes <- anes %>% #
mutate(college = case_when(
V201511x==1 ~ 0,
V201511x==2 ~ 0,
V201511x==3 ~ 0,
V201511x==4 ~ 1,
V201511x==5 ~ 1))
anes <- anes %>%
mutate(poc = case_when(
V201549x ==1 ~ 0,
V201549x >=2 ~ 1,
V201549x <=-1 ~ NA_real_))
anes <- anes %>%
mutate(female = case_when(
V201600 ==1 ~ 0,
V201600 ==2 ~ 1,
V201600 <=-1 ~ NA_real_))
anes <- anes %>%
mutate(libcon = replace(V201200, (V201200 <= -1 | V201200 >= 99), NA))
freq(anes$libcon)
## PRE: 7pt scale liberal-conservative self-placement
## Frequency Percent Valid Percent
## 1 163 4.656 5.181
## 2 566 16.167 17.991
## 3 404 11.540 12.842
## 4 739 21.108 23.490
## 5 379 10.825 12.047
## 6 704 20.109 22.378
## 7 191 5.456 6.071
## NA's 355 10.140
## Total 3501 100.000 100.000
value_labels <- c("Ext Liberal", "Liberal",
"Slightly Liberal", "Moderate", "Slightly Conserv",
"Conserv", "Ext Conserv") #Add value labels to your new measure. Only use this if you need to as it changes the variable type to 'factor' which influences the types of analysis you can do with it. Previously, it was 'numeric'.
# Assign value labels to the Response variable
anes$libcon_f <- factor(anes$libcon, levels = 1:7, labels = value_labels)
anes$libcon_f <- relevel(anes$libcon_f, ref = "Moderate") #Change reference group
In this tutorial, we will be reviewing how to estimate, evaluate and interpret count regression models. The dependent variable in these models are counts of something occurring. In survey data, it is oftentimes the count of how many times something is done in a specified time period, like going to the gym in the past 30 days or the number of different participatory acts someone did during a recent election.
While it is tempting to consider this variable type to be continuous, it is technically drawn from a count distribution rather than a Gaussian one. For instance, counts must be an integer and cannot be negative. That type of data structure does not apply to a Gaussian distribution.
Here, we review several different count model types including Poisson, Negative-Binomial, Zero-Inflated Poisson (Zips), and Zero-Inflated Negative-Binomial (ZINB). This includes covering when to use each type of model and why.
We will be working out of the 2020 American National Election Survey in this tutorial. The dependent variable will be a count of the number of rightwing news sources an individual consumed during the 2020 presidential election. There will be two specific DVs:
fox_shows
)
all_rw_shows
)
First, examine the distribution of each count variable.
freq(anes$fox_shows)
## anes$fox_shows
## Frequency Percent
## 0 2879 82.2336
## 1 361 10.3113
## 2 170 4.8558
## 3 84 2.3993
## 4 7 0.1999
## Total 3501 100.0000
freq(anes$all_rw_shows)
## anes$all_rw_shows
## Frequency Percent Valid Percent
## 0 2325 66.40960 76.15460
## 1 281 8.02628 9.20406
## 2 145 4.14167 4.74943
## 3 117 3.34190 3.83230
## 4 75 2.14225 2.45660
## 5 56 1.59954 1.83426
## 6 20 0.57127 0.65509
## 7 27 0.77121 0.88438
## 8 3 0.08569 0.09826
## 9 4 0.11425 0.13102
## NA's 448 12.79634
## Total 3501 100.00000 100.00000
For the Fox News program count, there are a lot of zeroes, ~82% meaning of the sample watched 0 of the Fox News opinion shows listed, with 1’s being the most commonly watched number of shows. The full right wing media variable had fewer 0’s, ~66%, with a few respondents consuming 9 of the 12 possible shows. We will return to this distribution shortly, but first let’s estimate the most basic of the count models - the Poisson.
The Poisson count model assumes the dependent variable is drawn from the Poisson distribution. What is unique about the Poisson distribution is that it assumes the mean of the count variable equals its variance. When this is violated, the estimated standard errors are the wrong size leading to Type I and Type II errors. However, this does nothing to bias the coefficients, which remain the same regardless of the issues with the standard errors.
We start with a simple model with one predictor - political ideology - measured on a 7-point scale from extremely liberal (1) to extremely conservative (7) with moderates (4) in the middle.
pois_fox <- glm(fox_shows ~ libcon_f, family = poisson, data = anes, weights = anes$V200010b)
summary(pois_fox)
##
## Call:
## glm(formula = fox_shows ~ libcon_f, family = poisson, data = anes,
## weights = anes$V200010b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0296 -0.6216 -0.2701 -0.0930 6.6264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.89079 0.09819 -19.257 < 2e-16 ***
## libcon_fExt Liberal -15.41179 304.76458 -0.051 0.96
## libcon_fLiberal -1.52599 0.27287 -5.592 2.24e-08 ***
## libcon_fSlightly Liberal -1.48588 0.31356 -4.739 2.15e-06 ***
## libcon_fSlightly Conserv 0.76332 0.13603 5.611 2.01e-08 ***
## libcon_fConserv 1.51977 0.10934 13.900 < 2e-16 ***
## libcon_fExt Conserv 1.84095 0.12423 14.819 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2970.9 on 2898 degrees of freedom
## Residual deviance: 2127.6 on 2892 degrees of freedom
## (602 observations deleted due to missingness)
## AIC: 3383.3
##
## Number of Fisher Scoring iterations: 15
pois_all <- glm(all_rw_shows ~ libcon_f, family = poisson, data = anes, weights = anes$V200010b)
summary(pois_all)
##
## Call:
## glm(formula = all_rw_shows ~ libcon_f, family = poisson, data = anes,
## weights = anes$V200010b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7584 -0.7615 -0.3587 -0.1587 7.4819
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.21260 0.07547 -16.067 < 2e-16 ***
## libcon_fExt Liberal -2.88802 0.73057 -3.953 7.71e-05 ***
## libcon_fLiberal -1.78814 0.22465 -7.960 1.72e-15 ***
## libcon_fSlightly Liberal -1.13007 0.19953 -5.664 1.48e-08 ***
## libcon_fSlightly Conserv 0.79642 0.10332 7.708 1.28e-14 ***
## libcon_fConserv 1.60536 0.08341 19.246 < 2e-16 ***
## libcon_fExt Conserv 2.14999 0.09097 23.634 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5126.2 on 2543 degrees of freedom
## Residual deviance: 3304.7 on 2537 degrees of freedom
## (957 observations deleted due to missingness)
## AIC: 4966.5
##
## Number of Fisher Scoring iterations: 6
stargazer(pois_fox, pois_all, type="text", digits=3)
##
## =====================================================
## Dependent variable:
## ----------------------------
## fox_shows all_rw_shows
## (1) (2)
## -----------------------------------------------------
## libcon_fExt Liberal -15.412 -2.888***
## (304.765) (0.731)
##
## libcon_fLiberal -1.526*** -1.788***
## (0.273) (0.225)
##
## libcon_fSlightly Liberal -1.486*** -1.130***
## (0.314) (0.200)
##
## libcon_fSlightly Conserv 0.763*** 0.796***
## (0.136) (0.103)
##
## libcon_fConserv 1.520*** 1.605***
## (0.109) (0.083)
##
## libcon_fExt Conserv 1.841*** 2.150***
## (0.124) (0.091)
##
## Constant -1.891*** -1.213***
## (0.098) (0.075)
##
## -----------------------------------------------------
## Observations 2,899 2,544
## Log Likelihood -1,684.661 -2,476.275
## Akaike Inf. Crit. 3,383.322 4,966.549
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Results from our two simple model reveal the relationship between political ideology and right wing media consumption. The coefficients are on the logarithmic scale so are not easily interpretable in terms of the substantive effect We can interpret the direction and significance of the relationship directly from a regression table. Not surprisingly, conservatives consume more right wing media than moderates while liberals consume less.
fox_pred<-ggpredict(pois_fox, terms="libcon_f")
print(fox_pred)
## # Predicted counts of fox_shows
##
## libcon_f | Predicted | 95% CI
## ------------------------------------------------
## Moderate | 0.15 | [0.12, 0.18]
## Ext Liberal | 0.00 | [0.00, 7.97e+251]
## Liberal | 0.03 | [0.02, 0.05]
## Slightly Liberal | 0.03 | [0.02, 0.06]
## Slightly Conserv | 0.32 | [0.27, 0.39]
## Conserv | 0.69 | [0.63, 0.76]
## Ext Conserv | 0.95 | [0.82, 1.10]
all_pred<-ggpredict(pois_all, terms="libcon_f")
print(all_pred)
## # Predicted counts of all_rw_shows
##
## libcon_f | Predicted | 95% CI
## -------------------------------------------
## Moderate | 0.30 | [0.26, 0.34]
## Ext Liberal | 0.02 | [0.00, 0.07]
## Liberal | 0.05 | [0.03, 0.08]
## Slightly Liberal | 0.10 | [0.07, 0.14]
## Slightly Conserv | 0.66 | [0.57, 0.76]
## Conserv | 1.48 | [1.38, 1.59]
## Ext Conserv | 2.55 | [2.31, 2.82]
One thing we have not done yet is to test the assumption that the means of the DVs are equal to their variance. If they are not, we risk misestimating the size of our standard errors which can lead to incorrect conclusions.
There are multiple ways to test the assumption that the mean equals the variance of the dependent variables. We will cover a few of them including…
var_fox_shows <- var(anes$fox_shows, na.rm = TRUE)
var_all_rw_shows <- var(anes$all_rw_shows, na.rm = TRUE)
mean_fox_shows <- mean(anes$fox_shows, na.rm = TRUE)
mean_all_rw_shows <- mean(anes$all_rw_shows, na.rm = TRUE)
ratio_fox<-var_fox_shows/mean_fox_shows
ratio_all<-var_all_rw_shows/mean_all_rw_shows
# Combine means and variances for comparison
count_variance <- data.frame(
Variable = c("fox_shows", "all_rw_shows"),
Mean = c(mean_fox_shows, mean_all_rw_shows),
Variance = c(var_fox_shows, var_all_rw_shows),
Ratio=c(ratio_fox, ratio_all)
)
# Print the variance results for evaluation
print(count_variance)
## Variable Mean Variance Ratio
## 1 fox_shows 0.2802057 0.4668909 1.666244
## 2 all_rw_shows 0.6128398 1.9418010 3.168529
Using the tidyverse
, we calculate the mean, variance,
and ratio of the two for both of the count DVs. The ratio value is the
easiest way to evaluate the relationship, also called dispersion. If
mean = variance then the ratio will equal 1. The larger the value the
worse the overdispersion is in the count variable. Overdispersion leads
to standard errors that are too small, which leads us to be more certain
about the relationship between our Xs and Y then we should be. The
smaller the value, the worse the underdispersion is which leads the
standard errors to be too big. This is less common than overdispersion
but can happen.
These results suggest that we definitely have an overdispersion problem with the full right-wing media variable (ratio > 3) while the Fox TV shows variable shows some, but not as overwhelming, evidence of overdispersion (ratio of 1.7).
With the dispersion issues seen in variance/mean calculation, we model the count DVs with a technique that explicitly accounts for this issue. First, we will use the quasi-Poisson model before turning our attention to the Negative-Binomial model.
The quasi-Poisson model uses the same distribution as the Poisson model but adjusts for the variance/mean issue. One thing to note is that the quasi-Poisson modeling approach will not return basic fit statistics like the AIC, BIC or likelihood ratio. In the summary command for the regression results, it does return a dispersion parameter. Here we have a dispersion parameter of 1.23 for the Fox News shows - indicating slight overdispersion like we saw above - and 2.027 for all Right Wing media sources indicating problematic overdispersion. When calculating the negative binomial model, we will review the theta value which will gives us statistical tests to know if we need to worry about the dispersion parameter or not.
quasi_fox <- glm(fox_shows ~ libcon_f, family = quasipoisson, data = anes, weights = anes$V200010b)
summary(quasi_fox)
##
## Call:
## glm(formula = fox_shows ~ libcon_f, family = quasipoisson, data = anes,
## weights = anes$V200010b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0296 -0.6216 -0.2701 -0.0930 6.6264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8908 0.1088 -17.374 < 2e-16 ***
## libcon_fExt Liberal -15.4118 337.7893 -0.046 0.964
## libcon_fLiberal -1.5260 0.3024 -5.046 4.80e-07 ***
## libcon_fSlightly Liberal -1.4859 0.3475 -4.275 1.97e-05 ***
## libcon_fSlightly Conserv 0.7633 0.1508 5.063 4.39e-07 ***
## libcon_fConserv 1.5198 0.1212 12.541 < 2e-16 ***
## libcon_fExt Conserv 1.8409 0.1377 13.370 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.228465)
##
## Null deviance: 2970.9 on 2898 degrees of freedom
## Residual deviance: 2127.6 on 2892 degrees of freedom
## (602 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 15
quasi_all <- glm(all_rw_shows ~ libcon_f, family = quasipoisson, data = anes, weights = anes$V200010b)
summary(quasi_all)
##
## Call:
## glm(formula = all_rw_shows ~ libcon_f, family = quasipoisson,
## data = anes, weights = anes$V200010b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7584 -0.7615 -0.3587 -0.1587 7.4819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2126 0.1075 -11.284 < 2e-16 ***
## libcon_fExt Liberal -2.8880 1.0402 -2.776 0.00554 **
## libcon_fLiberal -1.7881 0.3199 -5.590 2.51e-08 ***
## libcon_fSlightly Liberal -1.1301 0.2841 -3.978 7.16e-05 ***
## libcon_fSlightly Conserv 0.7964 0.1471 5.413 6.76e-08 ***
## libcon_fConserv 1.6054 0.1188 13.517 < 2e-16 ***
## libcon_fExt Conserv 2.1500 0.1295 16.598 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.027423)
##
## Null deviance: 5126.2 on 2543 degrees of freedom
## Residual deviance: 3304.7 on 2537 degrees of freedom
## (957 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
stargazer(quasi_fox, quasi_all, type="text", digits=3)
##
## =====================================================
## Dependent variable:
## ----------------------------
## fox_shows all_rw_shows
## (1) (2)
## -----------------------------------------------------
## libcon_fExt Liberal -15.412 -2.888***
## (337.789) (1.040)
##
## libcon_fLiberal -1.526*** -1.788***
## (0.302) (0.320)
##
## libcon_fSlightly Liberal -1.486*** -1.130***
## (0.348) (0.284)
##
## libcon_fSlightly Conserv 0.763*** 0.796***
## (0.151) (0.147)
##
## libcon_fConserv 1.520*** 1.605***
## (0.121) (0.119)
##
## libcon_fExt Conserv 1.841*** 2.150***
## (0.138) (0.130)
##
## Constant -1.891*** -1.213***
## (0.109) (0.107)
##
## -----------------------------------------------------
## Observations 2,899 2,544
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(pois_fox, quasi_fox, pois_all, quasi_all, type="text", digits=3)
##
## ==================================================================================
## Dependent variable:
## ---------------------------------------------------------
## fox_shows all_rw_shows
## Poisson glm: quasipoisson Poisson glm: quasipoisson
## link = log link = log
## (1) (2) (3) (4)
## ----------------------------------------------------------------------------------
## libcon_fExt Liberal -15.412 -15.412 -2.888*** -2.888***
## (304.765) (337.789) (0.731) (1.040)
##
## libcon_fLiberal -1.526*** -1.526*** -1.788*** -1.788***
## (0.273) (0.302) (0.225) (0.320)
##
## libcon_fSlightly Liberal -1.486*** -1.486*** -1.130*** -1.130***
## (0.314) (0.348) (0.200) (0.284)
##
## libcon_fSlightly Conserv 0.763*** 0.763*** 0.796*** 0.796***
## (0.136) (0.151) (0.103) (0.147)
##
## libcon_fConserv 1.520*** 1.520*** 1.605*** 1.605***
## (0.109) (0.121) (0.083) (0.119)
##
## libcon_fExt Conserv 1.841*** 1.841*** 2.150*** 2.150***
## (0.124) (0.138) (0.091) (0.130)
##
## Constant -1.891*** -1.891*** -1.213*** -1.213***
## (0.098) (0.109) (0.075) (0.107)
##
## ----------------------------------------------------------------------------------
## Observations 2,899 2,899 2,544 2,544
## Log Likelihood -1,684.661 -2,476.275
## Akaike Inf. Crit. 3,383.322 4,966.549
## ==================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The second stargazer table compares the quasi-Poisson with the basic Poisson. Notice that the coefficients remain identical, as they should. Think of the quasi-Poisson in a similar way to using robust standard errors in OLS - the coefficients remain the same but the standard errors are adjusted. Because we have overdispersion problem in the data, the standard errors are larger in the quasi-Poisson compared to the Poisson. If we had an underdispersion problem, they would be smaller.
Now, we will estimate both models using a Negative-Binomial approach. This modeling approach assumes the count variable is drawn from a slightly different probability distribution, the Negative-Binomial distribution. Importantly, this distribution does not assume that the mean=variance like the Poisson distribution. The dispersion parameter in the output is called theta. The closer to zero theta is the more a problem overdispersion is in the data.
We will also use the fact that a Poisson model is technically a nested model of the Negative-Binomial model (it simply is a form of the NB when mean = variance) to compare the model fit statistics we have covered previously. All things being equal, we should use the model that better fit the data.
# Checking Dispersion Parameter from a Negative Binomial
nb_fox <- glm.nb(fox_shows ~ libcon_f , anes, weights = anes$V200010b)
summary(nb_fox)
##
## Call:
## glm.nb(formula = fox_shows ~ libcon_f, data = anes, weights = anes$V200010b,
## init.theta = 1.147262873, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6807 -0.5852 -0.2667 -0.0924 5.6124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8908 0.1044 -18.102 < 2e-16 ***
## libcon_fExt Liberal -18.4118 1365.8600 -0.013 0.989
## libcon_fLiberal -1.5260 0.2785 -5.479 4.28e-08 ***
## libcon_fSlightly Liberal -1.4859 0.3197 -4.647 3.36e-06 ***
## libcon_fSlightly Conserv 0.7633 0.1492 5.115 3.15e-07 ***
## libcon_fConserv 1.5198 0.1209 12.571 < 2e-16 ***
## libcon_fExt Conserv 1.8409 0.1466 12.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.1473) family taken to be 1)
##
## Null deviance: 2230.8 on 2898 degrees of freedom
## Residual deviance: 1556.4 on 2892 degrees of freedom
## (602 observations deleted due to missingness)
## AIC: 3301.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.147
## Std. Err.: 0.179
##
## 2 x log-likelihood: -3285.055
nb_all <- glm.nb(all_rw_shows ~ libcon_f , anes, weights = anes$V200010b)
summary(nb_all)
##
## Call:
## glm.nb(formula = all_rw_shows ~ libcon_f, data = anes, weights = anes$V200010b,
## init.theta = 0.5601640797, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1040 -0.6530 -0.3235 -0.1427 4.9686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.21260 0.09338 -12.985 < 2e-16 ***
## libcon_fExt Liberal -2.88802 0.74323 -3.886 0.000102 ***
## libcon_fLiberal -1.78814 0.23973 -7.459 8.71e-14 ***
## libcon_fSlightly Liberal -1.13007 0.22065 -5.121 3.03e-07 ***
## libcon_fSlightly Conserv 0.79642 0.13987 5.694 1.24e-08 ***
## libcon_fConserv 1.60536 0.11540 13.911 < 2e-16 ***
## libcon_fExt Conserv 2.14999 0.15185 14.159 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.5602) family taken to be 1)
##
## Null deviance: 2414.5 on 2543 degrees of freedom
## Residual deviance: 1532.5 on 2537 degrees of freedom
## (957 observations deleted due to missingness)
## AIC: 4306.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.5602
## Std. Err.: 0.0467
##
## 2 x log-likelihood: -4290.1410
stargazer(nb_fox, nb_all, type="text", diits=3)
##
## ==========================================================
## Dependent variable:
## ---------------------------------
## fox_shows all_rw_shows
## (1) (2)
## ----------------------------------------------------------
## libcon_fExt Liberal -18.412 -2.888***
## (1,365.860) (0.743)
##
## libcon_fLiberal -1.526*** -1.788***
## (0.279) (0.240)
##
## libcon_fSlightly Liberal -1.486*** -1.130***
## (0.320) (0.221)
##
## libcon_fSlightly Conserv 0.763*** 0.796***
## (0.149) (0.140)
##
## libcon_fConserv 1.520*** 1.605***
## (0.121) (0.115)
##
## libcon_fExt Conserv 1.841*** 2.150***
## (0.147) (0.152)
##
## Constant -1.891*** -1.213***
## (0.104) (0.093)
##
## ----------------------------------------------------------
## Observations 2,899 2,544
## Log Likelihood -1,643.528 -2,146.071
## theta 1.147*** (0.179) 0.560*** (0.047)
## Akaike Inf. Crit. 3,301.055 4,306.141
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## =
## 3
## -
The first thing we will examine from the Negative-Binomial models is the theta parameter. The benefit of running the negative binomial model instead of just looking at the ratio of variance to mean is the associated significance test that accompanies theta. Here, we see significant theta values in both models indicating that overdispersion is a problem so the Poisson model will have standard errors that are too small potentially leading to Type I error.
Next, let’s take advantage of the fact that the Poisson model is nested within the NB and compare model fit statistics. We can use AIC and BIC to compare but because these models are nested we can also use the Likelihood Ratio and deviance tests for comparisons.
#Create function to compare model fit stats
glm_fit <- function(model) {
# Calculate Likelihood Ratio
lr <- logLik(model)
# Calculate AIC
aic <- AIC(model)
# Calculate BIC
n <- nobs(model)
p <- length(coef(model))
bic <- -2 * logLik(model) + p * log(n)
# Calculate Deviance
deviance <- summary(model)$deviance
# Return the metrics as a list
metrics <- data.frame(Likelihood_Ratio = lr, AIC = aic, BIC = bic, Deviance = deviance, Coefficients = p)
return(metrics)
}
glm_fit(pois_fox)
glm_fit(nb_fox)
glm_fit(pois_all)
glm_fit(nb_all)
If the NB distribution fits the DV data structure better, we will see better fit statistics compared to the Poisson model. That is exactly what we see for both sets of models. The AIC and BIC fir stats are smaller for both NB models and both the LR and Deviance tests are closer to zero as well (indicting a better fit). This is more evidence that we should be reporting the NB models rather than the Poisson.
Next, we will compare the Poisson and NB model results side-by-side. Despite the two models assuming a slightly different distribution of the DV, both models’ coefficients are in the logarithmic scale.
stargazer(pois_fox, nb_fox, pois_all, nb_all, type="text", digits=3)
##
## ================================================================================
## Dependent variable:
## -------------------------------------------------------
## fox_shows all_rw_shows
## Poisson negative Poisson negative
## binomial binomial
## (1) (2) (3) (4)
## --------------------------------------------------------------------------------
## libcon_fExt Liberal -15.412 -18.412 -2.888*** -2.888***
## (304.765) (1,365.860) (0.731) (0.743)
##
## libcon_fLiberal -1.526*** -1.526*** -1.788*** -1.788***
## (0.273) (0.279) (0.225) (0.240)
##
## libcon_fSlightly Liberal -1.486*** -1.486*** -1.130*** -1.130***
## (0.314) (0.320) (0.200) (0.221)
##
## libcon_fSlightly Conserv 0.763*** 0.763*** 0.796*** 0.796***
## (0.136) (0.149) (0.103) (0.140)
##
## libcon_fConserv 1.520*** 1.520*** 1.605*** 1.605***
## (0.109) (0.121) (0.083) (0.115)
##
## libcon_fExt Conserv 1.841*** 1.841*** 2.150*** 2.150***
## (0.124) (0.147) (0.091) (0.152)
##
## Constant -1.891*** -1.891*** -1.213*** -1.213***
## (0.098) (0.104) (0.075) (0.093)
##
## --------------------------------------------------------------------------------
## Observations 2,899 2,899 2,544 2,544
## Log Likelihood -1,684.661 -1,643.528 -2,476.275 -2,146.071
## theta 1.147*** (0.179) 0.560*** (0.047)
## Akaike Inf. Crit. 3,383.322 3,301.055 4,966.549 4,306.141
## ================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Results are largely identical between the two types of models. This will oftentimes be the case, but it does not have to be. The coefficients will sometimes be slightly different between the two models. Generally, the main difference is in the size of the standard errors since the NB model calculates SE differently. This typically increases the size of the SE, which it does here in both models, but sometimes can make them smaller as well (just like Robust SE).
The odd findings for the Extreme Liberals on the Fox Shows DV is because only three respondents in this group watched at least 1 show. This low occurrence rate made the standard errors more difficult to estimate.
Now that we have decided to use the NB model, we can calculate
predicted number of shows watched for each level of the IV, political
ideology, using the same general format from previous tutorials. We will
be using the more efficient ggpredict
package here.
fox_pred<-ggpredict(nb_fox, terms="libcon_f")
print(fox_pred)
## # Predicted counts of fox_shows
##
## libcon_f | Predicted | 95% CI
## -----------------------------------------------
## Moderate | 0.15 | [0.12, 0.19]
## Ext Liberal | 0.00 | [0.00, Inf]
## Liberal | 0.03 | [0.02, 0.05]
## Slightly Liberal | 0.03 | [0.02, 0.06]
## Slightly Conserv | 0.32 | [0.26, 0.40]
## Conserv | 0.69 | [0.61, 0.78]
## Ext Conserv | 0.95 | [0.78, 1.16]
all_pred<-ggpredict(nb_all, terms="libcon_f")
print(all_pred)
## # Predicted counts of all_rw_shows
##
## libcon_f | Predicted | 95% CI
## -------------------------------------------
## Moderate | 0.30 | [0.25, 0.36]
## Ext Liberal | 0.02 | [0.00, 0.07]
## Liberal | 0.05 | [0.03, 0.08]
## Slightly Liberal | 0.10 | [0.06, 0.14]
## Slightly Conserv | 0.66 | [0.54, 0.81]
## Conserv | 1.48 | [1.30, 1.69]
## Ext Conserv | 2.55 | [2.02, 3.23]
As expected from reviewing the regression results, conservatives consume more right-wing media than moderates or liberals. From the Fox only model, our basic model expects moderates to watch about .15 shows on average whereas an extremely conservative person is likely watch about .95 shows on average. With the broader right-winder media environment model, there are bigger differences with extremely conservative respondents consume about 2.5 shows on average compared to just .3 shows for moderates.
We can pass these values into ggplot
for easy
visualization as well.
custom_order <- c('Ext Liberal', 'Liberal', 'Slightly Liberal', 'Moderate', 'Slightly Conserv', 'Conserv', 'Ext Conserv')
ggplot(all_pred, aes(x = x, y = predicted)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7, fill="grey") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.4, position = position_dodge(width = 0.7)) +
theme_minimal(base_size = 13) +
labs(x = "# of Right-wing Media Shows Consumed", y = "Predicted Count",
title = "Predicted Count of Right-wing Media Shows Consumed",
subtitle = "Results from Negative-Binomial Model")+
scale_x_discrete(limits = custom_order)
If you remember all the way back at the beginning of this tutorial, we examined the frequency of distribution of both count DVs and saw that both contained a high number of zeros. If this is the case, it can be appropriate to model the count variable using a different approach called Zero-Inflated Poisson/Negative Binomial. In this modeling approach, we actually assume there are two separate data generating processes: 1. DGP that describes why a person would be a 0 or > 0 on the count DV 2. DGP that describes why a person would be a count of 1, 2, 3, etc. on the count DV
You can think of it conceptually as two models. Model 1 is a binomial model predicting if a respondent is a 0 or not on the Count DV. Model 2 is a count model predicting the count of the DV for the people we predict are not a zero from the first stage.
In R, we will be estimating two separate models in one line of code. Because we already have decided to use the NB model on the count data, we will use the Zero Inflated NB model. You need to evaluate the variance/mean relationship first before deciding on using zero inflated model.
Also only use a zero inflated model if you have theory to suggest that the DGP is different between doing it at all and the count of doing it. For the example in this tutorial, we could argue something like the following.
Political ideology will directly influence whether a person chooses to consume right-wing opinion media. For people who do consume right-wing opinion media, being more conservative should be directly related to consuming a higher number of shows since it matches their political beliefs.
fox_zinb <- zeroinfl(fox_shows ~ libcon_f, data = anes)
summary(fox_zinb)
##
## Call:
## zeroinfl(formula = fox_shows ~ libcon_f, data = anes)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.7776 -0.4571 -0.1803 -0.1286 13.1113
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4428 0.1839 -2.407 0.01607 *
## libcon_fExt Liberal -4.6347 1.8122 -2.557 0.01054 *
## libcon_fLiberal -1.1986 1.0018 -1.196 0.23153
## libcon_fSlightly Liberal -0.2645 0.5161 -0.512 0.60830
## libcon_fSlightly Conserv 0.3299 0.2345 1.407 0.15947
## libcon_fConserv 0.4940 0.1961 2.520 0.01175 *
## libcon_fExt Conserv 0.7919 0.2106 3.760 0.00017 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2830 0.2175 5.898 3.68e-09 ***
## libcon_fExt Liberal -5.0121 63.8681 -0.078 0.9374
## libcon_fLiberal 0.9108 1.0736 0.848 0.3962
## libcon_fSlightly Liberal 0.9664 0.5440 1.777 0.0756 .
## libcon_fSlightly Conserv -0.7348 0.3022 -2.432 0.0150 *
## libcon_fConserv -2.0007 0.2792 -7.165 7.76e-13 ***
## libcon_fExt Conserv -1.8696 0.3278 -5.704 1.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 59
## Log-likelihood: -1760 on 14 Df
all_zinb <- zeroinfl(all_rw_shows ~ libcon_f, data = anes)
summary(all_zinb)
##
## Call:
## zeroinfl(formula = all_rw_shows ~ libcon_f, data = anes)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0353 -0.3669 -0.2193 -0.1228 11.8598
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3181 0.1008 3.155 0.001603 **
## libcon_fExt Liberal 0.4846 0.5325 0.910 0.362782
## libcon_fLiberal -1.7437 0.7012 -2.487 0.012892 *
## libcon_fSlightly Liberal -0.4112 0.3014 -1.364 0.172476
## libcon_fSlightly Conserv 0.4486 0.1258 3.565 0.000364 ***
## libcon_fConserv 0.6136 0.1074 5.715 1.09e-08 ***
## libcon_fExt Conserv 0.9016 0.1156 7.797 6.34e-15 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3554 0.1299 10.431 < 2e-16 ***
## libcon_fExt Liberal 2.7849 0.7389 3.769 0.000164 ***
## libcon_fLiberal 0.4337 0.7827 0.554 0.579498
## libcon_fSlightly Liberal 0.8845 0.3300 2.680 0.007354 **
## libcon_fSlightly Conserv -0.7200 0.1845 -3.903 9.48e-05 ***
## libcon_fConserv -1.8679 0.1619 -11.539 < 2e-16 ***
## libcon_fExt Conserv -2.0641 0.2222 -9.289 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -2314 on 14 Df
stargazer(fox_zinb, all_zinb, type='text', digits=3)
##
## =====================================================
## Dependent variable:
## ----------------------------
## fox_shows all_rw_shows
## (1) (2)
## -----------------------------------------------------
## libcon_fExt Liberal -4.635** 0.485
## (1.812) (0.532)
##
## libcon_fLiberal -1.199 -1.744**
## (1.002) (0.701)
##
## libcon_fSlightly Liberal -0.265 -0.411
## (0.516) (0.301)
##
## libcon_fSlightly Conserv 0.330 0.449***
## (0.235) (0.126)
##
## libcon_fConserv 0.494** 0.614***
## (0.196) (0.107)
##
## libcon_fExt Conserv 0.792*** 0.902***
## (0.211) (0.116)
##
## Constant -0.443** 0.318***
## (0.184) (0.101)
##
## -----------------------------------------------------
## Observations 3,146 2,762
## Log Likelihood -1,760.496 -2,313.545
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
With the zero-inflated modeling approaches, we see two sets of results for each DV:
Count Model which is modeling the count variable for cases that are > 0 on the count DV
Binomial Logit model which is modeling if the respondent consumed any Fox or right-wing media shows or not (1= consumed any while 0 = consumed zero shows)
Above, we estimate the model and in this instance we estimate the same model for the count and the binomial portion of the zero-inflated approach. This does not have to be the case. Oftentimes, there is theoretical reason to think that an X influences one part of the data generating process but not the other. In that instance, the models would be different, which is perfectly acceptable.
We can change the variables included in the models using the coding
approach below. In the zeroinfl
code, we specify two
separate models splitting them with the |
operator. Here,
we include political ideology, college education and biological sex as
predictors for the count portion of the model while we only include
political ideology in the binomial logit portion. This is because the
code starts with the count model before the |
and then
moves to the binomial logit model past that operator. We only have
libcon_f
specified past the |
operator so that
will be the only IV in the binomial logit portion of the model.
# Estimate the zero-inflated negative binomial model with separate predictors
zinb_model <- zeroinfl(all_rw_shows ~ libcon_f + college + female | libcon_f + college , data = anes)
summary(zinb_model)
##
## Call:
## zeroinfl(formula = all_rw_shows ~ libcon_f + college + female | libcon_f +
## college, data = anes)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0519 -0.3766 -0.2198 -0.1629 12.3646
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35934 0.10887 3.301 0.000965 ***
## libcon_fExt Liberal 0.52072 0.53244 0.978 0.328083
## libcon_fLiberal -1.79900 0.70799 -2.541 0.011054 *
## libcon_fSlightly Liberal -0.45763 0.30427 -1.504 0.132573
## libcon_fSlightly Conserv 0.44848 0.12777 3.510 0.000448 ***
## libcon_fConserv 0.61665 0.10932 5.641 1.69e-08 ***
## libcon_fExt Conserv 0.86424 0.11876 7.277 3.40e-13 ***
## college 0.04087 0.05590 0.731 0.464668
## female -0.15682 0.05528 -2.837 0.004558 **
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3193 0.1424 9.262 < 2e-16 ***
## libcon_fExt Liberal 2.7691 0.7392 3.746 0.00018 ***
## libcon_fLiberal 0.3389 0.8082 0.419 0.67492
## libcon_fSlightly Liberal 0.8172 0.3380 2.418 0.01562 *
## libcon_fSlightly Conserv -0.7264 0.1864 -3.897 9.75e-05 ***
## libcon_fConserv -1.8522 0.1639 -11.302 < 2e-16 ***
## libcon_fExt Conserv -2.0722 0.2291 -9.047 < 2e-16 ***
## college 0.0487 0.1195 0.407 0.68366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 23
## Log-likelihood: -2273 on 17 Df
Let’s calculated the predicted counts from the basic ZINB model that
only has political ideology included. Once again, we will use the
ggpredict
function. We can then easily graph the results
using ggplot
if we so choose.
# Summarize the model & calculate predicted counts
summary(fox_zinb)
##
## Call:
## zeroinfl(formula = fox_shows ~ libcon_f, data = anes)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.7776 -0.4571 -0.1803 -0.1286 13.1113
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4428 0.1839 -2.407 0.01607 *
## libcon_fExt Liberal -4.6347 1.8122 -2.557 0.01054 *
## libcon_fLiberal -1.1986 1.0018 -1.196 0.23153
## libcon_fSlightly Liberal -0.2645 0.5161 -0.512 0.60830
## libcon_fSlightly Conserv 0.3299 0.2345 1.407 0.15947
## libcon_fConserv 0.4940 0.1961 2.520 0.01175 *
## libcon_fExt Conserv 0.7919 0.2106 3.760 0.00017 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2830 0.2175 5.898 3.68e-09 ***
## libcon_fExt Liberal -5.0121 63.8681 -0.078 0.9374
## libcon_fLiberal 0.9108 1.0736 0.848 0.3962
## libcon_fSlightly Liberal 0.9664 0.5440 1.777 0.0756 .
## libcon_fSlightly Conserv -0.7348 0.3022 -2.432 0.0150 *
## libcon_fConserv -2.0007 0.2792 -7.165 7.76e-13 ***
## libcon_fExt Conserv -1.8696 0.3278 -5.704 1.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 59
## Log-likelihood: -1760 on 14 Df
fz<-ggpredict(fox_zinb, terms="libcon_f")
print(fz)
## # Predicted counts of fox_shows
##
## libcon_f | Predicted | 95% CI
## -------------------------------------------
## Moderate | 0.64 | [0.45, 0.92]
## Ext Liberal | 0.01 | [0.00, 0.21]
## Liberal | 0.19 | [0.03, 1.33]
## Slightly Liberal | 0.49 | [0.19, 1.27]
## Slightly Conserv | 0.89 | [0.67, 1.19]
## Conserv | 1.05 | [0.92, 1.20]
## Ext Conserv | 1.42 | [1.16, 1.73]
summary(all_zinb)
##
## Call:
## zeroinfl(formula = all_rw_shows ~ libcon_f, data = anes)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0353 -0.3669 -0.2193 -0.1228 11.8598
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3181 0.1008 3.155 0.001603 **
## libcon_fExt Liberal 0.4846 0.5325 0.910 0.362782
## libcon_fLiberal -1.7437 0.7012 -2.487 0.012892 *
## libcon_fSlightly Liberal -0.4112 0.3014 -1.364 0.172476
## libcon_fSlightly Conserv 0.4486 0.1258 3.565 0.000364 ***
## libcon_fConserv 0.6136 0.1074 5.715 1.09e-08 ***
## libcon_fExt Conserv 0.9016 0.1156 7.797 6.34e-15 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3554 0.1299 10.431 < 2e-16 ***
## libcon_fExt Liberal 2.7849 0.7389 3.769 0.000164 ***
## libcon_fLiberal 0.4337 0.7827 0.554 0.579498
## libcon_fSlightly Liberal 0.8845 0.3300 2.680 0.007354 **
## libcon_fSlightly Conserv -0.7200 0.1845 -3.903 9.48e-05 ***
## libcon_fConserv -1.8679 0.1619 -11.539 < 2e-16 ***
## libcon_fExt Conserv -2.0641 0.2222 -9.289 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -2314 on 14 Df
az<-ggpredict(all_zinb, terms="libcon_f")
print(az)
## # Predicted counts of all_rw_shows
##
## libcon_f | Predicted | 95% CI
## -------------------------------------------
## Moderate | 1.37 | [1.13, 1.67]
## Ext Liberal | 2.23 | [0.80, 6.22]
## Liberal | 0.24 | [0.06, 0.94]
## Slightly Liberal | 0.91 | [0.52, 1.59]
## Slightly Conserv | 2.15 | [1.86, 2.50]
## Conserv | 2.54 | [2.36, 2.73]
## Ext Conserv | 3.39 | [3.03, 3.78]
Using the zero-inflated models, we calculated the predicted counts of right-wing media shows consumed. If you were paying close attention to the predicted values from the previous count modeling approaches, you might recognize that these predicted values are considerably higher than before. These values are closer to being the predicted number of right wing media shows consumed for people who consumed at least 1 show or not. While that is not entirely true since the math is a bit more complicated in the combination of the two models, but conceptually that is a decent way to think about these numbers.
Let’s look at the mean number of right wing shows consumed by ideology both overall and for respondents who consumed at least 1 right-wing show.
anes %>%
group_by(libcon_f) %>%
summarize(mean = mean(all_rw_shows, na.rm = TRUE))
anes %>%
filter(all_rw_shows > 0) %>%
group_by(libcon_f) %>%
summarize(mean = mean(all_rw_shows))
These means largely track with the predicted count means from both the Negative-Binomial model (first set of means above) and the Zero-Inflated Negative-Binomial model (second set of means).