This assignment explores multilevel count data modeling using the
Diabetes Health Indicators dataset. Specifically, we model the number of
mentally unhealthy days (ment_hlth
) while accounting for
individual characteristics such as high blood pressure, age, and
physical activity. We incorporate income level as a grouping variable to
investigate whether the effects of predictors vary across income
brackets.
We compare standard Poisson and Negative Binomial models to their multilevel counterparts to address potential overdispersion and unobserved heterogeneity across income groups.
library(tidyverse)
library(janitor)
library(lme4)
library(MASS)
library(glmmTMB)
library(ggplot2)
library(texreg)
library(broom.mixed)
# Load and clean data
diabetes <- read_csv("Diabetes Health Indicators.csv")
diabetes <- clean_names(diabetes)
# Prepare relevant variables
diabetes_model <- diabetes %>%
dplyr::select(ment_hlth, high_bp, age, phys_activity, income) %>%
drop_na() %>%
mutate(
high_bp = factor(high_bp, labels = c("No", "Yes")),
phys_activity = factor(phys_activity, labels = c("No", "Yes")),
income = as.factor(income),
age = as.numeric(age),
age_z = scale(age)
)
Age was standardized to improve model convergence and interpretability of interaction terms in multilevel models.
## ment_hlth high_bp age phys_activity income
## Min. : 0.000 No :144851 Min. : 1.000 No : 61760 8 :90385
## 1st Qu.: 0.000 Yes:108829 1st Qu.: 6.000 Yes:191920 7 :43219
## Median : 0.000 Median : 8.000 6 :36470
## Mean : 3.185 Mean : 8.032 5 :25883
## 3rd Qu.: 2.000 3rd Qu.:10.000 4 :20135
## Max. :30.000 Max. :13.000 3 :15994
## (Other):21594
## age_z.V1
## Min. :-2.3024269
## 1st Qu.:-0.6653479
## Median :-0.0105163
## Mean : 0.0000000
## 3rd Qu.: 0.6443152
## Max. : 1.6265626
##
poisson_model <- glm(ment_hlth ~ high_bp + age_z + phys_activity,
family = poisson(link = "log"),
data = diabetes_model)
summary(poisson_model)
##
## Call:
## glm(formula = ment_hlth ~ high_bp + age_z + phys_activity, family = poisson(link = "log"),
## data = diabetes_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.355486 0.002338 579.7 <2e-16 ***
## high_bpYes 0.415205 0.002419 171.7 <2e-16 ***
## age_z -0.309949 0.001173 -264.3 <2e-16 ***
## phys_activityYes -0.606207 0.002335 -259.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2634293 on 253679 degrees of freedom
## Residual deviance: 2495199 on 253676 degrees of freedom
## AIC: 2781315
##
## Number of Fisher Scoring iterations: 7
##
## Overdispersion test
##
## data: poisson_model
## z = 150.57, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 16.69128
The significant p-value indicates overdispersion. This justifies using Negative Binomial and multilevel models.
nb_model <- MASS::glm.nb(ment_hlth ~ high_bp + age_z + phys_activity,
data = diabetes_model)
summary(nb_model)
##
## Call:
## MASS::glm.nb(formula = ment_hlth ~ high_bp + age_z + phys_activity,
## data = diabetes_model, init.theta = 0.1172691729, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.37817 0.01363 101.08 <2e-16 ***
## high_bpYes 0.39177 0.01282 30.57 <2e-16 ***
## age_z -0.33552 0.00632 -53.09 <2e-16 ***
## phys_activityYes -0.62764 0.01385 -45.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.1173) family taken to be 1)
##
## Null deviance: 166609 on 253679 degrees of freedom
## Residual deviance: 161627 on 253676 degrees of freedom
## AIC: 825431
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.117269
## Std. Err.: 0.000521
##
## 2 x log-likelihood: -825420.863000
ml_pois <- glmmTMB(
ment_hlth ~ high_bp + age + phys_activity + (1 | income),
family = poisson(link = "log"),
data = diabetes_model
)
summary(ml_pois)
## Family: poisson ( log )
## Formula: ment_hlth ~ high_bp + age + phys_activity + (1 | income)
## Data: diabetes_model
##
## AIC BIC logLik deviance df.resid
## 2642755 2642807 -1321373 2642745 253675
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## income (Intercept) 0.177 0.4207
## Number of obs: 253680, groups: income, 8
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.320757 0.148767 15.6 <2e-16 ***
## high_bpYes 0.299868 0.002443 122.8 <2e-16 ***
## age -0.103150 0.000375 -275.1 <2e-16 ***
## phys_activityYes -0.436660 0.002380 -183.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ml_nb <- glmmTMB(ment_hlth ~ high_bp + age_z + phys_activity + (1 | income),
family = nbinom2(link = "log"),
data = diabetes_model)
summary(ml_nb)
## Family: nbinom2 ( log )
## Formula: ment_hlth ~ high_bp + age_z + phys_activity + (1 | income)
## Data: diabetes_model
##
## AIC BIC logLik deviance df.resid
## 820695.1 820757.7 -410341.5 820683.1 253674
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## income (Intercept) 0.1851 0.4302
## Number of obs: 253680, groups: income, 8
##
## Dispersion parameter for nbinom2 family (): 0.123
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.535125 0.152726 10.05 <2e-16 ***
## high_bpYes 0.271289 0.012480 21.74 <2e-16 ***
## age_z -0.363171 0.006608 -54.96 <2e-16 ***
## phys_activityYes -0.480944 0.013752 -34.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
screenreg(list(poisson_model, nb_model, ml_pois, ml_nb),
custom.model.names = c("Poisson", "Neg Binomial", "ML Poisson", "ML NB"),
caption = "Comparison of Count Models")
##
## =========================================================================================
## Poisson Neg Binomial ML Poisson ML NB
## -----------------------------------------------------------------------------------------
## (Intercept) 1.36 *** 1.38 *** 2.32 *** 1.54 ***
## (0.00) (0.01) (0.15) (0.15)
## high_bpYes 0.42 *** 0.39 *** 0.30 *** 0.27 ***
## (0.00) (0.01) (0.00) (0.01)
## age_z -0.31 *** -0.34 *** -0.36 ***
## (0.00) (0.01) (0.01)
## phys_activityYes -0.61 *** -0.63 *** -0.44 *** -0.48 ***
## (0.00) (0.01) (0.00) (0.01)
## age -0.10 ***
## (0.00)
## -----------------------------------------------------------------------------------------
## AIC 2781315.29 825430.86 2642755.16 820695.06
## BIC 2781357.06 825483.08
## Log Likelihood -1390653.64 -412710.43 -1321372.58 -410341.53
## Deviance 2495199.39 161626.84
## Num. obs. 253680 253680 253680 253680
## Num. groups: income 8 8
## Var: income (Intercept) 0.18 0.19
## =========================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
## df AIC
## poisson_model 4 2781315.3
## nb_model 5 825430.9
## ml_pois 5 2642755.2
## ml_nb 6 820695.1
## df BIC
## poisson_model 4 2781357.1
## nb_model 5 825483.1
## ml_pois 5 2642807.4
## ml_nb 6 820757.7
## (Intercept)
## 1 0.60274142
## 2 0.51491233
## 3 0.26973089
## 4 0.12153457
## 5 -0.08703087
## 6 -0.31456660
## 7 -0.41458183
## 8 -0.69437084
tidy(ml_nb, effects = "ran_vals") %>%
filter(group == "income") %>%
ggplot(aes(x = term, y = estimate)) +
geom_point() +
coord_flip() +
labs(title = "Random Intercepts by Income Group",
x = "Income Group", y = "Intercept Estimate") +
theme_minimal()
The spread of random intercepts across income groups suggests substantial group-level variability, reinforcing the value of a multilevel modeling approach.
The dispersion test revealed overdispersion, indicating that the Poisson model is not an appropriate fit. This supports the use of the Negative Binomial and multilevel models to better capture variability in the data.
Physical activity significantly reduces the expected number of mentally unhealthy days across all models.
High blood pressure increases the number of mentally unhealthy days.
The multilevel models account for unobserved heterogeneity across income groups and demonstrate better fit through AIC comparisons.
Given its ability to account for both overdispersion and group-level variation, the multilevel Negative Binomial model offers the most appropriate and reliable fit for the data.
Multilevel modeling reveals that health-related predictors of mental health vary across income groups, suggesting that interventions should be income-sensitive. Promoting physical activity and managing high blood pressure could be especially important strategies, particularly for low-income groups that may face higher mental health burdens.
This assignment demonstrated the application of both standard and multilevel count models to a large-scale public health dataset. By addressing overdispersion and incorporating income-level variation, we were able to more accurately model mentally unhealthy days and extract meaningful policy insights.
Overall, accounting for income-level differences using a multilevel framework provided a more nuanced understanding of mental health disparities.