Introduction

This assignment explores multilevel count data modeling using the Diabetes Health Indicators dataset. Specifically, we model the number of mentally unhealthy days (ment_hlth) while accounting for individual characteristics such as high blood pressure, age, and physical activity. We incorporate income level as a grouping variable to investigate whether the effects of predictors vary across income brackets.

We compare standard Poisson and Negative Binomial models to their multilevel counterparts to address potential overdispersion and unobserved heterogeneity across income groups.

Data Preparation

library(tidyverse)
library(janitor)
library(lme4)
library(MASS)
library(glmmTMB)
library(ggplot2)
library(texreg)
library(broom.mixed)
# Load and clean data
diabetes <- read_csv("Diabetes Health Indicators.csv")
diabetes <- clean_names(diabetes)

# Prepare relevant variables
diabetes_model <- diabetes %>%
  dplyr::select(ment_hlth, high_bp, age, phys_activity, income) %>%
  drop_na() %>%
  mutate(
    high_bp = factor(high_bp, labels = c("No", "Yes")),
    phys_activity = factor(phys_activity, labels = c("No", "Yes")),
    income = as.factor(income),
age = as.numeric(age),
age_z = scale(age)
  )

Age was standardized to improve model convergence and interpretability of interaction terms in multilevel models.

Exploratory Data Analysis

Summary Statistics

summary(diabetes_model)
##    ment_hlth      high_bp           age         phys_activity     income     
##  Min.   : 0.000   No :144851   Min.   : 1.000   No : 61760    8      :90385  
##  1st Qu.: 0.000   Yes:108829   1st Qu.: 6.000   Yes:191920    7      :43219  
##  Median : 0.000                Median : 8.000                 6      :36470  
##  Mean   : 3.185                Mean   : 8.032                 5      :25883  
##  3rd Qu.: 2.000                3rd Qu.:10.000                 4      :20135  
##  Max.   :30.000                Max.   :13.000                 3      :15994  
##                                                               (Other):21594  
##        age_z.V1      
##  Min.   :-2.3024269  
##  1st Qu.:-0.6653479  
##  Median :-0.0105163  
##  Mean   : 0.0000000  
##  3rd Qu.: 0.6443152  
##  Max.   : 1.6265626  
## 

Outcome Distribution

ggplot(diabetes_model, aes(x = ment_hlth)) +
  geom_histogram(binwidth = 1, fill = "cadetblue") +
  labs(title = "Distribution of Mentally Unhealthy Days", x = "Days", y = "Frequency") +
  theme_minimal()

Baseline Count Models

Standard Poisson Model

poisson_model <- glm(ment_hlth ~ high_bp + age_z + phys_activity,
                     family = poisson(link = "log"),
                     data = diabetes_model)
summary(poisson_model)
## 
## Call:
## glm(formula = ment_hlth ~ high_bp + age_z + phys_activity, family = poisson(link = "log"), 
##     data = diabetes_model)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.355486   0.002338   579.7   <2e-16 ***
## high_bpYes        0.415205   0.002419   171.7   <2e-16 ***
## age_z            -0.309949   0.001173  -264.3   <2e-16 ***
## phys_activityYes -0.606207   0.002335  -259.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2634293  on 253679  degrees of freedom
## Residual deviance: 2495199  on 253676  degrees of freedom
## AIC: 2781315
## 
## Number of Fisher Scoring iterations: 7

Overdispersion Test

library(AER)
dispersiontest(poisson_model)
## 
##  Overdispersion test
## 
## data:  poisson_model
## z = 150.57, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   16.69128

The significant p-value indicates overdispersion. This justifies using Negative Binomial and multilevel models.

Negative Binomial Model

nb_model <- MASS::glm.nb(ment_hlth ~ high_bp + age_z + phys_activity,
                         data = diabetes_model)
summary(nb_model)
## 
## Call:
## MASS::glm.nb(formula = ment_hlth ~ high_bp + age_z + phys_activity, 
##     data = diabetes_model, init.theta = 0.1172691729, link = log)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.37817    0.01363  101.08   <2e-16 ***
## high_bpYes        0.39177    0.01282   30.57   <2e-16 ***
## age_z            -0.33552    0.00632  -53.09   <2e-16 ***
## phys_activityYes -0.62764    0.01385  -45.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1173) family taken to be 1)
## 
##     Null deviance: 166609  on 253679  degrees of freedom
## Residual deviance: 161627  on 253676  degrees of freedom
## AIC: 825431
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.117269 
##           Std. Err.:  0.000521 
## 
##  2 x log-likelihood:  -825420.863000

Multilevel Count Models (Grouped by Income)

Multilevel Poisson Model

ml_pois <- glmmTMB(
  ment_hlth ~ high_bp + age + phys_activity + (1 | income),
  family = poisson(link = "log"),
  data = diabetes_model
)
summary(ml_pois)
##  Family: poisson  ( log )
## Formula:          ment_hlth ~ high_bp + age + phys_activity + (1 | income)
## Data: diabetes_model
## 
##      AIC      BIC   logLik deviance df.resid 
##  2642755  2642807 -1321373  2642745   253675 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  income (Intercept) 0.177    0.4207  
## Number of obs: 253680, groups:  income, 8
## 
## Conditional model:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.320757   0.148767    15.6   <2e-16 ***
## high_bpYes        0.299868   0.002443   122.8   <2e-16 ***
## age              -0.103150   0.000375  -275.1   <2e-16 ***
## phys_activityYes -0.436660   0.002380  -183.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multilevel Negative Binomial Model

ml_nb <- glmmTMB(ment_hlth ~ high_bp + age_z + phys_activity + (1 | income),
                 family = nbinom2(link = "log"),
                 data = diabetes_model)
summary(ml_nb)
##  Family: nbinom2  ( log )
## Formula:          ment_hlth ~ high_bp + age_z + phys_activity + (1 | income)
## Data: diabetes_model
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  820695.1  820757.7 -410341.5  820683.1    253674 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  income (Intercept) 0.1851   0.4302  
## Number of obs: 253680, groups:  income, 8
## 
## Dispersion parameter for nbinom2 family (): 0.123 
## 
## Conditional model:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.535125   0.152726   10.05   <2e-16 ***
## high_bpYes        0.271289   0.012480   21.74   <2e-16 ***
## age_z            -0.363171   0.006608  -54.96   <2e-16 ***
## phys_activityYes -0.480944   0.013752  -34.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Comparison

screenreg(list(poisson_model, nb_model, ml_pois, ml_nb),
          custom.model.names = c("Poisson", "Neg Binomial", "ML Poisson", "ML NB"),
          caption = "Comparison of Count Models")
## 
## =========================================================================================
##                          Poisson          Neg Binomial    ML Poisson       ML NB         
## -----------------------------------------------------------------------------------------
## (Intercept)                     1.36 ***        1.38 ***         2.32 ***        1.54 ***
##                                (0.00)          (0.01)           (0.15)          (0.15)   
## high_bpYes                      0.42 ***        0.39 ***         0.30 ***        0.27 ***
##                                (0.00)          (0.01)           (0.00)          (0.01)   
## age_z                          -0.31 ***       -0.34 ***                        -0.36 ***
##                                (0.00)          (0.01)                           (0.01)   
## phys_activityYes               -0.61 ***       -0.63 ***        -0.44 ***       -0.48 ***
##                                (0.00)          (0.01)           (0.00)          (0.01)   
## age                                                             -0.10 ***                
##                                                                 (0.00)                   
## -----------------------------------------------------------------------------------------
## AIC                       2781315.29       825430.86       2642755.16       820695.06    
## BIC                       2781357.06       825483.08                                     
## Log Likelihood           -1390653.64      -412710.43      -1321372.58      -410341.53    
## Deviance                  2495199.39       161626.84                                     
## Num. obs.                  253680          253680           253680          253680       
## Num. groups: income                                              8               8       
## Var: income (Intercept)                                          0.18            0.19    
## =========================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Model Fit Comparison (AIC/BIC)

AIC(poisson_model, nb_model, ml_pois, ml_nb)
##               df       AIC
## poisson_model  4 2781315.3
## nb_model       5  825430.9
## ml_pois        5 2642755.2
## ml_nb          6  820695.1
BIC(poisson_model, nb_model, ml_pois, ml_nb)
##               df       BIC
## poisson_model  4 2781357.1
## nb_model       5  825483.1
## ml_pois        5 2642807.4
## ml_nb          6  820757.7

Random Intercepts by Income Group

ranef(ml_nb)$cond$income
##   (Intercept)
## 1  0.60274142
## 2  0.51491233
## 3  0.26973089
## 4  0.12153457
## 5 -0.08703087
## 6 -0.31456660
## 7 -0.41458183
## 8 -0.69437084

Random Effects Summary

tidy(ml_nb, effects = "ran_vals") %>%
  filter(group == "income") %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point() +
  coord_flip() +
  labs(title = "Random Intercepts by Income Group",
       x = "Income Group", y = "Intercept Estimate") +
  theme_minimal()

The spread of random intercepts across income groups suggests substantial group-level variability, reinforcing the value of a multilevel modeling approach.

Key Findings and Interpretation

  • The dispersion test revealed overdispersion, indicating that the Poisson model is not an appropriate fit. This supports the use of the Negative Binomial and multilevel models to better capture variability in the data.

  • Physical activity significantly reduces the expected number of mentally unhealthy days across all models.

  • High blood pressure increases the number of mentally unhealthy days.

  • The multilevel models account for unobserved heterogeneity across income groups and demonstrate better fit through AIC comparisons.

  • Given its ability to account for both overdispersion and group-level variation, the multilevel Negative Binomial model offers the most appropriate and reliable fit for the data.

Implications for Public Health

Multilevel modeling reveals that health-related predictors of mental health vary across income groups, suggesting that interventions should be income-sensitive. Promoting physical activity and managing high blood pressure could be especially important strategies, particularly for low-income groups that may face higher mental health burdens.

Conclusion

This assignment demonstrated the application of both standard and multilevel count models to a large-scale public health dataset. By addressing overdispersion and incorporating income-level variation, we were able to more accurately model mentally unhealthy days and extract meaningful policy insights.

Overall, accounting for income-level differences using a multilevel framework provided a more nuanced understanding of mental health disparities.