Assignment 6

Introduction

This assignment explores count data models using the Diabetes Health Indicators Dataset. This analysis focuses on modeling the number of mentally unhealthy days reported in the past 30 days, using health and lifestyle predictors such as age, high blood pressure, and physical activity.

Data Preparation

Load Packages

library(tidyverse)
library(janitor)
library(AER)
library(MASS)
library(pscl)
library(clarify)
library(texreg)
library(dplyr)

Load and Clean Dataset

diabetes <- read_csv("Diabetes Health Indicators.csv")
diabetes <- clean_names(diabetes)

# Select relevant variables
diabetes_model <- diabetes %>%
  dplyr::select(ment_hlth, high_bp, age, phys_activity) %>%
  drop_na() %>%
  mutate(
    high_bp = factor(high_bp, labels = c("No", "Yes")),
    phys_activity = factor(phys_activity, labels = c("No", "Yes")),
    age = as.numeric(age)
  )

# Preview
glimpse(diabetes_model)
## Rows: 253,680
## Columns: 4
## $ ment_hlth     <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, 5, 0, 0…
## $ high_bp       <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes,…
## $ age           <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, 4, 6, 1…
## $ phys_activity <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, No, No, Yes, No, No…

The summary statistics provide context on key variables. The ment_hlth variable is right-skewed with a median of 0 days, while age ranges from 1 to 13, reflecting coded age groups.

Exploratory Analysis

Outcome Distribution

ggplot(diabetes_model, aes(x = ment_hlth)) +
  geom_histogram(binwidth = 1, fill = "steelblue") +
  labs(title = "Distribution of Mentally Unhealthy Days", x = "Days", y = "Frequency") +
  theme_minimal()

Summary Statistics

# Print numeric summary of key variables
summary(diabetes_model)
##    ment_hlth      high_bp           age         phys_activity
##  Min.   : 0.000   No :144851   Min.   : 1.000   No : 61760   
##  1st Qu.: 0.000   Yes:108829   1st Qu.: 6.000   Yes:191920   
##  Median : 0.000                Median : 8.000                
##  Mean   : 3.185                Mean   : 8.032                
##  3rd Qu.: 2.000                3rd Qu.:10.000                
##  Max.   :30.000                Max.   :13.000

Model Building

Poisson Regression

poisson_model <- glm(ment_hlth ~ high_bp + age + phys_activity,
                     family = poisson(link = "log"),
                     data = diabetes_model)
summary(poisson_model)
## 
## Call:
## glm(formula = ment_hlth ~ high_bp + age + phys_activity, family = poisson(link = "log"), 
##     data = diabetes_model)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.170605   0.003361   645.9   <2e-16 ***
## high_bpYes        0.415205   0.002419   171.7   <2e-16 ***
## age              -0.101482   0.000384  -264.3   <2e-16 ***
## phys_activityYes -0.606206   0.002335  -259.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2634293  on 253679  degrees of freedom
## Residual deviance: 2495199  on 253676  degrees of freedom
## AIC: 2781315
## 
## Number of Fisher Scoring iterations: 7

Overdispersion Test

dispersiontest(poisson_model)
## 
##  Overdispersion test
## 
## data:  poisson_model
## z = 150.57, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   16.69128

A significant p-value suggests overdispersion. This justifies exploring a Negative Binomial model.

Negative Binomial Regression Model

nb_model <- MASS::glm.nb(ment_hlth ~ high_bp + age + phys_activity, data = diabetes_model)
summary(nb_model)
## 
## Call:
## MASS::glm.nb(formula = ment_hlth ~ high_bp + age + phys_activity, 
##     data = diabetes_model, init.theta = 0.1172691729, link = log)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.260537   0.020418  110.71   <2e-16 ***
## high_bpYes        0.391771   0.012816   30.57   <2e-16 ***
## age              -0.109854   0.002069  -53.09   <2e-16 ***
## phys_activityYes -0.627640   0.013854  -45.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1173) family taken to be 1)
## 
##     Null deviance: 166609  on 253679  degrees of freedom
## Residual deviance: 161627  on 253676  degrees of freedom
## AIC: 825431
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.117269 
##           Std. Err.:  0.000521 
## 
##  2 x log-likelihood:  -825420.863000

Model Comparison

htmlreg(list(poisson_model, nb_model),
        custom.model.names = c("Poisson", "Negative Binomial"),
        caption = "Model Comparison: Poisson vs Negative Binomial")
## <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
## <caption>Model Comparison: Poisson vs Negative Binomial</caption>
## <thead>
## <tr>
## <th style="padding-left: 5px;padding-right: 5px;">&nbsp;</th>
## <th style="padding-left: 5px;padding-right: 5px;">Poisson</th>
## <th style="padding-left: 5px;padding-right: 5px;">Negative Binomial</th>
## </tr>
## </thead>
## <tbody>
## <tr style="border-top: 1px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
## <td style="padding-left: 5px;padding-right: 5px;">2.17<sup>&#42;&#42;&#42;</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">2.26<sup>&#42;&#42;&#42;</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.02)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">high_bpYes</td>
## <td style="padding-left: 5px;padding-right: 5px;">0.42<sup>&#42;&#42;&#42;</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">0.39<sup>&#42;&#42;&#42;</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">age</td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.10<sup>&#42;&#42;&#42;</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.11<sup>&#42;&#42;&#42;</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">phys_activityYes</td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.61<sup>&#42;&#42;&#42;</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.63<sup>&#42;&#42;&#42;</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
## </tr>
## <tr style="border-top: 1px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">AIC</td>
## <td style="padding-left: 5px;padding-right: 5px;">2781315.29</td>
## <td style="padding-left: 5px;padding-right: 5px;">825430.86</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">BIC</td>
## <td style="padding-left: 5px;padding-right: 5px;">2781357.06</td>
## <td style="padding-left: 5px;padding-right: 5px;">825483.08</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
## <td style="padding-left: 5px;padding-right: 5px;">-1390653.64</td>
## <td style="padding-left: 5px;padding-right: 5px;">-412710.43</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
## <td style="padding-left: 5px;padding-right: 5px;">2495199.39</td>
## <td style="padding-left: 5px;padding-right: 5px;">161626.84</td>
## </tr>
## <tr style="border-bottom: 2px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
## <td style="padding-left: 5px;padding-right: 5px;">253680</td>
## <td style="padding-left: 5px;padding-right: 5px;">253680</td>
## </tr>
## </tbody>
## <tfoot>
## <tr>
## <td style="font-size: 0.8em;" colspan="3"><sup>&#42;&#42;&#42;</sup>p &lt; 0.001; <sup>&#42;&#42;</sup>p &lt; 0.01; <sup>&#42;</sup>p &lt; 0.05</td>
## </tr>
## </tfoot>
## </table>

Simulation-Based Interpretation

Average Marginal Effects (AME)

options(clarify.sim = 500)
nb_sim <- sim(nb_model)

ame_bp <- sim_ame(nb_sim, var = "high_bp", contrast = "rd")
summary(ame_bp)
##           Estimate 2.5 % 97.5 %
## E[Y(No)]      2.69  2.65   2.74
## E[Y(Yes)]     3.99  3.91   4.06
## RD            1.29  1.20   1.39
ame_phys <- sim_ame(nb_sim, var = "phys_activity", contrast = "rd")
summary(ame_phys)
##           Estimate 2.5 % 97.5 %
## E[Y(No)]      4.95  4.84   5.07
## E[Y(Yes)]     2.64  2.60   2.68
## RD           -2.31 -2.44  -2.19

Dose-Response Relationship for Age

# Create a sequence of age values for prediction
age_seq <- seq(min(diabetes_model$age), max(diabetes_model$age), by = 1)

# Create a new dataframe for prediction - hold other variables constant
new_data <- data.frame(
  age = age_seq,
  high_bp = factor("Yes", levels = c("No", "Yes")),
  phys_activity = factor("No", levels = c("No", "Yes"))
)

# Get predicted values and confidence intervals
new_data$fit <- predict(nb_model, newdata = new_data, type = "response")
pred_se <- predict(nb_model, newdata = new_data, type = "link", se.fit = TRUE)$se.fit

# Confidence intervals (on response scale)
new_data$lower <- qpois(0.025, lambda = new_data$fit)
new_data$upper <- qpois(0.975, lambda = new_data$fit)

# Plot
ggplot(new_data, aes(x = age, y = fit)) +
  geom_line(color = "blue") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
  labs(
    title = "Predicted Mentally Unhealthy Days by Age",
    x = "Age", y = "Predicted Mentally Unhealthy Days"
  ) +
  theme_minimal()

Interpretation of Results

  • Overdispersion Detected: The dispersion test confirmed overdispersion in the Poisson model, which validates the use of the Negative Binomial model.

  • Physical Activity: Simulation results show that individuals who are physically active report ~2.3 fewer mentally unhealthy days on average.

  • High Blood Pressure: Respondents with high blood pressure experience ~1.3 more mentally unhealthy days, on average.

  • Age: The predicted number of unhealthy days increases with age, suggesting a dose-response effect.

Why This Matters

Simulations allow us to move beyond raw coefficients and express effects in real-world terms (like number of days), with confidence intervals. This makes interpretation more intuitive and communication of results more accessible to wider audiences.

Conclusion

Count data models like Poisson and Negative Binomial are powerful tools in public health research. In this analysis, we demonstrated how health and lifestyle factors such as high blood pressure, physical activity, and age influence mental health outcomes. The clarify package helped communicate effects more clearly through simulation and visualization. Future work could extend this model by examining interaction effects or additional predictors. These findings highlight the importance of preventative strategies—such as promoting physical activity—to mitigate poor mental health outcomes. Identifying at-risk populations (e.g., older adults with high blood pressure) allows for more targeted interventions.