Assignment 6
Introduction
This assignment explores count data models using the Diabetes Health Indicators Dataset. This analysis focuses on modeling the number of mentally unhealthy days reported in the past 30 days, using health and lifestyle predictors such as age, high blood pressure, and physical activity.
Data Preparation
Load Packages
Load and Clean Dataset
diabetes <- read_csv("Diabetes Health Indicators.csv")
diabetes <- clean_names(diabetes)
# Select relevant variables
diabetes_model <- diabetes %>%
dplyr::select(ment_hlth, high_bp, age, phys_activity) %>%
drop_na() %>%
mutate(
high_bp = factor(high_bp, labels = c("No", "Yes")),
phys_activity = factor(phys_activity, labels = c("No", "Yes")),
age = as.numeric(age)
)
# Preview
glimpse(diabetes_model)
## Rows: 253,680
## Columns: 4
## $ ment_hlth <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, 5, 0, 0…
## $ high_bp <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes,…
## $ age <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, 4, 6, 1…
## $ phys_activity <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, No, No, Yes, No, No…
The summary statistics provide context on key variables. The
ment_hlth
variable is right-skewed with a median of 0 days,
while age ranges from 1 to 13, reflecting coded age groups.
Exploratory Analysis
Outcome Distribution
ggplot(diabetes_model, aes(x = ment_hlth)) +
geom_histogram(binwidth = 1, fill = "steelblue") +
labs(title = "Distribution of Mentally Unhealthy Days", x = "Days", y = "Frequency") +
theme_minimal()
Summary Statistics
## ment_hlth high_bp age phys_activity
## Min. : 0.000 No :144851 Min. : 1.000 No : 61760
## 1st Qu.: 0.000 Yes:108829 1st Qu.: 6.000 Yes:191920
## Median : 0.000 Median : 8.000
## Mean : 3.185 Mean : 8.032
## 3rd Qu.: 2.000 3rd Qu.:10.000
## Max. :30.000 Max. :13.000
Model Building
Poisson Regression
poisson_model <- glm(ment_hlth ~ high_bp + age + phys_activity,
family = poisson(link = "log"),
data = diabetes_model)
summary(poisson_model)
##
## Call:
## glm(formula = ment_hlth ~ high_bp + age + phys_activity, family = poisson(link = "log"),
## data = diabetes_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.170605 0.003361 645.9 <2e-16 ***
## high_bpYes 0.415205 0.002419 171.7 <2e-16 ***
## age -0.101482 0.000384 -264.3 <2e-16 ***
## phys_activityYes -0.606206 0.002335 -259.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2634293 on 253679 degrees of freedom
## Residual deviance: 2495199 on 253676 degrees of freedom
## AIC: 2781315
##
## Number of Fisher Scoring iterations: 7
Overdispersion Test
##
## Overdispersion test
##
## data: poisson_model
## z = 150.57, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 16.69128
A significant p-value suggests overdispersion. This justifies exploring a Negative Binomial model.
Negative Binomial Regression Model
nb_model <- MASS::glm.nb(ment_hlth ~ high_bp + age + phys_activity, data = diabetes_model)
summary(nb_model)
##
## Call:
## MASS::glm.nb(formula = ment_hlth ~ high_bp + age + phys_activity,
## data = diabetes_model, init.theta = 0.1172691729, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.260537 0.020418 110.71 <2e-16 ***
## high_bpYes 0.391771 0.012816 30.57 <2e-16 ***
## age -0.109854 0.002069 -53.09 <2e-16 ***
## phys_activityYes -0.627640 0.013854 -45.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.1173) family taken to be 1)
##
## Null deviance: 166609 on 253679 degrees of freedom
## Residual deviance: 161627 on 253676 degrees of freedom
## AIC: 825431
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.117269
## Std. Err.: 0.000521
##
## 2 x log-likelihood: -825420.863000
Model Comparison
htmlreg(list(poisson_model, nb_model),
custom.model.names = c("Poisson", "Negative Binomial"),
caption = "Model Comparison: Poisson vs Negative Binomial")
## <table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
## <caption>Model Comparison: Poisson vs Negative Binomial</caption>
## <thead>
## <tr>
## <th style="padding-left: 5px;padding-right: 5px;"> </th>
## <th style="padding-left: 5px;padding-right: 5px;">Poisson</th>
## <th style="padding-left: 5px;padding-right: 5px;">Negative Binomial</th>
## </tr>
## </thead>
## <tbody>
## <tr style="border-top: 1px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
## <td style="padding-left: 5px;padding-right: 5px;">2.17<sup>***</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">2.26<sup>***</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;"> </td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.02)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">high_bpYes</td>
## <td style="padding-left: 5px;padding-right: 5px;">0.42<sup>***</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">0.39<sup>***</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;"> </td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">age</td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.10<sup>***</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.11<sup>***</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;"> </td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">phys_activityYes</td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.61<sup>***</sup></td>
## <td style="padding-left: 5px;padding-right: 5px;">-0.63<sup>***</sup></td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;"> </td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
## <td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
## </tr>
## <tr style="border-top: 1px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">AIC</td>
## <td style="padding-left: 5px;padding-right: 5px;">2781315.29</td>
## <td style="padding-left: 5px;padding-right: 5px;">825430.86</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">BIC</td>
## <td style="padding-left: 5px;padding-right: 5px;">2781357.06</td>
## <td style="padding-left: 5px;padding-right: 5px;">825483.08</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
## <td style="padding-left: 5px;padding-right: 5px;">-1390653.64</td>
## <td style="padding-left: 5px;padding-right: 5px;">-412710.43</td>
## </tr>
## <tr>
## <td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
## <td style="padding-left: 5px;padding-right: 5px;">2495199.39</td>
## <td style="padding-left: 5px;padding-right: 5px;">161626.84</td>
## </tr>
## <tr style="border-bottom: 2px solid #000000;">
## <td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
## <td style="padding-left: 5px;padding-right: 5px;">253680</td>
## <td style="padding-left: 5px;padding-right: 5px;">253680</td>
## </tr>
## </tbody>
## <tfoot>
## <tr>
## <td style="font-size: 0.8em;" colspan="3"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td>
## </tr>
## </tfoot>
## </table>
Simulation-Based Interpretation
Average Marginal Effects (AME)
options(clarify.sim = 500)
nb_sim <- sim(nb_model)
ame_bp <- sim_ame(nb_sim, var = "high_bp", contrast = "rd")
summary(ame_bp)
## Estimate 2.5 % 97.5 %
## E[Y(No)] 2.69 2.65 2.74
## E[Y(Yes)] 3.99 3.91 4.06
## RD 1.29 1.20 1.39
## Estimate 2.5 % 97.5 %
## E[Y(No)] 4.95 4.84 5.07
## E[Y(Yes)] 2.64 2.60 2.68
## RD -2.31 -2.44 -2.19
Dose-Response Relationship for Age
# Create a sequence of age values for prediction
age_seq <- seq(min(diabetes_model$age), max(diabetes_model$age), by = 1)
# Create a new dataframe for prediction - hold other variables constant
new_data <- data.frame(
age = age_seq,
high_bp = factor("Yes", levels = c("No", "Yes")),
phys_activity = factor("No", levels = c("No", "Yes"))
)
# Get predicted values and confidence intervals
new_data$fit <- predict(nb_model, newdata = new_data, type = "response")
pred_se <- predict(nb_model, newdata = new_data, type = "link", se.fit = TRUE)$se.fit
# Confidence intervals (on response scale)
new_data$lower <- qpois(0.025, lambda = new_data$fit)
new_data$upper <- qpois(0.975, lambda = new_data$fit)
# Plot
ggplot(new_data, aes(x = age, y = fit)) +
geom_line(color = "blue") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
labs(
title = "Predicted Mentally Unhealthy Days by Age",
x = "Age", y = "Predicted Mentally Unhealthy Days"
) +
theme_minimal()
Interpretation of Results
Overdispersion Detected: The dispersion test confirmed overdispersion in the Poisson model, which validates the use of the Negative Binomial model.
Physical Activity: Simulation results show that individuals who are physically active report ~2.3 fewer mentally unhealthy days on average.
High Blood Pressure: Respondents with high blood pressure experience ~1.3 more mentally unhealthy days, on average.
Age: The predicted number of unhealthy days increases with age, suggesting a dose-response effect.
Why This Matters
Simulations allow us to move beyond raw coefficients and express effects in real-world terms (like number of days), with confidence intervals. This makes interpretation more intuitive and communication of results more accessible to wider audiences.
Conclusion
Count data models like Poisson and Negative Binomial are powerful
tools in public health research. In this analysis, we demonstrated how
health and lifestyle factors such as high blood pressure, physical
activity, and age influence mental health outcomes. The
clarify
package helped communicate effects more clearly
through simulation and visualization. Future work could extend this
model by examining interaction effects or additional predictors. These
findings highlight the importance of preventative strategies—such as
promoting physical activity—to mitigate poor mental health outcomes.
Identifying at-risk populations (e.g., older adults with high blood
pressure) allows for more targeted interventions.