library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_mlr,
"C:/rmd/brfss_mlr_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
brfss_mlr <- readRDS(
"C:/rmd/brfss_mlr_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_cat ~ "Household income (cat)",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Household income (cat) | 5,000 | |
| 1 | 190 (3.8%) | |
| 2 | 169 (3.4%) | |
| 3 | 312 (6.2%) | |
| 4 | 434 (8.7%) | |
| 5 | 489 (9.8%) | |
| 6 | 683 (14%) | |
| 7 | 841 (17%) | |
| 8 | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Distribution of Mentally Unhealthy Days (BRFSS 2020)
Describe the shape of the distribution: The histogram of number of mentally unhealthy days is skewed to the right, as we observe most of the participants report number of mentally unhealthy days =0, we can observe the long tail till 30 days of mentally unhealthy data. Data is assymetrical.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age.
#Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)
Comment on the direction and strength of each pairwise correlation with
the outcome: 1) We observe positive weak correlation between
Mental Health days and Physical Health days;
negative very weak correlation between Mental Health
days and Sleep; negative very weak correlation
between Mental Health days and Age; 2) We observe negative
very weak correlation between Physical Health days and
Sleep; positive negligible correlation between
Physical Health days and Age; 3) Finally, we observe
positive negligible correlation between Sleep and
Age;
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs
alone.
# Model 1: Unadjusted
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_slr)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
Write out the fitted regression equation: \[\widehat{\text{menthlth_days}} = `9.47` - `0.80` \times \text{sleep_hrs}\] 2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
For every additional +1 hour of sleep Days of poor mental health decreases by almost 1 day (0.80) on average among participants.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? \[H_0: \beta_1 = 0 \quad \text{(no linear relationship between Days of poor mental health and Hours of sleep)}\]
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise# Model A: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B: Add sleep
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat +exercise, data = brfss_mlr)
summary(mC)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
tidy(mC, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"sexFemale" = "Sex: Female (ref = Male)",
"physhlth_days" = "Physical health days",
"income_cat" = "Income (ordinal 1-8)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 1. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB") # highlight key predictors| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | 12.4755 | 0.7170 | 17.4006 | 0.0000 | 11.0699 | 13.8810 |
| Sleep (hours/night) | -0.5092 | 0.0753 | -6.7574 | 0.0000 | -0.6569 | -0.3614 |
| Age (years) | -0.0823 | 0.0059 | -13.8724 | 0.0000 | -0.0939 | -0.0707 |
| Sex: Female (ref = Male) | 1.2451 | 0.2023 | 6.1535 | 0.0000 | 0.8484 | 1.6417 |
| Physical health days | 0.2917 | 0.0136 | 21.4779 | 0.0000 | 0.2650 | 0.3183 |
| Income (ordinal 1-8) | -0.3213 | 0.0520 | -6.1778 | 0.0000 | -0.4233 | -0.2194 |
| Exercise: Yes (ref = No) | -0.3427 | 0.2531 | -1.3537 | 0.1759 | -0.8389 | 0.1536 |
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. \[\widehat{\text{Mental Health Days}} = `rb[1]` + `rb[2]`(\text{Sleep Hours}) + `rb[3]`(\text{Age}) + `rb[4]`(\text{Female}) + `rb[5]`(\text{Physical Health days}) + `rb[6]`(\text{Income}) + `rb[7]`(\text{Excersice: Yes}) \] Interpretation: Sleep hours (\(\hat{\beta}\) = -0.5092): Each additional 1 hour of sleep is associated with an estimated 0.5092 fewer mentally unhealthy days on average, holding age, sex, and days of poor physical health days constant (95% CI: -0.6569 to -0.3614). The negative sign indicates a protective association.
Age (\(\hat{\beta}\) = -0.0823): Each additional 1-year of age is associated with an estimated 0.0823 fewer mentally unhealthy days in average, holding, sleep hours, sex, poor physical health days constant (95% CI: -0.0939 to -0.0707); The negative sign indicates a protective association.
Sex: Female (\(\hat{\beta}\) = 1.2451): Being a female is associated with an estimated 1.2451 additional mentally unhealthy days in average, holding sleep hours, age, and days of poor physical health days constant (95% CI: 0.8484 to 1.6417).
Physical health days (\(\hat{\beta}\) = 0.2917): Each additional day of poor physical health is associated with an estimated 0.2917 additional mentally unhealthy days on average, holding sleep, age, sex constant (95% CI: 0.2650 to 0.3183). This is the strongest predictor (t-value=24.10) in the model and is consistent with the bidirectional mind-body connection documented in the literature
Income category (\(\hat{\beta}\) = -0.3213): Each increase by one unit of income category (on 1-8 ordinal scale) is associated with an estimated 0.3213 fewer mentally unhealthy days in average, holding sleep hours, age, female sex and days of poor physical health constant (95% CI: -0.4233 to -0.2194). The negative sign indicates a protective association.
Exersice: Yes (\(\hat{\beta}\) = -0.3427 ): People who engaged in any physical activity in the past 30 days report an estimated 0.3427 fewer mentally unhealthy days in average, holding sleep hours, age, female sex, days of poor physical, and income constant (95% CI: -0.8389 to 0.1536). The negative sign indicates a protective association.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
## Analysis of Variance Table
##
## Model 1: menthlth_days ~ sleep_hrs
## Model 2: menthlth_days ~ sleep_hrs + age + sex
## Model 3: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat +
## exercise
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4998 291847
## 2 4996 282717 2 9129 90.805 < 2.2e-16 ***
## 3 4993 250993 3 31724 210.365 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R2`,
"mA: Sleep Hours only", 1,
round(summary(mA)$r.squared, 4),
round(summary(mA)$adj.r.squared, 4),
"mB: + age + sex", 2,
round(summary(mB)$r.squared, 4),
round(summary(mB)$adj.r.squared, 4),
"mC: Full (+ physhlth_days + income_cat + exercise)", 6,
round(summary(mC)$r.squared, 4),
round(summary(mC)$adj.r.squared, 4)
) %>%
kable(caption = "Table 7. R2 and Adjusted R2 Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj. R2 |
|---|---|---|---|
| mA: Sleep Hours only | 1 | 0.0197 | 0.0195 |
| mB: + age + sex | 2 | 0.0504 | 0.0498 |
| mC: Full (+ physhlth_days + income_cat + exercise) | 6 | 0.1569 | 0.1559 |
Interpretation: R² for mental health outcomes is highest for full multivariable regression model (including menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise). Nonetheless, this full multivariable linear regression model explains only 15.59% variability of the days of poor mental health, the rest variability of the outcome explained by other factors.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
\[RMSE = \sqrt{MSE} = s_{Y|X}\]
rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model C):", rmse_mC, "mentally unhealthy days\n")## Root MSE (Model C): 7.09 mentally unhealthy days
The Root MSE (also called the Residual Standard Error in R output) is the estimated standard deviation of the errors. On average, 7.09 days the model’s prediction is off by. Or the model’s predicted days of poor mental health deviate from the observed by roughly 7.09 days on average.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
glance(mC) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R2",
"adj.r.squared" = "Adjusted R2",
"sigma" = "Residual Std Error (Root MSE)",
"statistic" = "F statistic",
"p.value" = "p-value (overall F test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n - p - 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary - Model C") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R2 | 0.1569 |
| Adjusted R2 | 0.1559 |
| Residual Std Error (Root MSE) | 7.0901 |
| F statistic | 154.8953 |
| p-value (overall F test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n - p - 1) | 4993.0000 |
| n (observations) | 5000.0000 |
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 6 | 46719.86 | 7786.64 | 154.8953 |
| Residual | 4993.0 | 250998.23 | 50.273 | |
| Total | 5000 | 297718.09 |
State the null hypothesis for the overall F-test and your conclusion. \[H_0: \beta_1 = 0 \quad \text{( All regression coefficients (except the intercept) are equal to zero)}\] \[H_A: \beta_1 = 0 \quad \text{( At least one regression coefficient is not zero)}\] —
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
par(mfrow = c(2, 2))
plot(mC, which = 1:4,
col = adjustcolor("steelblue", 0.4),
pch = 19, cex = 0.6)Standard Regression Diagnostic Plots
Interpretation: -Residuals vs. Fitted plot does not show random scatter around zero, there is some pattern; meaning potential violation of the linearity assumption. Also red curved line shows that the relationship between predictors and outcome may be not absolutely linear.
Normal QQ plot plot shows the S-shaped line - residuals are deviating from the line, assuming that the residuals may not be normally distributed. This means a possible violation of the normality assumption.
Scale location plot shows the red line increasing, and not remaining at 0. This means that the variance of errors is not completely constant across fitted values (possible heteroscedasticity).
Cook’s distance plot shows that there are no influential observations that might affect the model (all points < 0.014)
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. So most of the assumptions are violated: Linearity, Normality and Homoscedasticity. Considering that our sample size = 5,000 people, MLR is robust to those violations of assumptions.
5c. c(5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? Cook’s distance plot shows that there are no influential observations that might affect the model (all points < 0.014). —
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
Based on the analysis, it seems that age, female sex, income, sleep hours have meaningful association with the days of poor mental health among participants, whereas exercise does not have it. Physical health days is the strongest predictor in the model, meaning each additional day of poor physical health is associated with almost half of a poor mentally unhealthy day.