Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

brfss_full <- read_xpt(
  "C:/rmd/LLCP2020.XPT"
) %>%
  clean_names()

# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "C:/rmd/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Part 2: In-Class Lab Activity

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)

brfss_mlr <- readRDS(
  "C:/rmd/brfss_mlr_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_cat      ~ "Household income (cat)",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income (cat)	5,000
1		190 (3.8%)
2		169 (3.4%)
3		312 (6.2%)
4		434 (8.7%)
5		489 (9.8%)
6		683 (14%)
7		841 (17%)
8		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

Describe the shape of the distribution: The histogram of number of mentally unhealthy days is skewed to the right, as we observe most of the participants report number of mentally unhealthy days =0, we can observe the long tail till 30 days of mentally unhealthy data. Data is assymetrical.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age.

#Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Comment on the direction and strength of each pairwise correlation with the outcome: 1) We observe positive weak correlation between Mental Health days and Physical Health days; negative very weak correlation between Mental Health days and Sleep; negative very weak correlation between Mental Health days and Age; 2) We observe negative very weak correlation between Physical Health days and Sleep; positive negligible correlation between Physical Health days and Age; 3) Finally, we observe positive negligible correlation between Sleep and Age;

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone.

# Model 1: Unadjusted
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_slr)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

b0 <- round(coef(model_slr)[1], 3)
b1 <- round(coef(model_slr)[2], 4)

Write out the fitted regression equation: \[\widehat{\text{menthlth_days}} = `9.47` - `0.80` \times \text{sleep_hrs}\] 2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

For every additional +1 hour of sleep Days of poor mental health decreases by almost 1 day (0.80) on average among participants.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? \[H_0: \beta_1 = 0 \quad \text{(no linear relationship between Days of poor mental health and Hours of sleep)}\]

\[H_A: \beta_1 \neq 0 \quad \text{(there is a linear relationship)}\]

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

# Model A: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model B: Add sleep
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model C: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat +exercise, data = brfss_mlr)

summary(mC)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

tidy(mC, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)" = "Intercept",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "physhlth_days" = "Physical health days",
      "income_cat"    = "Income (ordinal 1-8)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
      
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 1. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")  # highlight key predictors

Table 1. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate (β̂	Std. Erro
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707
Sex: Female (ref = Male)	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417
Physical health days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194
Exercise: Yes (ref = No)	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. \[\widehat{\text{Mental Health Days}} = `rb[1]` + `rb[2]`(\text{Sleep Hours}) + `rb[3]`(\text{Age}) + `rb[4]`(\text{Female}) + `rb[5]`(\text{Physical Health days}) + `rb[6]`(\text{Income}) + `rb[7]`(\text{Excersice: Yes}) \] Interpretation: Sleep hours ($\hat{\beta}$ = -0.5092): Each additional 1 hour of sleep is associated with an estimated 0.5092 fewer mentally unhealthy days on average, holding age, sex, and days of poor physical health days constant (95% CI: -0.6569 to -0.3614). The negative sign indicates a protective association.

Age ($\hat{\beta}$ = -0.0823): Each additional 1-year of age is associated with an estimated 0.0823 fewer mentally unhealthy days in average, holding, sleep hours, sex, poor physical health days constant (95% CI: -0.0939 to -0.0707); The negative sign indicates a protective association.

Sex: Female ($\hat{\beta}$ = 1.2451): Being a female is associated with an estimated 1.2451 additional mentally unhealthy days in average, holding sleep hours, age, and days of poor physical health days constant (95% CI: 0.8484 to 1.6417).

Physical health days ($\hat{\beta}$ = 0.2917): Each additional day of poor physical health is associated with an estimated 0.2917 additional mentally unhealthy days on average, holding sleep, age, sex constant (95% CI: 0.2650 to 0.3183). This is the strongest predictor (t-value=24.10) in the model and is consistent with the bidirectional mind-body connection documented in the literature

Income category ($\hat{\beta}$ = -0.3213): Each increase by one unit of income category (on 1-8 ordinal scale) is associated with an estimated 0.3213 fewer mentally unhealthy days in average, holding sleep hours, age, female sex and days of poor physical health constant (95% CI: -0.4233 to -0.2194). The negative sign indicates a protective association.

Exersice: Yes ($\hat{\beta}$ = -0.3427 ): People who engaged in any physical activity in the past 30 days report an estimated 0.3427 fewer mentally unhealthy days in average, holding sleep hours, age, female sex, days of poor physical, and income constant (95% CI: -0.8389 to 0.1536). The negative sign indicates a protective association.

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

anova_mA <- anova(mA)
anova_mB <- anova(mB)
anova_mC <- anova (mC)
anova (mA, mB, mC)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ sleep_hrs
## Model 2: menthlth_days ~ sleep_hrs + age + sex
## Model 3: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + 
##     exercise
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1   4998 291847                                   
## 2   4996 282717  2      9129  90.805 < 2.2e-16 ***
## 3   4993 250993  3     31724 210.365 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R2`,
  "mA: Sleep Hours only", 1,
  round(summary(mA)$r.squared, 4),
  round(summary(mA)$adj.r.squared, 4),

  "mB: + age + sex", 2,
  round(summary(mB)$r.squared, 4),
  round(summary(mB)$adj.r.squared, 4),

  "mC: Full (+ physhlth_days + income_cat + exercise)", 6,
  round(summary(mC)$r.squared, 4),
  round(summary(mC)$adj.r.squared, 4)

) %>%
  kable(caption = "Table 7. R2 and Adjusted R2 Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 7. R2 and Adjusted R2 Across Sequential Models
Model	Predictors	R2	Adj. R2
mA: Sleep Hours only	1	0.0197	0.0195
mB: + age + sex	2	0.0504	0.0498
mC: Full (+ physhlth_days + income_cat + exercise)	6	0.1569	0.1559

Interpretation: R² for mental health outcomes is highest for full multivariable regression model (including menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise). Nonetheless, this full multivariable linear regression model explains only 15.59% variability of the days of poor mental health, the rest variability of the outcome explained by other factors.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

\[RMSE = \sqrt{MSE} = s_{Y|X}\]

rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model C):", rmse_mC, "mentally unhealthy days\n")

## Root MSE (Model C): 7.09 mentally unhealthy days

The Root MSE (also called the Residual Standard Error in R output) is the estimated standard deviation of the errors. On average, 7.09 days the model’s prediction is off by. Or the model’s predicted days of poor mental health deviate from the observed by roughly 7.09 days on average.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

glance(mC) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R2",
    "adj.r.squared" = "Adjusted R2",
    "sigma"         = "Residual Std Error (Root MSE)",
    "statistic"     = "F statistic",
    "p.value"       = "p-value (overall F test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n - p - 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary - Model C") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6. Overall Model Summary - Model C
Statistic	Value
R2	0.1569
Adjusted R2	0.1559
Residual Std Error (Root MSE)	7.0901
F statistic	154.8953
p-value (overall F test)	0.0000
Model df (p)	6.0000
Residual df (n - p - 1)	4993.0000
n (observations)	5000.0000

Source	df	SS	MS	F
Model	6	46719.86	7786.64	154.8953
Residual	4993.0	250998.23	50.273
Total	5000	297718.09

State the null hypothesis for the overall F-test and your conclusion. \[H_0: \beta_1 = 0 \quad \text{( All regression coefficients (except the intercept) are equal to zero)}\] \[H_A: \beta_1 = 0 \quad \text{( At least one regression coefficient is not zero)}\] —

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(mC, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

Standard Regression Diagnostic Plots

par(mfrow = c(1, 1))

Interpretation: -Residuals vs. Fitted plot does not show random scatter around zero, there is some pattern; meaning potential violation of the linearity assumption. Also red curved line shows that the relationship between predictors and outcome may be not absolutely linear.

Normal QQ plot plot shows the S-shaped line - residuals are deviating from the line, assuming that the residuals may not be normally distributed. This means a possible violation of the normality assumption.
Scale location plot shows the red line increasing, and not remaining at 0. This means that the variance of errors is not completely constant across fitted values (possible heteroscedasticity).
Cook’s distance plot shows that there are no influential observations that might affect the model (all points < 0.014)

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. So most of the assumptions are violated: Linearity, Normality and Homoscedasticity. Considering that our sample size = 5,000 people, MLR is robust to those violations of assumptions.

5c. c(5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? Cook’s distance plot shows that there are no influential observations that might affect the model (all points < 0.014). —

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

Based on the analysis, it seems that age, female sex, income, sleep hours have meaningful association with the days of poor mental health among participants, whereas exercise does not have it. Physical health days is the strongest predictor in the model, meaning each additional day of poor physical health is associated with almost half of a poor mentally unhealthy day.

Epi553_lab07_MLR_Azamatova_Aigerim

Azamatova Aigerim

2026-03-08