Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

brfss_full <- read_xpt("data/LLCP2020.XPT") %>%
  clean_names()

# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "data/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026

Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") %>%
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,0001
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
1Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

The outcome is right-skewed with most respondents reporting no mentally unhealthy days, with a long tail toward 30. This means that outliers may skew our results.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Mental health days and Physical health days (r= 0.315) shows a moderate positive correlation Mental health days and sleep (r= -0.140) shows a weak negative correlation Mental health days and age (r= -0.156) shows a weak negative correlation Physical health days and sleep (r= -0.112) shows a weak negative correlation Physical health days and age (r= 0.093) shows a weak positive correlation Sleep and age (r= 0.089) shows a weak positive correlation —

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(m1)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

Mental Health Days = 9.474 + -0.804 (Sleep Hours)

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon). The estimated change in the mean number of days with poor mental health is -0.804 for a one-unit increase in sleep hours, holding all other predictors in the model constant.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? The null hypothesis is that there is no linear association between the number of days with poor health and BMI, whereas the alternative hypothesis is that there is a linear relationship. The null hypothesis can be rejected (p< 0.05) suggesting there is a linear relationship. The test-statistic is 16.42 with 4998 degrees of freedom.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model 2: Add sleep
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + physhlth_days + age + income_cat + sex + exercise, data = brfss_mlr)
summary(m3)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + physhlth_days + age + 
##     income_cat + sex + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

tribble(
  ~Model, ~`Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
  "M1 (unadjusted)", round(coef(m1)[2], 3),
    paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
    round(summary(m1)$adj.r.squared, 3),
  "M2 (+ age and sex)",     round(coef(m2)[2], 3),
    paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
    round(summary(m2)$adj.r.squared, 3),
  "M3 (full)",       round(coef(m3)[2], 3),
    paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
    round(summary(m3)$adj.r.squared, 3)
) %>%
  kable(caption = "Sleep Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Sleep Coefficient Across Sequential Models
Model	Sleep β	\|95% CI	Adj. R
M1 (unadjusted)	-0.804	(-0.962, -0.647)	0.020
M2 (+ age and sex)	-0.734	(-0.889, -0.578)	0.050
M3 (full)	-0.509	(-0.657, -0.361)	0.156

The sleep coefficient appears to decrease with increasing model covariates. This suggests that the covariates added to the model were confounding the relationship between sleep and days with poor mental health.

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. Mental Health Days = 12.475 + -0.509 (Sleep Hours) + 0.292 (Physhealh_days) + -0.082 (age) + -0.321 (income_cat) + 1.245 (sex: Female) + -0.343 (exercise: Yes) This fitted regression equation describes the association between covariates and mental health days: 1) The estimated change in the mean number of days with poor mental health is -0.509 for a one-unit increase in sleep hours, holding all other predictors in the model constant. 2) The estimated change in the mean number of days with poor mental health is 0.292 for a one-unit increase in days with poor physical health, holding all other predictors in the model constant. 3) The estimated change in the mean number of days with poor mental health is -0.082 for a one-unit increase in age, holding all other predictors in the model constant. 4) The estimated change in the mean number of days with poor mental health is -0.321 for a one-unit increase in income category, holding all other predictors in the model constant. 5) The estimated change in the mean number of days with poor mental health is 1.245 for females compared to males, holding all other predictors in the model constant. 6) The estimated change in the mean number of days with poor mental health is -0.343 for those who exercise compared to those who do not, holding all other predictors in the model constant. —

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "M1: Sleep only", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
  "M2: + age and sex", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
  "M3: Full (sleep_hrs + age + sex + physhlth_days + income_cat + exercise)", 6,
    round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
  kable(caption = "R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
M1: Sleep only	1	0.0197	0.0195
M2: + age and sex	2	0.0504	0.0498
M3: Full (sleep_hrs + age + sex + physhlth_days + income_cat + exercise)	6	0.1569	0.1559

The full model explains the most variance in mental health days as evidenced by the largest r2 and adjusted r2 values (0.157 and 0.156).

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy? rmse_m3 <- round(summary(m3)$sigma, 2) cat(“Root MSE (Model 3):”, rmse_m3, “mentally unhealthy days”) The root MSE is 7.09, which suggests that the model is off by 7.09 mentally unhealthy days.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

anova_m3 <- anova(m3)
print(anova_m3)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## physhlth_days    1  26933 26933.0 535.7786 < 2.2e-16 ***
## age              1   9261  9261.5 184.2385 < 2.2e-16 ***
## income_cat       1   2649  2648.8  52.6918 4.503e-13 ***
## sex              1   1918  1918.4  38.1626 7.025e-10 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

glance(m3) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Overall Model Summary — Model 3") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Overall Model Summary — Model 3
Statistic	Value
R²	0.1569
Adjusted R²	0.1559
Residual Std. Error (Root MSE)	7.0901
F-statistic	154.8953
p-value (overall F-test)	0.0000
Model df (p)	6.0000
Residual df (n − p − 1)	4993.0000
n (observations)	5000.0000

Source	df	SS	MS	F
Model C	6	46718	46717.3	154.90
Residual	4993	250993	50.3	154.90
Total	5000	297711	46767.6	154.90

**State the null hypothesis for the overall F-test and your conclusion. The null hypothesis is that none of the covariates have a linear relationship with mental health days. The overall F-test is 15.90 with a p-value of <0.05, whcih means that at least one of these covariatres does have a linear relationship with mental health days so we can reject the null.

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

cooks_d <- cooks.distance(m3)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Residual vs. fitted: Checks for linearity and equal variance. We want a horizontal red line and random scatter with no pattern. This graphs shows the spread increasing with fitted values indicating heteroscedasticity.

Normal Q-Q: Checks for normality of residuals. Points should fall approximately along the 45° reference line. The heavy tail suggests non-normality.

Scale-Location: Another check for equal variance or homoscedasticity. A flat line indicates constant variance. The graph shows increases along fitted values suggesting heteroscedasticity.

Cook’s Distance: Check for influential observations using Cook’s distance. Points above a Cook’s distance of 0.5 or 1 are considered to have high influence.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. Normality and equal variance are likely to be violated in this analysis. This does not invalidate the analysis because the sample size is large (n=5000) and the covariate group sizes are relatively balanced.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? There are no observations with Cook’s distance >1. If there were I would remove them as they would be highly influential. —

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

This analysis aimed to study the associations between poor mental health and sleep, physical health, age, income, sex, and exercise. The results show that all of these factors play a role in poor mental health days, except for exercise which was found to have no association. Higher sleep, age, and income all reduced the average number of poor mental health days, while a lower number of days with poor physical health and being female reduced poor mental health days. It’s important to note that since this analysis only took a snapshot of a sample group of people, we cannot determine whther these variables are impacting mental health or if mental health is simply assoicated with them. Further studies that follow-up with their study participants can help determine this association.

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Matthew Goldman

March 3, 5, and 10, 2026