Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026


Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
income_cat Household income (1 = <$10k to 8 = >$75k) Ordinal
sex Sex (Male/Female) Factor
exercise Any physical activity past 30 days (Yes/No) Factor
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
## Warning: package 'broom' was built under R version 4.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
brfss_mlr <- readRDS(
  "C:/Users/abbym/OneDrive/Desktop/STATS553/R Materials/epi553/scripts/brfss_mlr_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

# Create a descriptive statistics table
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
Characteristic N N = 5,0001
Mentally unhealthy days (past 30) 5,000 3.8 (7.7)
Physically unhealthy days (past 30) 5,000 3.3 (7.8)
Sleep (hours/night) 5,000 7.1 (1.3)
Age (years) 5,000 54.3 (17.2)
Household income 5,000
    <$10k
190 (3.8%)
    $10-15k
169 (3.4%)
    $15-20k
312 (6.2%)
    $20-25k
434 (8.7%)
    $25-35k
489 (9.8%)
    $35-50k
683 (14%)
    $50-75k
841 (17%)
    >$75k
1,882 (38%)
Sex 5,000
    Male
2,331 (47%)
    Female
2,669 (53%)
Any physical activity (past 30 days) 5,000 3,874 (77%)
1 Mean (SD); n (%)
# Create histogram of menthlth_days
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)
# Create a scatterplot matrix for continuous variables 
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling? The histogram of bad mental health days is heavily right skewed. This goes against the assumption of normality for linear regression.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome. Physical health Days and mental health days have a moderate positive correlation. Sleep and mental health days have a weak negative correlation. Sleep and physical health days have a weak negative correlation. Mental Health days and age have a weak negative correlation. Physical health days and age have a weak positive correlation. Sleep and age have a weak positive correlation. —

Task 2: Unadjusted (Simple) Linear Regression (15 points)

# Fit a simple linear regression model 
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

tidy(m1, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)" = "Intercept",
      "menthlth_days" = "Mental health days",
      "sleep_hrs"     = "Sleep (hours/night)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Hours of Sleep (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) 
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Hours of Sleep (BRFSS 2020, n = 5,000)
Term Estimate (β̂
Std. Erro
Intercept 9.4743 0.5771 16.4165 0 8.3429 10.6057
Sleep (hours/night) -0.8042 0.0802 -10.0218 0 -0.9616 -0.6469
# Conduct a slope test
slope_test <- tidy(m1, conf.int = TRUE) %>% filter(term == "sleep_hrs")

tibble(
  Quantity = c("Slope (b₁)", "SE(b₁)", "t-statistic",
               "Degrees of freedom", "p-value", "95% CI Lower", "95% CI Upper"),
  Value    = c(
    round(slope_test$estimate, 4),
    round(slope_test$std.error, 4),
    round(slope_test$statistic, 3),
    4998,
    format.pval(slope_test$p.value, digits = 3),
    round(slope_test$conf.low, 4),
    round(slope_test$conf.high, 4)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
t-Test for the Slope (H₀: β₁ = 0)
Quantity Value
Slope (b₁) -0.8042
SE(b₁) 0.0802
t-statistic -10.022
Degrees of freedom 4998
p-value <2e-16
95% CI Lower -0.9616
95% CI Upper -0.6469

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation. menthlth_days= 9.4743 - 0.8042(sleep_hrs) 2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon). Each additional hour of sleep is associated with 0.8042 fewer bad mental health days.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test? The null hypothesis is that there is no linear relationship between hours of sleep and number of bad mental health days (slope=0) and the null hypothesis is that there is a linear relationship between hours of sleep and number of bad mental health days (slope does not equal 0). The t-statistic is -10.022, the p-value is <2e-16 and the degrees of freedom are 4998. You reject the null hypothesis that there is no linear relationship between sleep and bad mental health days. The slope is statistically significant, and shows a negative association between sleep and bad mental health days.


Task 3: Building the Multivariable Model (25 points)

# Fit three models 
# Model 1: Unadjusted
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model 2: Add sleep
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr)
# Create comparison table
tribble(
  ~Model, ~`Mental Health Days β̂`, ~`SE`, ~`p-value`, ~`95% CI`, ~`Adj. R²`,

  "M1 (unadjusted)", 
    round(coef(m1)[2], 3),
    round(summary(m1)$coefficients[2,2], 3),
    round(summary(m1)$coefficients[2,4], 4),
    paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
    round(summary(m1)$adj.r.squared, 3),

  "M2 (+age +sex)",     
    round(coef(m2)[2], 3),
    round(summary(m2)$coefficients[2,2], 3),
    round(summary(m2)$coefficients[2,4], 4),
    paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
    round(summary(m2)$adj.r.squared, 3),

  "M3 (full)",       
    round(coef(m3)[2], 3),
    round(summary(m3)$coefficients[2,2], 3),
    round(summary(m3)$coefficients[2,4], 4),
    paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
    round(summary(m3)$adj.r.squared, 3)

) %>%
  kable(caption = "Table 4. Physical Health Days Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Table 4. Physical Health Days Coefficient Across Sequential Models
Model Mental Health Days β
   S
M1 (unadjusted) -0.804 0.080 0 (-0.962, -0.647) 0.020
M2 (+age +sex) -0.734 0.079 0 (-0.889, -0.578) 0.050
M3 (full) -0.509 0.075 0 (-0.657, -0.361) 0.156

3a. (5 pts) Fit three models:

  • Model A: menthlth_days ~ sleep_hrs
  • Model B: menthlth_days ~ sleep_hrs + age + sex
  • Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding? The sleep coeffecient does change a good amount when other covariates are included. The coefficient decreases the number of fewer days 1 hour of additional sleep impacts, meaning that the other variables are confounders that enhance the relationship between sleep and bad mental health days. 3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report. \(\hat{Mental Health Days}\)= 12.4755 -0.5092(sleep_hrs)-0.0823(age)+1.2451(sex)+0.2917(physhlth_days)-0.3213(income_cat)-0.3427(exercizse) Each additional hour of sleep is associated with 0.5092 fewer bad mental health days, holding everything else constant. Each year increase in age is associated with 0.0823 fewer bad mental health days, holding everything else constant. Compared to males, females experience 1.2451 more bad mental health days on average, holding everything else constant. Each additional physical health day is associated with 0.2917 more bad mental health days, holding everything else constant. Each one unit increase in income level is associated with 0.3213 fewer bad mental health days, holding everything else constant. Compared to those who do not exercise, those who do experience 0.3427 fewer bad mental health days on average, holding everything else constant. —

Task 4: Model Fit and ANOVA (20 points)

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "M1: Sleep only",      1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
  "M2: + Sleep + Age",                   2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
  "M3: Full (+ age, income, sex, exercise, physical days)", 6,
    round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 7. R² and Adjusted R² Across Sequential Models
Model Predictors R2 Adj. R²
M1: Sleep only 1 0.0197 0.0195
M2: + Sleep + Age 2 0.0504 0.0498
M3: Full (+ age, income, sex, exercise, physical days) 6 0.1569 0.1559
glance(m3) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Overall Model Summary — Model 3
Statistic Value
0.1569
Adjusted R² 0.1559
Residual Std. Error (Root MSE) 7.0901
F-statistic 154.8953
p-value (overall F-test) 0.0000
Model df (p) 6.0000
Residual df (n − p − 1) 4993.0000
n (observations) 5000.0000
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days? The full model (model C) explains the most variance in mental health days.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy? The root MSE is 7.0901. This tells us how many bad mental health days the model is off on average. 4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

Source df SS MS F
Model 6.0000 46719 7786.5 154.8953
Residual 4993.0000 250993 50.3
Total 4999.0000 297712

State the null hypothesis for the overall F-test and your conclusion. The null hypothesis for the F-test is that all of the coefficients are equal to 0. There is statistically significant evidence that at least one predictor’s coefficient is not equal to 0, meaning it is associated with the number of poor mental health days. —

Task 5: Residual Diagnostics (15 points)

par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

par(mfrow = c(1, 1))

# Cook's Distance 
augmented <- augment(m3)

augmented <- augmented %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(m3),
    influential = ifelse(cooks_d > 1, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 1)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 1, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = 1 threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions. The residuals vs. fitted plot shows a somewhat random distribution of the points, but the red line isn’t exactly horizontal. This tells us that the linearity assumption could be violated.The Q-Q plot shows a deviation from the line, meaning the data are not normally distributed. The scale-location test shows a non-horizontal red line, meaning the equivalence of variances assumptions could be violated. There are no observations about Cook’s distance, meaning there are no influential observations. 5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. The normality, linearity, and equivalent variance assumptions will most likely be violated. The normality violation will unlikely invalidate the results, because it is such a large sample size, the central limit theorem applies. The linearity and equivalence of variance violations are more likely to invalidate the results. The results of these tests can still have give information regarding the relationship between the predictors and bad mental health days. 5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? There are no observations with Cook’s distance > 1. I would remove them in a real analysis. —

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

  • Identify which predictors were significantly associated with mental health days
  • State the direction and approximate magnitude of the most important associations
  • Appropriately caveat the cross-sectional nature of the data (no causal language)
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

Sleep, age, sex, physical health days, and income category all have impacts on the number of poor mental health days. Increasing age, sleep, and income decreases the number of bad mental health days, whereas being female and having more physical health days increases the number of bad mental health days you may have in a month. Sleep, sex, and income have larger impacts on mental health days than age and the number of physical health days.

End of Lab Activity