Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026

Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

#recode for income_cat
brfss_mlr <- brfss_mlr %>%
  mutate(
    income_cat = factor(
      income_cat,
      levels = 1:8,
      labels = c(
        "<$10,000",
        "$10,000–<15,000",
        "$15,000–<20,000",
        "$20,000–<25,000",
        "$25,000–<35,000",
        "$35,000–<50,000",
        "$50,000–<75,000",
        "$75,000+"
      )
    )
  )

# Descriptive statistics table
brfss_mlr %>%
  select(menthlth_days,physhlth_days, sleep_hrs,age, income_cat, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally Unhealthy Days",
      physhlth_days ~ "Physically Unhealthy Days",
      sleep_hrs ~ "Sleep (hours/night)",
      income_cat ~ "Household Income",
      sex ~ "Sex",
      exercise ~ "Physical Activity"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p})"
    ),
  digits = all_continuous() ~ 1 
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("Table 1: Descriptive Statistics (BRFSS 2020)")

Table 1: Descriptive Statistics (BRFSS 2020)
Characteristic	N	N = 5,000¹
Mentally Unhealthy Days	5,000	3.8 (7.7)
Physically Unhealthy Days	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
IMPUTED AGE VALUE COLLAPSED ABOVE 80	5,000	54.3 (17.2)
Household Income	5,000
<$10,000		190 (3.8)
$10,000–<15,000		169 (3.4)
$15,000–<20,000		312 (6.2)
$20,000–<25,000		434 (8.7)
$25,000–<35,000		489 (9.8)
$35,000–<50,000		683 (14)
$50,000–<75,000		841 (17)
$75,000+		1,882 (38)
Sex	5,000
Male		2,331 (47)
Female		2,669 (53)
Physical Activity	5,000	3,874 (77)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

#Histogram of menthlth_days
ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(
    bins = 30,
    color = "white", fill = "steelblue") +
  labs(
    title = "Histogram of Mentally Unhealthy Days (BRFSS 2020)",
    x = "Mentally Unhealthy Days",
    y = "Count")

Interpretation:

The distribution is right-skewed. Many respondents report 0 mentally unhealthy days. With the heavy concentration at 0, it’s difficult for the linear regression to satisfy assumptions of normality and equal-variance.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
   rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>% 
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Mental health days and physical helath days have a moderate positive correlation, meaning individuals reporting more physically unhealhty days tend to have more mentally unhealthy days.

Age and mental health days have a small negative correlation, indicating younger adults report more mentally unhealthy days than older adults.

Hours of sleep has a small negative association with mental health days, meaning individuals who get more hours of sleep tend to report fewer mentally unhealthy days

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

$\hat{Y} = 9.474 + -0.804 * sleep$

model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

summary(model_slr)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

Adults who sleep more hours a night tend to report fewer mentally unhealthy days, where each additional hour of sleep is associated with 0.8 fewer mentally unhealthy days.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

H0: b1 = 0 Ha: b1 != 0 Test statistic: t= -10.02 p-value: <2e-16 Degree of Freedom: 4998

The p-value is <2e-16, so we reject the null hypothesis. we can conclude that hours of sleep per night is significantly associated with mentally unhealthy days in this sample.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

mA <- lm(menthlth_days ~ sleep_hrs, data= brfss_mlr)

mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data= brfss_mlr)

mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise, data = brfss_mlr)

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

tribble(
  ~Model, ~ `Sleep β̂`, ~`95% CI`, ~`Adj. R²`,
  "M1 (unadjusted)", round(coef(mA)[2], 3),
    paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
    round(summary(mA)$adj.r.squared, 3),
  "M2 (+age, +sex)",     round(coef(mB)[2], 3),
    paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
    round(summary(mB)$adj.r.squared, 3),
  "M3 (full)",       round(coef(mC)[2], 3),
    paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
    round(summary(mC)$adj.r.squared, 3)
) %>%
  kable(caption = "Sleep Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Sleep Coefficient Across Sequential Models
Model	Sleep β	\|95% CI	Adj. R
M1 (unadjusted)	-0.804	(-0.962, -0.647)	0.020
M2 (+age, +sex)	-0.734	(-0.889, -0.578)	0.050
M3 (full)	-0.505	(-0.653, -0.357)	0.156

Interpretation:

Across the 3 models, the effect of sleep becomes slightly less negative as more covariates. This suggests that physical health days, income, exercise, age, and sex explain some of the relationship between sleep and mental health.

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

summary(mC)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4299  -3.4508  -1.7554   0.3154  29.7449 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               12.62902    0.81563  15.484  < 2e-16 ***
## sleep_hrs                 -0.50502    0.07537  -6.700 2.31e-11 ***
## age                       -0.08245    0.00596 -13.834  < 2e-16 ***
## sexFemale                  1.24182    0.20236   6.137 9.09e-10 ***
## physhlth_days              0.29136    0.01358  21.451  < 2e-16 ***
## income_cat$10,000–<15,000 -0.04980    0.75075  -0.066 0.947111    
## income_cat$15,000–<20,000 -1.47121    0.65316  -2.252 0.024337 *  
## income_cat$20,000–<25,000 -1.80753    0.61835  -2.923 0.003481 ** 
## income_cat$25,000–<35,000 -2.04393    0.60934  -3.354 0.000801 ***
## income_cat$35,000–<50,000 -1.77637    0.58710  -3.026 0.002493 ** 
## income_cat$50,000–<75,000 -2.68254    0.57658  -4.653 3.36e-06 ***
## income_cat$75,000+        -2.61495    0.54904  -4.763 1.96e-06 ***
## exerciseYes               -0.36319    0.25369  -1.432 0.152315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.088 on 4987 degrees of freedom
## Multiple R-squared:  0.1584, Adjusted R-squared:  0.1563 
## F-statistic: 78.19 on 12 and 4987 DF,  p-value: < 2.2e-16

$ = 12.629 + -0.505(sleep_hrs) + -0.082(age) + 1.242(sex-female) + 0.291(physhlth_days) + -0.0498(10k - <15k) + -1.471(15k - <20k)+ -1.808 (20k-<25k) + -2.044 (25k - <35k) + -1.776(35k-<50k) + -2.683(50k-<75k) + -2.615(75k+) + -0.363(exercise-yes) $

Intercept (12.629): The represents the expected number of mentally unhealthy days for someone in all reference groups who is age 0, sleeps 0 hours, and has 0 physically unhealthy days. The intercept is not meaningful on its own, since these are not realistic conditions.

Sleep (-0.505): For each additional hour of sleep per night, adults report about half a day fewer mentally unhealthy days.

Age(-0.082): Each additional year of age is associated with slightly fewer mentally unhealthy days.

Sex-Female (1.242): Women report about 1.2 more mentally unhealthy days per month than men.

Physically Unhealthy Days (0.291): Each additional physically unhealthy day is linked to about 0.291 more mentally unhealthy days.

Income: Adults in higher-income groups consistently report fewer mentally unhealthy days. $10k - <15k: -0.0498 $15 - <20k: -1.471 $20k - <25k: -1.808 $25k - <35k: -2.044 $35k - <50k: - 1.776 $50k - <75k: -2.683 $75k: -2.615

Exercise-yes (-0.363): Adults who report regular exercise have slightly fewer mentally unhealthy days.

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

r2_table <- bind_rows(
  glance(mA) %>% mutate(Model = "Model A: Sleep only"),
  glance(mB) %>% mutate(Model = "Model B: + Age + Sex"),
  glance(mC) %>% mutate(Model = "Model C: Full model")
) %>%
  select(Model, r.squared, adj.r.squared) %>%
  mutate(across(where(is.numeric), ~ round(., 4)))

r2_table %>%
  kable(
    caption = "Table. R² and Adjusted R² for Models A, B, and C",
    col.names = c("Model", "R²", "Adjusted R²")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table. R² and Adjusted R² for Models A, B, and C
Model	R²	Adjusted R²
Model A: Sleep only	0.0197	0.0195
Model B: + Age + Sex	0.0504	0.0498
Model C: Full model	0.1584	0.1563

Model A: $R^2$ = 0.0197, Adjusted $R^2$ = 0.0195 Model B: $R^2$ = 0.0504, Adjusted $R^2$ = 0.0498 Model C: $R^2$ = 0.1584, Adjusted $R^2$ = 0.1563

Model C explains the most variance in mentally unhealthy days since it incorporates the strongest predictors.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

The Root MSE for Model C is 7.088. This indicates that Model C’s predictions differ from actual mentally unhealthy days by about 7 days.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

anova(mC)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.7255 < 2.2e-16 ***
## age              1   6182  6182.2 123.0441 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6557 2.242e-14 ***
## physhlth_days    1  29456 29455.5 586.2487 < 2.2e-16 ***
## income_cat       7   2592   370.2   7.3687 7.870e-09 ***
## exercise         1    103   103.0   2.0495    0.1523    
## Residuals     4987 250567    50.2                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Source	df	SS	MS	F
Model	12	47145	3928.75	78.26
Residual	4987	250567	50.2
Total	4999	297712

State the null hypothesis for the overall F-test and your conclusion.

H0: All regression coefficients are equal to zero. The overall F-statistic is 78.26 with a p-value of <2.2e-16, indicating we reject the null hypothesis. Model C explains significantly more variation in mentally unhealthy days than models A and B.

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(mC, which = 1:4,
     col = adjustcolor("steelblue", 0.4),
     pch = 19, cex = 0.6)

Interpretation: The Residuals vs Fitted plot shows a slight curve in the reference line. This indicates that the relationship between all predictors and mentally unhealthy days is not perfectly linear. The Q-Q plot shows clear deviations from the reference line, particularly in the upper tail. This reflects a skewed outcome. Scale-Location plot shows an increasing spread of residuals as fitted values increase, indicating heteroscedasticity. The Cook’s Distance plot shows no influential observations.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

Since mentally unhealthy days are bounded at 0 and 30 and strongly right skewed,the assumptions most likely to be strained are normality, equal variance, and linearity. They don’t invalidate the analysis as this is a large sample size so linear regression still provides estimates of association. The predictions at the high end of the scale are less precise.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

cooks_d <- cooks.distance(mC)

ggplot(mC, aes(x = seq_along(cooks_d), y= cooks_d)) +
    geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance",
    x = "Observation",
    y = "Cook's Distance"
  )

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggplot2 package.
##   Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation: Based on Cook’s Distance, there are no influential observations. If present in an analysis, we would need to inspect for data entry errors, check to see whether they represent unusual but valid cases, and consider sensitivity analyses if they have meaning to the results.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

Adults who slept more hours per night tend to report fewer mentally unhealthy days, after accounting for age, sex, physical health, income, and exercise. Individuals with more physically unhealthy days and those with lower income report the highest levels of mentally unhealthy days, with physical health showing the strongest association. Since the data is cross-sectional, the findings describe patterns in the population, but can’t determine whether the factors directly cause changes to mental health status.

End of Lab Activity

MLR_Lab

Sarah Andres

2026-03-06