Part 2: In-Class Lab Activity
EPI 553 — Multiple Linear Regression Lab
Due: End of class, March 10, 2026
Instructions
In this lab, you will build and interpret multiple linear regression
models using the BRFSS 2020 analytic dataset. Work through each task
systematically. You may discuss concepts with classmates, but your
written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to
Brightspace by end of class.
Data for the Lab
Use the saved analytic dataset from today’s lecture. It contains
5,000 randomly sampled BRFSS 2020 respondents with the following
variables:
menthlth_days |
Mentally unhealthy days in past 30 |
Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 |
Continuous (0–30) |
sleep_hrs |
Sleep hours per night |
Continuous (1–14) |
age |
Age in years (capped at 80) |
Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) |
Ordinal |
sex |
Sex (Male/Female) |
Factor |
exercise |
Any physical activity past 30 days (Yes/No) |
Factor |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)
brfss_mlr <- readRDS(
"C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/brfss_mlr_2020.rds"
)
Task 1: Exploratory Data Analysis (15 points)
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr <- readRDS(
"C:/Users/userp/OneDrive/Рабочий стол/HSTA553/R files/brfss_mlr_2020.rds"
)
descriptive_table <- brfss_mlr %>%
tbl_summary(
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 2
)
descriptive_table
| Characteristic |
N = 5,000 |
| menthlth_days |
3.79 (7.72) |
| physhlth_days |
3.28 (7.79) |
| sleep_hrs |
7.06 (1.35) |
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 |
54.31 (17.18) |
| income_cat |
|
| 1 |
190 (3.8%) |
| 2 |
169 (3.4%) |
| 3 |
312 (6.2%) |
| 4 |
434 (8.7%) |
| 5 |
489 (9.8%) |
| 6 |
683 (14%) |
| 7 |
841 (17%) |
| 8 |
1,882 (38%) |
| income_f |
|
| <$10k |
190 (3.8%) |
| $10-15k |
169 (3.4%) |
| $15-20k |
312 (6.2%) |
| $20-25k |
434 (8.7%) |
| $25-35k |
489 (9.8%) |
| $35-50k |
683 (14%) |
| $50-75k |
841 (17%) |
| >$75k |
1,882 (38%) |
| sex |
|
| Male |
2,331 (47%) |
| Female |
2,669 (53%) |
| exercise |
3,874 (77%) |
| bmi |
28.36 (6.35) |
| Unknown |
294 |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(
title = "Distribution of Mental Health Days",
x = "Number of Poor Mental Health Days (Last 30 Days)",
y = "Frequency"
) +
theme_minimal()

1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
GGally::ggpairs()

Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
model_sleep <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_sleep)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon).
Each additional hour of sleep is associated with fewer days of poor
mental health per month, suggesting that people who sleep longer tend to
report better mental health.
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
The null hypothesis states that the slope for sleep hours equals
zero, meaning sleep duration is not associated with mental health days.
The alternative hypothesis states that the slope is not zero. The
t-statistic is -10.02 with a p-value < 2×10⁻¹⁶, indicating strong
evidence against the null hypothesis. Therefore, we conclude that sleep
hours are significantly associated with the number of poor mental health
days. The degrees of freedom for this test are 4998.
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
model_A <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
model_B <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
model_C <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
summary(model_A)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.306 -3.980 -2.515 -0.307 32.360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.76563 0.64447 18.256 < 2e-16 ***
## sleep_hrs -0.73391 0.07932 -9.253 < 2e-16 ***
## age -0.06648 0.00622 -10.688 < 2e-16 ***
## sexFemale 1.53992 0.21339 7.217 6.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared: 0.05036, Adjusted R-squared: 0.04979
## F-statistic: 88.32 on 3 and 4996 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
sleep_table <- bind_rows(
tidy(model_A, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "Model A"),
tidy(model_B, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "Model B"),
tidy(model_C, conf.int = TRUE) %>% filter(term == "sleep_hrs") %>% mutate(Model = "Model C")
) %>%
select(Model, estimate, std.error, conf.low, conf.high, p.value)
sleep_table
## # A tibble: 3 × 6
## Model estimate std.error conf.low conf.high p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Model A -0.804 0.0802 -0.962 -0.647 2.03e-23
## 2 Model B -0.734 0.0793 -0.889 -0.578 3.17e-20
## 3 Model C -0.509 0.0753 -0.657 -0.361 1.57e-11
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
#In the fully adjusted model, each additional hour of sleep was
associated with 0.51 fewer poor mental health days per month, after
controlling for age, sex, physical health days, income, and exercise.
Age reflects the change in mental health days associated with increasing
age. Sex represents differences in mental health days between males and
females. Physical health days are positively associated with mental
health days, indicating that individuals with more poor physical health
also report more poor mental health days. Income category and exercise
represent socioeconomic and behavioral differences that may influence
mental health outcomes.
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
model_table <- bind_rows(
glance(model_A) %>% mutate(Model = "Model A"),
glance(model_B) %>% mutate(Model = "Model B"),
glance(model_C) %>% mutate(Model = "Model C")
) %>%
select(Model, r.squared, adj.r.squared)
model_table
## # A tibble: 3 × 3
## Model r.squared adj.r.squared
## <chr> <dbl> <dbl>
## 1 Model A 0.0197 0.0195
## 2 Model B 0.0504 0.0498
## 3 Model C 0.157 0.156
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
#The Root MSE for Model C represents the typical prediction error of
the model. It indicates that the predicted number of poor mental health
days differs from the observed values by approximately X days on
average.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.157 0.156 7.09 155. 6.48e-181 6 -16885. 33785. 33837.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
#table | Source | df | SS | MS | F | |———-|——-|———–|——-|——-| | Model
| 6 | 46,719 | 7,786 | 154,9 | | Residual | 4,993 | 250,992 | 50,3 | | |
Total | 4,999 | 297,711.8 | | |
#State the null hypothesis for the overall F-test and your
conclusion: The F-statistic is 154.9 with a p-value < 0.001, which is
far below the typical significance level of 0.05. Therefore, we reject
the null hypothesis.
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2,2))
plot(model_C)

#Comments: #Residuals vs. Fitted (Linearity & Equal Variance)
This plot checks whether the relationship between predictors and the
outcome is linear and whether residuals have constant variance. #Normal
Q–Q Plot (Normality) The Q–Q plot compares the distribution of residuals
to a normal distribution. #Scale-Location Plot (Equal Variance) This
plot assesses whether the variance of residuals is constant across
fitted values. #Residuals vs. Leverage (Cook’s Distance) This plot
identifies influential observations that may strongly affect the
model.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
#Because mental health days are bounded between 0 and 30 and are
strongly right-skewed, the assumptions of normally distributed residuals
and constant variance are most likely violated. However, with a large
sample size, linear regression is relatively robust to these violations,
so the results remain informative. Alternative count-based models such
as Poisson or negative binomial regression could also be considered.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
cooks <- cooks.distance(model_C)
which(cooks > 1)
## named integer(0)
## [1] 0
#No observations have Cook’s Distance greater than 1, indicating that
there are no highly influential observations in the dataset. In a real
analysis, observations with large Cook’s Distance values would be
examined for potential data errors or unusual characteristics, and
sensitivity analyses could be performed to determine whether they
substantially affect the model results.
Task 6: Interpretation for a Public Health Audience (10 points)
#After accounting for several factors, people who reported more hours
of sleep tended to report fewer days of poor mental health each month,
with roughly half a day fewer for each additional hour of sleep.
Individuals who experienced more days of poor physical health reported
substantially more days of poor mental health, suggesting that physical
and mental health often occur together. Age, sex, and income level were
also related to differences in reported mental health days, while
exercise showed little difference after considering other factors.
Because the data were collected at a single point in time, these
findings describe patterns and relationships but cannot determine
whether any of these factors directly cause changes in mental
health.
End of Lab Activity