knitr::opts_chunk$set(
  echo    = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width  = 10,
  fig.height = 6,
  cache  = FALSE
)

Introduction

In the previous lectures, we used Simple Linear Regression (SLR) to model a continuous outcome as a function of a single predictor. But real-world health phenomena are never driven by one variable alone. A person’s mental health is shaped by how much they sleep, how physically ill they are, how old they are, their income, their sex, and much more — simultaneously.

Multiple Linear Regression (MLR) extends SLR to accommodate this complexity. It allows us to:

Adjust for confounding — estimate the association between an exposure and outcome after accounting for other covariates
Improve prediction — use multiple pieces of information to better forecast an outcome
Model effect modification — examine whether the effect of one variable changes across levels of another
Gain precision — by explaining additional variance in the outcome, we can reduce uncertainty in our estimates

In epidemiology, MLR is our primary tool for multivariable analysis of continuous outcomes.

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

Part 2: In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab Due: End of class, March 10, 2026

Instructions

In this lab, you will build and interpret multiple linear regression models using the BRFSS 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)

brfss_mlr_new <- readRDS(
"/Users/morganwheat/Epi_553/data/LLCP2020.RDS  copy"
)

Task 1: Exploratory Data Analysis (15 points)

# a. (5 pts)** Create a descriptive statistics table using `tbl_summary()` that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr_new %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  # add_overall() %>%
  bold_labels() %>%
  #italicize_levels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
¹ Mean (SD); n (%)

# b. (5 pts)** Create a histogram of `menthlth_days`. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr_new, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

# c. (5 pts)** Create a scatterplot matrix (using `GGally::ggpairs()` or similar) for the continuous variables: `menthlth_days`, `physhlth_days`, `sleep_hrs`, and `age`. Comment on the direction and strength of each pairwise correlation with the outcome.

# Pairs plot of continuous predictors vs outcome
brfss_mlr_new %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

ANSWERED QUESTIONS

The created histogram displays that the outcome (Number of Mentally Unhealthy Days in the past 30 days) is highly right-skewed. This may affect regression modeling because it can lead to non-normally distributed residuals, potentially violating the normality assumption required for valid inference in multiple linear regression.
Continuing to use the number of mentally unhealthy days in the past 30 days (menthlth_days) as the outcome, there is a weak positive correlation (r = 0.315) with the number of physically unhealthy days in the past 30 days, a very weak negative correlation (r = -0.140) with sleep hours per night, and a very weak negative correlation (r = -0.156) with age.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

# a. (5 pts)** Fit a simple linear regression model regressing `menthlth_days` on `sleep_hrs` alone. Write out the fitted regression equation.

# Model 1: Unadjusted
mo1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr_new)
summary(mo1)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

#b. (5 pts)** Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).]

#c. (5 pts)** State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
slope_test <- tidy(mo1, conf.int = TRUE) %>%
   filter(term == "sleep_hrs")

tibble(
  Quantity = c("Slope (b1)", "t-statistic", "p-value"),
  Value = c(
    round(slope_test$estimate, 4),
    round(slope_test$statistic, 3),
    format.pval(slope_test$p.value, digits = 3)
  )
) %>%
  kable(caption = "t-Test for the Slope (H₀: β₁ = 0)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

t-Test for the Slope (H₀: β₁ = 0)
Quantity	Value
Slope (b1)	-0.8042
t-statistic	-10.022
p-value	<2e-16

ANSWERED QUESTIONS

$\hat{Mentally Unhealthy Days} = 9.47429 + -0.80424 (Sleep(hrs/night)$
For every one-hour increase in sleep per night, the number of mentally unhealthy days in the past 30 days is predicted to decrease by about 1 day.
The null hypothesis ((H₀): β₁ = 0) is that there is no linear relationship between sleep hours per night and mentally unhealthy days in the past 30 day.The alternative hypothesis ((𝐻𝐴):𝛽1≠0) is that there is a linear relationship between sleep hours per night and mentally unhealthy days in the past 30 days. The the test statistic is (t = -10.02) and the p-value is (p = 2e-16). The numerator degree of freedom for this test is (df = 1), and the denominator degree of freedom is (df = 4998).

Task 3: Building the Multivariable Model (25 points)

#a. (5 pts)** Fit three models:

# **Model A:** `menthlth_days ~ sleep_hrs`
# **Model B:** `menthlth_days ~ sleep_hrs + age + sex`
# **Model C:** `menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise`

# Model 1: `menthlth_days ~ sleep_hrs`
mo1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr_new)

# mo1 <- lm(menthlth_days ~ sleep_hrs -1, data = brfss_mlr_new)

# Model 2: Add age and sex
mo2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr_new)

# Model 3: `menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise`
mo3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr_new)

summary(mo1)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

summary(mo2)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.306 -3.980 -2.515 -0.307 32.360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.76563    0.64447  18.256  < 2e-16 ***
## sleep_hrs   -0.73391    0.07932  -9.253  < 2e-16 ***
## age         -0.06648    0.00622 -10.688  < 2e-16 ***
## sexFemale    1.53992    0.21339   7.217 6.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.523 on 4996 degrees of freedom
## Multiple R-squared:  0.05036,    Adjusted R-squared:  0.04979 
## F-statistic: 88.32 on 3 and 4996 DF,  p-value: < 2.2e-16

summary(mo3)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

#b. (10 pts)** Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

b <- round(coef(mo3), 3)
ci <- round(confint(mo3), 3)

# Compare the sleep_hrs coefficient across models
tribble(
  ~Model, ~`sleep_hrs β̂`, ~`95% CI`, ~`p-value`, ~`SE`,
  "Mo1 (unadjusted)", round(coef(mo1)[2], 3),
    paste0("(", round(confint(mo1)[2,1],3), ", ", round(confint(mo1)[2,2],3), ")"),
    round(summary(mo1)$coefficients[2,4], 3),
    round(summary(mo1)$coefficients[2,2], 3),

  "Mo2 (+sleep)", round(coef(mo2)[2], 3),
    paste0("(", round(confint(mo2)[2,1],3), ", ", round(confint(mo2)[2,2],3), ")"),
    round(summary(mo2)$coefficients[2,4], 3),
    round(summary(mo2)$coefficients[2,2], 3),

  "Mo3 (full)", round(coef(mo3)[2], 3),
    paste0("(", round(confint(mo3)[2,1],3), ", ", round(confint(mo3)[2,2],3), ")"),
    round(summary(mo3)$coefficients[2,4], 3),
    round(summary(mo3)$coefficients[2,2], 3)
) %>%
  kable(caption = "Table 4. Sleep Hours Per Night Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 4. Sleep Hours Per Night Coefficient Across Sequential Models
Model	sleep_hrs β	\|95% CI
Mo1 (unadjusted)	-0.804	(-0.962, -0.647)	0.080
Mo2 (+sleep)	-0.734	(-0.889, -0.578)	0.079
Mo3 (full)	-0.509	(-0.657, -0.361)	0.075

#c. (10 pts)** For **Model C**, write out the full fitted regression equation and interpret **every** coefficient in plain language appropriate for a public health report.

ANSWERED QUESTIONS

The sleep coefficient does change substantially when additional covariates are added to the models. This suggests that these variables may introduce some confounding in the relationship between sleep hours per night and the number of mentally unhealthy days in the past 30 days.
$\hat{Mentally Unhealthy Days} = 12.475 + -0.509 (Sleep(hrs/night)$ + -0.082(Age) + 1.245(Female) + 0.292(Physical Health Days) + -0.321(Income) + -0.343(Excercise:Yes). To interpret all listed coefficients in plain language, each additional hour of sleep per night is associated with an estimated 1/2 day decrease in mentally unhealthy days on average. For each additional year of age, there is an associated 0.082 fewer mentally unhealthy days on average. As for sex, compared to males, females report an estimated 1.245 more mentally unhealthy days on average. Looking at physical health days, for each additional day of poor physical health, there is an associated 0.292 increase in the number of mentally unhealthy days on average. For the income category (on the 1-8 ordinal scale), each one-unit increase is associated with 0.321 fewer mentally unhealthy days on average. People who engaged in any physical activity in the past 30 days report an estimated 0.343 fewer mentally unhealthy days compared to individuals who did not exercise. Finally, looking at the intercept, the estimated mean number of unhealthy mental health days for a person with zero hours of sleep per night, zero physically unhealthy days, zero years of age, who is male and does not exercise, and has an income level of zero, is about 12 1/2 days within the last 30 days.

Task 4: Model Fit and ANOVA (20 points)

# a. (5 pts)** Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
  
  tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "Mo1: Physical health only",      1, round(summary(mo1)$r.squared, 4), round(summary(mo1)$adj.r.squared, 4),
  "Mo2: + Sleep",                   2, round(summary(mo2)$r.squared, 4), round(summary(mo2)$adj.r.squared, 4),
  "Mo3: Full (+ age, income, sex, exercise)", 6,
    round(summary(mo3)$r.squared, 4), round(summary(mo3)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 7. R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
Mo1: Physical health only	1	0.0197	0.0195
Mo2: + Sleep	2	0.0504	0.0498
Mo3: Full (+ age, income, sex, exercise)	6	0.1569	0.1559

# b. (5 pts)** What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

rmse_mo3 <- round(summary(mo3)$sigma, 2)
cat("Root MSE (Model 3):", rmse_mo3, "mentally unhealthy days\n")

## Root MSE (Model 3): 7.09 mentally unhealthy days

# c. (10 pts)** Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from `anova()` or `glance()`):State the null hypothesis for the overall F-test and your conclusion.

#anova()
anova_mo3 <- anova(mo3)
print(anova_mo3)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# glance()
glance(mo3) %>%
  select(statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "statistic" = "F-statistic",
    "p.value"   = "p-value (overall F-test)",
    "df"        = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"      = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6. Overall Model Summary — Model 3
Statistic	Value
F-statistic	154.8953
p-value (overall F-test)	0.0000
Model df (p)	6.0000
Residual df (n − p − 1)	4993.0000
n (observations)	5000.0000

ANSWERED QUESTIONS

Source	df	SS	MS	F
Model	6	46,719	7,791.2336	154.8953
Residual	4,993	250,993	50.3
Total	4,499	297,712

The r-squared for the first model (menthlth_days ~ sleep_hrs) is 0.0197 and the adjusted r-squared is 0.0195. The r-squared for the second model (menthlth_days ~ sleep_hrs + sleep_hrs) is 0.0504 and the adjusted r-sqaured is 0.0498. Finally, the r-squared for the third model (menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise) is 0.1569 and the adjusted r-squared is 0.1559. The third model explains most of the variance in mental health days as it has the largest r-squared and adjusted r-squared values in comparison to the first two models. However, when comparing the adjusted r-squared values, all three models explain relatively little variance overall (model 1 = 2%, model 2 = 5%, model 3 = 16%).
The Root MSE for Model C is 7.09 mentally unhealthy days. In practical terms, this indicates that the model’s predicted number of mentally unhealthy days differs from the observed value by about 7 days on average.

The null hypothesis for the overall F-test is that none of the predictors matter (𝐻0:𝛽1=𝛽2=⋯=𝛽𝑝=0). My conclusion for the overall F-test is that the F-statistic (F = 154.8953) is statistically significant since (p = 0). In other words, we can reject the null hypothesis since a large F-value (F = 154.8953) and small p-value (p = 0) means that the predictors collectively explain more variation than expected by chance.

Task 5: Residual Diagnostics (15 points)

# a. (5 pts)** For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook's Distance). Comment on what each plot tells you about the LINE assumptions.
par(mfrow = c(2, 2))
plot(mo3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

# b. (5 pts)** Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

# c. (5 pts)** Identify any observations with Cook's Distance > 1. How many are there? What would you do with them in a real analysis?

# Cook's distance
n <- nobs(mo3)
augmented <- augment(mo3) %>%
  mutate(
    obs_num = row_number(),
    cooks_d = cooks.distance(mo3),
    influential = ifelse(cooks_d > 1, "Potentially influential", "Not influential")
  )

n_influential <- sum(augmented$cooks_d > 1)

p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6, size = 1.2) +
  geom_hline(yintercept = 1, linetype = "dashed",
             color = "red", linewidth = 1) +
  scale_color_manual(values = c("Potentially influential" = "tomato",
                                "Not influential" = "steelblue")) +
  labs(
    title = "Cook's Distance",
    subtitle = paste0("Dashed line = D > 1 threshold | ",
                      n_influential, " potentially influential observations"),
    x = "Observation Number",
    y = "Cook's Distance",
    color = ""
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

p_cooks

ANSWERED QUESTIONS

The Residuals vs Fitted plot does not show complete random scatter around zero, and the red line is not perfectly horizontal, suggesting potential non-linearity in the relationship. However, there appears to be no clear “fan shape” which indicates that heteroscedasticity may not be severe. The Q-Q residuals plot indicates that the residuals deviate from the reference line, suggesting that they are not normally distributed and may be right-skewed.The scale-location plot which is another equal variance check (homoscedasticity), is not flat, indicating some evidence of non-constant variance. Finally, the cook’s distance plot suggests the presence of potentially influential observations.
Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), the normality assumption is most likely to be violated as the outcome is heavily right-skewed indicating that the residuals are not approximately normally distributed. The equal variance assumption (homoscedasticity) is also most likely to be violated since the outcome cannot go below 0-30 since this variable measures the number of unhealthy mental health days WITHIN THE LAST 30 DAYS ONLY. However, the violation of these assumptions do not necessarily mean the analysis is invalid. This is because with large sample sizes such as this, the Central Limit Theorem means mild violations cause minimal bias and homoscedasticity can withstand mild departures.
When Cook’s Distance > 1, there are 0 potentially influential observations. In a real analysis, if I had one, I would first investigate whether these observations are legitimate data points or a result from data entry or measurement error problems. If these points are determined to be legitimate but extreme, I would use robust regression techniques or assess how removing these points would affect the model results.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

ANSWERED QUESTION

The number of hours of sleep per night, age, sex, number of poor physical health days, and income category were associated with the number of mentally unhealthy days reported in the past 30 days. Two notable associations were observed. First, for each additional hour of sleep per night, there was an estimated decrease of about half a mentally unhealthy day on average. Second, compared to males, females reported an estimated 1.2 more mentally unhealthy days on average in the past 30 days. However, because this data is from a cross-sectional study, or a study that collects data at a single point in time, we cannot say that these factors directly cause the number of mentally unhealthy days.

End of Lab Activity

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Morgan Wheat

March 3, 5, and 10, 2026

Introduction

Setup and Data

Part 2: In-Class Lab Activity

Instructions

Data for the Lab

Task 1: Exploratory Data Analysis (15 points)

Task 2: Unadjusted (Simple) Linear Regression (15 points)

Task 3: Building the Multivariable Model (25 points)

Task 4: Model Fit and ANOVA (20 points)

Task 5: Residual Diagnostics (15 points)

Task 6: Interpretation for a Public Health Audience (10 points)