Lab Activity

Instructions

In this lab, you will build and interpret multiple linear regression models using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Data for the Lab

Use the saved analytic dataset containing 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable	Description	Type
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`income_cat`	Household income (1 = <$10k to 8 = >$75k)	Ordinal
`sex`	Sex (Male/Female)	Factor
`exercise`	Any physical activity past 30 days (Yes/No)	Factor

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)

brfss_full <- read_xpt(
  "/Users/nataliasmall/Downloads/EPI 553/LLCP2020.XPT "
) %>%
  clean_names()

# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
  mutate(
    # Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      menthlth == 88              ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_
    ),
    # Physical health days (key predictor)
    physhlth_days = case_when(
      physhlth == 88              ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_
    ),
    # Sleep hours (practical cap at 14)
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                        ~ NA_real_
    ),
    # Age (capped at 80 per BRFSS coding)
    age = age80,
    # Income category (ordinal 1–8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days (any physical activity)
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # BMI (stored as integer × 100 in BRFSS)
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    # Income as labeled factor (for display)
    income_f = factor(income2, levels = 1:8,
      labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
                 "$25-35k", "$35-50k", "$50-75k", ">$75k"))
  ) %>%
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(age),   age >= 18,
    !is.na(income_cat),
    !is.na(sex),
    !is.na(exercise)
  )

# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_mlr,
  "/Users/nataliasmall/Downloads/EPI 553/brfss_mlr_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	9

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_cat, income_f, sex, exercise, bmi) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      income_cat    ~ "Income Category",
      income_f      ~ "Household income",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)",
      bmi           ~ "BMI (kg/m²)"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")

**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**
Characteristic	N	N = 5,000¹
Mentally unhealthy days (past 30)	5,000	3.8 (7.7)
Physically unhealthy days (past 30)	5,000	3.3 (7.8)
Sleep (hours/night)	5,000	7.1 (1.3)
Age (years)	5,000	54.3 (17.2)
Income Category	5,000
1		190 (3.8%)
2		169 (3.4%)
3		312 (6.2%)
4		434 (8.7%)
5		489 (9.8%)
6		683 (14%)
7		841 (17%)
8		1,882 (38%)
Household income	5,000
<$10k		190 (3.8%)
$10-15k		169 (3.4%)
$15-20k		312 (6.2%)
$20-25k		434 (8.7%)
$25-35k		489 (9.8%)
$35-50k		683 (14%)
$50-75k		841 (17%)
>$75k		1,882 (38%)
Sex	5,000
Male		2,331 (47%)
Female		2,669 (53%)
Any physical activity (past 30 days)	5,000	3,874 (77%)
BMI (kg/m²)	4,706	28.4 (6.4)
¹ Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2020)

The distribution is heavily right skewed. Most respondents report zero mentally unhealthy days, with a long tail toward 30. Implications of right skewed shape: Since OLS regression assumes normally distributed residuals and our we have a large sample size (n = 5,000) providing robustness, we should consider a log transformation or Poisson/negative binomial regression as alternatives.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Create a scatterplot matrix for continuous variables

Mental Health Days and Physical Health Days: r = 0.315, weak positive correlation, positive in direction. This pairwise correlation is the strongest among all pairs. As physical health days increase, mental health days tend to increase.
Mental Health Days and Sleep: r = -0.14, weak negative correlation, negative in direction. More sleep is associated with fewer mentally unhealthy days.
Mental Health Days and Age: r = -0.156, weak negative correlation, negative in direction. In this sample, older age is slightly associated with fewer mentally unhealthy days.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

# (a) Fit SLR: menthlth_days ~ sleep_hrs
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_slr)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16

# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: menthlth_days ~ sleep_hrs (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Simple Linear Regression: menthlth_days ~ sleep_hrs (BRFSS 2020)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	9.4743	0.5771	16.4165	0	8.3429	10.6057
sleep_hrs	-0.8042	0.0802	-10.0218	0	-0.9616	-0.6469

Fitted regression equation: menthlth_days = 9.4743 + −0.8042 × sleep_hrs

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

On average, for every 1 hour increase in sleep a person gets, it is assumed that they will experience about 1 fewer mentally unhealthy day per month.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

df: 4998
null hypothesis: 𝐻0:𝛽1=0 (no linear relationship between menthlth_days (Y) and sleep_hrs (X))
alternative hypothesis: 𝐻𝐴:𝛽1≠0 (there is a linear relationship between menthlth_days (Y) and sleep_hrs (X))
test statistic: -10.0218
p-value: <2e-16 (< 0.05)
conclusion.: since p-value < 0.05, we reject the null hypothesis. There is a statistically significant negative association between sleep hours and mentally unhealthy days, indicating that more sleep is linked to fewer unhealthy days.

Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

Model A: menthlth_days ~ sleep_hrs
Model B: menthlth_days ~ sleep_hrs + age + sex
Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise

# Model A: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model B: Add age and sex
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model C: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr)

3b. (10 pts) Create a table comparing the sleep coefficient ($\hat{\beta}$, SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

# Compare the sleep_hrs coefficient across models
tribble(
  ~Model, ~`Mental Health Days β̂`, ~`SE` ,  ~`95% CI`, ~`p-value`,
  "MA (unadjusted)", round(coef(mA)[2], 3),
    round(summary(mA)$coefficients[2,2], 3),
    paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
    round(summary(mA)$coefficients[2,4], 3),
  
  "MB (+age+sex)",   round(coef(mB)[2], 3),
    round(summary(mB)$coefficients[2,2], 3),
    paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
    round(summary(mB)$coefficients[2,4], 3),
  
  "MC (full)",       round(coef(mC)[2], 3),
    round(summary(mC)$coefficients[2,2], 3),
    paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
    round(summary(mC)$coefficients[2,4], 3)
) %>%
  kable(caption = "Table 3. Sleep Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 3. Sleep Coefficient Across Sequential Models
Model	Mental Health Days β	S
MA (unadjusted)	-0.804	0.080	(-0.962, -0.647)
MB (+age+sex)	-0.734	0.079	(-0.889, -0.578)
MC (full)	-0.509	0.075	(-0.657, -0.361)

When more covariates are added the sleep coefficient does not appear to change substantially. The coefficient moves from -0.804 in Model A to -0.509 in Model C. This suggest that while some of the relationship between sleep and mental health is explained by other confounding factors, sleep remains an independent predictor, and confounding by these specific variables is not overwhelming.

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

summary(mC)

## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days + 
##     income_cat + exercise, data = brfss_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9192  -3.4262  -1.7803   0.2948  30.0568 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.475489   0.716959  17.401  < 2e-16 ***
## sleep_hrs     -0.509160   0.075348  -6.757 1.57e-11 ***
## age           -0.082307   0.005933 -13.872  < 2e-16 ***
## sexFemale      1.245053   0.202333   6.153 8.17e-10 ***
## physhlth_days  0.291657   0.013579  21.478  < 2e-16 ***
## income_cat    -0.321323   0.052012  -6.178 7.02e-10 ***
## exerciseYes   -0.342685   0.253138  -1.354    0.176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1559 
## F-statistic: 154.9 on 6 and 4993 DF,  p-value: < 2.2e-16

tidy(mC, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)" = "Intercept",
      "menthlth_days" = "Poor  mental health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "physhlth_days" = "Physical unealthy days",
      "income_cat"    = "Income (ordinal 1-8)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")  # highlight key predictors

Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate (β̂	Std. Erro
Intercept	12.4755	0.7170	17.4006	0.0000	11.0699	13.8810
Sleep (hours/night)	-0.5092	0.0753	-6.7574	0.0000	-0.6569	-0.3614
Age (years)	-0.0823	0.0059	-13.8724	0.0000	-0.0939	-0.0707
Sex: Female (ref = Male)	1.2451	0.2023	6.1535	0.0000	0.8484	1.6417
Physical unealthy days	0.2917	0.0136	21.4779	0.0000	0.2650	0.3183
Income (ordinal 1-8)	-0.3213	0.0520	-6.1778	0.0000	-0.4233	-0.2194
Exercise: Yes (ref = No)	-0.3427	0.2531	-1.3537	0.1759	-0.8389	0.1536

regression equation: Mental Health Daysˆ= 12.475 + -0.5092(Sleep hrs) + -0.0823(Age) + 1.2451(Sex) + 0.2917(phys. health days) + -0.3213(income) + -0.3427(Exercise: Yes)

Interpretation: Sleep hrs (𝛽̂= -0.5092): Each additional hour of sleep per night is associated with an estimated 0.509 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: -0.657 to -0.361).

Age (𝛽̂ = -0.0823):Each additional year of age is associated with 0.082 fewer mentally unhealthy days on average (holding all else constant). This counterintuitive finding is well-documented — older adults often report fewer mental health difficulties, possibly due to better emotion regulation, survivor bias, or cohort effects.

Sex:Feamle (𝛽̂= 1.2451): Compared to males (the reference group), females report an estimated 1.245 more mentally unhealthy days on average, holding all other variables constant.

Physical health days (𝛽̂ = 0.2917): Each additional day of poor physical health is associated with an estimated 0.292 additional mentally unhealthy day on average, holding sleep, age, income, sex, and exercise constant (95% CI: 0.265 to 0.318). This is the strongest predictor in the model and is consistent with the bidirectional mind-body connection documented in the literature.

Income category (𝛽̂ = -0.3213): Each one-unit increase in the income category (on the 1-8 ordinal scale) is associated with 0.3213 fewer mentally unhealthy days on average, consistent with the well-established socioeconomic gradient in mental health.

Exercise: Yes (𝛽̂ = -0.3427): People who engaged in any physical activity in the past 30 days report an estimated 0.343 fewer mentally unhealthy days compared to those who did not exercise, adjusting for all other covariates.

Intercept (𝛽̂0= 12.475): The estimated mean mental health days for a person with zero physically unhealthy days, zero sleep hours, age = 0, income category = 0, who is male and does not exercise. This is a mathematical artifact - not a meaningful quantity in context. Note that p-value is 0.176.

Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report $R^2$ and Adjusted $R^2$ for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "MA: menthlth_days ~ sleep_hrs",      1, round(summary(mA)$r.squared, 4), round(summary(mA)$adj.r.squared, 4),
  "MB: + age + sex",               2, round(summary(mB)$r.squared, 4), round(summary(mB)$adj.r.squared, 4),
  "MC: Full (+ age + sex + physhlth_days + income_cat + exercise)", 6,
    round(summary(mC)$r.squared, 4), round(summary(mC)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 4. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")

Table 4. R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
MA: menthlth_days ~ sleep_hrs	1	0.0197	0.0195
MB: + age + sex	2	0.0504	0.0498
MC: Full (+ age + sex + physhlth_days + income_cat + exercise)	6	0.1569	0.1559

Model A R²:0.0197 Model A adj.R²:0.0195 Model B R²:0.0504 Model B adj.R²:0.0498 Model C R²:0.1569 Model C adj.R²:0.1559

Model C explains the most variance in mental health days, having the highest R² and adjusted R² values.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model 3):", rmse_mC, "mentally unhealthy days\n")

## Root MSE (Model 3): 7.09 mentally unhealthy days

The root MSE for Model C is 7.09 mentally unhealthy days. It tells you, on average, how many days the model’s prediction is off by. On average, the model’s prediction of mentally unhealthy days deviates from the actual reported value by about 7.1 days.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

anova_mC <- anova(mC)
print(anova_mC)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

glance(mC) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 4c. Overall Model Summary — Model C") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4c. Overall Model Summary — Model C
Statistic	Value
R²	0.1569
Adjusted R²	0.1559
Residual Std. Error (Root MSE)	7.0901
F-statistic	154.8953
p-value (overall F-test)	0.0000
Model df (p)	6.0000
Residual df (n − p − 1)	4993.0000
n (observations)	5000.0000

Source	df	SS	MS	F
Model	6	46,780.2	7,796.7	154.9
Residual	4993	250,993.0	50.3
Total	4999	298,016.5

State the null hypothesis for the overall F-test and your conclusion. Null hypotheses: All coefficients in model C = 0 Alternative hypotheses: At least one coefficient in model C ≠ 0 Conclusion: There is strong statistical evidence that at least one predictor is associated with the number of mentally unhealthy days. Since p-value < 0.05, we reject the null hypothesis.

Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(mC, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Standard Residual Diagnostic Plots

par(mfrow = c(1, 1))

Residuals vs Fitted: The plot shows a distinct downward trend. The red line is starts out relatively flat but gradually deviates upward. It can be assumed the model isn’t capturing the structure of the data perfectly and assumptions of linearity is not satisfied. Q-Q Residuals: The points deviate significantly from the straight dashed line at both tails. Residuals are not normally distributed, confirming that the Normality assumption is violated. Scale-Location: The points form a V shape, indicating that points may not be spread randomly. As fitted values increase so does the red line, indicating heteroscedasticity. This confirms Equal variance assumption is violated. Cook’s Distance: Few data points can been seen straying away from the cook’s distance, so we can say there are some potentially influential observations. There aren’t enough to severely distort the model.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.

Assumptions of normality is most likely to be violated. However, this does not necessarily invalidate the analysis. Since our sample size is very large, n = 5000, the Central Limit Theorem helps ensure that coefficient estimates are still unbiased, despite violation. Transformation using log or negative binomial regression could be considered for a more formal analysis.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?

There are 0 observations with Cook’s Distance > 1, as the y-axis only goes up to 0.014. In a real analysis, these observations would require no further action. But, if there were observations with Cook’s Distance > 1, in a real analysis you would most likely double check that the point was not a data entry error.

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

Identify which predictors were significantly associated with mental health days
State the direction and approximate magnitude of the most important associations
Appropriately caveat the cross-sectional nature of the data (no causal language)
Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

In this sample of 5,000 U.S. adults, sleep and physical health were the two predictors most associated with the number of days individuals experience poor mental health. Getting more sleep and reporting fewer days of physical illness were both linked to better mental health. On average, women reported 1.25 more mentally unhealthy days than men, adults with lower incomes reported worse mental health compared to those of higher income groups, and older adults appeared to report fewer mentally unhealthy days than younger adults. Since this study represents a single point in time, findings highlight important connections made during the identified period rather than proving that one factor directly causes another.

End of Lab Activity

Multiple Linear Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Natalia Small

March 3, 5, and 10, 2026