In-Class Lab Activity

EPI 553 — Multiple Linear Regression Lab

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:

Variable Description Type
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
income_cat Household income (1 = <$10k to 8 = >$75k) Ordinal
sex Sex (Male/Female) Factor
exercise Any physical activity past 30 days (Yes/No) Factor
# Load the dataset
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)
library(dplyr)
library(ggplot2)


brfss_mlr <- readRDS(
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_mlr_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a descriptive statistics table using tbl_summary() that includes all variables in the dataset. Include means (SD) for continuous variables and n (%) for categorical variables.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age,
         income_f, sex, exercise, bmi) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)",
      bmi           ~ "Body mass index",
      income_f      ~ "Household income by income level"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
Characteristic N N = 5,0001
Mentally unhealthy days (past 30) 5,000 3.8 (7.7)
Physically unhealthy days (past 30) 5,000 3.3 (7.8)
Sleep (hours/night) 5,000 7.1 (1.3)
Age (years) 5,000 54.3 (17.2)
Household income by income level 5,000
    <$10k
190 (3.8%)
    $10-15k
169 (3.4%)
    $15-20k
312 (6.2%)
    $20-25k
434 (8.7%)
    $25-35k
489 (9.8%)
    $35-50k
683 (14%)
    $50-75k
841 (17%)
    >$75k
1,882 (38%)
Sex 5,000
    Male
2,331 (47%)
    Female
2,669 (53%)
Any physical activity (past 30 days) 5,000 3,874 (77%)
Body mass index 4,706 28.4 (6.4)
1 Mean (SD); n (%)

1b. (5 pts) Create a histogram of menthlth_days. Describe the shape of the distribution. Is it symmetric, right-skewed, or left-skewed? What are the implications of this shape for regression modeling?

p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Description of 1b: The distribution of mentally unhealthy days is strongly right skewed. This skewness suggests the outcome is not normally distributed, which may affect regression assumptions such as normality of residuals.

1c. (5 pts) Create a scatterplot matrix (using GGally::ggpairs() or similar) for the continuous variables: menthlth_days, physhlth_days, sleep_hrs, and age. Comment on the direction and strength of each pairwise correlation with the outcome.

brfss_mlr %>%
  select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `Sleep\n(hrs)`         = sleep_hrs,
    Age                    = age
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
  ) +
  theme_minimal(base_size = 11)

Description for 1c:

Mental health days and physical health days: The correlation is 0.315 indicating a moderate positive relationship. Individuals reporting more physical health days also tend to report more mental health days.

Mental health days and sleep: The correlation is -0.140 indicating a weak negative relationship. More sleep is slightly associated with fewer mental health days.

Mental health days and age: The correlation is -0.156 indicating a weak negative relationship. Younger individuals tend to report slightly more mental health days compared to older individuals.

Task 2: Unadjusted (Simple) Linear Regression (15 points)

2a. (5 pts) Fit a simple linear regression model regressing menthlth_days on sleep_hrs alone. Write out the fitted regression equation.

brfss_mlr_lab <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Summary output
summary(brfss_mlr_lab)
## 
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.670 -3.845 -3.040 -0.040 31.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.47429    0.57712   16.42   <2e-16 ***
## sleep_hrs   -0.80424    0.08025  -10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared:  0.0197, Adjusted R-squared:  0.0195 
## F-statistic: 100.4 on 1 and 4998 DF,  p-value: < 2.2e-16
tidy(brfss_mlr_lab, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Simple Linear Regression: Menthlth_days ~ Sleep_hrs (BRFSS 2020)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper"),
    align = "lrrrrrrr"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Simple Linear Regression: Menthlth_days ~ Sleep_hrs (BRFSS 2020)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 9.4743 0.5771 16.4165 0 8.3429 10.6057
sleep_hrs -0.8042 0.0802 -10.0218 0 -0.9616 -0.6469
b0 <- round(coef(brfss_mlr_lab)[1], 3)
b1 <- round(coef(brfss_mlr_lab)[2], 4)

Fitted Regression Equation: menthlth_days=9.4743−0.8042×sleep_hrs

2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).

For every additional hour of sleep per night, people reported about 0.8 fewer days of poor mental health in the past 30 days.

2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?

Null hypothesis: There is no linear relationship between sleep hours and mental unhealthy days, the slope is 0.

Alternative hypothesis: There is a linear relationship between sleep and mentally unhealthy days, the slope is not zero.

T- Statistic: -10.0218 p-value: 0.000 The p-value is below 0.05 so we reject the null hypothesis and conclude that there is evidence that sleep hours are associated with the number of mentally unhealthy days.

Degrees of freedom: df= n-2 df=4998


Task 3: Building the Multivariable Model (25 points)

3a. (5 pts) Fit three models:

  • Model A: menthlth_days ~ sleep_hrs
  • Model B: menthlth_days ~ sleep_hrs + age + sex
  • Model C: menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
 #Model 1: menthlth_days ~ sleep_hrs
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)

# Model 2: menthlth_days ~ sleep_hrs + age + sex`
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)

# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
         data = brfss_mlr)

tidy(m3, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "physhlth_days" = "Physical health days",
      "sleep_hrs"     = "Sleep (hours/night)",
      "age"           = "Age (years)",
      "income_cat"    = "Income (ordinal 1-8)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      "exerciseYes"   = "Exercise: Yes (ref = No)"
      
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
Intercept 12.4755 0.7170 17.4006 0.0000 11.0699 13.8810
Sleep (hours/night) -0.5092 0.0753 -6.7574 0.0000 -0.6569 -0.3614
Age (years) -0.0823 0.0059 -13.8724 0.0000 -0.0939 -0.0707
Sex: Female (ref = Male) 1.2451 0.2023 6.1535 0.0000 0.8484 1.6417
Physical health days 0.2917 0.0136 21.4779 0.0000 0.2650 0.3183
Income (ordinal 1-8) -0.3213 0.0520 -6.1778 0.0000 -0.4233 -0.2194
Exercise: Yes (ref = No) -0.3427 0.2531 -1.3537 0.1759 -0.8389 0.1536

3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?

tribble(
  ~Model, ~`Mental Health Days β̂`, ~`SE`, ~`p-value`, ~`95% CI`, ~`Adj. R²`,
  "M1 (sleep_hrs)", 
    round(coef(m1)[2], 3),
    round(summary(m1)$coefficients[2,2], 3),
    round(summary(m1)$coefficients[2,4], 3),
    paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
    round(summary(m1)$adj.r.squared, 3),

  "M2 (sleep_hrs)",     
    round(coef(m2)[2], 3),
    round(summary(m2)$coefficients[2,2], 3),
    round(summary(m2)$coefficients[2,4], 3),
    paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
    round(summary(m2)$adj.r.squared, 3),

  "M3 (sleep_hrs)",       
    round(coef(m3)[2], 3),
    round(summary(m3)$coefficients[2,2], 3),
    round(summary(m3)$coefficients[2,4], 3),
    paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
    round(summary(m3)$adj.r.squared, 3)
) %>%
  kable(caption = "Table 4. Sleep Coefficient Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)
Table 4. Sleep Coefficient Across Sequential Models
Model Mental Health Days β
   S
M1 (sleep_hrs) -0.804 0.080 0 (-0.962, -0.647) 0.020
M2 (sleep_hrs) -0.734 0.079 0 (-0.889, -0.578) 0.050
M3 (sleep_hrs) -0.509 0.075 0 (-0.657, -0.361) 0.156

3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.

Fitted regression equation: Menthlth_days= 12.476-0.509X sleep_hrs-0.082Xage +1.245X female+0.2917X phys_health_days-0.3213X income-0.3427X exercise+e

Intercept (12.476): The predicted number of poor mental health days for covarriates at 0/baseline.

Sleep hours (-0.509): Each extra hour of sleep per nigh is associated with about half a day fewer mentally unhealthy days.

Age(-0.082): each additional year of age is associated with 0.08 fewer mentally unhealthy days per month

Sex(1.245): Women report about 1.25 more mentally unhealthy days per month than men.

Physical health days (0.292): Each additional day of poor physical health is associated with 0.29 more mentally unhealthy days per month.

Income(-0.321): Each step higher income is associated with 0.32 fewer mentally unhealthy days per month.

Exercise(0.343, p=0.176): People who exercise report 0.34 fewer mentally unhealthy days per month, however this result is not statistically significant because the p-value is not less than 0.05. We cannot confidently say the association differs from zero.


Task 4: Model Fit and ANOVA (20 points)

4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table.

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "M1: Menthlth_days ~ sleep_hrs",      1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
  "M2: + age and sex",                   2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
  "M3: Full (+ age + sex + physhlth_days + income_cat + exercise)", 6,
    round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
  kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 7. R² and Adjusted R² Across Sequential Models
Model Predictors R2 Adj. R²
M1: Menthlth_days ~ sleep_hrs 1 0.0197 0.0195
M2: + age and sex 2 0.0504 0.0498
M3: Full (+ age + sex + physhlth_days + income_cat + exercise) 6 0.1569 0.1559

Which model explains the most variance in mental health days?

Model C explains the most variance because it has the highest R^2 and adjusted R^2. This shows us that by including the additonal predictors it improves the models explanatory power.

4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?

glance(m3) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6. Overall Model Summary — Model 3
Statistic Value
0.1569
Adjusted R² 0.1559
Residual Std. Error (Root MSE) 7.0901
F-statistic 154.8953
p-value (overall F-test) 0.0000
Model df (p) 6.0000
Residual df (n − p − 1) 4993.0000
n (observations) 5000.0000

Interpretation: The root MSE of 7.0901 means that the model predicted number of poor mental health days is typically off by about 7 days per month from the actual reported value.

4c. (10 pts) Using the ANOVA output for Model C, fill in the following table manually (i.e., compute the values using the output from anova() or glance()):

Source df SS MS F
Model 6 46,719 7,786.5 154.8953
Residual 4,993 250,993 50.3
Total 4,999 297,712
anova_m3 <- anova(m3)
print(anova_m3)
## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## sleep_hrs        1   5865  5864.8 116.6678 < 2.2e-16 ***
## age              1   6182  6182.2 122.9832 < 2.2e-16 ***
## sex              1   2947  2947.1  58.6266 2.274e-14 ***
## physhlth_days    1  29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat       1   2177  2176.8  43.3031 5.169e-11 ***
## exercise         1     92    92.1   1.8326    0.1759    
## Residuals     4993 250993    50.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

State the null hypothesis for the overall F-test and your conclusion.

Null hypotheses: All coefficients in model C equal 0. Alternative hypotheses: At least one coefficient in model C does not equal 0. Conclusion: There is strong statistical evidence that at least one predictor is associated with the number of poor mental health days. The p-value is less than 0.05, so we can reject the null hypothesis.


Task 5: Residual Diagnostics (15 points)

5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.

par(mfrow = c(2, 2))
plot(m3, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Interpretation of results:

Residuals vs. Fitted: The residuals are scattered around zero but show a slight curved pattern and wider spread at lower fitted values. This suggests the linearity assumption is mostly reasonable but some violation is possible. Normal Q-Q: The points follow the diagonal line in the center but deviate noticeably at the tail. This shows us that residuals are approximately normal but deviations in tails suggest some non normality. Scale-Location: The red line trends slightly upward indicating that some heteroscedasticity is present. Cook’s Distance:Most of the observations have small cooks distance values with only a few slightly larger points. This shows that there is no extremely influential observations.

5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. The first assumption of violation is normality. With a bounded and right skewed outcome, residuals are likely non normal. The second violation is homoscedasticity. Those with low or high mental health days may show different variability. Third, linearity may be violated because some predictors may have nonlinear effects on mental health days. This does not invalidate the analysis. The results are still informative for general trends but estimates may be slightly biased for extreme values and standard error may be underestimated.

5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? Observation: For Cooks Distance, none of the observations appear to exceed 1 and all values are very close to 0. If the observations did exceed 1, the data points would be looked at carefully to ensure that there are no data entry errors, unusual but valid observations, or influential outliers. —

Task 6: Interpretation for a Public Health Audience (10 points)

Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:

  • Identify which predictors were significantly associated with mental health days
  • State the direction and approximate magnitude of the most important associations
  • Appropriately caveat the cross-sectional nature of the data (no causal language)
  • Not use any statistical jargon (no “significant,” “coefficient,” “p-value”)

People who reported more poor physical health days also tended to report more mentally unhealthy days, with about 0.3 additional mentally unhealthy days for each extra physically unhealthy day.Several factors were linked with fewer mentally unhealthy days, including more sleep, older age, and higher income. There was about a half a day fewer for each additional hour of sleep, about 0.08 fewer days for each additional year of age, and about 0.3 fewer days for each step up in income level. Women reported about 1.2 more mentally unhealthy days per month on average than men and those who reported exercising had about 0.3 fewer days of mentally unhealthy days. This data does not show cause and effect but rather the relationship between factors and mental health days.