EPI 553 — Multiple Linear Regression Lab
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
# Load the dataset
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)
library(dplyr)
library(ggplot2)
brfss_mlr <- readRDS(
"C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_mlr_2020.rds"
)1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise, bmi) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)",
bmi ~ "Body mass index",
income_f ~ "Household income by income level"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Household income by income level | 5,000 | |
| <$10k | 190 (3.8%) | |
| $10-15k | 169 (3.4%) | |
| $15-20k | 312 (6.2%) | |
| $20-25k | 434 (8.7%) | |
| $25-35k | 489 (9.8%) | |
| $35-50k | 683 (14%) | |
| $50-75k | 841 (17%) | |
| >$75k | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| Body mass index | 4,706 | 28.4 (6.4) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Description of 1b: The distribution of mentally unhealthy days is strongly right skewed. This skewness suggests the outcome is not normally distributed, which may affect regression assumptions such as normality of residuals.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)Description for 1c:
Mental health days and physical health days: The correlation is 0.315 indicating a moderate positive relationship. Individuals reporting more physical health days also tend to report more mental health days.
Mental health days and sleep: The correlation is -0.140 indicating a weak negative relationship. More sleep is slightly associated with fewer mental health days.
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
brfss_mlr_lab <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Summary output
summary(brfss_mlr_lab)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
tidy(brfss_mlr_lab, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: Menthlth_days ~ Sleep_hrs (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.4743 | 0.5771 | 16.4165 | 0 | 8.3429 | 10.6057 |
| sleep_hrs | -0.8042 | 0.0802 | -10.0218 | 0 | -0.9616 | -0.6469 |
Fitted Regression Equation: menthlth_days=9.4743−0.8042×sleep_hrs
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
For every additional hour of sleep per night, people reported about 0.8 fewer days of poor mental health in the past 30 days.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
Null hypothesis: There is no linear relationship between sleep hours and mental unhealthy days, the slope is 0.
Alternative hypothesis: There is a linear relationship between sleep and mentally unhealthy days, the slope is not zero.
T- Statistic: -10.0218 p-value: 0.000 The p-value is below 0.05 so we reject the null hypothesis and conclude that there is evidence that sleep hours are associated with the number of mentally unhealthy days.
Degrees of freedom: df= n-2 df=4998
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise #Model 1: menthlth_days ~ sleep_hrs
m1 <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model 2: menthlth_days ~ sleep_hrs + age + sex`
m2 <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model 3: Full multivariable model
m3 <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
tidy(m3, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB")| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| Intercept | 12.4755 | 0.7170 | 17.4006 | 0.0000 | 11.0699 | 13.8810 |
| Sleep (hours/night) | -0.5092 | 0.0753 | -6.7574 | 0.0000 | -0.6569 | -0.3614 |
| Age (years) | -0.0823 | 0.0059 | -13.8724 | 0.0000 | -0.0939 | -0.0707 |
| Sex: Female (ref = Male) | 1.2451 | 0.2023 | 6.1535 | 0.0000 | 0.8484 | 1.6417 |
| Physical health days | 0.2917 | 0.0136 | 21.4779 | 0.0000 | 0.2650 | 0.3183 |
| Income (ordinal 1-8) | -0.3213 | 0.0520 | -6.1778 | 0.0000 | -0.4233 | -0.2194 |
| Exercise: Yes (ref = No) | -0.3427 | 0.2531 | -1.3537 | 0.1759 | -0.8389 | 0.1536 |
3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
tribble(
~Model, ~`Mental Health Days β̂`, ~`SE`, ~`p-value`, ~`95% CI`, ~`Adj. R²`,
"M1 (sleep_hrs)",
round(coef(m1)[2], 3),
round(summary(m1)$coefficients[2,2], 3),
round(summary(m1)$coefficients[2,4], 3),
paste0("(", round(confint(m1)[2,1],3), ", ", round(confint(m1)[2,2],3), ")"),
round(summary(m1)$adj.r.squared, 3),
"M2 (sleep_hrs)",
round(coef(m2)[2], 3),
round(summary(m2)$coefficients[2,2], 3),
round(summary(m2)$coefficients[2,4], 3),
paste0("(", round(confint(m2)[2,1],3), ", ", round(confint(m2)[2,2],3), ")"),
round(summary(m2)$adj.r.squared, 3),
"M3 (sleep_hrs)",
round(coef(m3)[2], 3),
round(summary(m3)$coefficients[2,2], 3),
round(summary(m3)$coefficients[2,4], 3),
paste0("(", round(confint(m3)[2,1],3), ", ", round(confint(m3)[2,2],3), ")"),
round(summary(m3)$adj.r.squared, 3)
) %>%
kable(caption = "Table 4. Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Mental Health Days β |
S
| |||
|---|---|---|---|---|---|
| M1 (sleep_hrs) | -0.804 | 0.080 | 0 | (-0.962, -0.647) | 0.020 |
| M2 (sleep_hrs) | -0.734 | 0.079 | 0 | (-0.889, -0.578) | 0.050 |
| M3 (sleep_hrs) | -0.509 | 0.075 | 0 | (-0.657, -0.361) | 0.156 |
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.
Fitted regression equation: Menthlth_days= 12.476-0.509X sleep_hrs-0.082Xage +1.245X female+0.2917X phys_health_days-0.3213X income-0.3427X exercise+e
Intercept (12.476): The predicted number of poor mental health days for covarriates at 0/baseline.
Sleep hours (-0.509): Each extra hour of sleep per nigh is associated with about half a day fewer mentally unhealthy days.
Age(-0.082): each additional year of age is associated with 0.08 fewer mentally unhealthy days per month
Sex(1.245): Women report about 1.25 more mentally unhealthy days per month than men.
Physical health days (0.292): Each additional day of poor physical health is associated with 0.29 more mentally unhealthy days per month.
Income(-0.321): Each step higher income is associated with 0.32 fewer mentally unhealthy days per month.
Exercise(0.343, p=0.176): People who exercise report 0.34 fewer mentally unhealthy days per month, however this result is not statistically significant because the p-value is not less than 0.05. We cannot confidently say the association differs from zero.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table.
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M1: Menthlth_days ~ sleep_hrs", 1, round(summary(m1)$r.squared, 4), round(summary(m1)$adj.r.squared, 4),
"M2: + age and sex", 2, round(summary(m2)$r.squared, 4), round(summary(m2)$adj.r.squared, 4),
"M3: Full (+ age + sex + physhlth_days + income_cat + exercise)", 6,
round(summary(m3)$r.squared, 4), round(summary(m3)$adj.r.squared, 4)
) %>%
kable(caption = "Table 7. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| M1: Menthlth_days ~ sleep_hrs | 1 | 0.0197 | 0.0195 |
| M2: + age and sex | 2 | 0.0504 | 0.0498 |
| M3: Full (+ age + sex + physhlth_days + income_cat + exercise) | 6 | 0.1569 | 0.1559 |
Which model explains the most variance in mental health days?
Model C explains the most variance because it has the highest R^2 and adjusted R^2. This shows us that by including the additonal predictors it improves the models explanatory power.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
glance(m3) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 6. Overall Model Summary — Model 3") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R² | 0.1569 |
| Adjusted R² | 0.1559 |
| Residual Std. Error (Root MSE) | 7.0901 |
| F-statistic | 154.8953 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n − p − 1) | 4993.0000 |
| n (observations) | 5000.0000 |
Interpretation: The root MSE of 7.0901 means that the model predicted number of poor mental health days is typically off by about 7 days per month from the actual reported value.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 6 | 46,719 | 7,786.5 | 154.8953 |
| Residual | 4,993 | 250,993 | 50.3 | |
| Total | 4,999 | 297,712 |
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
State the null hypothesis for the overall F-test and your conclusion.
Null hypotheses: All coefficients in model C equal 0. Alternative hypotheses: At least one coefficient in model C does not equal 0. Conclusion: There is strong statistical evidence that at least one predictor is associated with the number of poor mental health days. The p-value is less than 0.05, so we can reject the null hypothesis.
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
Interpretation of results:
Residuals vs. Fitted: The residuals are scattered around zero but show a slight curved pattern and wider spread at lower fitted values. This suggests the linearity assumption is mostly reasonable but some violation is possible. Normal Q-Q: The points follow the diagonal line in the center but deviate noticeably at the tail. This shows us that residuals are approximately normal but deviations in tails suggest some non normality. Scale-Location: The red line trends slightly upward indicating that some heteroscedasticity is present. Cook’s Distance:Most of the observations have small cooks distance values with only a few slightly larger points. This shows that there is no extremely influential observations.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain. The first assumption of violation is normality. With a bounded and right skewed outcome, residuals are likely non normal. The second violation is homoscedasticity. Those with low or high mental health days may show different variability. Third, linearity may be violated because some predictors may have nonlinear effects on mental health days. This does not invalidate the analysis. The results are still informative for general trends but estimates may be slightly biased for extreme values and standard error may be underestimated.
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis? Observation: For Cooks Distance, none of the observations appear to exceed 1 and all values are very close to 0. If the observations did exceed 1, the data points would be looked at carefully to ensure that there are no data entry errors, unusual but valid observations, or influential outliers. —
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
People who reported more poor physical health days also tended to report more mentally unhealthy days, with about 0.3 additional mentally unhealthy days for each extra physically unhealthy day.Several factors were linked with fewer mentally unhealthy days, including more sleep, older age, and higher income. There was about a half a day fewer for each additional hour of sleep, about 0.08 fewer days for each additional year of age, and about 0.3 fewer days for each step up in income level. Women reported about 1.2 more mentally unhealthy days per month on average than men and those who reported exercising had about 0.3 fewer days of mentally unhealthy days. This data does not show cause and effect but rather the relationship between factors and mental health days.