In this lab, you will build and interpret multiple linear regression models using the Behavioral Risk Factor Surveillance System (BRFSS) 2020 analytic dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Use the saved analytic dataset containing 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
income_cat |
Household income (1 = <$10k to 8 = >$75k) | Ordinal |
sex |
Sex (Male/Female) | Factor |
exercise |
Any physical activity past 30 days (Yes/No) | Factor |
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)# ── Recode and clean ──────────────────────────────────────────────────────────
brfss_mlr <- brfss_full %>%
mutate(
# Outcome: mentally unhealthy days in past 30 (88 = none = 0; 77/99 = DK/refused = NA)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days (key predictor)
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Sleep hours (practical cap at 14)
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Age (capped at 80 per BRFSS coding)
age = age80,
# Income category (ordinal 1–8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days (any physical activity)
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# BMI (stored as integer × 100 in BRFSS)
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Income as labeled factor (for display)
income_f = factor(income2, levels = 1:8,
labels = c("<$10k", "$10-15k", "$15-20k", "$20-25k",
"$25-35k", "$35-50k", "$50-75k", ">$75k"))
) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(age), age >= 18,
!is.na(income_cat),
!is.na(sex),
!is.na(exercise)
)
# ── Analytic sample (reproducible random sample) ────────────────────────────
set.seed(553)
brfss_mlr <- brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_mlr,
"/Users/nataliasmall/Downloads/EPI 553/brfss_mlr_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_mlr), ncol(brfss_mlr))) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes all variables in the
dataset. Include means (SD) for continuous variables and n (%) for
categorical variables.
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_cat, income_f, sex, exercise, bmi) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_cat ~ "Income Category",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)",
bmi ~ "BMI (kg/m²)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.7) |
| Physically unhealthy days (past 30) | 5,000 | 3.3 (7.8) |
| Sleep (hours/night) | 5,000 | 7.1 (1.3) |
| Age (years) | 5,000 | 54.3 (17.2) |
| Income Category | 5,000 | |
| 1 | 190 (3.8%) | |
| 2 | 169 (3.4%) | |
| 3 | 312 (6.2%) | |
| 4 | 434 (8.7%) | |
| 5 | 489 (9.8%) | |
| 6 | 683 (14%) | |
| 7 | 841 (17%) | |
| 8 | 1,882 (38%) | |
| Household income | 5,000 | |
| <$10k | 190 (3.8%) | |
| $10-15k | 169 (3.4%) | |
| $15-20k | 312 (6.2%) | |
| $20-25k | 434 (8.7%) | |
| $25-35k | 489 (9.8%) | |
| $35-50k | 683 (14%) | |
| $50-75k | 841 (17%) | |
| >$75k | 1,882 (38%) | |
| Sex | 5,000 | |
| Male | 2,331 (47%) | |
| Female | 2,669 (53%) | |
| Any physical activity (past 30 days) | 5,000 | 3,874 (77%) |
| BMI (kg/m²) | 4,706 | 28.4 (6.4) |
| 1 Mean (SD); n (%) | ||
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Distribution of Mentally Unhealthy Days (BRFSS 2020)
The distribution is heavily right skewed. Most respondents report zero mentally unhealthy days, with a long tail toward 30. Implications of right skewed shape: Since OLS regression assumes normally distributed residuals and our we have a large sample size (n = 5,000) providing robustness, we should consider a log transformation or Poisson/negative binomial regression as alternatives.
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)Create a scatterplot matrix for continuous variables
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# (a) Fit SLR: menthlth_days ~ sleep_hrs
model_slr <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
summary(model_slr)##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.670 -3.845 -3.040 -0.040 31.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.47429 0.57712 16.42 <2e-16 ***
## sleep_hrs -0.80424 0.08025 -10.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.642 on 4998 degrees of freedom
## Multiple R-squared: 0.0197, Adjusted R-squared: 0.0195
## F-statistic: 100.4 on 1 and 4998 DF, p-value: < 2.2e-16
# Tidy coefficient table
tidy(model_slr, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Simple Linear Regression: menthlth_days ~ sleep_hrs (BRFSS 2020)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = "lrrrrrrr"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.4743 | 0.5771 | 16.4165 | 0 | 8.3429 | 10.6057 |
| sleep_hrs | -0.8042 | 0.0802 | -10.0218 | 0 | -0.9616 | -0.6469 |
Fitted regression equation: menthlth_days = 9.4743 + −0.8042 × sleep_hrs
2b. (5 pts) Interpret the slope for sleep in a single sentence appropriate for a public health audience (no statistical jargon).
On average, for every 1 hour increase in sleep a person gets, it is assumed that they will experience about 1 fewer mentally unhealthy day per month.
2c. (5 pts) State the null and alternative hypotheses for the slope, report the t-statistic and p-value, and state your conclusion. What is the degree of freedom for this test?
3a. (5 pts) Fit three models:
menthlth_days ~ sleep_hrsmenthlth_days ~ sleep_hrs + age + sexmenthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise# Model A: Unadjusted
mA <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B: Add age and sex
mB <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C: Full multivariable model
mC <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)3b. (10 pts) Create a table comparing the sleep coefficient (\(\hat{\beta}\), SE, 95% CI, p-value) across Models A, B, and C. Does the sleep coefficient change substantially when you add more covariates? What does this suggest about confounding?
# Compare the sleep_hrs coefficient across models
tribble(
~Model, ~`Mental Health Days β̂`, ~`SE` , ~`95% CI`, ~`p-value`,
"MA (unadjusted)", round(coef(mA)[2], 3),
round(summary(mA)$coefficients[2,2], 3),
paste0("(", round(confint(mA)[2,1],3), ", ", round(confint(mA)[2,2],3), ")"),
round(summary(mA)$coefficients[2,4], 3),
"MB (+age+sex)", round(coef(mB)[2], 3),
round(summary(mB)$coefficients[2,2], 3),
paste0("(", round(confint(mB)[2,1],3), ", ", round(confint(mB)[2,2],3), ")"),
round(summary(mB)$coefficients[2,4], 3),
"MC (full)", round(coef(mC)[2], 3),
round(summary(mC)$coefficients[2,2], 3),
paste0("(", round(confint(mC)[2,1],3), ", ", round(confint(mC)[2,2],3), ")"),
round(summary(mC)$coefficients[2,4], 3)
) %>%
kable(caption = "Table 3. Sleep Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Mental Health Days β |
S
| ||
|---|---|---|---|---|
| MA (unadjusted) | -0.804 | 0.080 | (-0.962, -0.647) | 0 |
| MB (+age+sex) | -0.734 | 0.079 | (-0.889, -0.578) | 0 |
| MC (full) | -0.509 | 0.075 | (-0.657, -0.361) | 0 |
When more covariates are added the sleep coefficient does not appear to change substantially. The coefficient moves from -0.804 in Model A to -0.509 in Model C. This suggest that while some of the relationship between sleep and mental health is explained by other confounding factors, sleep remains an independent predictor, and confounding by these specific variables is not overwhelming.
3c. (10 pts) For Model C, write out the full fitted regression equation and interpret every coefficient in plain language appropriate for a public health report.
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
tidy(mC, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"menthlth_days" = "Poor mental health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"physhlth_days" = "Physical unealthy days",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3), background = "#EBF5FB") # highlight key predictors| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | 12.4755 | 0.7170 | 17.4006 | 0.0000 | 11.0699 | 13.8810 |
| Sleep (hours/night) | -0.5092 | 0.0753 | -6.7574 | 0.0000 | -0.6569 | -0.3614 |
| Age (years) | -0.0823 | 0.0059 | -13.8724 | 0.0000 | -0.0939 | -0.0707 |
| Sex: Female (ref = Male) | 1.2451 | 0.2023 | 6.1535 | 0.0000 | 0.8484 | 1.6417 |
| Physical unealthy days | 0.2917 | 0.0136 | 21.4779 | 0.0000 | 0.2650 | 0.3183 |
| Income (ordinal 1-8) | -0.3213 | 0.0520 | -6.1778 | 0.0000 | -0.4233 | -0.2194 |
| Exercise: Yes (ref = No) | -0.3427 | 0.2531 | -1.3537 | 0.1759 | -0.8389 | 0.1536 |
regression equation: Mental Health Daysˆ= 12.475 + -0.5092(Sleep hrs) + -0.0823(Age) + 1.2451(Sex) + 0.2917(phys. health days) + -0.3213(income) + -0.3427(Exercise: Yes)
Interpretation: Sleep hrs (𝛽̂= -0.5092): Each additional hour of sleep per night is associated with an estimated 0.509 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: -0.657 to -0.361).
Age (𝛽̂ = -0.0823):Each additional year of age is associated with 0.082 fewer mentally unhealthy days on average (holding all else constant). This counterintuitive finding is well-documented — older adults often report fewer mental health difficulties, possibly due to better emotion regulation, survivor bias, or cohort effects.
Sex:Feamle (𝛽̂= 1.2451): Compared to males (the reference group), females report an estimated 1.245 more mentally unhealthy days on average, holding all other variables constant.
Physical health days (𝛽̂ = 0.2917): Each additional day of poor physical health is associated with an estimated 0.292 additional mentally unhealthy day on average, holding sleep, age, income, sex, and exercise constant (95% CI: 0.265 to 0.318). This is the strongest predictor in the model and is consistent with the bidirectional mind-body connection documented in the literature.
Income category (𝛽̂ = -0.3213): Each one-unit increase in the income category (on the 1-8 ordinal scale) is associated with 0.3213 fewer mentally unhealthy days on average, consistent with the well-established socioeconomic gradient in mental health.
Exercise: Yes (𝛽̂ = -0.3427): People who engaged in any physical activity in the past 30 days report an estimated 0.343 fewer mentally unhealthy days compared to those who did not exercise, adjusting for all other covariates.
Intercept (𝛽̂0= 12.475): The estimated mean mental health days for a person with zero physically unhealthy days, zero sleep hours, age = 0, income category = 0, who is male and does not exercise. This is a mathematical artifact - not a meaningful quantity in context. Note that p-value is 0.176.
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B, C). Create a table. Which model explains the most variance in mental health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"MA: menthlth_days ~ sleep_hrs", 1, round(summary(mA)$r.squared, 4), round(summary(mA)$adj.r.squared, 4),
"MB: + age + sex", 2, round(summary(mB)$r.squared, 4), round(summary(mB)$adj.r.squared, 4),
"MC: Full (+ age + sex + physhlth_days + income_cat + exercise)", 6,
round(summary(mC)$r.squared, 4), round(summary(mC)$adj.r.squared, 4)
) %>%
kable(caption = "Table 4. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| MA: menthlth_days ~ sleep_hrs | 1 | 0.0197 | 0.0195 |
| MB: + age + sex | 2 | 0.0504 | 0.0498 |
| MC: Full (+ age + sex + physhlth_days + income_cat + exercise) | 6 | 0.1569 | 0.1559 |
Model A R²:0.0197 Model A adj.R²:0.0195 Model B R²:0.0504 Model B adj.R²:0.0498 Model C R²:0.1569 Model C adj.R²:0.1559
Model C explains the most variance in mental health days, having the highest R² and adjusted R² values.
4b. (5 pts) What is the Root MSE for Model C? Interpret it in practical terms — what does it tell you about prediction accuracy?
rmse_mC <- round(summary(mC)$sigma, 2)
cat("Root MSE (Model 3):", rmse_mC, "mentally unhealthy days\n")## Root MSE (Model 3): 7.09 mentally unhealthy days
The root MSE for Model C is 7.09 mentally unhealthy days. It tells you, on average, how many days the model’s prediction is off by. On average, the model’s prediction of mentally unhealthy days deviates from the actual reported value by about 7.1 days.
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## sleep_hrs 1 5865 5864.8 116.6678 < 2.2e-16 ***
## age 1 6182 6182.2 122.9832 < 2.2e-16 ***
## sex 1 2947 2947.1 58.6266 2.274e-14 ***
## physhlth_days 1 29456 29455.5 585.9585 < 2.2e-16 ***
## income_cat 1 2177 2176.8 43.3031 5.169e-11 ***
## exercise 1 92 92.1 1.8326 0.1759
## Residuals 4993 250993 50.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(mC) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 4c. Overall Model Summary — Model C") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R² | 0.1569 |
| Adjusted R² | 0.1559 |
| Residual Std. Error (Root MSE) | 7.0901 |
| F-statistic | 154.8953 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 6.0000 |
| Residual df (n − p − 1) | 4993.0000 |
| n (observations) | 5000.0000 |
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Model | 6 | 46,780.2 | 7,796.7 | 154.9 |
| Residual | 4993 | 250,993.0 | 50.3 | |
| Total | 4999 | 298,016.5 |
State the null hypothesis for the overall F-test and your conclusion. Null hypotheses: All coefficients in model C = 0 Alternative hypotheses: At least one coefficient in model C ≠ 0 Conclusion: There is strong statistical evidence that at least one predictor is associated with the number of mentally unhealthy days. Since p-value < 0.05, we reject the null hypothesis.
5a. (5 pts) For Model C, produce the four standard diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location, Cook’s Distance). Comment on what each plot tells you about the LINE assumptions.
Standard Residual Diagnostic Plots
Residuals vs Fitted: The plot shows a distinct downward trend. The red line is starts out relatively flat but gradually deviates upward. It can be assumed the model isn’t capturing the structure of the data perfectly and assumptions of linearity is not satisfied. Q-Q Residuals: The points deviate significantly from the straight dashed line at both tails. Residuals are not normally distributed, confirming that the Normality assumption is violated. Scale-Location: The points form a V shape, indicating that points may not be spread randomly. As fitted values increase so does the red line, indicating heteroscedasticity. This confirms Equal variance assumption is violated. Cook’s Distance: Few data points can been seen straying away from the cook’s distance, so we can say there are some potentially influential observations. There aren’t enough to severely distort the model.
5b. (5 pts) Given the nature of the outcome (mental health days, bounded at 0 and 30, heavily right-skewed), which assumptions are most likely to be violated? Does this invalidate the analysis? Explain.
Assumptions of normality is most likely to be violated. However, this does not necessarily invalidate the analysis. Since our sample size is very large, n = 5000, the Central Limit Theorem helps ensure that coefficient estimates are still unbiased, despite violation. Transformation using log or negative binomial regression could be considered for a more formal analysis.
5c. (5 pts) Identify any observations with Cook’s Distance > 1. How many are there? What would you do with them in a real analysis?
There are 0 observations with Cook’s Distance > 1, as the y-axis only goes up to 0.014. In a real analysis, these observations would require no further action. But, if there were observations with Cook’s Distance > 1, in a real analysis you would most likely double check that the point was not a data entry error.
Suppose you are writing the results section of a public health paper. Write a 3–4 sentence paragraph summarizing the findings from Model C for a non-statistical audience. Your paragraph should:
In this sample of 5,000 U.S. adults, sleep and physical health were the two predictors most associated with the number of days individuals experience poor mental health. Getting more sleep and reporting fewer days of physical illness were both linked to better mental health. On average, women reported 1.25 more mentally unhealthy days than men, adults with lower incomes reported worse mental health compared to those of higher income groups, and older adults appeared to report fewer mentally unhealthy days than younger adults. Since this study represents a single point in time, findings highlight important connections made during the identified period rather than proving that one factor directly causes another.
End of Lab Activity