Descriptive Statistics
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age,
income_f, sex, exercise) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
income_f ~ "Household income",
sex ~ "Sex",
exercise ~ "Any physical activity (past 30 days)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**")
Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)
| Characteristic |
N |
N = 5,000 |
| Mentally unhealthy days (past 30) |
5,000 |
3.8 (7.7) |
| Physically unhealthy days (past 30) |
5,000 |
3.3 (7.8) |
| Sleep (hours/night) |
5,000 |
7.1 (1.3) |
| Age (years) |
5,000 |
54.3 (17.2) |
| Household income |
5,000 |
|
| <$10k |
|
190 (3.8%) |
| $10-15k |
|
169 (3.4%) |
| $15-20k |
|
312 (6.2%) |
| $20-25k |
|
434 (8.7%) |
| $25-35k |
|
489 (9.8%) |
| $35-50k |
|
683 (14%) |
| $50-75k |
|
841 (17%) |
| >$75k |
|
1,882 (38%) |
| Sex |
5,000 |
|
| Male |
|
2,331 (47%) |
| Female |
|
2,669 (53%) |
| Any physical activity (past 30 days) |
5,000 |
3,874 (77%) |
1b. (5 pts) Create a histogram of
menthlth_days. Describe the shape of the distribution. Is
it symmetric, right-skewed, or left-skewed? What are the implications of
this shape for regression modeling?
p_hist <- ggplot(brfss_mlr, aes(x = menthlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2020 Analytic Sample (n = 5,000)",
x = "Number of Mentally Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)
| The histogram appears to right-skewed. Since the data are count
outcomes and zero is a common reported answer and can extend all the way
to 30. This suggests that the residuals may not be normally distributed
and the variance may not be constant and is worth keeping track of as we
continue with the models later. |
1c. (5 pts) Create a scatterplot matrix (using
GGally::ggpairs() or similar) for the continuous variables:
menthlth_days, physhlth_days,
sleep_hrs, and age. Comment on the direction
and strength of each pairwise correlation with the outcome.
# Pairs plot of continuous predictors vs outcome
brfss_mlr %>%
select(menthlth_days, physhlth_days, sleep_hrs, age) %>%
rename(
`Mental Health\nDays` = menthlth_days,
`Physical Health\nDays` = physhlth_days,
`Sleep\n(hrs)` = sleep_hrs,
Age = age
) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
diag = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
title = "Pairwise Relationships Among Key Variables (BRFSS 2020)"
) +
theme_minimal(base_size = 11)

| Overall, none of the correlations are strong. Physical health days
and mental health days shows a moderate positive correlation of 0.315.
Sleep in hours and mental health days demonstrate a weak negative
correlation of -0.140. Sleep in hours and physical health days show a
weak negative correlation of -0.112. Age and mental health days
demonstrate a weak negative correlation of -0.156. Age and physical
health days show a weak positive correlation of 0.093. Finally, age and
sleep in hours demonstrates a weak positive correlation of 0.089. |
Task 2: Unadjusted (Simple) Linear Regression (15 points)
2a. (5 pts) Fit a simple linear regression model
regressing menthlth_days on sleep_hrs alone.
Write out the fitted regression equation.
# Model A: Unadjusted mental health and sleep hours
modela <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
| Write the fitted regression equation in the form |
| \(\widehat{Mental Health Days} = 9.4743 +
(-.8042) \times\ Hours of Sleep\). |
tidy(modela, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"sleep_hrs" = "Sleep (hours/night)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 2. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE)
Table 2. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep
(BRFSS 2020, n = 5,000)
|
Term
|
Estimate
|
Std. Error
|
t-statistic
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
Intercept
|
9.4743
|
0.5771
|
16.4165
|
0
|
8.3429
|
10.6057
|
|
Sleep (hours/night)
|
-0.8042
|
0.0802
|
-10.0218
|
0
|
-0.9616
|
-0.6469
|
# Anova to see degrees of freedom
anova_modela <- anova(modela)
anova_modela %>%
kable(
caption = "ANOVA Table: Phys_days ~ BMI",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: Phys_days ~ BMI
|
Source
|
Df
|
Sum Sq
|
Mean Sq
|
F value
|
Pr(>F)
|
|
sleep_hrs
|
1
|
5864.765
|
5864.765
|
100.437
|
0
|
|
Residuals
|
4998
|
291846.575
|
58.393
|
NA
|
NA
|
2b. (5 pts) Interpret the slope for sleep in a
single sentence appropriate for a public health audience (no statistical
jargon).
Each additional hour of sleep per night (𝛽̂ = -0.8042) is associated
with an estimated 0.8042 fewer mentally unhealthy days on average (95%
CI: -0.9616 to -0.6469).
2c. (5 pts) State the null and alternative
hypotheses for the slope, report the t-statistic and p-value, and state
your conclusion. What is the degree of freedom for this test?
Null Hypothesis (Ho): No association between mental health days and
sleep hours Alternative Hypothesis (H1): There is an association between
mental health days and sleep hours. The t-statistic is -10.0218 and the
p-value is < .001 suggesting there is enough evidence to reject the
null hypothesis and consider the evidence that there is an association
between mental health and sleep hours. The degress of freedom for this
test is one in the numerator and 4998 in the denominator.
Task 3: Building the Multivariable Model (25 points)
3a. (5 pts) Fit three models:
- Model A:
menthlth_days ~ sleep_hrs
- Model B:
menthlth_days ~ sleep_hrs + age + sex
- Model C:
menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise
# Model A: Unadjusted
modela <- lm(menthlth_days ~ sleep_hrs, data = brfss_mlr)
# Model B: Add sleep
modelb <- lm(menthlth_days ~ sleep_hrs + age + sex, data = brfss_mlr)
# Model C: Full multivariable model
modelc <- lm(menthlth_days ~ sleep_hrs + age + sex + physhlth_days + income_cat + exercise,
data = brfss_mlr)
3b. (10 pts) Create a table comparing the sleep
coefficient (\(\hat{\beta}\), SE, 95%
CI, p-value) across Models A, B, and C. Does the sleep coefficient
change substantially when you add more covariates? What does this
suggest about confounding?
tidy(modelb, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"sexFemale" = "Sex: Female (ref = Male)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep, Age, Sex (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE)
Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Sleep,
Age, Sex (BRFSS 2020, n = 5,000)
|
Term
|
Estimate
|
Std. Error
|
t-statistic
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
Intercept
|
11.7656
|
0.6445
|
18.2563
|
0
|
10.5022
|
13.0291
|
|
Sleep (hours/night)
|
-0.7339
|
0.0793
|
-9.2528
|
0
|
-0.8894
|
-0.5784
|
|
Age (years)
|
-0.0665
|
0.0062
|
-10.6877
|
0
|
-0.0787
|
-0.0543
|
|
Sex: Female (ref = Male)
|
1.5399
|
0.2134
|
7.2166
|
0
|
1.1216
|
1.9583
|
tidy(modelc, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"sleep_hrs" = "Sleep (hours/night)",
"age" = "Age (years)",
"income_cat" = "Income (ordinal 1-8)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseYes" = "Exercise: Yes (ref = No)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 4. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
format = "html"
) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = TRUE) %>%
row_spec(0, bold = TRUE)
Table 4. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple
Predictors (BRFSS 2020, n = 5,000)
|
Term
|
Estimate
|
Std. Error
|
t-statistic
|
p-value
|
95% CI Lower
|
95% CI Upper
|
|
Intercept
|
12.4755
|
0.7170
|
17.4006
|
0.0000
|
11.0699
|
13.8810
|
|
Sleep (hours/night)
|
-0.5092
|
0.0753
|
-6.7574
|
0.0000
|
-0.6569
|
-0.3614
|
|
Age (years)
|
-0.0823
|
0.0059
|
-13.8724
|
0.0000
|
-0.0939
|
-0.0707
|
|
Sex: Female (ref = Male)
|
1.2451
|
0.2023
|
6.1535
|
0.0000
|
0.8484
|
1.6417
|
|
Physical health days
|
0.2917
|
0.0136
|
21.4779
|
0.0000
|
0.2650
|
0.3183
|
|
Income (ordinal 1-8)
|
-0.3213
|
0.0520
|
-6.1778
|
0.0000
|
-0.4233
|
-0.2194
|
|
Exercise: Yes (ref = No)
|
-0.3427
|
0.2531
|
-1.3537
|
0.1759
|
-0.8389
|
0.1536
|
# Compare the sleep_hrs coefficient across models
tribble(
~Model, ~`sleep_hrs β̂`, ~`95% CI`, ~`Adj. R²`,
"Model A (unadjusted sleep)", round(coef(modela)[2], 3),
paste0("(", round(confint(modela)[2,1],3), ", ", round(confint(modela)[2,2],3), ")"),
round(summary(modela)$adj.r.squared, 3),
"Model B (+ age, sex)", round(coef(modelb)[2], 3),
paste0("(", round(confint(modelb)[2,1],3), ", ", round(confint(modelb)[2,2],3), ")"),
round(summary(modelb)$adj.r.squared, 3),
"Model C (full)", round(coef(modelc)[2], 3),
paste0("(", round(confint(modelc)[2,1],3), ", ", round(confint(modelc)[2,2],3), ")"),
round(summary(modelc)$adj.r.squared, 3)
) %>%
kable(caption = "Table 5. Sleep Hours Coefficient Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)
Table 5. Sleep Hours Coefficient Across Sequential Models
|
Model
|
sleep_hrs β
|
|95% CI
|
Adj. R
|
|
Model A (unadjusted sleep)
|
-0.804
|
(-0.962, -0.647)
|
0.020
|
|
Model B (+ age, sex)
|
-0.734
|
(-0.889, -0.578)
|
0.050
|
|
Model C (full)
|
-0.509
|
(-0.657, -0.361)
|
0.156
|
| The sleep coefficient does change when we add more covariates. In
Model A with only sleep, the coefficient is -0.804. When we added age
and sex as covariates the coefficient gets higher -0.734. Finally when
we have the full model, the coefficient is its highest at -0.509. This
suggests that Model A was overestimating the association due to
confounding because when those other covariates are controlled for such
as in the full Model C, the coefficient is closer to 0. |
3c. (10 pts) For Model C, write out
the full fitted regression equation and interpret every
coefficient in plain language appropriate for a public health
report.
\(\widehat{Mental Health Days} = 12.4755 +
-0.5092(Sleep) + -0.0823(Age) + 1.2451(Female) + 0.2917(Physical Health
Days) + -0.3213(Income) + -0.3427(Exercise:Yes)\).
Sleep hours (𝛽̂ = -0.5092): Each additional hour of sleep per night is
associated with an estimated 0.5092 fewer mentally unhealthy days on
average, adjusting for all other covariates (95% CI: -0.6569 to
-0.3614). The negative sign suggests that sleep hours could cause less
bad mental health days.
Age (𝛽̂ = -0.0823): Each additional year of age is associated with
0.082 fewer mentally unhealthy days on average (holding all else
constant). This suggests that as individuals get older, they possibly
experience less mentally unhealthy days on average.
Income category (𝛽̂ = -0.3213): Each one-unit increase in the income
category (on the 1–8 ordinal scale) is associated with 0.321 fewer
mentally unhealthy days on average. This findings seems correct when
considering that individuals who are more financially comfortable may
not experience as many mental health days.
Sex: Female (𝛽̂ = 1.2451): Compared to males (the reference group),
females report an estimated 1.245 more mentally unhealthy days on
average, holding all other variables constant.
Exercise: Yes (𝛽̂ = -0.3427): People who engaged in any physical
activity in the past 30 days report an estimated 0.343 fewer mentally
unhealthy days compared to those who did not exercise, when adjusting
for all other covariates.
Intercept (𝛽̂ 0 = 12.4755): The estimated mean mental health days for
a person with zero physically unhealthy days, zero sleep hours, age = 0,
income category = 0, who is male and does not exercise. This is not a
meaningful quantity in the current context because these qualifiers are
nearly impossible to meet. —
Task 4: Model Fit and ANOVA (20 points)
4a. (5 pts) Report \(R^2\) and Adjusted \(R^2\) for each of the three models (A, B,
C). Create a table. Which model explains the most variance in mental
health days?
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"Model A: Sleep Hours Only", 1, round(summary(modela)$r.squared, 4), round(summary(modela)$adj.r.squared, 4),
"Model B: + Age + Sex", 2, round(summary(modelb)$r.squared, 4), round(summary(modelb)$adj.r.squared, 4),
"Model C: Full (+ physical health, income, exercise)", 6,
round(summary(modelc)$r.squared, 4), round(summary(modelc)$adj.r.squared, 4)
) %>%
kable(caption = "Table 6. R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(3, bold = TRUE, background = "#EBF5FB")
Table 6. R² and Adjusted R² Across Sequential Models
|
Model
|
Predictors
|
R2
|
Adj. R²
|
|
Model A: Sleep Hours Only
|
1
|
0.0197
|
0.0195
|
|
Model B: + Age + Sex
|
2
|
0.0504
|
0.0498
|
|
Model C: Full (+ physical health, income, exercise)
|
6
|
0.1569
|
0.1559
|
| Model C explains the most variance in mental health days with a R²
of 0.1569 and an adjusted R² of 0.1559 whereas Model B had a R² of
0.0504 and adjusted R² of 0.0498 and finally Model A had an R² of 0.0197
and an adjusted R² of 0.0195. |
4b. (5 pts) What is the Root MSE for Model C?
Interpret it in practical terms — what does it tell you about prediction
accuracy?
rmse_modelc <- round(summary(modelc)$sigma, 2)
cat("Root MSE (Model C):", rmse_modelc, "mentally unhealthy days\n")
## Root MSE (Model C): 7.09 mentally unhealthy days
| The root MSE for Model C is 7.09. This suggests in practical terms
that Model C’s prediction is off, on average, by 7.09 mentally unhealthy
days. |
4c. (10 pts) Using the ANOVA output for Model C,
fill in the following table manually (i.e., compute the values using the
output from anova() or glance()):
# (a) Display the ANOVA table
anova_modelc <- anova(modelc)
anova_modelc %>%
kable(
caption = "ANOVA Table: Model C ",
digits = 3,
col.names = c("Source", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
ANOVA Table: Model C
|
Source
|
Df
|
Sum Sq
|
Mean Sq
|
F value
|
Pr(>F)
|
|
sleep_hrs
|
1
|
5864.765
|
5864.765
|
116.668
|
0.000
|
|
age
|
1
|
6182.237
|
6182.237
|
122.983
|
0.000
|
|
sex
|
1
|
2947.099
|
2947.099
|
58.627
|
0.000
|
|
physhlth_days
|
1
|
29455.513
|
29455.513
|
585.959
|
0.000
|
|
income_cat
|
1
|
2176.799
|
2176.799
|
43.303
|
0.000
|
|
exercise
|
1
|
92.124
|
92.124
|
1.833
|
0.176
|
|
Residuals
|
4993
|
250992.804
|
50.269
|
NA
|
NA
|
g <- glance(modelc)
# Extract values
k <- g$df
df_res <- g$df.residual
sigma <- g$sigma
R2 <- g$r.squared
F_val <- g$statistic
p_val <- g$p.value
# Compute components
MSE <- sigma^2
SSE <- MSE * df_res
SST <- SSE / (1 - R2)
SSR <- SST - SSE
MSR <- SSR / k
F_manual <- MSR / MSE
# Print ANOVA table
anova_table <- data.frame(
Source = c("Regression", "Residuals", "Total"),
DF = c(k, df_res, k + df_res),
SS = c(SSR, SSE, SST),
MS = c(MSR, MSE, NA),
F = c(F_manual, NA, NA)
)
anova_table
## Source DF SS MS F
## 1 Regression 6 46718.54 7786.42287 154.8953
## 2 Residuals 4993 250992.80 50.26894 NA
## 3 Total 4999 297711.34 NA NA
# View model summary to see the F-Test results
summary(modelc)
##
## Call:
## lm(formula = menthlth_days ~ sleep_hrs + age + sex + physhlth_days +
## income_cat + exercise, data = brfss_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9192 -3.4262 -1.7803 0.2948 30.0568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.475489 0.716959 17.401 < 2e-16 ***
## sleep_hrs -0.509160 0.075348 -6.757 1.57e-11 ***
## age -0.082307 0.005933 -13.872 < 2e-16 ***
## sexFemale 1.245053 0.202333 6.153 8.17e-10 ***
## physhlth_days 0.291657 0.013579 21.478 < 2e-16 ***
## income_cat -0.321323 0.052012 -6.178 7.02e-10 ***
## exerciseYes -0.342685 0.253138 -1.354 0.176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.09 on 4993 degrees of freedom
## Multiple R-squared: 0.1569, Adjusted R-squared: 0.1559
## F-statistic: 154.9 on 6 and 4993 DF, p-value: < 2.2e-16
| Model |
6 |
46718.54 |
7786.42 |
154.90 |
| Residual |
4993 |
250992.80 |
50.27 |
NA |
| Total |
4999 |
297711.34 |
NA |
NA |
State the null hypothesis for the overall F-test and your
conclusion.
\[H_0: \beta_1 = \beta_2 = \cdots =
\beta_p = 0 \quad \text{(none of the predictors matter)}\] \[H_A: \text{At least one } \beta_j \neq 0\]
The F-statistic for Model C is 154.9 on 6 and 4993 degrees of freedom
with a p-value < .001. This suggests that there is significant
evidence to reject the null hypothesis and to consider the evidence that
at least one predictor has an association and the high f-test and low
p-value suggest this is unlikely to chance.
Task 5: Residual Diagnostics (15 points)
5a. (5 pts) For Model C, produce the four standard
diagnostic plots (Residuals vs. Fitted, Normal Q-Q, Scale-Location,
Cook’s Distance). Comment on what each plot tells you about the LINE
assumptions.
par(mfrow = c(2, 2))
plot(modelc, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
Residuals vs. Fitted: Relative random scatter around 0 for the first
half of the graph with a slight negative slope that plateaus toward the
second half of the graph. This may suggest issues with
homoscedasticity.
Q-Q Residuals: Demonstrates a curve resembling an S instead of
following the reference line. This suggests that the residuals are not
normally distributed.
Scale-Location: Should depict a flat red line, with constant spread,
instead we see a positive sloped line suggesting issues with
homoscedasticity.
Cook’s Distance: There are no points beyond 0.5 or 1.0 of Cook’s D
contour suggesting no influential observations.
5b. (5 pts) Given the nature of the outcome (mental
health days, bounded at 0 and 30, heavily right-skewed), which
assumptions are most likely to be violated? Does this invalidate the
analysis? Explain.
Equal variance and normal distribution of residuals are most likely
violated. Independence may also be violated because some BRFSS
collections are from the same family. This does not necessary invalidate
the analysis because we have such a large sample size allowing central
limit theorem to state that mild violations will cause minimal bias.
Mild departures from variance are acceptable.
5c. (5 pts) Identify any observations with Cook’s
Distance > 1. How many are there? What would you do with them in a
real analysis?
augmented <- augment(modelc) %>%
mutate(
obs_num = row_number(),
cooks_d = cooks.distance(modelc),
influential = ifelse(cooks_d > 1, "Influential (Cook's D > 1)", "Not influential")
)
# Count how many observations exceed Cook's D > 1
n_influential <- sum(augmented$cooks_d > 1)
p_cooks <- ggplot(augmented, aes(x = obs_num, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6, size = 1.2) +
geom_hline(yintercept = 1, linetype = "dashed",
color = "red", linewidth = 1) +
scale_color_manual(values = c("Influential (Cook's D > 1)" = "tomato",
"Not influential" = "steelblue")) +
labs(
title = "Cook's Distance",
subtitle = paste0("Dashed line = Cook's D > 1 | ",
n_influential, " influential observations"),
x = "Observation Number",
y = "Cook's Distance",
color = ""
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
p_cooks

| There are no observations identified by Cook’s Distance. If there
were an outliers, we could center them, utilize imputation, or cap them
at 1.5 IQR. |