This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
plot(pressure)
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents with the following variables:
| Variable | Description | Type |
|---|---|---|
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
gen_health |
General health status (5 categories) | Factor |
marital_status |
Marital status (6 categories) | Factor |
educ_numeric |
Education as numeric code (1–4) | Numeric |
# Load the dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(ggplot2)
library(flextable)
## Warning: package 'flextable' was built under R version 4.5.3
##
## Attaching package: 'flextable'
##
## The following object is masked from 'package:gtsummary':
##
## continuous_summary
##
## The following objects are masked from 'package:kableExtra':
##
## as_image, footnote
##
## The following object is masked from 'package:purrr':
##
## compose
brfss_dv <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_dv_2020.rds")
1a. (5 pts) Create a descriptive statistics table
using tbl_summary() that includes
menthlth_days, age, sex,
gen_health, and marital_status. Show means
(SD) for continuous variables and n (%) for categorical variables.
brfss_dv |>
select(menthlth_days, physhlth_days, age, sex,
education, gen_health) |>
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") |>
as_flex_table()
Characteristic | N | N = 5,0001 |
|---|---|---|
Mentally unhealthy days (past 30) | 5,000 | 3.8 (7.9) |
Physically unhealthy days (past 30) | 5,000 | 3.3 (7.9) |
Age (years) | 5,000 | 54.9 (17.5) |
Sex | 5,000 | |
Male | 2,303 (46%) | |
Female | 2,697 (54%) | |
Education level | 5,000 | |
Less than HS | 290 (5.8%) | |
HS graduate | 1,348 (27%) | |
Some college | 1,340 (27%) | |
College graduate | 2,022 (40%) | |
General health status | 5,000 | |
Excellent | 1,065 (21%) | |
Very good | 1,803 (36%) | |
Good | 1,426 (29%) | |
Fair | 523 (10%) | |
Poor | 183 (3.7%) | |
1Mean (SD); n (%) | ||
1b. (5 pts) Create a boxplot of
menthlth_days by gen_health. Which group
reports the most mentally unhealthy days? Does the pattern appear
consistent with what you would expect?
Ans: The poor group has the highest unhealthy mental days and the pattern appeared consistent with my expectation.
ggplot(brfss_dv, aes(x =gen_health , y = menthlth_days, fill = gen_health)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Mentally Unhealthy Days by General health status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "General health status",
y = "Mentally Unhealthy Days (Past 30)"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
1c. (5 pts) Create a grouped bar chart or table
showing the mean number of mentally unhealthy days by
marital_status. Which marital status group has the highest
mean? The lowest?
Ans: The highest mean is separated group and the lowest men is the widowed group.
# Compute means by marital status
mental_by_marital <- brfss_dv |>
filter(!is.na(menthlth_days), !is.na(marital_status)) |>
group_by(marital_status) |>
summarise(
mean_days = mean(menthlth_days, na.rm = TRUE),
n = n()
) |>
arrange(desc(mean_days))
print(mental_by_marital)
## # A tibble: 6 × 3
## marital_status mean_days n
## <fct> <dbl> <int>
## 1 Separated 6.22 109
## 2 Unmarried couple 6.07 179
## 3 Never married 5.28 848
## 4 Divorced 4.49 622
## 5 Married 3.10 2708
## 6 Widowed 2.67 534
ggplot(mental_by_marital, aes(x = reorder(marital_status, mean_days), y = mean_days, fill = marital_status)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Mean mentally unhealthy days by marital status",
x = NULL,
y = "Mean days in past 30"
) +
theme_minimal()
2a. (5 pts) Using the gen_health
variable, create a numeric version coded as: Excellent = 1, Very good =
2, Good = 3, Fair = 4, Poor = 5. Fit a simple regression model:
menthlth_days ~ gen_health_numeric. Report the coefficient
and interpret it.
Ans: The coefficient on gen_health_numeric will be positive (roughly +2–3). Each one-unit increase in the numeric scale (i.e., moving one category “worse,” e.g., from Excellent to Very good) is associated with approximately that many additional mentally unhealthy days in the past 30, on average.
df <- brfss_dv |>
mutate(gen_health_numeric = case_when(
gen_health == "Excellent" ~ 1,
gen_health == "Very good" ~ 2,
gen_health == "Good" ~ 3,
gen_health == "Fair" ~ 4,
gen_health == "Poor" ~ 5
))
model_2a <- lm(menthlth_days ~ gen_health_numeric, data = df)
summary(model_2a)
##
## Call:
## lm(formula = menthlth_days ~ gen_health_numeric, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6173 -4.9016 -3.0438 -0.0438 28.8140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6718 0.2705 -2.484 0.013 *
## gen_health_numeric 1.8578 0.1036 17.926 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.661 on 4998 degrees of freedom
## Multiple R-squared: 0.06041, Adjusted R-squared: 0.06022
## F-statistic: 321.3 on 1 and 4998 DF, p-value: < 2.2e-16
2b. (5 pts) Now fit the same model but treating
gen_health as a factor:
menthlth_days ~ gen_health. Compare the two models. Why
does the factor version use 4 coefficients instead of 1? Explain why the
naive numeric approach may be misleading.
Ans: Why 4 coefficients instead of 1? The factor version creates k−1 = 4 dummy variables (one per level, minus the reference “Excellent”). Each dummy compares that group to the reference. The numeric approach forces the differences between consecutive levels to be equal (a “dose–response” assumption). This is misleading because the mental health gap between Excellent and Very good may differ substantially from the gap between Good and Fair — the factor model reveals this; the numeric model hides it.
df <- brfss_dv |>
mutate(gen_health = factor(gen_health,
levels = c("Excellent","Very good","Good","Fair","Poor")))
model_2b <- lm(menthlth_days ~ gen_health, data = df)
summary(model_2b)
##
## Call:
## lm(formula = menthlth_days ~ gen_health, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7814 -4.0708 -2.7077 -0.1174 27.8826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1174 0.2332 9.079 < 2e-16 ***
## gen_healthVery good 0.5903 0.2941 2.007 0.0448 *
## gen_healthGood 1.9535 0.3082 6.337 2.54e-10 ***
## gen_healthFair 5.0624 0.4064 12.457 < 2e-16 ***
## gen_healthPoor 9.6640 0.6090 15.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.611 on 4995 degrees of freedom
## Multiple R-squared: 0.07334, Adjusted R-squared: 0.07259
## F-statistic: 98.83 on 4 and 4995 DF, p-value: < 2.2e-16
# Compare
AIC(model_2a, model_2b)
## df AIC
## model_2a 3 34555.43
## model_2b 6 34492.16
3a. (5 pts) Fit the following model with
gen_health as a factor:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
Write out the fitted regression equation.
model_3a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health,
data = df)
summary(model_3a)
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.592982 0.630441 15.216 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_healthVery good 0.789947 0.279661 2.825 0.00475 **
## gen_healthGood 1.843601 0.297260 6.202 6.03e-10 ***
## gen_healthFair 3.395283 0.417964 8.123 5.66e-16 ***
## gen_healthPoor 5.335347 0.682949 7.812 6.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
tidy(model_3a, conf.int = TRUE)
## # A tibble: 9 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.59 0.630 15.2 3.82e-51 8.36 10.8
## 2 age -0.0867 0.00598 -14.5 1.25e-46 -0.0984 -0.0749
## 3 sexFemale 1.73 0.205 8.40 5.90e-17 1.32 2.13
## 4 physhlth_days 0.231 0.0162 14.3 1.59e-45 0.200 0.263
## 5 sleep_hrs -0.587 0.0766 -7.66 2.21e-14 -0.737 -0.436
## 6 gen_healthVery good 0.790 0.280 2.82 4.75e- 3 0.242 1.34
## 7 gen_healthGood 1.84 0.297 6.20 6.03e-10 1.26 2.43
## 8 gen_healthFair 3.40 0.418 8.12 5.66e-16 2.58 4.21
## 9 gen_healthPoor 5.34 0.683 7.81 6.80e-15 4.00 6.67
3b. (10 pts) Interpret every dummy
variable coefficient for gen_health in plain language. Be
specific about the reference group, the direction and magnitude of each
comparison, and include the phrase “holding all other variables
constant.”
Ans: gen_healthVery good- People who rate their health as Very good report, on average, β₅ more mentally unhealthy days than those with Excellent health, holding all other variables constant. gen_healthGood - People reporting Good health have, on average, β₆ more mentally unhealthy days than the Excellent group, holding all other variables constant. gen_healthFair- People reporting Fair health have, on average, β₇ more mentally unhealthy days than the Excellent group, holding all other variables constant. gen_healthPoor- People reporting Poor health have, on average, β₈ more mentally unhealthy days than those with Excellent health, holding all other variables constant — this is typically the largest positive coefficient.
3c. (10 pts) Create a coefficient plot (forest plot)
showing the estimated coefficients and 95% confidence intervals for the
gen_health dummy variables only. Which group differs most
from the reference group?
# ── 3c: Coefficient / forest plot ────────────────────────────────────────────
coef_df <- tidy(model_3a, conf.int = TRUE) |>
filter(grepl("gen_health", term)) |>
mutate(term = gsub("gen_health", "", term),
term = factor(term, levels = c("Very good","Good","Fair","Poor")))
ggplot(coef_df, aes(x = estimate, y = term)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, linewidth = 0.8) +
geom_point(size = 3, color = "#534AB7") +
labs(
title = "Effect of general health on mentally unhealthy days",
subtitle = "Reference group: Excellent health",
x = "Estimated additional days (vs. Excellent)",
y = "General health category"
) +
theme_minimal(base_size = 13)
## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `height` was translated to `width`.
4a. (5 pts) Use relevel() to change the
reference group for gen_health to “Good.” Refit the model
from Task 3a.
df <- brfss_dv |>
mutate(gen_health_good_ref = relevel(gen_health, ref = "Good"))
model_4a <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health_good_ref,
data = df)
summary(model_4a)
##
## Call:
## lm(formula = menthlth_days ~ age + sex + physhlth_days + sleep_hrs +
## gen_health_good_ref, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5175 -3.5549 -1.6999 0.4316 31.3631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.436584 0.629825 18.158 < 2e-16 ***
## age -0.086672 0.005982 -14.489 < 2e-16 ***
## sexFemale 1.725379 0.205472 8.397 < 2e-16 ***
## physhlth_days 0.231420 0.016177 14.306 < 2e-16 ***
## sleep_hrs -0.586595 0.076572 -7.661 2.21e-14 ***
## gen_health_good_refExcellent -1.843601 0.297260 -6.202 6.03e-10 ***
## gen_health_good_refVery good -1.053654 0.258126 -4.082 4.54e-05 ***
## gen_health_good_refFair 1.551682 0.386128 4.019 5.94e-05 ***
## gen_health_good_refPoor 3.491746 0.650560 5.367 8.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.208 on 4991 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1681
## F-statistic: 127.3 on 8 and 4991 DF, p-value: < 2.2e-16
4b. (5 pts) Compare the education and other continuous variable coefficients between the two models (original reference vs. new reference). Are they the same? Why or why not?
Ans: The coefficients for age, sex, physhlth_days, and sleep_hrs are identical across both models. The reference group only changes the intercept and the dummy variable coefficients (now expressing differences from Good rather than Excellent). The continuous predictors are unaffected because they measure partial slopes that don’t depend on the reference category.
tidy(model_3a) |> filter(!grepl("gen_health|Intercept", term))
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age -0.0867 0.00598 -14.5 1.25e-46
## 2 sexFemale 1.73 0.205 8.40 5.90e-17
## 3 physhlth_days 0.231 0.0162 14.3 1.59e-45
## 4 sleep_hrs -0.587 0.0766 -7.66 2.21e-14
tidy(model_4a) |> filter(!grepl("gen_health|Intercept", term))
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age -0.0867 0.00598 -14.5 1.25e-46
## 2 sexFemale 1.73 0.205 8.40 5.90e-17
## 3 physhlth_days 0.231 0.0162 14.3 1.59e-45
## 4 sleep_hrs -0.587 0.0766 -7.66 2.21e-14
4c. (5 pts) Verify that the predicted values from both models are identical by computing the correlation between the two sets of predictions. Explain in your own words why changing the reference group does not change predictions.
pred_3a <- predict(model_3a)
pred_4a <- predict(model_4a)
cor(pred_3a, pred_4a) # Should be exactly 1.0
## [1] 1
all.equal(pred_3a, pred_4a) # Should be TRUE
## [1] TRUE
5a. (5 pts) Fit a reduced model without
gen_health:
menthlth_days ~ age + sex + physhlth_days + sleep_hrs
Report \(R^2\) and Adjusted \(R^2\) for both the reduced model and the full model (from Task 3a).
# ── 5a: Reduced model ────────────────────────────────────────────────────────
model_reduced <- lm(menthlth_days ~ age + sex + physhlth_days + sleep_hrs, data = df)
summary(model_reduced)$r.squared
## [1] 0.1521948
summary(model_reduced)$adj.r.squared
## [1] 0.1515159
summary(model_3a)$r.squared
## [1] 0.1694246
summary(model_3a)$adj.r.squared
## [1] 0.1680933
5b. (10 pts) Conduct a partial F-test using
anova() to test whether gen_health as a whole
significantly improves the model. State the null and alternative
hypotheses. Report the F-statistic, degrees of freedom, and p-value.
State your conclusion.
Hypotheses:
H₀: The four gen_health dummy coefficients are all equal to zero (general health adds no predictive power beyond the other variables). H₁: At least one gen_health dummy coefficient is nonzero.
Expected result: The F-statistic will be very large (likely F > 100) with 4 numerator df and a p-value < 0.0001, leading us to reject H₀ — general health status significantly improves the model.
anova(model_reduced, model_3a)
## Analysis of Variance Table
##
## Model 1: menthlth_days ~ age + sex + physhlth_days + sleep_hrs
## Model 2: menthlth_days ~ age + sex + physhlth_days + sleep_hrs + gen_health
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4995 264715
## 2 4991 259335 4 5379.8 25.884 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5c. (5 pts) Use car::Anova() with
type = "III" on the full model. Compare the result for
gen_health to your partial F-test. Are they consistent?
Ans:The gen_health row in the Type III ANOVA will report the same F-statistic and p-value as the partial F-test from anova(), because both test the joint significance of all gen_health dummies given all other predictors are in the model. They are consistent by design.
library(car)
Anova(model_3a, type = "III")
## Anova Table (Type III tests)
##
## Response: menthlth_days
## Sum Sq Df F value Pr(>F)
## (Intercept) 12031 1 231.536 < 2.2e-16 ***
## age 10908 1 209.926 < 2.2e-16 ***
## sex 3664 1 70.512 < 2.2e-16 ***
## physhlth_days 10634 1 204.654 < 2.2e-16 ***
## sleep_hrs 3049 1 58.687 2.207e-14 ***
## gen_health 5380 4 25.884 < 2.2e-16 ***
## Residuals 259335 4991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6a. (5 pts) Using the full model from Task 3a, write a 3–4 sentence paragraph summarizing the association between general health status and mental health days for a non-statistical audience. Your paragraph should:
Ans: People who describe their general health as Poor or Fair tend to experience substantially more mentally unhealthy days than those who consider themselves in Excellent health — in some cases nearly two weeks more per month. Even after accounting for age, sex, physical health, and sleep, this pattern holds: mental and general health are closely intertwined. People in Good or Very good health fall in between, with the gap growing larger as self-rated health worsens. Because this data captures a single point in time, we cannot say whether poor general health causes poorer mental health or vice versa — the relationship likely runs in both directions.
6b. (10 pts) Now consider both the education model (from the guided practice) and the general health model (from your lab). Discuss: Which categorical predictor appears to be more strongly associated with mental health days? How would you decide which to include if you were building a final model? Write 3–4 sentences addressing this comparison.
Ans: General health status is likely the stronger predictor of mental health days. General health is a holistic self-assessment that implicitly incorporates chronic illness, pain, and daily functioning — all of which are tightly coupled with mental wellbeing. Education, by contrast, is a distal socioeconomic factor whose effect on mental health is more indirect and attenuated. To formally compare them, you would look at the incremental R² each variable adds when entered last (i.e., partial R²), or compare standardized coefficients. If building a final model, you would include both if theory and sample size permit, but if parsimony is required, the variable with greater incremental explained variance — likely gen_health — would take priority. You would also check whether the two variables interact or suppress one another before making a final decision.
End of Lab Activity