Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.
# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(janitor)
library(car)
library(gtsummary)
library(ggeffects)# Import the dataset
brfss <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 3/LLCP2023.xpt") %>%
clean_names()
#select variables
brfss_work <- brfss |>
select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)brfss_work <- brfss_work %>%
mutate(
menthlth_days = case_when(menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_),
physhlth_days = case_when(physhlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_),
bmi = case_when(bmi5 == 9999 ~ NA, bmi5 < 9999 ~ bmi5/100),
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
exercise = case_when(exerany2 == 7 & exerany2 == 9 ~ NA),
exercise = factor(exerany2, levels = c(1, 2), labels = c("Yes", "No")),
age_group = case_when(ageg5yr == 14 ~ NA),
age_group = factor(ageg5yr, levels = 1:13,
labels = c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+") ),
income = case_when(incomg1 == 9 ~ NA),
income = factor(incomg1, levels = 1:7,
labels = c("<$15k", "$15-25k", "$25-35k", "$35-50k",
"$50-100k", "$100-200k", ">$200k")),
education = case_when(educa == 9 ~ NA),
education = factor(educa, levels = 1:6, labels =
c("Less than high school",
"Less than high school",
"High school diploma or GED",
"Some college or technical school",
"College graduate",
"Graduate or professional degree")),
smoking = case_when(smoker3 == 9 ~ NA),
smoking = factor(smoker3, levels=1:4, labels =
c("Current daily smoker",
"Current some-day smoker",
"Former smoker",
"Never smoker"))
)## Total N: 433323
## Missing data: 291293
## Missing data %: 67.22306
set.seed(1220)
brfss_analytic <- brfss_work |>
drop_na(menthlth_days, physhlth_days, bmi, sex, exercise,
age_group, income, education, smoking) |>
slice_sample(n = 8000)
cat("FInal analytic N:", nrow(brfss_analytic), "\n")## FInal analytic N: 8000
brfss_analytic %>%
select(menthlth_days, physhlth_days, bmi, sex, exercise, age_group, income, education, smoking) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
bmi ~ "Body mass index",
sex ~ "Sex",
exercise ~ "Any exercise in past 30 days",
age_group ~ "Age group",
income ~ "Annual household income",
education ~ "Highest level of education completed",
smoking ~ "Smoking status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)**")| Characteristic | N | N = 8,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 8,000 | 5.6 (9.0) |
| Physically unhealthy days (past 30) | 8,000 | 3.5 (7.9) |
| Body mass index | 8,000 | 28.6 (6.5) |
| Sex | 8,000 | |
| Male | 3,921 (49%) | |
| Female | 4,079 (51%) | |
| Any exercise in past 30 days | 8,000 | 6,242 (78%) |
| Age group | 8,000 | |
| 18-24 | 457 (5.7%) | |
| 25-29 | 424 (5.3%) | |
| 30-34 | 533 (6.7%) | |
| 35-39 | 590 (7.4%) | |
| 40-44 | 628 (7.9%) | |
| 45-49 | 538 (6.7%) | |
| 50-54 | 627 (7.8%) | |
| 55-59 | 694 (8.7%) | |
| 60-64 | 819 (10%) | |
| 65-69 | 818 (10%) | |
| 70-74 | 759 (9.5%) | |
| 75-79 | 577 (7.2%) | |
| 80+ | 536 (6.7%) | |
| Annual household income | 8,000 | |
| <$15k | 424 (5.3%) | |
| $15-25k | 643 (8.0%) | |
| $25-35k | 812 (10%) | |
| $35-50k | 1,102 (14%) | |
| $50-100k | 2,551 (32%) | |
| $100-200k | 1,809 (23%) | |
| >$200k | 659 (8.2%) | |
| Highest level of education completed | 8,000 | |
| Less than high school | 127 (1.6%) | |
| High school diploma or GED | 254 (3.2%) | |
| Some college or technical school | 1,795 (22%) | |
| College graduate | 2,144 (27%) | |
| Graduate or professional degree | 3,680 (46%) | |
| Smoking status | 8,000 | |
| Current daily smoker | 691 (8.6%) | |
| Current some-day smoker | 274 (3.4%) | |
| Former smoker | 2,195 (27%) | |
| Never smoker | 4,840 (61%) | |
| 1 Mean (SD); n (%) | ||
model <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking, data = brfss_analytic)
summary(model)##
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise +
## age_group + income + education + smoking, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8259 -2.4065 -1.4297 -0.1347 29.4079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.818398 0.692538 6.958 3.74e-12
## physhlth_days 0.850795 0.008271 102.870 < 2e-16
## bmi 0.010978 0.009866 1.113 0.265864
## sexFemale 0.586915 0.125583 4.674 3.01e-06
## exerciseNo 0.272611 0.158988 1.715 0.086445
## age_group25-29 -0.602502 0.376024 -1.602 0.109130
## age_group30-34 -0.768963 0.357147 -2.153 0.031343
## age_group35-39 -0.727709 0.351562 -2.070 0.038491
## age_group40-44 -1.342610 0.349512 -3.841 0.000123
## age_group45-49 -1.924205 0.360173 -5.342 9.42e-08
## age_group50-54 -2.272300 0.350703 -6.479 9.76e-11
## age_group55-59 -2.230300 0.341562 -6.530 6.99e-11
## age_group60-64 -2.648336 0.329843 -8.029 1.12e-15
## age_group65-69 -3.152289 0.331799 -9.501 < 2e-16
## age_group70-74 -3.385648 0.335638 -10.087 < 2e-16
## age_group75-79 -3.244898 0.355529 -9.127 < 2e-16
## age_group80+ -3.594839 0.358756 -10.020 < 2e-16
## income$15-25k -0.237845 0.346795 -0.686 0.492836
## income$25-35k -0.132541 0.335342 -0.395 0.692677
## income$35-50k -0.445473 0.323322 -1.378 0.168303
## income$50-100k -0.569835 0.304430 -1.872 0.061269
## income$100-200k -1.005289 0.318742 -3.154 0.001617
## income>$200k -1.256938 0.368011 -3.415 0.000640
## educationHigh school diploma or GED 0.381209 0.602595 0.633 0.527005
## educationSome college or technical school 0.904135 0.513082 1.762 0.078080
## educationCollege graduate 0.975882 0.513649 1.900 0.057482
## educationGraduate or professional degree 0.832938 0.513798 1.621 0.105028
## smokingCurrent some-day smoker -0.842184 0.394788 -2.133 0.032934
## smokingFormer smoker -0.738961 0.250086 -2.955 0.003138
## smokingNever smoker -1.368381 0.237205 -5.769 8.28e-09
##
## (Intercept) ***
## physhlth_days ***
## bmi
## sexFemale ***
## exerciseNo .
## age_group25-29
## age_group30-34 *
## age_group35-39 *
## age_group40-44 ***
## age_group45-49 ***
## age_group50-54 ***
## age_group55-59 ***
## age_group60-64 ***
## age_group65-69 ***
## age_group70-74 ***
## age_group75-79 ***
## age_group80+ ***
## income$15-25k
## income$25-35k
## income$35-50k
## income$50-100k .
## income$100-200k **
## income>$200k ***
## educationHigh school diploma or GED
## educationSome college or technical school .
## educationCollege graduate .
## educationGraduate or professional degree
## smokingCurrent some-day smoker *
## smokingFormer smoker **
## smokingNever smoker ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.514 on 7970 degrees of freedom
## Multiple R-squared: 0.6257, Adjusted R-squared: 0.6243
## F-statistic: 459.4 on 29 and 7970 DF, p-value: < 2.2e-16
## (Intercept)
## 4.81839770
## physhlth_days
## 0.85079499
## bmi
## 0.01097774
## sexFemale
## 0.58691470
## exerciseNo
## 0.27261058
## age_group25-29
## -0.60250175
## age_group30-34
## -0.76896262
## age_group35-39
## -0.72770861
## age_group40-44
## -1.34260981
## age_group45-49
## -1.92420498
## age_group50-54
## -2.27229975
## age_group55-59
## -2.23029964
## age_group60-64
## -2.64833599
## age_group65-69
## -3.15228943
## age_group70-74
## -3.38564774
## age_group75-79
## -3.24489757
## age_group80+
## -3.59483883
## income$15-25k
## -0.23784516
## income$25-35k
## -0.13254056
## income$35-50k
## -0.44547317
## income$50-100k
## -0.56983491
## income$100-200k
## -1.00528854
## income>$200k
## -1.25693832
## educationHigh school diploma or GED
## 0.38120905
## educationSome college or technical school
## 0.90413454
## educationCollege graduate
## 0.97588172
## educationGraduate or professional degree
## 0.83293760
## smokingCurrent some-day smoker
## -0.84218438
## smokingFormer smoker
## -0.73896133
## smokingNever smoker
## -1.36838119
Step 2: Write the fitted regression equation including all terms, with coefficients rounded to three decimal places.
Menthlth_days = 4.818 + (physhlth_days * 0.850) + (bmi * 0.010) + (sexFemale * 0.586) + (exerciseNo * 0.272) + (age_group25-29 * -0.602) + (age_group30-34 * -0.768) + (age_group35-39 * -0.727) + (age_group40-44 * -1.342) + (age_group45-49 * -1.924) + (age_group50-54 * -2.272) + (age_group55-59 * -2.230) + (age_group60-64 * -2.648) + (age_group65-69 * -3.152) + (age_group70-74 * -3.385) + (age_group75-79 * -3.244) + (age_group80+ * -3.594) + (income15-25k * -0.237) + income25-35k * -0.132) + (income35-50k * -0.445) + income50-100k * -0.569) + (income100-200k * -1.005) + (income>$200k * -1.256) + (educationHigh school diploma or GED * 0.381) + (educationSome college or technical school * 0.904) + (educationCollege graduate * 0.975 + (educationGraduate or professional degree * 0.832) + (smokingCurrent some-day smoker * -0.842) + (smokingFormer smoker * -0.738) + (smokingNever smoker * -1.368)
Step 3: Interpret the following coefficients in plain language. For each, state the direction, magnitude, and meaning in terms of mentally unhealthy days, holding all other variables constant: • physhlth_days (continuous predictor)
For every 1 day increase in physhlth_days, menthlth days increases by 0.850, keeping all other variables constant.
• bmi (continuous predictor)
For every 1 unit increase in bmi, menthlth days increases by 0.010, keeping all other variables constant.
• sex: Female vs. Male (reference)
Being female is associated with 0.586 more mentally unhealthy days per month compared to males, keeping all other variables constant.
• exercise: Yes vs. No (reference)
Not exercising is associated with 0.272 more mentally unhealthy days per month compared to exercising, keeping all other variables constant.
• One age_group coefficient of your choice
Being 30-34 is associated with 0.768 less mentally unhealthy days per month than being 18-24, keeping all other variables constant.
• Two income coefficients of your choice, compared to the reference category (Less than $15,000)
Making >200K is associated with 1.256 less mentally unhealthy days per month than making <15K, and making 35-50K is associated with 0.445 less mentally unhealthy days per month than making <15K, keeping all other variables constant.
Step 4: Report and interpret each of the following model-level statistics: • R-squared: proportion of variance in mentally unhealthy days explained by all predictors
0.6257: all predictors explain 62.57% of the variance in menthlth_days.
• Adjusted R-squared: how it differs from R-squared and what it adds
0.6243: all predictors explain 62.43% of the variance in menthlth_days. Differs from R2 because adjusted R2 only increases when a predictor is added that improves the model by more than would be expected by chance.
• Root MSE (residual standard error): average prediction error of the model
5.514: average of deviations from the linear model. Datapoints are 5.514 mental health days off from the regression model, on average.
• Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion
H0: all the coefficients are 0 - there is no linear relationship between any predictor and the outcome. F: 459.4 df1 = 29, df2 = 7970 p-value: <2.2e-16
Conclusion: Reject H0; there is a linear relationship between some predictors and the outcome.
Anova(model, type = "III") |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Type 3 ANOVA on Full Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | sumsq | df | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1471.8139 | 1 | 48.4081 | 0.0000 |
| physhlth_days | 321747.4695 | 1 | 10582.2957 | 0.0000 |
| bmi | 37.6448 | 1 | 1.2381 | 0.2659 |
| sex | 664.0808 | 1 | 21.8417 | 0.0000 |
| exercise | 89.3911 | 1 | 2.9401 | 0.0864 |
| age_group | 9085.6971 | 12 | 24.9024 | 0.0000 |
| income | 760.5243 | 6 | 4.1689 | 0.0003 |
| education | 179.4841 | 4 | 1.4758 | 0.2066 |
| smoking | 1313.1821 | 3 | 14.3969 | 0.0000 |
| Residuals | 242322.4044 | 7970 | NA | NA |
Physhlth_days, sex, age_group, income, and smoking have statistically significant partial associations.
reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking, data = brfss_analytic)
anova(reduced, model) %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced (no Income)", "Full (with Income)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Partial F-Test: Does Income add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced (no Income) | 7976 | 243082.9 | NA | NA | NA | NA |
| Full (with Income) | 7970 | 242322.4 | 6 | 760.5243 | 4.1689 | 3e-04 |
H0: Income does not add information to the prediction of mental health days, given the other variables present in the model. HA: Income adds information to the prediction of mental health days, given the other variables present in the model.
F: 4.1689 df: 6 p-value: 3e-04
I reject the null hypothesis and conclude that income adds information to the prediction of mental health days even given the other variables present.
reduced2 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking, data = brfss_analytic)
anova(reduced2, model) %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced (no Education)", "Full (with Education)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Partial F-Test: Does Education add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced (no Education) | 7974 | 242501.9 | NA | NA | NA | NA |
| Full (with Education) | 7970 | 242322.4 | 4 | 179.4841 | 1.4758 | 0.2066 |
H0: Education does not add information to the prediction of mental health days, given the other variables present in the model. HA: Education adds information to the prediction of mental health days, given the other variables present in the model.
F: 1.4758 df: 4 p-value: 0.2066
I do not reject the null hypothesis and conclude that there is no evidence education adds information to the prediction of mental health days given the other variables present.
The results of the Type III ANOVA indicate that Physhlth_days, sex, age_group, income, and smoking significantly contribute to the model given that all other variables are already in the model. These predictors make the strongest independent contributions. The chunk tests indicate whether a group of variables adds to the model compared to a reduced model without that group of variables, and in this case, it was found that income but not education contributes significantly to the model. This is important because even if a single variable is not significant in the overall Type III ANOVA, it could still be an important contributor within a group of variables.
brfss_analytic <- brfss_analytic %>%
mutate(
smoker_current = case_when(smoking == "Current some-day smoker" ~ 0,
smoking == "Current daily smoker" ~ 0,
smoking == "Former smoker" ~ 1,
smoking == "Never smoker" ~ 1),
smoker_current = factor(smoker_current, levels=0:1, labels =
c("Current smoker",
"Non_smoker")) )
model_A <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_A, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model A",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.3262 | 0.6787 | 6.3742 | 0.0000 | 2.9958 | 5.6566 |
| physhlth_days | 0.8526 | 0.0083 | 103.0635 | 0.0000 | 0.8364 | 0.8688 |
| bmi | 0.0111 | 0.0099 | 1.1192 | 0.2631 | -0.0083 | 0.0304 |
| sexFemale | 0.5489 | 0.1254 | 4.3752 | 0.0000 | 0.3029 | 0.7948 |
| smoker_currentNon_smoker | -0.9071 | 0.2017 | -4.4978 | 0.0000 | -1.3025 | -0.5118 |
| exerciseNo | 0.2770 | 0.1592 | 1.7405 | 0.0818 | -0.0350 | 0.5890 |
| age_group25-29 | -0.5205 | 0.3760 | -1.3844 | 0.1663 | -1.2575 | 0.2165 |
| age_group30-34 | -0.6452 | 0.3566 | -1.8093 | 0.0704 | -1.3442 | 0.0538 |
| age_group35-39 | -0.5765 | 0.3505 | -1.6450 | 0.1000 | -1.2635 | 0.1105 |
| age_group40-44 | -1.1457 | 0.3474 | -3.2977 | 0.0010 | -1.8267 | -0.4647 |
| age_group45-49 | -1.7312 | 0.3583 | -4.8316 | 0.0000 | -2.4335 | -1.0288 |
| age_group50-54 | -2.1153 | 0.3495 | -6.0517 | 0.0000 | -2.8005 | -1.4301 |
| age_group55-59 | -2.0583 | 0.3400 | -6.0539 | 0.0000 | -2.7247 | -1.3918 |
| age_group60-64 | -2.4695 | 0.3280 | -7.5287 | 0.0000 | -3.1125 | -1.8265 |
| age_group65-69 | -2.9304 | 0.3287 | -8.9139 | 0.0000 | -3.5748 | -2.2859 |
| age_group70-74 | -3.1576 | 0.3323 | -9.5012 | 0.0000 | -3.8091 | -2.5062 |
| age_group75-79 | -2.9905 | 0.3513 | -8.5116 | 0.0000 | -3.6793 | -2.3018 |
| age_group80+ | -3.3630 | 0.3554 | -9.4622 | 0.0000 | -4.0597 | -2.6663 |
| income$15-25k | -0.2432 | 0.3472 | -0.7003 | 0.4837 | -0.9239 | 0.4375 |
| income$25-35k | -0.1420 | 0.3358 | -0.4229 | 0.6723 | -0.8002 | 0.5162 |
| income$35-50k | -0.4414 | 0.3237 | -1.3637 | 0.1727 | -1.0760 | 0.1931 |
| income$50-100k | -0.5743 | 0.3048 | -1.8842 | 0.0596 | -1.1719 | 0.0232 |
| income$100-200k | -0.9893 | 0.3191 | -3.0998 | 0.0019 | -1.6149 | -0.3637 |
| income>$200k | -1.2491 | 0.3685 | -3.3899 | 0.0007 | -1.9714 | -0.5268 |
| educationHigh school diploma or GED | 0.4668 | 0.6031 | 0.7741 | 0.4389 | -0.7154 | 1.6491 |
| educationSome college or technical school | 1.0098 | 0.5132 | 1.9675 | 0.0492 | 0.0037 | 2.0159 |
| educationCollege graduate | 1.0710 | 0.5139 | 2.0839 | 0.0372 | 0.0635 | 2.0784 |
| educationGraduate or professional degree | 0.8558 | 0.5144 | 1.6637 | 0.0962 | -0.1525 | 1.8641 |
model_B <- lm(menthlth_days ~ physhlth_days + bmi + sex * smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_B, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Model B",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 4.0900 | 0.6923 | 5.9081 | 0.0000 | 2.7330 | 5.4470 |
| physhlth_days | 0.8522 | 0.0083 | 102.9927 | 0.0000 | 0.8360 | 0.8684 |
| bmi | 0.0108 | 0.0099 | 1.0962 | 0.2730 | -0.0085 | 0.0302 |
| sexFemale | 1.1269 | 0.3575 | 3.1522 | 0.0016 | 0.4261 | 1.8277 |
| smoker_currentNon_smoker | -0.5875 | 0.2737 | -2.1465 | 0.0319 | -1.1241 | -0.0510 |
| exerciseNo | 0.2758 | 0.1591 | 1.7330 | 0.0831 | -0.0362 | 0.5878 |
| age_group25-29 | -0.5236 | 0.3759 | -1.3927 | 0.1637 | -1.2605 | 0.2134 |
| age_group30-34 | -0.6432 | 0.3565 | -1.8041 | 0.0713 | -1.3421 | 0.0557 |
| age_group35-39 | -0.5824 | 0.3504 | -1.6618 | 0.0966 | -1.2693 | 0.1046 |
| age_group40-44 | -1.1493 | 0.3474 | -3.3085 | 0.0009 | -1.8303 | -0.4683 |
| age_group45-49 | -1.7351 | 0.3583 | -4.8431 | 0.0000 | -2.4374 | -1.0328 |
| age_group50-54 | -2.1209 | 0.3495 | -6.0681 | 0.0000 | -2.8061 | -1.4358 |
| age_group55-59 | -2.0601 | 0.3399 | -6.0601 | 0.0000 | -2.7265 | -1.3937 |
| age_group60-64 | -2.4701 | 0.3280 | -7.5315 | 0.0000 | -3.1130 | -1.8272 |
| age_group65-69 | -2.9346 | 0.3287 | -8.9278 | 0.0000 | -3.5790 | -2.2903 |
| age_group70-74 | -3.1600 | 0.3323 | -9.5093 | 0.0000 | -3.8114 | -2.5086 |
| age_group75-79 | -2.9967 | 0.3513 | -8.5298 | 0.0000 | -3.6854 | -2.3080 |
| age_group80+ | -3.3610 | 0.3554 | -9.4577 | 0.0000 | -4.0577 | -2.6644 |
| income$15-25k | -0.2396 | 0.3472 | -0.6901 | 0.4902 | -0.9202 | 0.4410 |
| income$25-35k | -0.1352 | 0.3358 | -0.4028 | 0.6871 | -0.7934 | 0.5229 |
| income$35-50k | -0.4322 | 0.3237 | -1.3352 | 0.1818 | -1.0668 | 0.2023 |
| income$50-100k | -0.5650 | 0.3048 | -1.8535 | 0.0639 | -1.1625 | 0.0326 |
| income$100-200k | -0.9855 | 0.3191 | -3.0882 | 0.0020 | -1.6110 | -0.3599 |
| income>$200k | -1.2470 | 0.3684 | -3.3846 | 0.0007 | -1.9693 | -0.5248 |
| educationHigh school diploma or GED | 0.4322 | 0.6034 | 0.7164 | 0.4738 | -0.7505 | 1.6150 |
| educationSome college or technical school | 0.9721 | 0.5136 | 1.8925 | 0.0585 | -0.0348 | 1.9790 |
| educationCollege graduate | 1.0310 | 0.5144 | 2.0043 | 0.0451 | 0.0226 | 2.0393 |
| educationGraduate or professional degree | 0.8155 | 0.5149 | 1.5840 | 0.1132 | -0.1937 | 1.8248 |
| sexFemale:smoker_currentNon_smoker | -0.6563 | 0.3801 | -1.7268 | 0.0843 | -1.4013 | 0.0887 |
anova(model_A, model_B) %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Model A", "Model B"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Partial F-Test",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Model A | 7972 | 243018.9 | NA | NA | NA | NA |
| Model B | 7971 | 242928.0 | 1 | 90.871 | 2.9817 | 0.0843 |
H0: the two lines are identical; current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept) HA: the two lines are not identical, current smoking does affect the sex-menthlth days relationship (either slope or intercept or both)
F: 2.9817 df: 1 p-value: 0.0843
I conclude the two lines are identical; I fail to reject H0, and conclude current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept).
The estimated effect of current smoking on mentally unhealthy days among men is -0.5875. This means that holding all other variables equal, not being a current smoker results in 0.5875 less mentally unhealthy days than being a current smoker.
The estimated effect of current smoking on mentally unhealthy days among women is -1.2438. This means that holding all other variables equal, not being a current smoker results in 1.2438 less mentally unhealthy days than being a current smoker.
This indicates that smoking is more strongly associated with poor mental health in females than in males. In males, being a current smoker results in 0.5875 more mentally unhealthy days, but in females, being a current smoker results in 1.2438 more mentally unhealthy days. Thus, sex modifies the current smoker-menthlth_days relationship.
pred_int <- ggpredict(model_B, terms = c("smoker_current", "sex"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_point() +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Mental Health Days by Sex & Smoking",
x = "Smoking status",
y = "Predicted Mentally Unhealthy Days",
color = "sex",
fill = "sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")The figure shows there is an interaction between smoking status and sex, that is, current smoking modifies the sex-mental health days relationship. Among current smokers, females have much more mentally unhealthy days than males, but among non-smokers, the gap between sexes is much smaller. This means that the relationship between sex and mentally unhealthy days is affected by current smoking.
REFLECTION
The results suggest that physically unhealthy days, sex, age group, income, and smoking status are the determinants most strongly associated with mental health among US adults. BMI, exercise, and education were not statistically significant. Key limitations of using cross-sectional survey data is that the data was collected at a single point in time, and so it is not possible to infer causality, which requires a temporal relationship to be assessed. Additionally, there may be additional variables unaccounted for that could be confounding observed relationships. For example, the presence of chronic illness could be a confounder of the physically unhealthy days-mentally unhealthy days relationship, as well as a potential confounder of the age group-mentally unhealthy days relationship.
Adjusted R squared only changes when variables are added that significantly contribute to the prediction power of the model more than would be expected by chance, while R squared does not make this distinction. This means that when adding multiple categorical predictors, adjusted R squared is a better metric to determine linear relationships between predictors and the outcome. Groups of predictors should be tested with chunk tests because a variable may be nonsignificant on its own, but contribute information to the model as part of a group of variables.
I did not use AI at all to complete this assignment.