Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.


Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(janitor)
library(car)
library(gtsummary)
library(ggeffects)
# Import the dataset

brfss <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 3/LLCP2023.xpt") %>%
  clean_names()
#select variables
brfss_work <- brfss |>
  select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)
brfss_work <- brfss_work %>%
  mutate(
      menthlth_days = case_when(menthlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
      
      physhlth_days = case_when(physhlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
      
      bmi = case_when(bmi5 == 9999 ~ NA, bmi5 < 9999 ~ bmi5/100),
      
      sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
      
      exercise = case_when(exerany2 == 7 & exerany2 == 9 ~ NA),
      exercise = factor(exerany2, levels = c(1, 2), labels = c("Yes", "No")),
      
      age_group = case_when(ageg5yr == 14 ~ NA),
      age_group = factor(ageg5yr, levels = 1:13,
      labels = c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+") ),
      
      income = case_when(incomg1 == 9 ~ NA),
      income = factor(incomg1, levels = 1:7,
      labels = c("<$15k", "$15-25k", "$25-35k", "$35-50k",
                 "$50-100k", "$100-200k", ">$200k")),
  
      education = case_when(educa == 9 ~ NA),
      education = factor(educa, levels = 1:6, labels = 
                           c("Less than high school", 
                             "Less than high school",
                             "High school diploma or GED",
                             "Some college or technical school",
                             "College graduate",
                             "Graduate or professional degree")),
      
      smoking = case_when(smoker3 == 9 ~ NA),
      smoking = factor(smoker3, levels=1:4, labels =
                         c("Current daily smoker",
                           "Current some-day smoker",
                           "Former smoker",
                           "Never smoker"))
      
  )
cat("Total N:", nrow(brfss_work), "\n")
## Total N: 433323
cat("Missing data:", sum(is.na(brfss_work)), "\n")
## Missing data: 291293
a <- nrow(brfss_work)
b <- sum(is.na(brfss_work))
c <- (b / a)*100

cat("Missing data %:", c , "\n")
## Missing data %: 67.22306
set.seed(1220) 
brfss_analytic <- brfss_work |> 
drop_na(menthlth_days, physhlth_days, bmi, sex, exercise, 
age_group, income, education, smoking) |> 
slice_sample(n = 8000) 

cat("FInal analytic N:", nrow(brfss_analytic), "\n")
## FInal analytic N: 8000
brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi, sex, exercise, age_group, income, education, smoking) %>% 
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      bmi     ~ "Body mass index",
      sex           ~ "Sex",
      exercise      ~ "Any exercise in past 30 days",
      age_group      ~ "Age group",
      income           ~ "Annual household income",
      education       ~ "Highest level of education completed",
      smoking         ~ "Smoking status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)**")
Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)
Characteristic N N = 8,0001
Mentally unhealthy days (past 30) 8,000 5.6 (9.0)
Physically unhealthy days (past 30) 8,000 3.5 (7.9)
Body mass index 8,000 28.6 (6.5)
Sex 8,000
    Male
3,921 (49%)
    Female
4,079 (51%)
Any exercise in past 30 days 8,000 6,242 (78%)
Age group 8,000
    18-24
457 (5.7%)
    25-29
424 (5.3%)
    30-34
533 (6.7%)
    35-39
590 (7.4%)
    40-44
628 (7.9%)
    45-49
538 (6.7%)
    50-54
627 (7.8%)
    55-59
694 (8.7%)
    60-64
819 (10%)
    65-69
818 (10%)
    70-74
759 (9.5%)
    75-79
577 (7.2%)
    80+
536 (6.7%)
Annual household income 8,000
    <$15k
424 (5.3%)
    $15-25k
643 (8.0%)
    $25-35k
812 (10%)
    $35-50k
1,102 (14%)
    $50-100k
2,551 (32%)
    $100-200k
1,809 (23%)
    >$200k
659 (8.2%)
Highest level of education completed 8,000
    Less than high school
127 (1.6%)
    High school diploma or GED
254 (3.2%)
    Some college or technical school
1,795 (22%)
    College graduate
2,144 (27%)
    Graduate or professional degree
3,680 (46%)
Smoking status 8,000
    Current daily smoker
691 (8.6%)
    Current some-day smoker
274 (3.4%)
    Former smoker
2,195 (27%)
    Never smoker
4,840 (61%)
1 Mean (SD); n (%)
model <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking, data = brfss_analytic)
summary(model)
## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8259 -2.4065 -1.4297 -0.1347 29.4079 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                4.818398   0.692538   6.958 3.74e-12
## physhlth_days                              0.850795   0.008271 102.870  < 2e-16
## bmi                                        0.010978   0.009866   1.113 0.265864
## sexFemale                                  0.586915   0.125583   4.674 3.01e-06
## exerciseNo                                 0.272611   0.158988   1.715 0.086445
## age_group25-29                            -0.602502   0.376024  -1.602 0.109130
## age_group30-34                            -0.768963   0.357147  -2.153 0.031343
## age_group35-39                            -0.727709   0.351562  -2.070 0.038491
## age_group40-44                            -1.342610   0.349512  -3.841 0.000123
## age_group45-49                            -1.924205   0.360173  -5.342 9.42e-08
## age_group50-54                            -2.272300   0.350703  -6.479 9.76e-11
## age_group55-59                            -2.230300   0.341562  -6.530 6.99e-11
## age_group60-64                            -2.648336   0.329843  -8.029 1.12e-15
## age_group65-69                            -3.152289   0.331799  -9.501  < 2e-16
## age_group70-74                            -3.385648   0.335638 -10.087  < 2e-16
## age_group75-79                            -3.244898   0.355529  -9.127  < 2e-16
## age_group80+                              -3.594839   0.358756 -10.020  < 2e-16
## income$15-25k                             -0.237845   0.346795  -0.686 0.492836
## income$25-35k                             -0.132541   0.335342  -0.395 0.692677
## income$35-50k                             -0.445473   0.323322  -1.378 0.168303
## income$50-100k                            -0.569835   0.304430  -1.872 0.061269
## income$100-200k                           -1.005289   0.318742  -3.154 0.001617
## income>$200k                              -1.256938   0.368011  -3.415 0.000640
## educationHigh school diploma or GED        0.381209   0.602595   0.633 0.527005
## educationSome college or technical school  0.904135   0.513082   1.762 0.078080
## educationCollege graduate                  0.975882   0.513649   1.900 0.057482
## educationGraduate or professional degree   0.832938   0.513798   1.621 0.105028
## smokingCurrent some-day smoker            -0.842184   0.394788  -2.133 0.032934
## smokingFormer smoker                      -0.738961   0.250086  -2.955 0.003138
## smokingNever smoker                       -1.368381   0.237205  -5.769 8.28e-09
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                          
## sexFemale                                 ***
## exerciseNo                                .  
## age_group25-29                               
## age_group30-34                            *  
## age_group35-39                            *  
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15-25k                                
## income$25-35k                                
## income$35-50k                                
## income$50-100k                            .  
## income$100-200k                           ** 
## income>$200k                              ***
## educationHigh school diploma or GED          
## educationSome college or technical school .  
## educationCollege graduate                 .  
## educationGraduate or professional degree     
## smokingCurrent some-day smoker            *  
## smokingFormer smoker                      ** 
## smokingNever smoker                       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.514 on 7970 degrees of freedom
## Multiple R-squared:  0.6257, Adjusted R-squared:  0.6243 
## F-statistic: 459.4 on 29 and 7970 DF,  p-value: < 2.2e-16
coef(model)
##                               (Intercept) 
##                                4.81839770 
##                             physhlth_days 
##                                0.85079499 
##                                       bmi 
##                                0.01097774 
##                                 sexFemale 
##                                0.58691470 
##                                exerciseNo 
##                                0.27261058 
##                            age_group25-29 
##                               -0.60250175 
##                            age_group30-34 
##                               -0.76896262 
##                            age_group35-39 
##                               -0.72770861 
##                            age_group40-44 
##                               -1.34260981 
##                            age_group45-49 
##                               -1.92420498 
##                            age_group50-54 
##                               -2.27229975 
##                            age_group55-59 
##                               -2.23029964 
##                            age_group60-64 
##                               -2.64833599 
##                            age_group65-69 
##                               -3.15228943 
##                            age_group70-74 
##                               -3.38564774 
##                            age_group75-79 
##                               -3.24489757 
##                              age_group80+ 
##                               -3.59483883 
##                             income$15-25k 
##                               -0.23784516 
##                             income$25-35k 
##                               -0.13254056 
##                             income$35-50k 
##                               -0.44547317 
##                            income$50-100k 
##                               -0.56983491 
##                           income$100-200k 
##                               -1.00528854 
##                              income>$200k 
##                               -1.25693832 
##       educationHigh school diploma or GED 
##                                0.38120905 
## educationSome college or technical school 
##                                0.90413454 
##                 educationCollege graduate 
##                                0.97588172 
##  educationGraduate or professional degree 
##                                0.83293760 
##            smokingCurrent some-day smoker 
##                               -0.84218438 
##                      smokingFormer smoker 
##                               -0.73896133 
##                       smokingNever smoker 
##                               -1.36838119

Step 2: Write the fitted regression equation including all terms, with coefficients rounded to three decimal places.

Menthlth_days = 4.818 + (physhlth_days * 0.850) + (bmi * 0.010) + (sexFemale * 0.586) + (exerciseNo * 0.272) + (age_group25-29 * -0.602) + (age_group30-34 * -0.768) + (age_group35-39 * -0.727) + (age_group40-44 * -1.342) + (age_group45-49 * -1.924) + (age_group50-54 * -2.272) + (age_group55-59 * -2.230) + (age_group60-64 * -2.648) + (age_group65-69 * -3.152) + (age_group70-74 * -3.385) + (age_group75-79 * -3.244) + (age_group80+ * -3.594) + (income15-25k * -0.237) + income25-35k * -0.132) + (income35-50k * -0.445) + income50-100k * -0.569) + (income100-200k * -1.005) + (income>$200k * -1.256) + (educationHigh school diploma or GED * 0.381) + (educationSome college or technical school * 0.904) + (educationCollege graduate * 0.975 + (educationGraduate or professional degree * 0.832) + (smokingCurrent some-day smoker * -0.842) + (smokingFormer smoker * -0.738) + (smokingNever smoker * -1.368)

Step 3: Interpret the following coefficients in plain language. For each, state the direction, magnitude, and meaning in terms of mentally unhealthy days, holding all other variables constant: • physhlth_days (continuous predictor)

For every 1 day increase in physhlth_days, menthlth days increases by 0.850, keeping all other variables constant.

• bmi (continuous predictor)

For every 1 unit increase in bmi, menthlth days increases by 0.010, keeping all other variables constant.

• sex: Female vs. Male (reference)

Being female is associated with 0.586 more mentally unhealthy days per month compared to males, keeping all other variables constant.

• exercise: Yes vs. No (reference)

Not exercising is associated with 0.272 more mentally unhealthy days per month compared to exercising, keeping all other variables constant.

• One age_group coefficient of your choice

Being 30-34 is associated with 0.768 less mentally unhealthy days per month than being 18-24, keeping all other variables constant.

• Two income coefficients of your choice, compared to the reference category (Less than $15,000)

Making >200K is associated with 1.256 less mentally unhealthy days per month than making <15K, and making 35-50K is associated with 0.445 less mentally unhealthy days per month than making <15K, keeping all other variables constant.

Step 4: Report and interpret each of the following model-level statistics: • R-squared: proportion of variance in mentally unhealthy days explained by all predictors

0.6257: all predictors explain 62.57% of the variance in menthlth_days.

• Adjusted R-squared: how it differs from R-squared and what it adds

0.6243: all predictors explain 62.43% of the variance in menthlth_days. Differs from R2 because adjusted R2 only increases when a predictor is added that improves the model by more than would be expected by chance.

• Root MSE (residual standard error): average prediction error of the model

5.514: average of deviations from the linear model. Datapoints are 5.514 mental health days off from the regression model, on average.

• Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion

H0: all the coefficients are 0 - there is no linear relationship between any predictor and the outcome. F: 459.4 df1 = 29, df2 = 7970 p-value: <2.2e-16

Conclusion: Reject H0; there is a linear relationship between some predictors and the outcome.

Anova(model, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type 3 ANOVA on Full Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Type 3 ANOVA on Full Model
term sumsq df statistic p.value
(Intercept) 1471.8139 1 48.4081 0.0000
physhlth_days 321747.4695 1 10582.2957 0.0000
bmi 37.6448 1 1.2381 0.2659
sex 664.0808 1 21.8417 0.0000
exercise 89.3911 1 2.9401 0.0864
age_group 9085.6971 12 24.9024 0.0000
income 760.5243 6 4.1689 0.0003
education 179.4841 4 1.4758 0.2066
smoking 1313.1821 3 14.3969 0.0000
Residuals 242322.4044 7970 NA NA

Physhlth_days, sex, age_group, income, and smoking have statistically significant partial associations.

reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking, data = brfss_analytic)

anova(reduced, model) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (no Income)", "Full (with Income)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test: Does Income add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-Test: Does Income add to the model?
Model Res. df RSS df Sum of Sq F p-value
Reduced (no Income) 7976 243082.9 NA NA NA NA
Full (with Income) 7970 242322.4 6 760.5243 4.1689 3e-04

H0: Income does not add information to the prediction of mental health days, given the other variables present in the model. HA: Income adds information to the prediction of mental health days, given the other variables present in the model.

F: 4.1689 df: 6 p-value: 3e-04

I reject the null hypothesis and conclude that income adds information to the prediction of mental health days even given the other variables present.

reduced2 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking, data = brfss_analytic)

anova(reduced2, model) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (no Education)", "Full (with Education)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test: Does Education add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-Test: Does Education add to the model?
Model Res. df RSS df Sum of Sq F p-value
Reduced (no Education) 7974 242501.9 NA NA NA NA
Full (with Education) 7970 242322.4 4 179.4841 1.4758 0.2066

H0: Education does not add information to the prediction of mental health days, given the other variables present in the model. HA: Education adds information to the prediction of mental health days, given the other variables present in the model.

F: 1.4758 df: 4 p-value: 0.2066

I do not reject the null hypothesis and conclude that there is no evidence education adds information to the prediction of mental health days given the other variables present.

The results of the Type III ANOVA indicate that Physhlth_days, sex, age_group, income, and smoking significantly contribute to the model given that all other variables are already in the model. These predictors make the strongest independent contributions. The chunk tests indicate whether a group of variables adds to the model compared to a reduced model without that group of variables, and in this case, it was found that income but not education contributes significantly to the model. This is important because even if a single variable is not significant in the overall Type III ANOVA, it could still be an important contributor within a group of variables.

brfss_analytic <- brfss_analytic %>%
  mutate(
      smoker_current = case_when(smoking == "Current some-day smoker" ~ 0,
                                 smoking == "Current daily smoker" ~ 0,
                                 smoking == "Former smoker" ~ 1,
                                 smoking == "Never smoker" ~ 1),
  

      smoker_current = factor(smoker_current, levels=0:1, labels =
                         c("Current smoker",
                           "Non_smoker")) )

model_A <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_A, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model A",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model A
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 4.3262 0.6787 6.3742 0.0000 2.9958 5.6566
physhlth_days 0.8526 0.0083 103.0635 0.0000 0.8364 0.8688
bmi 0.0111 0.0099 1.1192 0.2631 -0.0083 0.0304
sexFemale 0.5489 0.1254 4.3752 0.0000 0.3029 0.7948
smoker_currentNon_smoker -0.9071 0.2017 -4.4978 0.0000 -1.3025 -0.5118
exerciseNo 0.2770 0.1592 1.7405 0.0818 -0.0350 0.5890
age_group25-29 -0.5205 0.3760 -1.3844 0.1663 -1.2575 0.2165
age_group30-34 -0.6452 0.3566 -1.8093 0.0704 -1.3442 0.0538
age_group35-39 -0.5765 0.3505 -1.6450 0.1000 -1.2635 0.1105
age_group40-44 -1.1457 0.3474 -3.2977 0.0010 -1.8267 -0.4647
age_group45-49 -1.7312 0.3583 -4.8316 0.0000 -2.4335 -1.0288
age_group50-54 -2.1153 0.3495 -6.0517 0.0000 -2.8005 -1.4301
age_group55-59 -2.0583 0.3400 -6.0539 0.0000 -2.7247 -1.3918
age_group60-64 -2.4695 0.3280 -7.5287 0.0000 -3.1125 -1.8265
age_group65-69 -2.9304 0.3287 -8.9139 0.0000 -3.5748 -2.2859
age_group70-74 -3.1576 0.3323 -9.5012 0.0000 -3.8091 -2.5062
age_group75-79 -2.9905 0.3513 -8.5116 0.0000 -3.6793 -2.3018
age_group80+ -3.3630 0.3554 -9.4622 0.0000 -4.0597 -2.6663
income$15-25k -0.2432 0.3472 -0.7003 0.4837 -0.9239 0.4375
income$25-35k -0.1420 0.3358 -0.4229 0.6723 -0.8002 0.5162
income$35-50k -0.4414 0.3237 -1.3637 0.1727 -1.0760 0.1931
income$50-100k -0.5743 0.3048 -1.8842 0.0596 -1.1719 0.0232
income$100-200k -0.9893 0.3191 -3.0998 0.0019 -1.6149 -0.3637
income>$200k -1.2491 0.3685 -3.3899 0.0007 -1.9714 -0.5268
educationHigh school diploma or GED 0.4668 0.6031 0.7741 0.4389 -0.7154 1.6491
educationSome college or technical school 1.0098 0.5132 1.9675 0.0492 0.0037 2.0159
educationCollege graduate 1.0710 0.5139 2.0839 0.0372 0.0635 2.0784
educationGraduate or professional degree 0.8558 0.5144 1.6637 0.0962 -0.1525 1.8641
model_B <- lm(menthlth_days ~ physhlth_days + bmi + sex * smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_B, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model B",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model B
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 4.0900 0.6923 5.9081 0.0000 2.7330 5.4470
physhlth_days 0.8522 0.0083 102.9927 0.0000 0.8360 0.8684
bmi 0.0108 0.0099 1.0962 0.2730 -0.0085 0.0302
sexFemale 1.1269 0.3575 3.1522 0.0016 0.4261 1.8277
smoker_currentNon_smoker -0.5875 0.2737 -2.1465 0.0319 -1.1241 -0.0510
exerciseNo 0.2758 0.1591 1.7330 0.0831 -0.0362 0.5878
age_group25-29 -0.5236 0.3759 -1.3927 0.1637 -1.2605 0.2134
age_group30-34 -0.6432 0.3565 -1.8041 0.0713 -1.3421 0.0557
age_group35-39 -0.5824 0.3504 -1.6618 0.0966 -1.2693 0.1046
age_group40-44 -1.1493 0.3474 -3.3085 0.0009 -1.8303 -0.4683
age_group45-49 -1.7351 0.3583 -4.8431 0.0000 -2.4374 -1.0328
age_group50-54 -2.1209 0.3495 -6.0681 0.0000 -2.8061 -1.4358
age_group55-59 -2.0601 0.3399 -6.0601 0.0000 -2.7265 -1.3937
age_group60-64 -2.4701 0.3280 -7.5315 0.0000 -3.1130 -1.8272
age_group65-69 -2.9346 0.3287 -8.9278 0.0000 -3.5790 -2.2903
age_group70-74 -3.1600 0.3323 -9.5093 0.0000 -3.8114 -2.5086
age_group75-79 -2.9967 0.3513 -8.5298 0.0000 -3.6854 -2.3080
age_group80+ -3.3610 0.3554 -9.4577 0.0000 -4.0577 -2.6644
income$15-25k -0.2396 0.3472 -0.6901 0.4902 -0.9202 0.4410
income$25-35k -0.1352 0.3358 -0.4028 0.6871 -0.7934 0.5229
income$35-50k -0.4322 0.3237 -1.3352 0.1818 -1.0668 0.2023
income$50-100k -0.5650 0.3048 -1.8535 0.0639 -1.1625 0.0326
income$100-200k -0.9855 0.3191 -3.0882 0.0020 -1.6110 -0.3599
income>$200k -1.2470 0.3684 -3.3846 0.0007 -1.9693 -0.5248
educationHigh school diploma or GED 0.4322 0.6034 0.7164 0.4738 -0.7505 1.6150
educationSome college or technical school 0.9721 0.5136 1.8925 0.0585 -0.0348 1.9790
educationCollege graduate 1.0310 0.5144 2.0043 0.0451 0.0226 2.0393
educationGraduate or professional degree 0.8155 0.5149 1.5840 0.1132 -0.1937 1.8248
sexFemale:smoker_currentNon_smoker -0.6563 0.3801 -1.7268 0.0843 -1.4013 0.0887
anova(model_A, model_B) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Model A", "Model B"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-Test
Model Res. df RSS df Sum of Sq F p-value
Model A 7972 243018.9 NA NA NA NA
Model B 7971 242928.0 1 90.871 2.9817 0.0843

H0: the two lines are identical; current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept) HA: the two lines are not identical, current smoking does affect the sex-menthlth days relationship (either slope or intercept or both)

F: 2.9817 df: 1 p-value: 0.0843

I conclude the two lines are identical; I fail to reject H0, and conclude current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept).

The estimated effect of current smoking on mentally unhealthy days among men is -0.5875. This means that holding all other variables equal, not being a current smoker results in 0.5875 less mentally unhealthy days than being a current smoker.

The estimated effect of current smoking on mentally unhealthy days among women is -1.2438. This means that holding all other variables equal, not being a current smoker results in 1.2438 less mentally unhealthy days than being a current smoker.

This indicates that smoking is more strongly associated with poor mental health in females than in males. In males, being a current smoker results in 0.5875 more mentally unhealthy days, but in females, being a current smoker results in 1.2438 more mentally unhealthy days. Thus, sex modifies the current smoker-menthlth_days relationship.

pred_int <- ggpredict(model_B, terms = c("smoker_current", "sex"))

ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_point() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Sex & Smoking",
    x        = "Smoking status",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "sex",
    fill     = "sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

The figure shows there is an interaction between smoking status and sex, that is, current smoking modifies the sex-mental health days relationship. Among current smokers, females have much more mentally unhealthy days than males, but among non-smokers, the gap between sexes is much smaller. This means that the relationship between sex and mentally unhealthy days is affected by current smoking.

REFLECTION

The results suggest that physically unhealthy days, sex, age group, income, and smoking status are the determinants most strongly associated with mental health among US adults. BMI, exercise, and education were not statistically significant. Key limitations of using cross-sectional survey data is that the data was collected at a single point in time, and so it is not possible to infer causality, which requires a temporal relationship to be assessed. Additionally, there may be additional variables unaccounted for that could be confounding observed relationships. For example, the presence of chronic illness could be a confounder of the physically unhealthy days-mentally unhealthy days relationship, as well as a potential confounder of the age group-mentally unhealthy days relationship.

Adjusted R squared only changes when variables are added that significantly contribute to the prediction power of the model more than would be expected by chance, while R squared does not make this distinction. This means that when adding multiple categorical predictors, adjusted R squared is a better metric to determine linear relationships between predictors and the outcome. Groups of predictors should be tested with chunk tests because a variable may be nonsignificant on its own, but contribute information to the model as part of a group of variables.

I did not use AI at all to complete this assignment.