What individual and behavioral factors are associated with the number of mentally unhealthy days reported by U.S. adults, and does the association between current smoking and mental health differ by sex?
Source: https://www.cdc.gov/brfss/annual_data/annual_2023.html
If there is already a file named “working_data.rds” in the data folder in the directory, then read the rds cache in. If there is not, read from the XPT. Write the rds cache only if it already exists, to save processing time.
filename <- "LLCP2023.XPT"
filename <- paste0("data/", filename)
df <- if (file.exists("data/working_data.rds")) {
print("Loading working data from cache: data/working_data.rds")
readRDS("data/working_data.rds")
} else {
print(paste0("Loading full XPT from ", filename))
haven::read_xpt(filename)
}## [1] "Loading working data from cache: data/working_data.rds"
if (file.exists("data/working_data.rds")) {
print("nah it's cool don't rewrite")
} else {
saveRDS(df, "data/working_data.rds")
print("Writing rds")
}## [1] "nah it's cool don't rewrite"
There are currently 433323 observations in the dataset.
The following nine variables are required for this assignment.
| Raw Variable Name | Description | Your Recoded Name |
|---|---|---|
| MENTHLTH | Mentally unhealthy days in past 30 (outcome) | menthlth_days |
| PHYSHLTH | Physically unhealthy days in past 30 | physhlth_days |
| _BMI5 | Body mass index × 100 (divide by 100) | bmi |
| SEXVAR | Sex (1 = Male, 2 = Female) | sex |
| EXERANY2 | Any exercise in past 30 days (1 = Yes, 2 = No) | exercise |
| _AGEG5YR | Age group in 5-year categories (13 levels) | age_group |
| _INCOMG1 | Annual household income (7 levels) | income |
| EDUCA | Highest level of education completed | education |
| _SMOKER3 | Smoking status (4 levels) | smoking |
df <- df %>%
select(MENTHLTH, PHYSHLTH, `_BMI5`, SEXVAR, EXERANY2, `_AGEG5YR`, `_INCOMG1`, EDUCA, `_SMOKER3`) %>%
mutate(
##Recode MENTHLTH to menthlth_days
menthlth_days = case_when(
MENTHLTH == 88 ~ 0,
MENTHLTH == 99 ~ NA,
MENTHLTH == 77 ~ NA,
TRUE ~ MENTHLTH),
##Recode PHYSHLTH to physhlth_days
physhlth_days = case_when(
PHYSHLTH == 88 ~ 0,
PHYSHLTH == 99 ~ NA,
PHYSHLTH == 77 ~ NA,
TRUE ~ PHYSHLTH),
##Recode __BMI5 to bmi - divide by 100
bmi = case_when(`_BMI5` == 9999 ~ NA,
TRUE ~ `_BMI5` /100),
## Recode sex as factor
sex = factor(
ifelse(SEXVAR %in% c(7, 9), NA, SEXVAR),
levels = c(1, 2),
labels = c("Male", "Female")
),
## Recode exerany2 as factor
exercise = factor(
ifelse(EXERANY2 %in% c(7, 9), NA, EXERANY2),
levels = c(1, 2),
labels = c("Yes", "No")
),
## Recode age group
age_group = factor(ifelse(`_AGEG5YR` %in% c(14), NA,`_AGEG5YR`),
levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13),
labels = c("18-24", "25-29", "30-34", "35-39",
"40-44", "45-49", "50-54", "55-59",
"60-64", "65-69", "70-74", "75-79",
"80+")
),
##recode income group
income = factor(ifelse(`_INCOMG1` == 9, NA, `_INCOMG1`),
levels = c(1,2,3,4,5, 6, 7),
labels = c("Less than $15,000",
"$15,000 to less than $25,000",
"$25,000 to less than $35,000",
"$35,000 to less than $50,000",
"$50,000 to less than $100,000",
"$100,000 to less than $200,000",
"$200,000 or more")
),
education = factor(
case_when(
EDUCA == 9 ~ NA_character_,
EDUCA %in% c(1, 2) ~ "Less than high school",
EDUCA == 3 ~ "High school diploma or GED",
EDUCA == 4 ~ "Some college or technical school",
EDUCA == 5 ~ "College graduate",
EDUCA == 6 ~ "Graduate or professional degree"
),
levels = c(
"Less than high school",
"High school diploma or GED",
"Some college or technical school",
"College graduate",
"Graduate or professional degree"
)
),
smoking = factor(ifelse(`_SMOKER3` == 9, NA, `_SMOKER3`),
levels = c(1,2,3,4),
labels = c("Current smoker - now smokes every day",
"Current smoker - now smokes some days",
"Former smoker",
"Never smoked"))
)
#QA - Print top 3 rows, output for new var
(df[1:3, (ncol(df)-8):ncol(df)]) %>%
kable(caption = "3 row sample of recoded variables"
)| menthlth_days | physhlth_days | bmi | sex | exercise | age_group | income | education | smoking |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 30.47 | Female | No | 80+ | NA | College graduate | Never smoked |
| 0 | 0 | 28.56 | Female | Yes | 80+ | NA | College graduate | Never smoked |
| 2 | 6 | 22.31 | Female | Yes | 80+ | Less than $15,000 | Some college or technical school | Former smoker |
After recoding, create your analytic dataset by removing observations with missing values on all nine analysis variables, then drawing a random sample of 8,000 observations. Report the number and percentage of missing observations for menthlth_days and at least two other key variables, before removing any missing values
Using set.seed(1220) ensures that your results are reproducible and comparable across submissions (this is critical)
This helper function was written for my final project. Adjusting it so it looks at everything in an input list, rather than just by prefix.
missingness <- function(w_data, colsearch){
w_data %>%
summarize(
across(all_of(colsearch),
~sum(is.na(.x)), .names = "{.col}")) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "nmissing") %>%
mutate(
run = unique(w_data$v_year),
n = nrow(w_data),
pct_missing = nmissing / n * 100
)
}Call helper function on the list
#list of columns to search in
search <- c("menthlth_days", "physhlth_days", "bmi", "sex", "exercise", "age_group", "income", "education", "smoking")
#
a_df <- df
test <- missingness(a_df, search)
test %>% kable()| variable | nmissing | n | pct_missing |
|---|---|---|---|
| menthlth_days | 8108 | 433323 | 1.8711215 |
| physhlth_days | 10785 | 433323 | 2.4889055 |
| bmi | 40535 | 433323 | 9.3544538 |
| sex | 0 | 433323 | 0.0000000 |
| exercise | 1251 | 433323 | 0.2886992 |
| age_group | 7779 | 433323 | 1.7951967 |
| income | 86623 | 433323 | 19.9903998 |
| education | 2325 | 433323 | 0.5365513 |
| smoking | 23062 | 433323 | 5.3221269 |
Call helper function on the list
## Drop if missing on any values in this list.
a_df <- a_df |>
drop_na(search)
n_end <- nrow(a_df)
set.seed(1220)
a_df <- a_df |> slice_sample(n=8000)The starting dataset had 433323 observations, and with exclusions of any variables for complete case analysis, the dataset had 310768 observations. A sample of 8000 observations will be used for the analysis.
table1 <- a_df %>%
tbl_summary(
include = search,
label= list(menthlth_days ~ "Mentally unhealthy days in past 30 days",
physhlth_days ~ "Physically unhealthy days in past 30 days",
bmi ~ "Body mass index",
exercise ~ "Exercise in past 50 days",
age_group ~ "Age (years)",
income ~ "Anual household income",
education ~ "Highest level of education completed",
smoking ~ "Smoking status"
),
by = sex) |>
modify_caption(caption= "Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)")
as_flex_table(table1)Characteristic | Male | Female |
|---|---|---|
Mentally unhealthy days in past 30 days | 0 (0, 3) | 0 (0, 5) |
Physically unhealthy days in past 30 days | 0 (0, 3) | 0 (0, 5) |
Body mass index | 28 (25, 31) | 27 (24, 32) |
Exercise in past 50 days | 3,146 (80%) | 3,094 (76%) |
Age (years) | ||
18-24 | 235 (6.0%) | 171 (4.2%) |
25-29 | 219 (5.6%) | 189 (4.7%) |
30-34 | 253 (6.4%) | 210 (5.2%) |
35-39 | 263 (6.7%) | 302 (7.4%) |
40-44 | 290 (7.4%) | 292 (7.2%) |
45-49 | 266 (6.8%) | 252 (6.2%) |
50-54 | 305 (7.7%) | 303 (7.5%) |
55-59 | 308 (7.8%) | 352 (8.7%) |
60-64 | 408 (10%) | 379 (9.3%) |
65-69 | 418 (11%) | 483 (12%) |
70-74 | 382 (9.7%) | 426 (10%) |
75-79 | 325 (8.3%) | 338 (8.3%) |
80+ | 264 (6.7%) | 367 (9.0%) |
Anual household income | ||
Less than $15,000 | 160 (4.1%) | 247 (6.1%) |
$15,000 to less than $25,000 | 271 (6.9%) | 370 (9.1%) |
$25,000 to less than $35,000 | 376 (9.6%) | 495 (12%) |
$35,000 to less than $50,000 | 482 (12%) | 585 (14%) |
$50,000 to less than $100,000 | 1,251 (32%) | 1,260 (31%) |
$100,000 to less than $200,000 | 996 (25%) | 869 (21%) |
$200,000 or more | 400 (10%) | 238 (5.9%) |
Highest level of education completed | ||
Less than high school | 75 (1.9%) | 49 (1.2%) |
High school diploma or GED | 130 (3.3%) | 122 (3.0%) |
Some college or technical school | 950 (24%) | 877 (22%) |
College graduate | 1,018 (26%) | 1,120 (28%) |
Graduate or professional degree | 1,763 (45%) | 1,896 (47%) |
Smoking status | ||
Current smoker - now smokes every day | 339 (8.6%) | 319 (7.8%) |
Current smoker - now smokes some days | 151 (3.8%) | 117 (2.9%) |
Former smoker | 1,207 (31%) | 1,055 (26%) |
Never smoked | 2,239 (57%) | 2,573 (63%) |
1Median (Q1, Q3); n (%) | ||
Research question: What is the independent association of each predictor with the number of mentally unhealthy days in the past 30 days?
Use menthlth_days as the outcome and the following predictors: physhlth_days, bmi, sex, exercise, age_group, income, education, and smoking
# Model : Full multivariable model
m1 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = a_df)
b <- round(coef(m1), 3)
ci <- round(confint(m1), 3)
summary(m1)##
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise +
## age_group + income + education + smoking, data = a_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.080 -3.865 -1.597 0.712 30.471
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 8.99937 0.93968 9.577
## physhlth_days 0.26558 0.01007 26.384
## bmi 0.03338 0.01321 2.527
## sexFemale 1.39038 0.16750 8.301
## exerciseNo 0.65116 0.21472 3.033
## age_group25-29 -1.05660 0.51938 -2.034
## age_group30-34 -1.09395 0.50646 -2.160
## age_group35-39 -1.81103 0.48851 -3.707
## age_group40-44 -2.89307 0.48749 -5.935
## age_group45-49 -3.05618 0.49769 -6.141
## age_group50-54 -3.51674 0.48323 -7.277
## age_group55-59 -4.49597 0.47555 -9.454
## age_group60-64 -4.52215 0.45848 -9.863
## age_group65-69 -5.57795 0.45019 -12.390
## age_group70-74 -6.02536 0.45741 -13.173
## age_group75-79 -6.28656 0.47484 -13.239
## age_group80+ -6.81968 0.47684 -14.302
## income$15,000 to less than $25,000 -1.63703 0.46899 -3.491
## income$25,000 to less than $35,000 -2.07637 0.44797 -4.635
## income$35,000 to less than $50,000 -2.54685 0.43819 -5.812
## income$50,000 to less than $100,000 -3.05048 0.41069 -7.428
## income$100,000 to less than $200,000 -3.49984 0.42923 -8.154
## income$200,000 or more -3.78409 0.50036 -7.563
## educationHigh school diploma or GED 0.07991 0.81066 0.099
## educationSome college or technical school 1.07679 0.68973 1.561
## educationCollege graduate 1.79091 0.69119 2.591
## educationGraduate or professional degree 1.73781 0.69250 2.509
## smokingCurrent smoker - now smokes some days -1.58670 0.53523 -2.965
## smokingFormer smoker -1.98971 0.33713 -5.902
## smokingNever smoked -2.93681 0.32162 -9.131
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## physhlth_days < 2e-16 ***
## bmi 0.011510 *
## sexFemale < 2e-16 ***
## exerciseNo 0.002432 **
## age_group25-29 0.041950 *
## age_group30-34 0.030803 *
## age_group35-39 0.000211 ***
## age_group40-44 3.07e-09 ***
## age_group45-49 8.61e-10 ***
## age_group50-54 3.72e-13 ***
## age_group55-59 < 2e-16 ***
## age_group60-64 < 2e-16 ***
## age_group65-69 < 2e-16 ***
## age_group70-74 < 2e-16 ***
## age_group75-79 < 2e-16 ***
## age_group80+ < 2e-16 ***
## income$15,000 to less than $25,000 0.000485 ***
## income$25,000 to less than $35,000 3.63e-06 ***
## income$35,000 to less than $50,000 6.40e-09 ***
## income$50,000 to less than $100,000 1.22e-13 ***
## income$100,000 to less than $200,000 4.07e-16 ***
## income$200,000 or more 4.38e-14 ***
## educationHigh school diploma or GED 0.921484
## educationSome college or technical school 0.118520
## educationCollege graduate 0.009585 **
## educationGraduate or professional degree 0.012111 *
## smokingCurrent smoker - now smokes some days 0.003041 **
## smokingFormer smoker 3.74e-09 ***
## smokingNever smoked < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.352 on 7970 degrees of freedom
## Multiple R-squared: 0.1853, Adjusted R-squared: 0.1824
## F-statistic: 62.52 on 29 and 7970 DF, p-value: < 2.2e-16
\[ \begin{aligned} \widehat{\text{Mental Health Days}} =\;& 8.999 + 0.266(\text{Phys. Health Days}) + 0.033(\text{BMI}) + 1.39(\text{Sex: Female}) + 0.651(\text{Exercise: No}) \\ &+ -1.057(\text{Age: 25–29}) + -1.094(\text{Age: 30–34}) + -1.811(\text{Age: 35–39}) + -2.893(\text{Age: 40–44}) \\ &+ -3.056(\text{Age: 45–49}) + -3.517(\text{Age: 50–54}) + -4.496(\text{Age: 55–59}) + -4.522(\text{Age: 60–64}) \\ &+ -5.578(\text{Age: 65–69}) + -6.025(\text{Age: 70–74}) + -6.287(\text{Age: 75–80}) + -6.82(\text{Age: 80+}) \\ &+ -1.637(\text{Income: \$15k–\$25k}) + -2.076(\text{Income: \$25k–\$35k}) + -2.547(\text{Income: \$35k–\$50k}) + -3.05(\text{Income: \$50k-\$100k}) + -3.5(\text{Income: \$100k-\$200k}) + -3.784(\text{Income: \$200k+}) \\ &+ 0.08(\text{Education: High school diploma or GED}) + 1.077(\text{Education: Some college or technical school}) + 1.791(\text{Education: College graduate}) + 1.738(\text{Education: Graduate or professional degree})\\ &+ -1.587(\text{Smoking: Current - now smokes some days}) + -1.99(\text{Smoking: Former smoker}) + -2.937(\text{Smoking: Never smoked}) \end{aligned} \]
glance(m1) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R^2",
"adj.r.squared" = "Adjusted R^2",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n - p - 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Table 2. Overall Model Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Statistic | Value |
|---|---|
| R^2 | 0.1853 |
| Adjusted R^2 | 0.1824 |
| Residual Std. Error (Root MSE) | 7.3516 |
| F-statistic | 62.5234 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 29.0000 |
| Residual df (n - p - 1) | 7970.0000 |
| n (observations) | 8000.0000 |
The overall model’s \(R^2\) is 0.185, indicating that the amount of variance in mentally unhealthy days explained by all predictors is relatively low - however, this is quite common for mental health outcomes as mental health is shaped by myriad unmeasured factors. Adjusted \(R^2\) penalizes for the number of predictors \(p\), It only increases when a new predictor improves the model by more than chance alone - because this model has many predictors, it is more appropriate for use than an unadjusted \(R^2\), though with an adjusted \(R^2\) of 0.182 both are quite similar.
The model’s Root MSE indicates an average prediction error of about 7.35 mentally unhealthy days in the past 30 days.
The model’s overall F-test asks: \[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\] The F-statistic is quite large, at 62.52, indicating that the predictors collectively explain more variation than expected by chance. This test has 29 numerator degrees of freedom, 7970 denominator degrees of freedom, and a total p-value of < 2.22e-16.
Based on an alpha of 0.05, we can reject the null hypothesis that the predictors only impact the model variation to the extent that would be expected by chance.
# Type III using car::Anova()
anova_type3 <- Anova(m1, type = "III")
anova_type3 %>%
as.data.frame() %>%
rownames_to_column("Source") %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "Table 3. Type III (Partial) Sums of Squares",
col.names = c("Source", "Sum of Sq", "df", "F value", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Source | Sum of Sq | df | F value | p-value |
|---|---|---|---|---|
| (Intercept) | 4957.0906 | 1 | 91.7191 | 0.0000 |
| physhlth_days | 37622.7952 | 1 | 696.1198 | 0.0000 |
| bmi | 345.2408 | 1 | 6.3879 | 0.0115 |
| sex | 3723.8662 | 1 | 68.9012 | 0.0000 |
| exercise | 497.0434 | 1 | 9.1966 | 0.0024 |
| age_group | 30092.1774 | 12 | 46.3986 | 0.0000 |
| income | 4733.8943 | 6 | 14.5982 | 0.0000 |
| education | 1265.1504 | 4 | 5.8521 | 0.0001 |
| smoking | 5101.1076 | 3 | 31.4613 | 0.0000 |
| Residuals | 430750.0872 | 7970 | NA | NA |
Based on the results of the Type III sums of squares test for all predictors, all predictors have statistically significant partial associations at alpha = 0.05, because all p-values are less than 0.05.
Test for whether income collectively improves the model significantly, after accounting for all other predictors.
The chunk test asks: \[H_0: \beta_{k+1} = \beta_{k+2} = \cdots = \beta_p = 0 \quad \text{(the group adds nothing)}\]
## Fit a reduced model that includes all predictors except income
m_reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking,
data = a_df)
## Fit a full model that includes all predictors, including income
m_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking + income,
data = a_df)
## Use Anova to compare the two models
chunk_income <- anova(m_reduced, m_full)
chunk_income %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced", "Full (+ income)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 4. Chunk Test: Does income add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced | 7976 | 435484.0 | NA | NA | NA | NA |
| Full (+ income) | 7970 | 430750.1 | 6 | 4733.894 | 14.5982 | 0 |
The chunk test found that because the p_value was < 2.22e-16, which is less than the alpha of 0.05, income adds statistically significant information to the model.
## Fit a reduced model that includes all predictors except income
m_reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking,
data = a_df)
## Fit a full model that includes all predictors, including income
m_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking + education,
data = a_df)
## Use Anova to compare the two models
chunk_education <- anova(m_reduced, m_full)
chunk_education %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced", "Full (+ income)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 5. Chunk Test: Does education add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced | 7974 | 432015.2 | NA | NA | NA | NA |
| Full (+ income) | 7970 | 430750.1 | 4 | 1265.15 | 5.8521 | 1e-04 |
The chunk test found that because the p_value was 0.00010644, which is less than the alpha of 0.05, education adds statistically significant information to the model.
The Type III Anova results found the statistical significance of each variable included in the model after controlling for all the others, and it determined that every one of the variables that we included in the model were statistically significant, although some were less dominant than others - like BMI, which had the largest p-value, at 0.01151 (though it was statistically significant), and a moderate F-statistic value of 6.39. Reported Physically Healthy days was one of the strongest individual contributors in the model, with an F-statistic of over 696.12, and a p value of < 2.22e-16.
Both chunk tests indicated that models including their respective predictors (income and education) were statistically significantly more predictive than models that excluded them individually. When a chunk test is conducted on a single variable, it yields the same statistical significance as the corresponding partial effect from the Type III test. However, by explicitly comparing nested models, chunk tests provide a model-comparison perspective, indicating whether the inclusion of a variable or group of variables improves overall model fit.
In this assignment, the chunk tests were conducted in parallel rather than sequentially, so they reflect each predictor’s independent contribution relative to the same baseline model. If conducted sequentially or with grouped predictors, chunk tests could instead be used to evaluate the joint contribution of income and education, as well as how their effects emerge depending on the order in which they are added to the model.
Effect modification occurs when the association between a predictor and an outcome differs across levels of another variable. You will test whether the association between current smoking and mentally unhealthy days differs between men and women.
## Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker current, smoker_current, exercise, age_group, income, and education.
ma <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education,
data = a_df)
## Model B (with interaction)
mb <- lm(menthlth_days ~ physhlth_days + bmi + sex*smoker_current + exercise + age_group + income + education,
data = a_df)To test whether sex \(\times\) smoking status is significant overall, we use a partial F-test comparing the model with and without the interaction terms:
\[H_0: \beta_3 = 0 \quad \text{(there is no interaction between sex and smoking status)}\] \[H_A: \beta_3 \not= 0 \quad \text{(there is interaction between sex and smoking status)}\]
## Use Anova to compare the two models
comparison <- anova(ma, mb)
comparison %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Main Effects", "Main Effects + Interaction (Sex * Smoker_Current)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 6. Interaction Effect",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Main Effects | 7972 | 432508.9 | NA | NA | NA | NA |
| Main Effects + Interaction (Sex * Smoker_Current) | 7971 | 432174.9 | 1 | 333.9749 | 6.1598 | 0.0131 |
The F-statistic is 6.16 with a p-value of 0.013089. At \(\alpha = 0.05\), the overall interaction is statistically significant (p < 0.05), indicating that we do have sufficient evidence to conclude that the effect of sex on the outcome depends on smoking status. There is one interaction effect because both sex and smoker_current are categorical, so there is one degree of freedom for this model’s predictor.
display <- tidy(mb, conf.int = TRUE) |> mutate(across(where(is.numeric), \(x) round(x, 4))) |>
filter(str_detect(term, "sex")|(str_detect(term, "smoke")))
display |>
kable(
caption = "Interactions: Sex and Current Smoking Status",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| sexFemale | 1.1855 | 0.1775 | 6.6784 | 0.0000 | 0.8376 | 1.5335 |
| smoker_currentCurrent smoker | 1.5208 | 0.3654 | 4.1621 | 0.0000 | 0.8045 | 2.2371 |
| sexFemale:smoker_currentCurrent smoker | 1.2833 | 0.5171 | 2.4819 | 0.0131 | 0.2697 | 2.2968 |
For men, the estimated effect of current smoking is an increase in number of reported mentally unhealthy days in the past 30 by 1.5208 days, compared to non-smoking men.
Among women, the estimated effect of current smoking is an increase in the number of reported mentally unhealthy days in the past 30 by both 1.1855, for women, and by 1.2833 for the interaction effect between sex and smoking status, for a resulting effect of 2.4688 more reported mentally unhealthy days compared to non-smoking men.
Because the effect of current smoking is positive for both men and women, and the interaction terms indicate the the effect is larger for women than for men, this does indicate that smoking is more strongly associated with poor mental health in women than it is in men.
Please note, because this comparison is two purely categorical variables, I am showing the predicted values for each combination, to demonstrate the way the interaction effects are working.
pred_mb <- ggeffects::ggpredict(mb, terms = c("smoker_current", "sex"))
ggplot(pred_mb, aes(x = x, y = predicted, color = group, fill = group)) +
geom_col() +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), color = NA) +
facet_wrap(~group) +
labs(
title = "Predicted Poor Mental Health Days by Sex and Smoking Status",
x = "Smoking status",
y = "Predicted Mentally Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_classic(base_size = 13) +
scale_color_brewer(palette = "Set2")For men, current smoking is associated with an increase of about 1.52 reported mentally unhealthy days in the last 30 days. Among women, the increase is even larger - with an increase of 1.28 days comparing smoking women to non-smoking women, on top of an increase of 1.185 for women. Overall, both men and women who smoke more report a larger number of poor mental health days than non-smokers, but the effect is heightened among women. This can be used to highlight people in need of programs with greater access to mental health support.
These results suggest that the determinants of mental health among US adults are varied and multifaceted. All predictors used in the analysis were statistically significant, although the reported number of physically unhealthy days was the strongest predictor of mentally unhealthy days, while BMI was the weakest predictor.
A major confounder that may affect this study is chronic health conditions that may result in both more mentally unhealthy days and physically unhealthy days, but do not impact the interaction between the two. Another potential unmeasured confounder is occupational physical activity – we saw some variation in the model based on income level, and we’ve seen a link between physically unhealthy days and poor mental health – it’s possible the type of work that someone does may affect mental health and isn’t measured by the model.
Please note, BRFSS data is a cross-sectional survey, where all data is taken at a single point in time. This means that there is no way to infer causality, as there’s no way of knowing if the predictor or the outcome occurred first for each participant in the study.
R^2 always increases when more predictors are added because of the way that the measure is calculated – Adjusting R^2 ensures that additional predictors that don’t improve the model do result in a decreasing R^2, as a better measure of model quality. We saw with this assignment that there were categorical variables that added 6, 12, 5 variables – because we treat them with dummy variables, this would substantially increase the number of predictors, and we need to be able to differentiate whether they actually add model quality or just inflate the R^2.
Chunk tests report on the all values for a categorical variable, allowing all the dummy variables to be evaluated collectively. Also if different variables that have a lot of overlap are tested (say, income and education), then the potential correlation between those variables can be identified with a chunk test. I asked chatgpt “can i add a line break in an equation formatted like this in rmarkdown”, when I noticed the equation was making the hrml output super wide, and chatgpt flagged the use of “\” to add line breaks. I also asked “why has this begun rendering weird” after I exceeded a certain equation length and it shifted to vertical formatting, and I was given some formatting updates. Once I applied them, the equation rendered more clearly.