Research Question

What individual and behavioral factors are associated with the number of mentally unhealthy days reported by U.S. adults, and does the association between current smoking and mental health differ by sex?

Part 0. Data Preparation

0.1 Data Loading

Source: https://www.cdc.gov/brfss/annual_data/annual_2023.html

If there is already a file named “working_data.rds” in the data folder in the directory, then read the rds cache in. If there is not, read from the XPT. Write the rds cache only if it already exists, to save processing time.

filename <- "LLCP2023.XPT"
filename <- paste0("data/", filename)

df <- if (file.exists("data/working_data.rds")) {
  print("Loading working data from cache: data/working_data.rds")
  readRDS("data/working_data.rds")
} else {
  print(paste0("Loading full XPT from ", filename))
  haven::read_xpt(filename)
}

## [1] "Loading working data from cache: data/working_data.rds"

if (file.exists("data/working_data.rds")) {
  print("nah it's cool don't rewrite")
  } else {
    saveRDS(df, "data/working_data.rds")
  print("Writing rds")
  }

## [1] "nah it's cool don't rewrite"

n_start <- nrow(df)

There are currently 433323 observations in the dataset.

0.2 Variable Recoding

0.2.1 Variable List

The following nine variables are required for this assignment.

Raw Variable Name	Description	Your Recoded Name
MENTHLTH	Mentally unhealthy days in past 30 (outcome)	`menthlth_days`
PHYSHLTH	Physically unhealthy days in past 30	`physhlth_days`
_BMI5	Body mass index × 100 (divide by 100)	`bmi`
SEXVAR	Sex (1 = Male, 2 = Female)	`sex`
EXERANY2	Any exercise in past 30 days (1 = Yes, 2 = No)	`exercise`
_AGEG5YR	Age group in 5-year categories (13 levels)	`age_group`
_INCOMG1	Annual household income (7 levels)	`income`
EDUCA	Highest level of education completed	`education`
_SMOKER3	Smoking status (4 levels)	`smoking`

0.2.2 Recode variables

df <- df %>%
  select(MENTHLTH, PHYSHLTH, `_BMI5`, SEXVAR, EXERANY2, `_AGEG5YR`, `_INCOMG1`, EDUCA, `_SMOKER3`) %>%
  mutate(
  ##Recode MENTHLTH to menthlth_days
  menthlth_days = case_when(
                      MENTHLTH == 88 ~ 0,
                      MENTHLTH == 99 ~ NA,
                      MENTHLTH == 77 ~ NA,
                      TRUE ~ MENTHLTH), 
  ##Recode PHYSHLTH to physhlth_days
   physhlth_days =   case_when(
                      PHYSHLTH == 88 ~ 0,
                      PHYSHLTH == 99 ~ NA,
                      PHYSHLTH == 77 ~ NA,
                      TRUE ~ PHYSHLTH), 
  ##Recode __BMI5 to bmi - divide by 100
  bmi = case_when(`_BMI5` == 9999 ~ NA,
                  TRUE ~ `_BMI5` /100),
  ## Recode sex as factor
sex = factor(
        ifelse(SEXVAR %in% c(7, 9), NA, SEXVAR),
        levels = c(1, 2),
        labels = c("Male", "Female")
),
  ## Recode exerany2 as factor
exercise = factor(
  ifelse(EXERANY2 %in% c(7, 9), NA, EXERANY2),
  levels = c(1, 2),
  labels = c("Yes", "No")
),
## Recode age group
  age_group = factor(ifelse(`_AGEG5YR` %in% c(14), NA,`_AGEG5YR`), 
                     levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13),
                     labels = c("18-24", "25-29", "30-34", "35-39",
                                "40-44", "45-49", "50-54", "55-59",
                                "60-64", "65-69", "70-74", "75-79",
                                "80+")
  ),
  ##recode income group
  income = factor(ifelse(`_INCOMG1` == 9, NA, `_INCOMG1`),
                  levels = c(1,2,3,4,5, 6, 7),
                  labels = c("Less than $15,000", 
                             "$15,000 to less than $25,000",
                             "$25,000 to less than $35,000",
                             "$35,000 to less than $50,000",
                             "$50,000 to less than $100,000",
                             "$100,000 to less than $200,000",
                             "$200,000 or more")
                  ),
 education = factor(
  case_when(
    EDUCA == 9 ~ NA_character_,
    EDUCA %in% c(1, 2) ~ "Less than high school",
    EDUCA == 3 ~ "High school diploma or GED",
    EDUCA == 4 ~ "Some college or technical school",
    EDUCA == 5 ~ "College graduate",
    EDUCA == 6 ~ "Graduate or professional degree"
  ),
  levels = c(
    "Less than high school",
    "High school diploma or GED",
    "Some college or technical school",
    "College graduate",
    "Graduate or professional degree"
  )
),
  smoking = factor(ifelse(`_SMOKER3` == 9, NA, `_SMOKER3`),
                   levels = c(1,2,3,4),
                   labels = c("Current smoker - now smokes every day",
                              "Current smoker - now smokes some days",
                              "Former smoker",
                              "Never smoked"))

  ) 

#QA - Print top 3 rows, output for new var
(df[1:3, (ncol(df)-8):ncol(df)]) %>% 
  kable(caption = "3 row sample of recoded variables"
)

3 row sample of recoded variables
menthlth_days	physhlth_days	bmi	sex	exercise	age_group	income	education	smoking
0	0	30.47	Female	No	80+	NA	College graduate	Never smoked
0	0	28.56	Female	Yes	80+	NA	College graduate	Never smoked
2	6	22.31	Female	Yes	80+	Less than $15,000	Some college or technical school	Former smoker

0.3 Analytical Sample

After recoding, create your analytic dataset by removing observations with missing values on all nine analysis variables, then drawing a random sample of 8,000 observations. Report the number and percentage of missing observations for menthlth_days and at least two other key variables, before removing any missing values

Using set.seed(1220) ensures that your results are reproducible and comparable across submissions (this is critical)

0.3.1 - Helper Function

This helper function was written for my final project. Adjusting it so it looks at everything in an input list, rather than just by prefix.

missingness <- function(w_data, colsearch){
w_data %>%
  summarize(
    across(all_of(colsearch),
        ~sum(is.na(.x)), .names = "{.col}")) %>%
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "nmissing") %>%
  mutate(
      run = unique(w_data$v_year),
      n = nrow(w_data),
      pct_missing = nmissing / n * 100
    )
}

0.3.2 - Missing by Variable

Call helper function on the list

#list of columns to search in
search <- c("menthlth_days", "physhlth_days", "bmi", "sex", "exercise", "age_group", "income", "education", "smoking")

#
a_df <- df

test <- missingness(a_df, search)

test %>% kable()

variable	nmissing	n	pct_missing
menthlth_days	8108	433323	1.8711215
physhlth_days	10785	433323	2.4889055
bmi	40535	433323	9.3544538
sex	0	433323	0.0000000
exercise	1251	433323	0.2886992
age_group	7779	433323	1.7951967
income	86623	433323	19.9903998
education	2325	433323	0.5365513
smoking	23062	433323	5.3221269

0.3.3 - Complete case exclusion and slice by sample

Call helper function on the list

## Drop if missing on any values in this list.
a_df <- a_df |>
 drop_na(search)

n_end <- nrow(a_df)

set.seed(1220)

a_df <- a_df |> slice_sample(n=8000)

The starting dataset had 433323 observations, and with exclusions of any variables for complete case analysis, the dataset had 310768 observations. A sample of 8000 observations will be used for the analysis.

0.4 Descriptive statistics table

table1 <- a_df %>%
  tbl_summary(
    include = search,
    label= list(menthlth_days ~ "Mentally unhealthy days in past 30 days",
                physhlth_days ~ "Physically unhealthy days in past 30 days",
                bmi ~ "Body mass index",
                exercise  ~ "Exercise in past 50 days",
                age_group  ~ "Age (years)",
                income ~ "Anual household income",
                education ~ "Highest level of education completed",
                smoking ~ "Smoking status"
                ),
    by = sex) |>
  
  modify_caption(caption= "Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)")

as_flex_table(table1)

Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)
Characteristic	Male N = 3,9361	Female N = 4,0641
Mentally unhealthy days in past 30 days	0 (0, 3)	0 (0, 5)
Physically unhealthy days in past 30 days	0 (0, 3)	0 (0, 5)
Body mass index	28 (25, 31)	27 (24, 32)
Exercise in past 50 days	3,146 (80%)	3,094 (76%)
Age (years)
18-24	235 (6.0%)	171 (4.2%)
25-29	219 (5.6%)	189 (4.7%)
30-34	253 (6.4%)	210 (5.2%)
35-39	263 (6.7%)	302 (7.4%)
40-44	290 (7.4%)	292 (7.2%)
45-49	266 (6.8%)	252 (6.2%)
50-54	305 (7.7%)	303 (7.5%)
55-59	308 (7.8%)	352 (8.7%)
60-64	408 (10%)	379 (9.3%)
65-69	418 (11%)	483 (12%)
70-74	382 (9.7%)	426 (10%)
75-79	325 (8.3%)	338 (8.3%)
80+	264 (6.7%)	367 (9.0%)
Anual household income
Less than $15,000	160 (4.1%)	247 (6.1%)
$15,000 to less than $25,000	271 (6.9%)	370 (9.1%)
$25,000 to less than $35,000	376 (9.6%)	495 (12%)
$35,000 to less than $50,000	482 (12%)	585 (14%)
$50,000 to less than $100,000	1,251 (32%)	1,260 (31%)
$100,000 to less than $200,000	996 (25%)	869 (21%)
$200,000 or more	400 (10%)	238 (5.9%)
Highest level of education completed
Less than high school	75 (1.9%)	49 (1.2%)
High school diploma or GED	130 (3.3%)	122 (3.0%)
Some college or technical school	950 (24%)	877 (22%)
College graduate	1,018 (26%)	1,120 (28%)
Graduate or professional degree	1,763 (45%)	1,896 (47%)
Smoking status
Current smoker - now smokes every day	339 (8.6%)	319 (7.8%)
Current smoker - now smokes some days	151 (3.8%)	117 (2.9%)
Former smoker	1,207 (31%)	1,055 (26%)
Never smoked	2,239 (57%)	2,573 (63%)
1Median (Q1, Q3); n (%)

Part 1: Multiple Linear Regression

Research question: What is the independent association of each predictor with the number of mentally unhealthy days in the past 30 days?

1.1 Fit a multiple linear regression model

Use menthlth_days as the outcome and the following predictors: physhlth_days, bmi, sex, exercise, age_group, income, education, and smoking

# Model : Full multivariable model
m1 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
         data = a_df)

b <- round(coef(m1), 3)
ci <- round(confint(m1), 3)

summary(m1)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = a_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.080  -3.865  -1.597   0.712  30.471 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                   8.99937    0.93968   9.577
## physhlth_days                                 0.26558    0.01007  26.384
## bmi                                           0.03338    0.01321   2.527
## sexFemale                                     1.39038    0.16750   8.301
## exerciseNo                                    0.65116    0.21472   3.033
## age_group25-29                               -1.05660    0.51938  -2.034
## age_group30-34                               -1.09395    0.50646  -2.160
## age_group35-39                               -1.81103    0.48851  -3.707
## age_group40-44                               -2.89307    0.48749  -5.935
## age_group45-49                               -3.05618    0.49769  -6.141
## age_group50-54                               -3.51674    0.48323  -7.277
## age_group55-59                               -4.49597    0.47555  -9.454
## age_group60-64                               -4.52215    0.45848  -9.863
## age_group65-69                               -5.57795    0.45019 -12.390
## age_group70-74                               -6.02536    0.45741 -13.173
## age_group75-79                               -6.28656    0.47484 -13.239
## age_group80+                                 -6.81968    0.47684 -14.302
## income$15,000 to less than $25,000           -1.63703    0.46899  -3.491
## income$25,000 to less than $35,000           -2.07637    0.44797  -4.635
## income$35,000 to less than $50,000           -2.54685    0.43819  -5.812
## income$50,000 to less than $100,000          -3.05048    0.41069  -7.428
## income$100,000 to less than $200,000         -3.49984    0.42923  -8.154
## income$200,000 or more                       -3.78409    0.50036  -7.563
## educationHigh school diploma or GED           0.07991    0.81066   0.099
## educationSome college or technical school     1.07679    0.68973   1.561
## educationCollege graduate                     1.79091    0.69119   2.591
## educationGraduate or professional degree      1.73781    0.69250   2.509
## smokingCurrent smoker - now smokes some days -1.58670    0.53523  -2.965
## smokingFormer smoker                         -1.98971    0.33713  -5.902
## smokingNever smoked                          -2.93681    0.32162  -9.131
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## physhlth_days                                 < 2e-16 ***
## bmi                                          0.011510 *  
## sexFemale                                     < 2e-16 ***
## exerciseNo                                   0.002432 ** 
## age_group25-29                               0.041950 *  
## age_group30-34                               0.030803 *  
## age_group35-39                               0.000211 ***
## age_group40-44                               3.07e-09 ***
## age_group45-49                               8.61e-10 ***
## age_group50-54                               3.72e-13 ***
## age_group55-59                                < 2e-16 ***
## age_group60-64                                < 2e-16 ***
## age_group65-69                                < 2e-16 ***
## age_group70-74                                < 2e-16 ***
## age_group75-79                                < 2e-16 ***
## age_group80+                                  < 2e-16 ***
## income$15,000 to less than $25,000           0.000485 ***
## income$25,000 to less than $35,000           3.63e-06 ***
## income$35,000 to less than $50,000           6.40e-09 ***
## income$50,000 to less than $100,000          1.22e-13 ***
## income$100,000 to less than $200,000         4.07e-16 ***
## income$200,000 or more                       4.38e-14 ***
## educationHigh school diploma or GED          0.921484    
## educationSome college or technical school    0.118520    
## educationCollege graduate                    0.009585 ** 
## educationGraduate or professional degree     0.012111 *  
## smokingCurrent smoker - now smokes some days 0.003041 ** 
## smokingFormer smoker                         3.74e-09 ***
## smokingNever smoked                           < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.352 on 7970 degrees of freedom
## Multiple R-squared:  0.1853, Adjusted R-squared:  0.1824 
## F-statistic: 62.52 on 29 and 7970 DF,  p-value: < 2.2e-16

1.2 Write fitted regression equation

\[ \begin{aligned} \widehat{\text{Mental Health Days}} =\;& 8.999 + 0.266(\text{Phys. Health Days}) + 0.033(\text{BMI}) + 1.39(\text{Sex: Female}) + 0.651(\text{Exercise: No}) \\ &+ -1.057(\text{Age: 25–29}) + -1.094(\text{Age: 30–34}) + -1.811(\text{Age: 35–39}) + -2.893(\text{Age: 40–44}) \\ &+ -3.056(\text{Age: 45–49}) + -3.517(\text{Age: 50–54}) + -4.496(\text{Age: 55–59}) + -4.522(\text{Age: 60–64}) \\ &+ -5.578(\text{Age: 65–69}) + -6.025(\text{Age: 70–74}) + -6.287(\text{Age: 75–80}) + -6.82(\text{Age: 80+}) \\ &+ -1.637(\text{Income: \$15k–\$25k}) + -2.076(\text{Income: \$25k–\$35k}) + -2.547(\text{Income: \$35k–\$50k}) + -3.05(\text{Income: \$50k-\$100k}) + -3.5(\text{Income: \$100k-\$200k}) + -3.784(\text{Income: \$200k+}) \\ &+ 0.08(\text{Education: High school diploma or GED}) + 1.077(\text{Education: Some college or technical school}) + 1.791(\text{Education: College graduate}) + 1.738(\text{Education: Graduate or professional degree})\\ &+ -1.587(\text{Smoking: Current - now smokes some days}) + -1.99(\text{Smoking: Former smoker}) + -2.937(\text{Smoking: Never smoked}) \end{aligned} \]

1.3 Interpretation of Coefficients

physhlth_days: Holding all other variables constant, for each 1 day increase in the number of reported physically unhealthy days in the past 30 days, there is a predicted increase of 0.266 mentally unhealthy days reported for the last 30 days.
bmi: For every 1 kg/m² increase in a participant’s BMI, holding all other variables constant, there is a predicted increase of 0.033 reported mentally unhealthy days in the last 30 days.
sex: Compared to the male reference group, holding all other variables constant, a respondent being female indicates a 1.39 increase in predicted number of mentally unhealthy days in the last 30 days.
exercise: Compared to the reference group that does exercise, holding all other variables constant, a respondent who does not exercise has a predicted increase of 1.39 mentally unhealthy days in the last 30 days.
Age group (80+): Being in the age group of 80+ results in a predicted decrease in -6.82 mentally unhealthy days in the last 30 days versus the reference group of 18-24, holding all other variables constant.
Income: Each increased level of income grouping in the model reflected a larger magnitude of predicted decrease in mentally unhealthy days in the last 30 days compared to the reference group of less than $15,000 in annual income, holding all other variables constant. The income group just above the reference group, $15,000 to less than $25,000 reflects an expected -1.637 fewer mentally unhealthy days than the reference group, whereas the maximum income group of $200,000 or more, reflects an expected -3.784 decrease in mentally unhealthy days compared to the reference group.

1.4 Model-level statistics

glance(m1) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R^2",
    "adj.r.squared" = "Adjusted R^2",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n - p - 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Table 2. Overall Model Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2. Overall Model Summary
Statistic	Value
R^2	0.1853
Adjusted R^2	0.1824
Residual Std. Error (Root MSE)	7.3516
F-statistic	62.5234
p-value (overall F-test)	0.0000
Model df (p)	29.0000
Residual df (n - p - 1)	7970.0000
n (observations)	8000.0000

The overall model’s $R^2$ is 0.185, indicating that the amount of variance in mentally unhealthy days explained by all predictors is relatively low - however, this is quite common for mental health outcomes as mental health is shaped by myriad unmeasured factors. Adjusted $R^2$ penalizes for the number of predictors $p$, It only increases when a new predictor improves the model by more than chance alone - because this model has many predictors, it is more appropriate for use than an unadjusted $R^2$, though with an adjusted $R^2$ of 0.182 both are quite similar.

The model’s Root MSE indicates an average prediction error of about 7.35 mentally unhealthy days in the past 30 days.

1.4.1 Overall F-Test

The model’s overall F-test asks: \[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\] The F-statistic is quite large, at 62.52, indicating that the predictors collectively explain more variation than expected by chance. This test has 29 numerator degrees of freedom, 7970 denominator degrees of freedom, and a total p-value of < 2.22e-16.

Based on an alpha of 0.05, we can reject the null hypothesis that the predictors only impact the model variation to the extent that would be expected by chance.

Part 2: Tests of Hypotheses

2.1 Type III Sums of Squares

# Type III using car::Anova()
anova_type3 <- Anova(m1, type = "III")

anova_type3 %>%
  as.data.frame() %>%
  rownames_to_column("Source") %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 3. Type III (Partial) Sums of Squares",
    col.names = c("Source", "Sum of Sq", "df", "F value", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3. Type III (Partial) Sums of Squares
Source	Sum of Sq	df	F value	p-value
(Intercept)	4957.0906	1	91.7191	0.0000
physhlth_days	37622.7952	1	696.1198	0.0000
bmi	345.2408	1	6.3879	0.0115
sex	3723.8662	1	68.9012	0.0000
exercise	497.0434	1	9.1966	0.0024
age_group	30092.1774	12	46.3986	0.0000
income	4733.8943	6	14.5982	0.0000
education	1265.1504	4	5.8521	0.0001
smoking	5101.1076	3	31.4613	0.0000
Residuals	430750.0872	7970	NA	NA

Based on the results of the Type III sums of squares test for all predictors, all predictors have statistically significant partial associations at alpha = 0.05, because all p-values are less than 0.05.

2.2 Chunk Test for Income

Test for whether income collectively improves the model significantly, after accounting for all other predictors.

The chunk test asks: \[H_0: \beta_{k+1} = \beta_{k+2} = \cdots = \beta_p = 0 \quad \text{(the group adds nothing)}\]

## Fit a reduced model that includes all predictors except income
m_reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking,
                  data = a_df)
## Fit a full model that includes all predictors, including income
m_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking + income,
             data = a_df)


## Use Anova to compare the two models
chunk_income <- anova(m_reduced, m_full)

chunk_income %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced", "Full (+ income)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Table 4. Chunk Test: Does income add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4. Chunk Test: Does income add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced	7976	435484.0	NA	NA	NA	NA
Full (+ income)	7970	430750.1	6	4733.894	14.5982	0

The chunk test found that because the p_value was < 2.22e-16, which is less than the alpha of 0.05, income adds statistically significant information to the model.

2.3 Chunk Test for Education

## Fit a reduced model that includes all predictors except income
m_reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking,
                  data = a_df)
## Fit a full model that includes all predictors, including income
m_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking + education,
             data = a_df)


## Use Anova to compare the two models
chunk_education <- anova(m_reduced, m_full)

chunk_education %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced", "Full (+ income)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Table 5. Chunk Test: Does education add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5. Chunk Test: Does education add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced	7974	432015.2	NA	NA	NA	NA
Full (+ income)	7970	430750.1	4	1265.15	5.8521	1e-04

The chunk test found that because the p_value was 0.00010644, which is less than the alpha of 0.05, education adds statistically significant information to the model.

2.4 Interpretation of test results

The Type III Anova results found the statistical significance of each variable included in the model after controlling for all the others, and it determined that every one of the variables that we included in the model were statistically significant, although some were less dominant than others - like BMI, which had the largest p-value, at 0.01151 (though it was statistically significant), and a moderate F-statistic value of 6.39. Reported Physically Healthy days was one of the strongest individual contributors in the model, with an F-statistic of over 696.12, and a p value of < 2.22e-16.

Both chunk tests indicated that models including their respective predictors (income and education) were statistically significantly more predictive than models that excluded them individually. When a chunk test is conducted on a single variable, it yields the same statistical significance as the corresponding partial effect from the Type III test. However, by explicitly comparing nested models, chunk tests provide a model-comparison perspective, indicating whether the inclusion of a variable or group of variables improves overall model fit.

In this assignment, the chunk tests were conducted in parallel rather than sequentially, so they reflect each predictor’s independent contribution relative to the same baseline model. If conducted sequentially or with grouped predictors, chunk tests could instead be used to evaluate the joint contribution of income and education, as well as how their effects emerge depending on the order in which they are added to the model.

Part 3: Interaction Analysis

Effect modification occurs when the association between a predictor and an outcome differs across levels of another variable. You will test whether the association between current smoking and mentally unhealthy days differs between men and women.

3.1 Create binary smoking variable

a_df <- a_df |>
  mutate(
 smoker_current = factor(case_when(
                          smoking %in% c("Current smoker - now smokes every day","Current smoker - now smokes some days") ~ 1,
                          smoking %in% c("Former smoker","Never smoked") ~ 0),                        
  levels = c(0,1),
  labels = c("Non-smoker", "Current smoker")))

3.2 Fit two models

Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker_current, exercise, age_group, income, and education.
Model B (with interaction): same as Model A, but use sex * smoker_current in the formula to include the interaction term.

## Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker current, smoker_current, exercise, age_group, income, and education.
ma <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education,
                  data = a_df)
## Model B (with interaction)
mb <- lm(menthlth_days ~ physhlth_days + bmi + sex*smoker_current + exercise + age_group + income + education,
                  data = a_df)

3.3 Compare models with Anova

3.3.1 Hypotheses for the test and test statistics

To test whether sex $\times$ smoking status is significant overall, we use a partial F-test comparing the model with and without the interaction terms:

\[H_0: \beta_3 = 0 \quad \text{(there is no interaction between sex and smoking status)}\] \[H_A: \beta_3 \not= 0 \quad \text{(there is interaction between sex and smoking status)}\]

3.3.2 Model comparison

## Use Anova to compare the two models
comparison <- anova(ma, mb)

comparison %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Main Effects", "Main Effects + Interaction (Sex * Smoker_Current)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Table 6. Interaction Effect",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6. Interaction Effect
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Main Effects	7972	432508.9	NA	NA	NA	NA
Main Effects + Interaction (Sex * Smoker_Current)	7971	432174.9	1	333.9749	6.1598	0.0131

3.3.3 Interpretation of test

The F-statistic is 6.16 with a p-value of 0.013089. At $\alpha = 0.05$, the overall interaction is statistically significant (p < 0.05), indicating that we do have sufficient evidence to conclude that the effect of sex on the outcome depends on smoking status. There is one interaction effect because both sex and smoker_current are categorical, so there is one degree of freedom for this model’s predictor.

3.4 Interpretation of Model B

display <- tidy(mb, conf.int = TRUE)  |> mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  filter(str_detect(term, "sex")|(str_detect(term, "smoke")))

display |>
  kable(
    caption = "Interactions: Sex and Current Smoking Status",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interactions: Sex and Current Smoking Status
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
sexFemale	1.1855	0.1775	6.6784	0.0000	0.8376	1.5335
smoker_currentCurrent smoker	1.5208	0.3654	4.1621	0.0000	0.8045	2.2371
sexFemale:smoker_currentCurrent smoker	1.2833	0.5171	2.4819	0.0131	0.2697	2.2968

For men, the estimated effect of current smoking is an increase in number of reported mentally unhealthy days in the past 30 by 1.5208 days, compared to non-smoking men.

Among women, the estimated effect of current smoking is an increase in the number of reported mentally unhealthy days in the past 30 by both 1.1855, for women, and by 1.2833 for the interaction effect between sex and smoking status, for a resulting effect of 2.4688 more reported mentally unhealthy days compared to non-smoking men.

Because the effect of current smoking is positive for both men and women, and the interaction terms indicate the the effect is larger for women than for men, this does indicate that smoking is more strongly associated with poor mental health in women than it is in men.

3.5 Visualization of interaction effects

Please note, because this comparison is two purely categorical variables, I am showing the predicted values for each combination, to demonstrate the way the interaction effects are working.

pred_mb <- ggeffects::ggpredict(mb, terms = c("smoker_current", "sex"))

ggplot(pred_mb, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_col() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), color = NA) +
  facet_wrap(~group) +
  labs(
    title    = "Predicted Poor Mental Health Days by Sex and Smoking Status",
    x        = "Smoking status",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_classic(base_size = 13) +
  scale_color_brewer(palette = "Set2")

3.6 Interpretation of test results

For men, current smoking is associated with an increase of about 1.52 reported mentally unhealthy days in the last 30 days. Among women, the increase is even larger - with an increase of 1.28 days comparing smoking women to non-smoking women, on top of an increase of 1.185 for women. Overall, both men and women who smoke more report a larger number of poor mental health days than non-smokers, but the effect is heightened among women. This can be used to highlight people in need of programs with greater access to mental health support.

Part 4: Reflection

These results suggest that the determinants of mental health among US adults are varied and multifaceted. All predictors used in the analysis were statistically significant, although the reported number of physically unhealthy days was the strongest predictor of mentally unhealthy days, while BMI was the weakest predictor.

A major confounder that may affect this study is chronic health conditions that may result in both more mentally unhealthy days and physically unhealthy days, but do not impact the interaction between the two. Another potential unmeasured confounder is occupational physical activity – we saw some variation in the model based on income level, and we’ve seen a link between physically unhealthy days and poor mental health – it’s possible the type of work that someone does may affect mental health and isn’t measured by the model.

Please note, BRFSS data is a cross-sectional survey, where all data is taken at a single point in time. This means that there is no way to infer causality, as there’s no way of knowing if the predictor or the outcome occurred first for each participant in the study.

R^2 always increases when more predictors are added because of the way that the measure is calculated – Adjusting R^2 ensures that additional predictors that don’t improve the model do result in a decreasing R^2, as a better measure of model quality. We saw with this assignment that there were categorical variables that added 6, 12, 5 variables – because we treat them with dummy variables, this would substantially increase the number of predictors, and we need to be able to differentiate whether they actually add model quality or just inflate the R^2.

Chunk tests report on the all values for a categorical variable, allowing all the dummy variables to be evaluated collectively. Also if different variables that have a lot of overlap are tested (say, income and education), then the potential correlation between those variables can be identified with a chunk test. I asked chatgpt “can i add a line break in an equation formatted like this in rmarkdown”, when I noticed the equation was making the hrml output super wide, and chatgpt flagged the use of “\” to add line breaks. I also asked “why has this begun rendering weird” after I exceeded a certain equation length and it shifted to vertical formatting, and I was given some formatting updates. Once I applied them, the equation rendered more clearly.

Assignment 3: Multiple Linear Regression, Tests of Hypotheses, and Interaction Analysis

EPI 553: Principles of Statistical Inference II

Alex Mossawir

2026-04-02