EPI 553 — Homework 3: Multiple Linear Regression, Tests of Hypotheses, and Interaction Analysis

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(janitor)
library(car)
library(gtsummary)
library(ggeffects)

# Import the dataset

brfss <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 3/LLCP2023.xpt") %>%
  clean_names()
#select variables
brfss_work <- brfss |>
  select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)

brfss_work <- brfss_work %>%
  mutate(
      menthlth_days = case_when(menthlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
      
      physhlth_days = case_when(physhlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
      
      bmi = case_when(bmi5 == 9999 ~ NA, bmi5 < 9999 ~ bmi5/100),
      
      sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
      
      exercise = case_when(exerany2 == 7 & exerany2 == 9 ~ NA),
      exercise = factor(exerany2, levels = c(1, 2), labels = c("Yes", "No")),
      
      age_group = case_when(ageg5yr == 14 ~ NA),
      age_group = factor(ageg5yr, levels = 1:13,
      labels = c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+") ),
      
      income = case_when(incomg1 == 9 ~ NA),
      income = factor(incomg1, levels = 1:7,
      labels = c("<$15k", "$15-25k", "$25-35k", "$35-50k",
                 "$50-100k", "$100-200k", ">$200k")),
  
      education = case_when(educa == 9 ~ NA),
      education = factor(educa, levels = 1:6, labels = 
                           c("Less than high school", 
                             "Less than high school",
                             "High school diploma or GED",
                             "Some college or technical school",
                             "College graduate",
                             "Graduate or professional degree")),
      
      smoking = case_when(smoker3 == 9 ~ NA),
      smoking = factor(smoker3, levels=1:4, labels =
                         c("Current daily smoker",
                           "Current some-day smoker",
                           "Former smoker",
                           "Never smoker"))
      
  )

cat("Total N:", nrow(brfss_work), "\n")

## Total N: 433323

cat("Missing data:", sum(is.na(brfss_work)), "\n")

## Missing data: 291293

a <- nrow(brfss_work)
b <- sum(is.na(brfss_work))
c <- (b / a)*100

cat("Missing data %:", c , "\n")

## Missing data %: 67.22306

set.seed(1220) 
brfss_analytic <- brfss_work |> 
drop_na(menthlth_days, physhlth_days, bmi, sex, exercise, 
age_group, income, education, smoking) |> 
slice_sample(n = 8000) 

cat("FInal analytic N:", nrow(brfss_analytic), "\n")

## FInal analytic N: 8000

brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi, sex, exercise, age_group, income, education, smoking) %>% 
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      bmi     ~ "Body mass index",
      sex           ~ "Sex",
      exercise      ~ "Any exercise in past 30 days",
      age_group      ~ "Age group",
      income           ~ "Annual household income",
      education       ~ "Highest level of education completed",
      smoking         ~ "Smoking status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)**")

**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)**
Characteristic	N	N = 8,000¹
Mentally unhealthy days (past 30)	8,000	5.6 (9.0)
Physically unhealthy days (past 30)	8,000	3.5 (7.9)
Body mass index	8,000	28.6 (6.5)
Sex	8,000
Male		3,921 (49%)
Female		4,079 (51%)
Any exercise in past 30 days	8,000	6,242 (78%)
Age group	8,000
18-24		457 (5.7%)
25-29		424 (5.3%)
30-34		533 (6.7%)
35-39		590 (7.4%)
40-44		628 (7.9%)
45-49		538 (6.7%)
50-54		627 (7.8%)
55-59		694 (8.7%)
60-64		819 (10%)
65-69		818 (10%)
70-74		759 (9.5%)
75-79		577 (7.2%)
80+		536 (6.7%)
Annual household income	8,000
<$15k		424 (5.3%)
$15-25k		643 (8.0%)
$25-35k		812 (10%)
$35-50k		1,102 (14%)
$50-100k		2,551 (32%)
$100-200k		1,809 (23%)
>$200k		659 (8.2%)
Highest level of education completed	8,000
Less than high school		127 (1.6%)
High school diploma or GED		254 (3.2%)
Some college or technical school		1,795 (22%)
College graduate		2,144 (27%)
Graduate or professional degree		3,680 (46%)
Smoking status	8,000
Current daily smoker		691 (8.6%)
Current some-day smoker		274 (3.4%)
Former smoker		2,195 (27%)
Never smoker		4,840 (61%)
¹ Mean (SD); n (%)

model <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking, data = brfss_analytic)
summary(model)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8259 -2.4065 -1.4297 -0.1347 29.4079 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                4.818398   0.692538   6.958 3.74e-12
## physhlth_days                              0.850795   0.008271 102.870  < 2e-16
## bmi                                        0.010978   0.009866   1.113 0.265864
## sexFemale                                  0.586915   0.125583   4.674 3.01e-06
## exerciseNo                                 0.272611   0.158988   1.715 0.086445
## age_group25-29                            -0.602502   0.376024  -1.602 0.109130
## age_group30-34                            -0.768963   0.357147  -2.153 0.031343
## age_group35-39                            -0.727709   0.351562  -2.070 0.038491
## age_group40-44                            -1.342610   0.349512  -3.841 0.000123
## age_group45-49                            -1.924205   0.360173  -5.342 9.42e-08
## age_group50-54                            -2.272300   0.350703  -6.479 9.76e-11
## age_group55-59                            -2.230300   0.341562  -6.530 6.99e-11
## age_group60-64                            -2.648336   0.329843  -8.029 1.12e-15
## age_group65-69                            -3.152289   0.331799  -9.501  < 2e-16
## age_group70-74                            -3.385648   0.335638 -10.087  < 2e-16
## age_group75-79                            -3.244898   0.355529  -9.127  < 2e-16
## age_group80+                              -3.594839   0.358756 -10.020  < 2e-16
## income$15-25k                             -0.237845   0.346795  -0.686 0.492836
## income$25-35k                             -0.132541   0.335342  -0.395 0.692677
## income$35-50k                             -0.445473   0.323322  -1.378 0.168303
## income$50-100k                            -0.569835   0.304430  -1.872 0.061269
## income$100-200k                           -1.005289   0.318742  -3.154 0.001617
## income>$200k                              -1.256938   0.368011  -3.415 0.000640
## educationHigh school diploma or GED        0.381209   0.602595   0.633 0.527005
## educationSome college or technical school  0.904135   0.513082   1.762 0.078080
## educationCollege graduate                  0.975882   0.513649   1.900 0.057482
## educationGraduate or professional degree   0.832938   0.513798   1.621 0.105028
## smokingCurrent some-day smoker            -0.842184   0.394788  -2.133 0.032934
## smokingFormer smoker                      -0.738961   0.250086  -2.955 0.003138
## smokingNever smoker                       -1.368381   0.237205  -5.769 8.28e-09
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                          
## sexFemale                                 ***
## exerciseNo                                .  
## age_group25-29                               
## age_group30-34                            *  
## age_group35-39                            *  
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15-25k                                
## income$25-35k                                
## income$35-50k                                
## income$50-100k                            .  
## income$100-200k                           ** 
## income>$200k                              ***
## educationHigh school diploma or GED          
## educationSome college or technical school .  
## educationCollege graduate                 .  
## educationGraduate or professional degree     
## smokingCurrent some-day smoker            *  
## smokingFormer smoker                      ** 
## smokingNever smoker                       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.514 on 7970 degrees of freedom
## Multiple R-squared:  0.6257, Adjusted R-squared:  0.6243 
## F-statistic: 459.4 on 29 and 7970 DF,  p-value: < 2.2e-16

coef(model)

##                               (Intercept) 
##                                4.81839770 
##                             physhlth_days 
##                                0.85079499 
##                                       bmi 
##                                0.01097774 
##                                 sexFemale 
##                                0.58691470 
##                                exerciseNo 
##                                0.27261058 
##                            age_group25-29 
##                               -0.60250175 
##                            age_group30-34 
##                               -0.76896262 
##                            age_group35-39 
##                               -0.72770861 
##                            age_group40-44 
##                               -1.34260981 
##                            age_group45-49 
##                               -1.92420498 
##                            age_group50-54 
##                               -2.27229975 
##                            age_group55-59 
##                               -2.23029964 
##                            age_group60-64 
##                               -2.64833599 
##                            age_group65-69 
##                               -3.15228943 
##                            age_group70-74 
##                               -3.38564774 
##                            age_group75-79 
##                               -3.24489757 
##                              age_group80+ 
##                               -3.59483883 
##                             income$15-25k 
##                               -0.23784516 
##                             income$25-35k 
##                               -0.13254056 
##                             income$35-50k 
##                               -0.44547317 
##                            income$50-100k 
##                               -0.56983491 
##                           income$100-200k 
##                               -1.00528854 
##                              income>$200k 
##                               -1.25693832 
##       educationHigh school diploma or GED 
##                                0.38120905 
## educationSome college or technical school 
##                                0.90413454 
##                 educationCollege graduate 
##                                0.97588172 
##  educationGraduate or professional degree 
##                                0.83293760 
##            smokingCurrent some-day smoker 
##                               -0.84218438 
##                      smokingFormer smoker 
##                               -0.73896133 
##                       smokingNever smoker 
##                               -1.36838119

Step 2: Write the fitted regression equation including all terms, with coefficients rounded to three decimal places.

Menthlth_days = 4.818 + (physhlth_days * 0.850) + (bmi * 0.010) + (sexFemale * 0.586) + (exerciseNo * 0.272) + (age_group25-29 * -0.602) + (age_group30-34 * -0.768) + (age_group35-39 * -0.727) + (age_group40-44 * -1.342) + (age_group45-49 * -1.924) + (age_group50-54 * -2.272) + (age_group55-59 * -2.230) + (age_group60-64 * -2.648) + (age_group65-69 * -3.152) + (age_group70-74 * -3.385) + (age_group75-79 * -3.244) + (age_group80+ * -3.594) + (income15-25k * -0.237) + income25-35k * -0.132) + (income35-50k * -0.445) + income50-100k * -0.569) + (income100-200k * -1.005) + (income>$200k * -1.256) + (educationHigh school diploma or GED * 0.381) + (educationSome college or technical school * 0.904) + (educationCollege graduate * 0.975 + (educationGraduate or professional degree * 0.832) + (smokingCurrent some-day smoker * -0.842) + (smokingFormer smoker * -0.738) + (smokingNever smoker * -1.368)

Step 3: Interpret the following coefficients in plain language. For each, state the direction, magnitude, and meaning in terms of mentally unhealthy days, holding all other variables constant: • physhlth_days (continuous predictor)

For every 1 day increase in physhlth_days, menthlth days increases by 0.850, keeping all other variables constant.

• bmi (continuous predictor)

For every 1 unit increase in bmi, menthlth days increases by 0.010, keeping all other variables constant.

• sex: Female vs. Male (reference)

Being female is associated with 0.586 more mentally unhealthy days per month compared to males, keeping all other variables constant.

• exercise: Yes vs. No (reference)

Not exercising is associated with 0.272 more mentally unhealthy days per month compared to exercising, keeping all other variables constant.

• One age_group coefficient of your choice

Being 30-34 is associated with 0.768 less mentally unhealthy days per month than being 18-24, keeping all other variables constant.

• Two income coefficients of your choice, compared to the reference category (Less than $15,000)

Making >200K is associated with 1.256 less mentally unhealthy days per month than making <15K, and making 35-50K is associated with 0.445 less mentally unhealthy days per month than making <15K, keeping all other variables constant.

Step 4: Report and interpret each of the following model-level statistics: • R-squared: proportion of variance in mentally unhealthy days explained by all predictors

0.6257: all predictors explain 62.57% of the variance in menthlth_days.

• Adjusted R-squared: how it differs from R-squared and what it adds

0.6243: all predictors explain 62.43% of the variance in menthlth_days. Differs from R2 because adjusted R2 only increases when a predictor is added that improves the model by more than would be expected by chance.

• Root MSE (residual standard error): average prediction error of the model

5.514: average of deviations from the linear model. Datapoints are 5.514 mental health days off from the regression model, on average.

• Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion

H0: all the coefficients are 0 - there is no linear relationship between any predictor and the outcome. F: 459.4 df1 = 29, df2 = 7970 p-value: <2.2e-16

Conclusion: Reject H0; there is a linear relationship between some predictors and the outcome.

Anova(model, type = "III") |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Type 3 ANOVA on Full Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Type 3 ANOVA on Full Model
term	sumsq	df	statistic	p.value
(Intercept)	1471.8139	1	48.4081	0.0000
physhlth_days	321747.4695	1	10582.2957	0.0000
bmi	37.6448	1	1.2381	0.2659
sex	664.0808	1	21.8417	0.0000
exercise	89.3911	1	2.9401	0.0864
age_group	9085.6971	12	24.9024	0.0000
income	760.5243	6	4.1689	0.0003
education	179.4841	4	1.4758	0.2066
smoking	1313.1821	3	14.3969	0.0000
Residuals	242322.4044	7970	NA	NA

Physhlth_days, sex, age_group, income, and smoking have statistically significant partial associations.

reduced <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking, data = brfss_analytic)

anova(reduced, model) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (no Income)", "Full (with Income)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test: Does Income add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-Test: Does Income add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (no Income)	7976	243082.9	NA	NA	NA	NA
Full (with Income)	7970	242322.4	6	760.5243	4.1689	3e-04

H0: Income does not add information to the prediction of mental health days, given the other variables present in the model. HA: Income adds information to the prediction of mental health days, given the other variables present in the model.

F: 4.1689 df: 6 p-value: 3e-04

I reject the null hypothesis and conclude that income adds information to the prediction of mental health days even given the other variables present.

reduced2 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking, data = brfss_analytic)

anova(reduced2, model) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (no Education)", "Full (with Education)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test: Does Education add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-Test: Does Education add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (no Education)	7974	242501.9	NA	NA	NA	NA
Full (with Education)	7970	242322.4	4	179.4841	1.4758	0.2066

H0: Education does not add information to the prediction of mental health days, given the other variables present in the model. HA: Education adds information to the prediction of mental health days, given the other variables present in the model.

F: 1.4758 df: 4 p-value: 0.2066

I do not reject the null hypothesis and conclude that there is no evidence education adds information to the prediction of mental health days given the other variables present.

The results of the Type III ANOVA indicate that Physhlth_days, sex, age_group, income, and smoking significantly contribute to the model given that all other variables are already in the model. These predictors make the strongest independent contributions. The chunk tests indicate whether a group of variables adds to the model compared to a reduced model without that group of variables, and in this case, it was found that income but not education contributes significantly to the model. This is important because even if a single variable is not significant in the overall Type III ANOVA, it could still be an important contributor within a group of variables.

brfss_analytic <- brfss_analytic %>%
  mutate(
      smoker_current = case_when(smoking == "Current some-day smoker" ~ 0,
                                 smoking == "Current daily smoker" ~ 0,
                                 smoking == "Former smoker" ~ 1,
                                 smoking == "Never smoker" ~ 1),
  

      smoker_current = factor(smoker_current, levels=0:1, labels =
                         c("Current smoker",
                           "Non_smoker")) )

model_A <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_A, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model A",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model A
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	4.3262	0.6787	6.3742	0.0000	2.9958	5.6566
physhlth_days	0.8526	0.0083	103.0635	0.0000	0.8364	0.8688
bmi	0.0111	0.0099	1.1192	0.2631	-0.0083	0.0304
sexFemale	0.5489	0.1254	4.3752	0.0000	0.3029	0.7948
smoker_currentNon_smoker	-0.9071	0.2017	-4.4978	0.0000	-1.3025	-0.5118
exerciseNo	0.2770	0.1592	1.7405	0.0818	-0.0350	0.5890
age_group25-29	-0.5205	0.3760	-1.3844	0.1663	-1.2575	0.2165
age_group30-34	-0.6452	0.3566	-1.8093	0.0704	-1.3442	0.0538
age_group35-39	-0.5765	0.3505	-1.6450	0.1000	-1.2635	0.1105
age_group40-44	-1.1457	0.3474	-3.2977	0.0010	-1.8267	-0.4647
age_group45-49	-1.7312	0.3583	-4.8316	0.0000	-2.4335	-1.0288
age_group50-54	-2.1153	0.3495	-6.0517	0.0000	-2.8005	-1.4301
age_group55-59	-2.0583	0.3400	-6.0539	0.0000	-2.7247	-1.3918
age_group60-64	-2.4695	0.3280	-7.5287	0.0000	-3.1125	-1.8265
age_group65-69	-2.9304	0.3287	-8.9139	0.0000	-3.5748	-2.2859
age_group70-74	-3.1576	0.3323	-9.5012	0.0000	-3.8091	-2.5062
age_group75-79	-2.9905	0.3513	-8.5116	0.0000	-3.6793	-2.3018
age_group80+	-3.3630	0.3554	-9.4622	0.0000	-4.0597	-2.6663
income$15-25k	-0.2432	0.3472	-0.7003	0.4837	-0.9239	0.4375
income$25-35k	-0.1420	0.3358	-0.4229	0.6723	-0.8002	0.5162
income$35-50k	-0.4414	0.3237	-1.3637	0.1727	-1.0760	0.1931
income$50-100k	-0.5743	0.3048	-1.8842	0.0596	-1.1719	0.0232
income$100-200k	-0.9893	0.3191	-3.0998	0.0019	-1.6149	-0.3637
income>$200k	-1.2491	0.3685	-3.3899	0.0007	-1.9714	-0.5268
educationHigh school diploma or GED	0.4668	0.6031	0.7741	0.4389	-0.7154	1.6491
educationSome college or technical school	1.0098	0.5132	1.9675	0.0492	0.0037	2.0159
educationCollege graduate	1.0710	0.5139	2.0839	0.0372	0.0635	2.0784
educationGraduate or professional degree	0.8558	0.5144	1.6637	0.0962	-0.1525	1.8641

model_B <- lm(menthlth_days ~ physhlth_days + bmi + sex * smoker_current + exercise + age_group + income + education, data = brfss_analytic)
tidy(model_B, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model B",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model B
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	4.0900	0.6923	5.9081	0.0000	2.7330	5.4470
physhlth_days	0.8522	0.0083	102.9927	0.0000	0.8360	0.8684
bmi	0.0108	0.0099	1.0962	0.2730	-0.0085	0.0302
sexFemale	1.1269	0.3575	3.1522	0.0016	0.4261	1.8277
smoker_currentNon_smoker	-0.5875	0.2737	-2.1465	0.0319	-1.1241	-0.0510
exerciseNo	0.2758	0.1591	1.7330	0.0831	-0.0362	0.5878
age_group25-29	-0.5236	0.3759	-1.3927	0.1637	-1.2605	0.2134
age_group30-34	-0.6432	0.3565	-1.8041	0.0713	-1.3421	0.0557
age_group35-39	-0.5824	0.3504	-1.6618	0.0966	-1.2693	0.1046
age_group40-44	-1.1493	0.3474	-3.3085	0.0009	-1.8303	-0.4683
age_group45-49	-1.7351	0.3583	-4.8431	0.0000	-2.4374	-1.0328
age_group50-54	-2.1209	0.3495	-6.0681	0.0000	-2.8061	-1.4358
age_group55-59	-2.0601	0.3399	-6.0601	0.0000	-2.7265	-1.3937
age_group60-64	-2.4701	0.3280	-7.5315	0.0000	-3.1130	-1.8272
age_group65-69	-2.9346	0.3287	-8.9278	0.0000	-3.5790	-2.2903
age_group70-74	-3.1600	0.3323	-9.5093	0.0000	-3.8114	-2.5086
age_group75-79	-2.9967	0.3513	-8.5298	0.0000	-3.6854	-2.3080
age_group80+	-3.3610	0.3554	-9.4577	0.0000	-4.0577	-2.6644
income$15-25k	-0.2396	0.3472	-0.6901	0.4902	-0.9202	0.4410
income$25-35k	-0.1352	0.3358	-0.4028	0.6871	-0.7934	0.5229
income$35-50k	-0.4322	0.3237	-1.3352	0.1818	-1.0668	0.2023
income$50-100k	-0.5650	0.3048	-1.8535	0.0639	-1.1625	0.0326
income$100-200k	-0.9855	0.3191	-3.0882	0.0020	-1.6110	-0.3599
income>$200k	-1.2470	0.3684	-3.3846	0.0007	-1.9693	-0.5248
educationHigh school diploma or GED	0.4322	0.6034	0.7164	0.4738	-0.7505	1.6150
educationSome college or technical school	0.9721	0.5136	1.8925	0.0585	-0.0348	1.9790
educationCollege graduate	1.0310	0.5144	2.0043	0.0451	0.0226	2.0393
educationGraduate or professional degree	0.8155	0.5149	1.5840	0.1132	-0.1937	1.8248
sexFemale:smoker_currentNon_smoker	-0.6563	0.3801	-1.7268	0.0843	-1.4013	0.0887

anova(model_A, model_B) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Model A", "Model B"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Partial F-Test",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-Test
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Model A	7972	243018.9	NA	NA	NA	NA
Model B	7971	242928.0	1	90.871	2.9817	0.0843

H0: the two lines are identical; current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept) HA: the two lines are not identical, current smoking does affect the sex-menthlth days relationship (either slope or intercept or both)

F: 2.9817 df: 1 p-value: 0.0843

I conclude the two lines are identical; I fail to reject H0, and conclude current smoking has no effect on the sex-menthlth_days relationship (neither slope nor intercept).

The estimated effect of current smoking on mentally unhealthy days among men is -0.5875. This means that holding all other variables equal, not being a current smoker results in 0.5875 less mentally unhealthy days than being a current smoker.

The estimated effect of current smoking on mentally unhealthy days among women is -1.2438. This means that holding all other variables equal, not being a current smoker results in 1.2438 less mentally unhealthy days than being a current smoker.

This indicates that smoking is more strongly associated with poor mental health in females than in males. In males, being a current smoker results in 0.5875 more mentally unhealthy days, but in females, being a current smoker results in 1.2438 more mentally unhealthy days. Thus, sex modifies the current smoker-menthlth_days relationship.

pred_int <- ggpredict(model_B, terms = c("smoker_current", "sex"))

ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_point() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Sex & Smoking",
    x        = "Smoking status",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "sex",
    fill     = "sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

The figure shows there is an interaction between smoking status and sex, that is, current smoking modifies the sex-mental health days relationship. Among current smokers, females have much more mentally unhealthy days than males, but among non-smokers, the gap between sexes is much smaller. This means that the relationship between sex and mentally unhealthy days is affected by current smoking.

REFLECTION

The results suggest that physically unhealthy days, sex, age group, income, and smoking status are the determinants most strongly associated with mental health among US adults. BMI, exercise, and education were not statistically significant. Key limitations of using cross-sectional survey data is that the data was collected at a single point in time, and so it is not possible to infer causality, which requires a temporal relationship to be assessed. Additionally, there may be additional variables unaccounted for that could be confounding observed relationships. For example, the presence of chronic illness could be a confounder of the physically unhealthy days-mentally unhealthy days relationship, as well as a potential confounder of the age group-mentally unhealthy days relationship.

Adjusted R squared only changes when variables are added that significantly contribute to the prediction power of the model more than would be expected by chance, while R squared does not make this distinction. This means that when adding multiple categorical predictors, adjusted R squared is a better metric to determine linear relationships between predictors and the outcome. Groups of predictors should be tested with chunk tests because a variable may be nonsignificant on its own, but contribute information to the model as part of a group of variables.

I did not use AI at all to complete this assignment.

EPI 553 — Homework 3: Multiple Linear Regression, Tests of Hypotheses, and Interaction Analysis

Josh Macera

Due: Thursday, March 6, 2026 at 11:59 PM ET

Part 0: Data Preparation (10 points)