Logistic Regression

Instructions

Complete the four tasks below using the BRFSS 2020 dataset (brfss_logistic_2020.rds). Submit a knitted HTML file via Brightspace. You may collaborate, but each student must submit their own work.

Data

Variable	Description	Type
`fmd`	Frequent mental distress (No/Yes)	Factor (outcome)
`menthlth_days`	Mentally unhealthy days (0-30)	Numeric
`physhlth_days`	Physically unhealthy days (0-30)	Numeric
`sleep_hrs`	Hours of sleep per night	Numeric
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income category (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(ggeffects)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_logistic <- readRDS("~/desktop/EPI553/data/brfss_logistic_2020.rds")

Task 1: Explore the Binary Outcome (15 points)

1a. (5 pts) Create a frequency table showing the number and percentage of individuals with and without frequent mental distress.

# create frequency table showing # and % of individuals w and w/o FMD
brfss_logistic |>
  tbl_summary(
    include = fmd,
    type= list(fmd ~ "categorical"), 
    statistic = all_categorical() ~ "{n} ({p}%)",
    label = list(fmd ~ "Frequent Mental Distress"),
  )

Characteristic	N = 5,000¹
Frequent Mental Distress
No	4,243 (85%)
Yes	757 (15%)
¹ n (%)

1b. (5 pts) Create a descriptive summary table of at least 4 predictors, stratified by FMD status. Use tbl_summary().

# descriptive summary table
brfss_logistic |>
  tbl_summary(
    by = fmd,
    include = c(physhlth_days, sleep_hrs, age, sex, bmi, exercise,
                income_cat, smoker),
    type = list(
      c(physhlth_days, sleep_hrs, age, bmi, income_cat) ~ "continuous"),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})"
    ),
    label = list(
      physhlth_days ~ "Physical unhealthy days",
      sleep_hrs     ~ "Sleep hours",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      bmi           ~ "BMI",
      exercise      ~ "Exercise in past 30 days",
      income_cat    ~ "Income category (1-8)",
      smoker        ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels()

Characteristic	Overall N = 5,000¹	No N = 4,243¹	Yes N = 757¹	p-value²
Physical unhealthy days	4 (9)	3 (8)	10 (13)	<0.001
Sleep hours	7.00 (1.48)	7.09 (1.40)	6.51 (1.83)	<0.001
Age (years)	56 (16)	57 (16)	50 (16)	<0.001
Sex				<0.001
Male	2,701 (54%)	2,378 (56%)	323 (43%)
Female	2,299 (46%)	1,865 (44%)	434 (57%)
BMI	28.5 (6.3)	28.4 (6.2)	29.3 (7.0)	0.001
Exercise in past 30 days	3,673 (73%)	3,192 (75%)	481 (64%)	<0.001
Income category (1-8)	5.85 (2.11)	6.00 (2.04)	5.05 (2.29)	<0.001
Smoking status				<0.001
Former/Never	3,280 (66%)	2,886 (68%)	394 (52%)
Current	1,720 (34%)	1,357 (32%)	363 (48%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

1c. (5 pts) Create a bar chart showing the proportion of FMD by exercise status OR smoking status.

# bar chart proportion of FMD by exercise
brfss_logistic |>
  ggplot(aes(x = exercise, fill = fmd)) +
  geom_bar(position = "fill") +
  labs(
    x = "Exercise in Past 30 Days",
    y = "Proportion of FMD",
    fill = "Frequent Mental Distress"
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values= c("No"= "hotpink","Yes"= "purple")) +
  theme_minimal()

Task 2: Simple Logistic Regression (20 points)

2a. (5 pts) Fit a simple logistic regression model predicting FMD from exercise. Report the coefficients on the log-odds scale.

B0 (Intercept)= -1.337 B1 (Exercise coefficient)= -0.555

# fit simple logistic regression model
mod_exercise <- glm(fmd ~ exercise, data = brfss_logistic,
                    family = binomial(link = "logit"))

# output
tidy(mod_exercise, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1.337	0.068	-19.769	0	-1.471	-1.206
exerciseYes	-0.555	0.083	-6.655	0	-0.718	-0.391

2b. (5 pts) Exponentiate the coefficients to obtain odds ratios with 95% confidence intervals.

# getting odds ratio by exponentiating the coefficients
tidy(mod_exercise, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Exercise (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.263	0.068	-19.769	0	0.230	0.299
exerciseYes	0.574	0.083	-6.655	0	0.488	0.676

# create table of results
mod_exercise |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(exercise ~ "Exercise in past 30 days")
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Exercise in past 30 days
No	—	—
Yes	0.57	0.49, 0.68	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

2c. (5 pts) Interpret the odds ratio for exercise in the context of the research question.

The odds ratio for exercise is 0.574 with a 95% CI of [0.488, 0.676] which tells us that those who exercise have lower odds of frequent mental distress in comparison to those who do not exercise, without adjustment for any other variables. Those who exercise have 42.6% lower odds of frequent mental distress.

2d. (5 pts) Create a plot showing the predicted probability of FMD across levels of a continuous predictor (e.g., age or sleep hours).

# predicted probability of FMD across continuous predictor (sleep hours)
mod_age <- glm(fmd ~ sleep_hrs, data = brfss_logistic,
               family = binomial(link = "logit"))

tidy(mod_age, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FMD ~ Sleep Hours (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FMD ~ Sleep Hours (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	1.101	0.184	0.523	0.601	0.767	1.580
sleep_hrs	0.765	0.027	-9.835	0.000	0.725	0.807

# plot (visualization)
ggpredict(mod_age, terms = "sleep_hrs") |>
  plot() +
  labs(title = "Predicted Probability of Frequent Mental Distress by Sleep",
       x = "Sleep (Hours)", y = "Predicted Probability of FMD") +
  theme_minimal()

Task 3: Comparing Predictors (20 points)

3a. (5 pts) Fit three separate simple logistic regression models, each with a different predictor of your choice.

# first simple logistic regression model
mod_smoker <- glm(fmd ~ smoker, data = brfss_logistic,
                    family = binomial(link = "logit"))

# second simple logistic regression model
mod_income_cat <- glm(fmd ~ income_cat, data = brfss_logistic,
                    family = binomial(link = "logit"))

#third simple logistic regression model
mod_age <- glm(fmd ~ age, data = brfss_logistic,
                    family = binomial(link = "logit"))

3b. (10 pts) Create a table comparing the odds ratios from all three models.

# odds ratio
table_smoker <- tidy(mod_smoker, conf.int = TRUE, exponentiate = TRUE) |>
    mutate(Model = "Smoker")
table_income <- tidy(mod_income_cat, conf.int = TRUE, exponentiate = TRUE) |>
   mutate(Model = "Income")
table_age <- tidy(mod_age, conf.int = TRUE, exponentiate = TRUE) |>
   mutate(Model = "Age")

# combine all results 
all_models <- bind_rows(table_smoker, table_income, table_age)

# final comparison table 
all_models |>
  select(Model, term, estimate, conf.low, conf.high, p.value) |>
  kable(digits = 3,
        col.names = c("Model", "Predictor", "OR", "CI Lower", "CI Upper", "p-value"),
        caption = "Comparison of Odds Ratios Across Three Logistic Regression Models") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Comparison of Odds Ratios Across Three Logistic Regression Models
Model	Predictor	OR	CI Lower	CI Upper	p-value
Smoker	(Intercept)	0.137	0.123	0.151	0.000
Smoker	smokerCurrent	1.959	1.675	2.291	0.000
Income	(Intercept)	0.531	0.435	0.647	0.000
Income	income_cat	0.821	0.793	0.850	0.000
Age	(Intercept)	0.725	0.559	0.936	0.014
Age	age	0.974	0.970	0.979	0.000

3c. (5 pts) Which predictor has the strongest crude association with FMD? Justify your answer.

The smoker predictor has the strongest crude association with FMD. Current smokers have 1.959 times higher odds of FMD in comparison to non-smokers. Since the OR for smoking is farther from 1 than the OR for age and income then it has the strongest crude association.

Task 4: Introduction to Multiple Logistic Regression (20 points)

4a. (5 pts) Fit a multiple logistic regression model predicting FMD from at least 3 predictors.

# mlr predicting FMD from smoking status, age, sex, sleep, and income
mod_multi <- glm(fmd ~ smoker + age + sex + sleep_hrs + income_cat,
                 data = brfss_logistic,
                 family = binomial(link = "logit"))

4b. (5 pts) Report the adjusted odds ratios using tbl_regression().

# reporting adjusted odds ratio
mod_multi |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      age        ~ "Age (per year)",
      sex        ~ "Sex",
      smoker     ~ "Smoking status",
      sleep_hrs  ~ "Sleep hours (per hour)",
      income_cat ~ "Income category (per unit)"
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Smoking status
Former/Never	—	—
Current	1.29	1.08, 1.53	0.004
Age (per year)	0.98	0.97, 0.98	<0.001
Sex
Male	—	—
Female	1.67	1.42, 1.97	<0.001
Sleep hours (per hour)	0.81	0.77, 0.85	<0.001
Income category (per unit)	0.84	0.81, 0.87	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4c. (5 pts) For one predictor, compare the crude OR (from Task 3) with the adjusted OR (from Task 4). Show both values.

crude_smoker <- tidy(mod_smoker, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Crude")

adj_smoker <- tidy(mod_multi, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term == "smokerCurrent") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(type = "Adjusted") 

bind_rows(crude_smoker, adj_smoker) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "95% CI Lower", "95% CI Upper", "Type"),
        caption = "Crude vs. Adjusted Odds Ratios") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Crude vs. Adjusted Odds Ratios
Predictor	OR	95% CI Lower	95% CI Upper	Type
smokerCurrent	1.959	1.675	2.291	Crude
smokerCurrent	1.286	1.084	1.525	Adjusted

4d. (5 pts) In 2-3 sentences, assess whether confounding is present for the predictor you chose. Which direction did the OR change, and what does this mean?

The crude OR for the smoker predictor is 1.959 and the adjusted OR is 1.286. These values suggest that confounding is present for the smoker predictor because the adjusted OR moved towards 1, indicating confounding was inflating the crude association.

Completion credit (25 points): Awarded for a complete, good-faith attempt at all tasks. Total: 75 + 25 = 100 points.

End of Lab Activity

Logistic Regression — Part 1

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Jackie Korosec

April 13, 2026

Part 7: In-Class Lab Activity