Show the code
library(tidyverse)
library(gt)
library(scales)
library(broom)Logistic regression predicts the probability of a binary outcome (yes/no, survived/died, 0/1) from one or more predictor variables. Unlike linear regression which predicts a continuous value, logistic regression predicts a probability between 0 and 1.
The model uses the logistic function to ensure predictions stay within this range, producing the characteristic S-shaped curve.
We’ll use Titanic passenger data to ask: Can we predict who survived based on their characteristics?
We’ll start with a single predictor (fare paid), then build up to a model with multiple predictors.
library(tidyverse)
library(gt)
library(scales)
library(broom)We’ll download the famous Titanic dataset, which contains detailed passenger information including age, sex, class, fare, and survival status.
titanic_df <- read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv") |>
rename_with(tolower) |>
select(survived, pclass, sex, age, fare) |>
mutate(
survived = factor(survived,
levels = c(0, 1),
labels = c("Died", "Survived")),
pclass = factor(pclass,
levels = c(1, 2, 3),
labels = c("1st", "2nd", "3rd")),
sex = factor(sex)
) |>
drop_na()
glimpse(titanic_df)Rows: 887
Columns: 5
$ survived <fct> Died, Survived, Survived, Survived, Died, Died, Died, Died, S…
$ pclass <fct> 3rd, 1st, 3rd, 1st, 3rd, 3rd, 1st, 3rd, 3rd, 2nd, 3rd, 1st, 3…
$ sex <fct> male, female, female, female, male, male, male, male, female,…
$ age <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
$ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
Before modelling, let’s explore how fare relates to survival.
ggplot(titanic_df,
aes(x = fare,
y = as.numeric(survived == "Survived"))) +
# Individual points (jittered)
geom_jitter(aes(colour = survived),
height = 0.05,
alpha = 0.4,
size = 2) +
# Smoothed probability curve
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
colour = "#1d3557",
fill = "#1d3557",
alpha = 0.2,
linewidth = 1.2) +
# Annotation: the S-curve
annotate("label",
x = 400, y = 0.5,
label = "The logistic curve:\nprobability stays\nbetween 0 and 1",
fill = "#f1faee",
colour = "#1d3557",
size = 3.5,
label.padding = unit(0.5, "lines")) +
annotate("curve",
x = 350, y = 0.5,
xend = 200, yend = 0.7,
curvature = 0.3,
arrow = arrow(length = unit(2, "mm")),
colour = "#1d3557") +
# Annotation: low fare region
annotate("segment",
x = 0, xend = 30,
y = -0.12, yend = -0.12,
colour = "#e63946",
linewidth = 2) +
annotate("text",
x = 15, y = -0.18,
label = "Low fare = Low survival",
colour = "#e63946",
fontface = "bold",
size = 3.5) +
scale_colour_manual(values = c("Died" = "#e63946",
"Survived" = "#2a9d8f")) +
scale_x_continuous(labels = dollar_format(),
limits = c(0, 520)) +
scale_y_continuous(breaks = c(0, 0.5, 1),
labels = c("0\n(Died)", "0.5", "1\n(Survived)")) +
coord_cartesian(clip = "off",
ylim = c(-0.1, 1.1)) +
labs(
title = "Did Money Buy Survival on the Titanic?",
subtitle = "Higher fares were associated with higher survival probability",
x = "Fare Paid",
y = "Probability of Survival",
colour = NULL
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
plot.subtitle = element_text(colour = "grey40"),
plot.title.position = "plot",
legend.position = "bottom",
panel.grid.minor = element_blank(),
plot.margin = margin(10, 10, 30, 10)
)glm with binomial family
The S-shaped curve shows that survival probability increases with fare, but the relationship isn’t linear—it follows the logistic function.
Let’s start with a single predictor: Fare.
model_simple <- glm(survived ~ fare,
data = titanic_df,
family = binomial)
summary(model_simple)family = binomial specifies logistic regression
Call:
glm(formula = survived ~ fare, family = binomial, data = titanic_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.931730 0.095215 -9.786 < 2e-16 ***
fare 0.015084 0.002228 6.771 1.28e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1182.8 on 886 degrees of freedom
Residual deviance: 1114.6 on 885 degrees of freedom
AIC: 1118.6
Number of Fisher Scoring iterations: 4
The raw coefficients are in log-odds, which aren’t intuitive. Let’s convert them to odds ratios:
tidy(model_simple, conf.int = TRUE) |>
mutate(
odds_ratio = exp(estimate),
or_lower = exp(conf.low),
or_upper = exp(conf.high)
) |>
select(term, estimate, odds_ratio,
or_lower, or_upper, p.value) |>
gt() |>
fmt_number(columns = c(estimate, odds_ratio,
or_lower, or_upper),
decimals = 3) |>
fmt_scientific(columns = p.value, decimals = 2) |>
cols_label(
term = "Term",
estimate = "Log-Odds",
odds_ratio = "Odds Ratio",
or_lower = "OR (Lower)",
or_upper = "OR (Upper)",
p.value = "p-value"
) |>
tab_header(
title = md("**Simple Logistic Regression: Fare**")
)tidy() from broom extracts coefficients with confidence intervals
Simple Logistic Regression: Fare |
|||||
|---|---|---|---|---|---|
| Term | Log-Odds | Odds Ratio | OR (Lower) | OR (Upper) | p-value |
| (Intercept) | −0.932 | 0.394 | 0.326 | 0.473 | 1.30 × 10−22 |
| fare | 0.015 | 1.015 | 1.011 | 1.020 | 1.28 × 10−11 |
Interpretation: The odds ratio for Fare is approximately 1.015. This means for each additional dollar in fare, the odds of survival increase by about 1.5%. For a $10 increase, the odds multiply by \(1.015^{10} \approx 1.16\), or a 16% increase.
Now let’s add more predictors: Sex and Passenger Class.
model_full <- glm(survived ~ fare + sex + pclass,
data = titanic_df,
family = binomial)
summary(model_full)
Call:
glm(formula = survived ~ fare + sex + pclass, family = binomial,
data = titanic_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.142035 0.278170 7.700 1.36e-14 ***
fare 0.001815 0.002101 0.864 0.38769
sexmale -2.617672 0.185211 -14.133 < 2e-16 ***
pclass2nd -0.735407 0.270871 -2.715 0.00663 **
pclass3rd -1.782118 0.251735 -7.079 1.45e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1182.77 on 886 degrees of freedom
Residual deviance: 825.31 on 882 degrees of freedom
AIC: 835.31
Number of Fisher Scoring iterations: 4
model_results <- tidy(model_full, conf.int = TRUE) |>
filter(term != "(Intercept)") |>
mutate(
odds_ratio = exp(estimate),
or_lower = exp(conf.low),
or_upper = exp(conf.high),
term = case_when(
term == "fare" ~ "Fare (per $1)",
term == "sexmale" ~ "Sex: Male",
term == "pclass2nd" ~ "Class: 2nd",
term == "pclass3rd" ~ "Class: 3rd"
),
term = fct_reorder(term, odds_ratio),
significant = p.value < 0.05
)
ggplot(model_results,
aes(x = odds_ratio, y = term)) +
# Reference line at OR = 1
geom_vline(xintercept = 1,
linetype = "dashed",
colour = "grey50",
linewidth = 0.8) +
# Confidence intervals
geom_errorbarh(aes(xmin = or_lower,
xmax = or_upper),
height = 0.2,
linewidth = 0.8,
colour = "grey30") +
# Point estimates
geom_point(aes(fill = odds_ratio < 1),
size = 5,
shape = 21,
colour = "black",
stroke = 1) +
# Labels on points
geom_text(aes(label = sprintf("%.2f", odds_ratio)),
vjust = -1.2,
fontface = "bold",
size = 3.5) +
# Annotations
annotate("text",
x = 0.3, y = 4.3,
label = "← Lower survival odds",
hjust = 0,
colour = "#e63946",
fontface = "italic",
size = 3.5) +
annotate("text",
x = 1.5, y = 4.3,
label = "Higher survival odds →",
hjust = 0,
colour = "#2a9d8f",
fontface = "italic",
size = 3.5) +
scale_fill_manual(values = c("TRUE" = "#e63946",
"FALSE" = "#2a9d8f"),
guide = "none") +
scale_x_log10(breaks = c(0.1, 0.2, 0.5, 1, 2)) +
coord_cartesian(clip = "off",
xlim = c(0.08, 2.5)) +
labs(
title = "Who Survived? Odds Ratios from Multiple Logistic Regression",
subtitle = "Adjusted for all other variables in the model",
x = "Odds Ratio (log scale)",
y = NULL
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(colour = "grey40"),
plot.title.position = "plot",
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
axis.text.y = element_text(face = "bold", size = 11),
plot.margin = margin(20, 10, 10, 10)
)tidy(model_full, conf.int = TRUE) |>
mutate(
odds_ratio = exp(estimate),
or_ci = paste0(round(exp(conf.low), 2), " – ",
round(exp(conf.high), 2)),
term = case_when(
term == "(Intercept)" ~ "Intercept",
term == "fare" ~ "Fare (per $1)",
term == "sexmale" ~ "Sex: Male (vs Female)",
term == "pclass2nd" ~ "Class: 2nd (vs 1st)",
term == "pclass3rd" ~ "Class: 3rd (vs 1st)"
)
) |>
select(term, odds_ratio, or_ci, p.value) |>
gt() |>
fmt_number(columns = odds_ratio, decimals = 2) |>
fmt_scientific(columns = p.value, decimals = 2) |>
cols_label(
term = "Predictor",
odds_ratio = "Odds Ratio",
or_ci = "95% CI",
p.value = "p-value"
) |>
data_color(
columns = odds_ratio,
rows = odds_ratio < 1,
palette = "#e63946"
) |>
data_color(
columns = odds_ratio,
rows = odds_ratio > 1,
palette = "#2a9d8f"
) |>
tab_header(
title = md("**Multiple Logistic Regression Results**"),
subtitle = "Predicting survival on the Titanic"
) |>
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
)Multiple Logistic Regression Results |
|||
|---|---|---|---|
| Predicting survival on the Titanic | |||
| Predictor | Odds Ratio | 95% CI | p-value |
| Intercept | 8.52 | 4.95 – 14.77 | 1.36 × 10−14 |
| Fare (per $1) | 1.00 | 1 – 1.01 | 3.88 × 10−1 |
| Sex: Male (vs Female) | 0.07 | 0.05 – 0.1 | 2.36 × 10−45 |
| Class: 2nd (vs 1st) | 0.48 | 0.28 – 0.81 | 6.63 × 10−3 |
| Class: 3rd (vs 1st) | 0.17 | 0.1 – 0.28 | 1.45 × 10−12 |
Each coefficient is now adjusted for the other variables. Here’s what the results tell us:
Sex: Male: Men had dramatically lower odds of survival compared to women, holding fare and class constant. This is the strongest predictor—“women and children first” was clearly applied.
Class: 3rd: Third-class passengers had much lower odds of survival compared to first-class, even after accounting for fare and sex.
Class: 2nd: Second-class passengers also had lower odds of survival compared to first-class.
Fare: After controlling for sex and class, fare itself has little additional predictive power. The effect of money was largely through class assignment.
Logistic regression has several assumptions that should be verified.
Our outcome (survived) is binary (Died/Survived). This assumption is satisfied by design.
Each observation should be independent. This is satisfied—each row represents a different passenger.
For continuous predictors like Fare, the relationship between the predictor and the log-odds of the outcome should be linear. We can check this visually:
# Create bins and calculate empirical log-odds
linearity_check <- titanic_df |>
mutate(fare_bin = cut_number(fare, n = 8)) |>
group_by(fare_bin) |>
summarise(
fare_mid = median(fare),
n = n(),
prop_survived = mean(survived == "Survived"),
log_odds = log(prop_survived / (1 - prop_survived)),
.groups = "drop"
) |>
filter(is.finite(log_odds))
ggplot(linearity_check,
aes(x = fare_mid, y = log_odds)) +
geom_point(size = 4, colour = "#1d3557") +
geom_smooth(method = "lm",
colour = "#e63946",
fill = "#e63946",
alpha = 0.2) +
annotate("label",
x = 350, y = -0.5,
label = "Roughly linear:\nassumption satisfied",
fill = "#d8f3dc",
colour = "#2d6a4f",
fontface = "bold",
size = 4) +
labs(
title = "Checking Linearity of Log-Odds",
subtitle = "Binned fare values vs empirical log-odds of survival",
x = "Fare (bin midpoint)",
y = "Log-Odds of Survival"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.title.position = "plot"
)The relationship appears roughly linear, so the assumption is reasonably satisfied.
Predictors shouldn’t be too highly correlated. We can check the Variance Inflation Factor (VIF):
car::vif(model_full) GVIF Df GVIF^(1/(2*Df))
fare 1.403414 1 1.184658
sex 1.099365 1 1.048506
pclass 1.522478 2 1.110805
Rule of thumb: VIF > 5 suggests problematic multicollinearity. Our values are low, so this assumption is satisfied.
Logistic regression requires adequate events per predictor. A common rule is at least 10 events (survivors) and 10 non-events (deaths) per predictor variable.
titanic_df |>
count(survived) |>
gt() |>
tab_header(title = md("**Sample Size Check**"))Sample Size Check |
|
|---|---|
| survived | n |
| Died | 545 |
| Survived | 342 |
With 342 survivors and 545 deaths across 4 predictors, we have more than sufficient sample size.
We can check for observations that disproportionately influence the model using Cook’s distance:
tibble(
index = 1:nrow(titanic_df),
cooks_d = cooks.distance(model_full)
) |>
ggplot(aes(x = index, y = cooks_d)) +
geom_point(alpha = 0.5, colour = "#1d3557") +
geom_hline(yintercept = 4 / nrow(titanic_df),
linetype = "dashed",
colour = "#e63946") +
annotate("text",
x = 600, y = 4 / nrow(titanic_df) + 0.01,
label = "Threshold: 4/n",
colour = "#e63946",
hjust = 0) +
labs(
title = "Checking for Influential Observations",
subtitle = "Cook's distance identifies points with high leverage",
x = "Observation Index",
y = "Cook's Distance"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.title.position = "plot"
)A few points exceed the threshold, but none are extreme. The model is not unduly influenced by individual observations.
Logistic regression revealed that survival on the Titanic was primarily determined by:
All key assumptions were checked and satisfied, giving us confidence in our model’s validity.