Logistic Regression

0.1 What is Logistic Regression?

Logistic regression predicts the probability of a binary outcome (yes/no, survived/died, 0/1) from one or more predictor variables. Unlike linear regression which predicts a continuous value, logistic regression predicts a probability between 0 and 1.

The model uses the logistic function to ensure predictions stay within this range, producing the characteristic S-shaped curve.

0.2 The Question

We’ll use Titanic passenger data to ask: Can we predict who survived based on their characteristics?

We’ll start with a single predictor (fare paid), then build up to a model with multiple predictors.

0.3 Setup

Show the code
library(tidyverse)
library(gt)
library(scales)
library(broom)

0.4 The Data

We’ll download the famous Titanic dataset, which contains detailed passenger information including age, sex, class, fare, and survival status.

Show the code
titanic_df <- read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv") |>
  rename_with(tolower) |>
  select(survived, pclass, sex, age, fare) |>   
  mutate(
    survived = factor(survived,
                      levels = c(0, 1),
                      labels = c("Died", "Survived")),
    pclass = factor(pclass,
                    levels = c(1, 2, 3),
                    labels = c("1st", "2nd", "3rd")),
    sex = factor(sex)
  ) |>
  drop_na()

glimpse(titanic_df)
1
Download Titanic data from Stanford’s CS109 course
2
Convert column names to lowercase
3
Convert survival to a labelled factor
4
Convert class to an ordered factor with clear labels
5
Remove rows with missing values
Rows: 887
Columns: 5
$ survived <fct> Died, Survived, Survived, Survived, Died, Died, Died, Died, S…
$ pclass   <fct> 3rd, 1st, 3rd, 1st, 3rd, 3rd, 1st, 3rd, 3rd, 2nd, 3rd, 1st, 3…
$ sex      <fct> male, female, female, female, male, male, male, male, female,…
$ age      <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
$ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…

0.5 Visualising the Question

Before modelling, let’s explore how fare relates to survival.

Show the code
ggplot(titanic_df, 
       aes(x = fare, 
           y = as.numeric(survived == "Survived"))) +
  

  # Individual points (jittered)
  geom_jitter(aes(colour = survived),
              height = 0.05,
              alpha = 0.4,
              size = 2) +
  
  # Smoothed probability curve
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              colour = "#1d3557",
              fill = "#1d3557",
              alpha = 0.2,
              linewidth = 1.2) +
  
  # Annotation: the S-curve
  annotate("label",
           x = 400, y = 0.5,
           label = "The logistic curve:\nprobability stays\nbetween 0 and 1",
           fill = "#f1faee",
           colour = "#1d3557",
           size = 3.5,
           label.padding = unit(0.5, "lines")) +
  
  annotate("curve",
           x = 350, y = 0.5,
           xend = 200, yend = 0.7,
           curvature = 0.3,
           arrow = arrow(length = unit(2, "mm")),
           colour = "#1d3557") +
  
  # Annotation: low fare region
  annotate("segment",
           x = 0, xend = 30,
           y = -0.12, yend = -0.12,
           colour = "#e63946",
           linewidth = 2) +
  annotate("text",
           x = 15, y = -0.18,
           label = "Low fare = Low survival",
           colour = "#e63946",
           fontface = "bold",
           size = 3.5) +
  
  scale_colour_manual(values = c("Died" = "#e63946", 
                                  "Survived" = "#2a9d8f")) +
  scale_x_continuous(labels = dollar_format(),
                     limits = c(0, 520)) +
  scale_y_continuous(breaks = c(0, 0.5, 1),
                     labels = c("0\n(Died)", "0.5", "1\n(Survived)")) +
  coord_cartesian(clip = "off",
                  ylim = c(-0.1, 1.1)) +
  labs(
    title = "Did Money Buy Survival on the Titanic?",
    subtitle = "Higher fares were associated with higher survival probability",
    x = "Fare Paid",
    y = "Probability of Survival",
    colour = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18),
    plot.subtitle = element_text(colour = "grey40"),
    plot.title.position = "plot",
    legend.position = "bottom",
    panel.grid.minor = element_blank(),
    plot.margin = margin(10, 10, 30, 10)
  )
1
Jittered points show individual passengers (0 = died, 1 = survived)
2
Logistic regression curve using glm with binomial family
3
Annotation explaining the S-curve shape
4
Highlight the low-fare danger zone

The S-shaped curve shows that survival probability increases with fare, but the relationship isn’t linear—it follows the logistic function.


1 Part 1: Simple Logistic Regression

Let’s start with a single predictor: Fare.

1.1 Fitting the Model

Show the code
model_simple <- glm(survived ~ fare,
                    data = titanic_df,
                    family = binomial)

summary(model_simple)
1
Formula: Survival predicted by Fare
2
family = binomial specifies logistic regression

Call:
glm(formula = survived ~ fare, family = binomial, data = titanic_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.931730   0.095215  -9.786  < 2e-16 ***
fare         0.015084   0.002228   6.771 1.28e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1182.8  on 886  degrees of freedom
Residual deviance: 1114.6  on 885  degrees of freedom
AIC: 1118.6

Number of Fisher Scoring iterations: 4

1.2 Interpreting the Coefficients

The raw coefficients are in log-odds, which aren’t intuitive. Let’s convert them to odds ratios:

Show the code
tidy(model_simple, conf.int = TRUE) |>
  mutate(
    odds_ratio = exp(estimate),
    or_lower = exp(conf.low),
    or_upper = exp(conf.high)
  ) |>
  select(term, estimate, odds_ratio, 
         or_lower, or_upper, p.value) |>
  gt() |>
  fmt_number(columns = c(estimate, odds_ratio, 
                         or_lower, or_upper),
             decimals = 3) |>
  fmt_scientific(columns = p.value, decimals = 2) |>
  cols_label(
    term = "Term",
    estimate = "Log-Odds",
    odds_ratio = "Odds Ratio",
    or_lower = "OR (Lower)",
    or_upper = "OR (Upper)",
    p.value = "p-value"
  ) |>
  tab_header(
    title = md("**Simple Logistic Regression: Fare**")
  )
1
tidy() from broom extracts coefficients with confidence intervals
2
Exponentiate to convert log-odds to odds ratios

Simple Logistic Regression: Fare

Term Log-Odds Odds Ratio OR (Lower) OR (Upper) p-value
(Intercept) −0.932 0.394 0.326 0.473 1.30 × 10−22
fare 0.015 1.015 1.011 1.020 1.28 × 10−11

Interpretation: The odds ratio for Fare is approximately 1.015. This means for each additional dollar in fare, the odds of survival increase by about 1.5%. For a $10 increase, the odds multiply by \(1.015^{10} \approx 1.16\), or a 16% increase.


2 Part 2: Multiple Logistic Regression

Now let’s add more predictors: Sex and Passenger Class.

2.1 Fitting the Full Model

Show the code
model_full <- glm(survived ~ fare + sex + pclass,
                  data = titanic_df,
                  family = binomial)

summary(model_full)
1
Add Sex and Pclass as additional predictors

Call:
glm(formula = survived ~ fare + sex + pclass, family = binomial, 
    data = titanic_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.142035   0.278170   7.700 1.36e-14 ***
fare         0.001815   0.002101   0.864  0.38769    
sexmale     -2.617672   0.185211 -14.133  < 2e-16 ***
pclass2nd   -0.735407   0.270871  -2.715  0.00663 ** 
pclass3rd   -1.782118   0.251735  -7.079 1.45e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1182.77  on 886  degrees of freedom
Residual deviance:  825.31  on 882  degrees of freedom
AIC: 835.31

Number of Fisher Scoring iterations: 4

2.2 Visualising the Coefficients

Show the code
model_results <- tidy(model_full, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    odds_ratio = exp(estimate),
    or_lower = exp(conf.low),
    or_upper = exp(conf.high),
    term = case_when(
      term == "fare" ~ "Fare (per $1)",
      term == "sexmale" ~ "Sex: Male",
      term == "pclass2nd" ~ "Class: 2nd",
      term == "pclass3rd" ~ "Class: 3rd"
    ),
    term = fct_reorder(term, odds_ratio),
    significant = p.value < 0.05
  )

ggplot(model_results,
       aes(x = odds_ratio, y = term)) +
  
  # Reference line at OR = 1
  geom_vline(xintercept = 1,
             linetype = "dashed",
             colour = "grey50",
             linewidth = 0.8) +
  
  # Confidence intervals
  geom_errorbarh(aes(xmin = or_lower,
                     xmax = or_upper),
                 height = 0.2,
                 linewidth = 0.8,
                 colour = "grey30") +
  
  # Point estimates
  geom_point(aes(fill = odds_ratio < 1),
             size = 5,
             shape = 21,
             colour = "black",
             stroke = 1) +
  
  # Labels on points
  geom_text(aes(label = sprintf("%.2f", odds_ratio)),
            vjust = -1.2,
            fontface = "bold",
            size = 3.5) +
  
  # Annotations
  annotate("text",
           x = 0.3, y = 4.3,
           label = "← Lower survival odds",
           hjust = 0,
           colour = "#e63946",
           fontface = "italic",
           size = 3.5) +
  annotate("text",
           x = 1.5, y = 4.3,
           label = "Higher survival odds →",
           hjust = 0,
           colour = "#2a9d8f",
           fontface = "italic",
           size = 3.5) +
  
  scale_fill_manual(values = c("TRUE" = "#e63946",
                                "FALSE" = "#2a9d8f"),
                    guide = "none") +
  scale_x_log10(breaks = c(0.1, 0.2, 0.5, 1, 2)) +
  coord_cartesian(clip = "off",
                  xlim = c(0.08, 2.5)) +
  labs(
    title = "Who Survived? Odds Ratios from Multiple Logistic Regression",
    subtitle = "Adjusted for all other variables in the model",
    x = "Odds Ratio (log scale)",
    y = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(colour = "grey40"),
    plot.title.position = "plot",
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(face = "bold", size = 11),
    plot.margin = margin(20, 10, 10, 10)
  )
1
Remove intercept (not meaningful as an odds ratio)
2
Create readable labels for each term
3
Flag statistically significant predictors
4
Reference line at OR = 1 (no effect)
5
Horizontal error bars show 95% confidence intervals
6
Colour points by direction of effect (red = harmful, green = protective)
7
Add odds ratio values above each point
8
Add directional annotations
9
Log scale for x-axis (standard for odds ratios)

2.3 Results Table

Show the code
tidy(model_full, conf.int = TRUE) |>
  mutate(
    odds_ratio = exp(estimate),
    or_ci = paste0(round(exp(conf.low), 2), " – ", 
                   round(exp(conf.high), 2)),
    term = case_when(
      term == "(Intercept)" ~ "Intercept",
      term == "fare" ~ "Fare (per $1)",
      term == "sexmale" ~ "Sex: Male (vs Female)",
      term == "pclass2nd" ~ "Class: 2nd (vs 1st)",
      term == "pclass3rd" ~ "Class: 3rd (vs 1st)"
    )
  ) |>
  select(term, odds_ratio, or_ci, p.value) |>
  gt() |>
  fmt_number(columns = odds_ratio, decimals = 2) |>
  fmt_scientific(columns = p.value, decimals = 2) |>
  cols_label(
    term = "Predictor",
    odds_ratio = "Odds Ratio",
    or_ci = "95% CI",
    p.value = "p-value"
  ) |>
  data_color(
    columns = odds_ratio,
    rows = odds_ratio < 1,
    palette = "#e63946"
  ) |>
  data_color(
    columns = odds_ratio,
    rows = odds_ratio > 1,
    palette = "#2a9d8f"
  ) |>
  tab_header(
    title = md("**Multiple Logistic Regression Results**"),
    subtitle = "Predicting survival on the Titanic"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )

Multiple Logistic Regression Results

Predicting survival on the Titanic
Predictor Odds Ratio 95% CI p-value
Intercept 8.52 4.95 – 14.77 1.36 × 10−14
Fare (per $1) 1.00 1 – 1.01 3.88 × 10−1
Sex: Male (vs Female) 0.07 0.05 – 0.1 2.36 × 10−45
Class: 2nd (vs 1st) 0.48 0.28 – 0.81 6.63 × 10−3
Class: 3rd (vs 1st) 0.17 0.1 – 0.28 1.45 × 10−12

2.4 Interpreting Multiple Predictors

Each coefficient is now adjusted for the other variables. Here’s what the results tell us:

  • Sex: Male: Men had dramatically lower odds of survival compared to women, holding fare and class constant. This is the strongest predictor—“women and children first” was clearly applied.

  • Class: 3rd: Third-class passengers had much lower odds of survival compared to first-class, even after accounting for fare and sex.

  • Class: 2nd: Second-class passengers also had lower odds of survival compared to first-class.

  • Fare: After controlling for sex and class, fare itself has little additional predictive power. The effect of money was largely through class assignment.


3 Part 3: Checking Assumptions

Logistic regression has several assumptions that should be verified.

3.1 1. Binary Outcome ✓

Our outcome (survived) is binary (Died/Survived). This assumption is satisfied by design.

3.2 2. Independence of Observations

Each observation should be independent. This is satisfied—each row represents a different passenger.

3.3 3. Linearity of Log-Odds (for Continuous Predictors)

For continuous predictors like Fare, the relationship between the predictor and the log-odds of the outcome should be linear. We can check this visually:

Show the code
# Create bins and calculate empirical log-odds
linearity_check <- titanic_df |>
  mutate(fare_bin = cut_number(fare, n = 8)) |>
  group_by(fare_bin) |>
  summarise(
    fare_mid = median(fare),
    n = n(),
    prop_survived = mean(survived == "Survived"),
    log_odds = log(prop_survived / (1 - prop_survived)),
    .groups = "drop"
  ) |>
  filter(is.finite(log_odds))

ggplot(linearity_check,
       aes(x = fare_mid, y = log_odds)) +
  geom_point(size = 4, colour = "#1d3557") +
  geom_smooth(method = "lm",
              colour = "#e63946",
              fill = "#e63946",
              alpha = 0.2) +
  annotate("label",
           x = 350, y = -0.5,
           label = "Roughly linear:\nassumption satisfied",
           fill = "#d8f3dc",
           colour = "#2d6a4f",
           fontface = "bold",
           size = 4) +
  labs(
    title = "Checking Linearity of Log-Odds",
    subtitle = "Binned fare values vs empirical log-odds of survival",
    x = "Fare (bin midpoint)",
    y = "Log-Odds of Survival"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )
1
Divide fare into 8 equal-sized bins
2
Calculate empirical log-odds for each bin
3
Remove bins where proportion is 0 or 1 (infinite log-odds)
4
Fit a linear trend to assess linearity

The relationship appears roughly linear, so the assumption is reasonably satisfied.

3.4 4. No Multicollinearity (for Multiple Predictors)

Predictors shouldn’t be too highly correlated. We can check the Variance Inflation Factor (VIF):

Show the code
car::vif(model_full)
1
VIF values below 5 (or 10) indicate acceptable multicollinearity
           GVIF Df GVIF^(1/(2*Df))
fare   1.403414  1        1.184658
sex    1.099365  1        1.048506
pclass 1.522478  2        1.110805

Rule of thumb: VIF > 5 suggests problematic multicollinearity. Our values are low, so this assumption is satisfied.

3.5 5. Sufficient Sample Size

Logistic regression requires adequate events per predictor. A common rule is at least 10 events (survivors) and 10 non-events (deaths) per predictor variable.

Show the code
titanic_df |>
  count(survived) |>
  gt() |>
  tab_header(title = md("**Sample Size Check**"))

Sample Size Check

survived n
Died 545
Survived 342

With 342 survivors and 545 deaths across 4 predictors, we have more than sufficient sample size.

3.6 6. Influential Observations

We can check for observations that disproportionately influence the model using Cook’s distance:

Show the code
tibble(
  index = 1:nrow(titanic_df),
  cooks_d = cooks.distance(model_full)
) |>
  ggplot(aes(x = index, y = cooks_d)) +
  geom_point(alpha = 0.5, colour = "#1d3557") +
  geom_hline(yintercept = 4 / nrow(titanic_df),
             linetype = "dashed",
             colour = "#e63946") +
  annotate("text",
           x = 600, y = 4 / nrow(titanic_df) + 0.01,
           label = "Threshold: 4/n",
           colour = "#e63946",
           hjust = 0) +
  labs(
    title = "Checking for Influential Observations",
    subtitle = "Cook's distance identifies points with high leverage",
    x = "Observation Index",
    y = "Cook's Distance"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )
1
Calculate Cook’s distance for each observation
2
Common threshold is 4/n

A few points exceed the threshold, but none are extreme. The model is not unduly influenced by individual observations.


3.7 Summary

Logistic regression revealed that survival on the Titanic was primarily determined by:

  1. Sex – Women had dramatically higher survival odds
  2. Passenger Class – Higher class meant higher survival probability
  3. Fare – Mattered less once class was accounted for

All key assumptions were checked and satisfied, giving us confidence in our model’s validity.