Chi-Squared Test in R

Testing Association Between Categorical Variables with Titanic Data

Author

Abdullah Al Shamim

Published

May 31, 2026

1 What is a Chi-Squared Test?

The chi-squared test of independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our data to the frequencies we would expect if the two variables were completely independent of each other.

If the observed counts differ substantially from the expected counts, we have evidence that the variables are associated.

2 The Question

We’ll use the famous Titanic dataset to ask: Was survival on the Titanic associated with passenger class?

In other words, did your ticket class (1st, 2nd, or 3rd) affect your chances of surviving the disaster?

3 Setup

Show the code

library(tidyverse)
library(gt)
library(gtExtras)

1: Load the tidyverse — provides ggplot2, dplyr, tidyr, and forcats used throughout
2: Load gt — for creating polished, publication-ready HTML tables
3: Load gtExtras — extends gt with helpers like data_color() for conditional cell colouring

4 The Data

The Titanic dataset is built into R. It contains counts of passengers and crew aboard the RMS Titanic, cross-classified by class, sex, age, and survival status. We’ll focus on passengers only.

Show the code

titanic_df <- Titanic |>
  as_tibble() |>
  uncount(n) |>
  filter(Class != "Crew")

glimpse(titanic_df)

1: Titanic is a built-in R 4-dimensional array storing aggregated passenger counts — not individual rows
2: as_tibble() converts the array to a long-format tibble with a column n holding each combination’s count
3: uncount(n) expands each aggregate row into n individual rows, giving exactly one row per passenger
4: Exclude the Crew class — keeping only the three passenger classes (1st, 2nd, 3rd) for the analysis
5: glimpse() prints a compact preview — confirms 1,316 rows and 4 character columns

Rows: 1,316
Columns: 4
$ Class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age      <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…

5 Visualising the Question

Before running the test, let’s visualise the relationship between class and survival.

Show the code

titanic_summary <- titanic_df |>
  count(Class, Survived) |>
  group_by(Class) |>
  mutate(
    total = sum(n),
    prop = n / total,
    cumulative = cumsum(prop),
    label_pos = cumulative - (prop / 2)
  ) |>
  ungroup()

titanic_summary |>
  ggplot(aes(x = prop, 
             y = fct_rev(Class),
             fill = fct_rev(Survived))) +
  geom_col(width = 0.6) +
  geom_text(aes(x = label_pos, label = n),
            colour = "white", 
            fontface = "bold",
            size = 5) +
  geom_text(aes(x = 1.08,
                label = ifelse(Survived == "Yes", 
                               paste("Total:", total), 
                               "")),
            hjust = 0, 
            colour = "grey40",
            size = 4) +
  scale_fill_manual(
    values = c("Yes" = "#60a5fa", "No" = "#f87171"),
    labels = c("Survived", "Died"),
    breaks = c("Yes", "No")
  ) +
  scale_x_continuous(
    expand = expansion(mult = c(0, 0.15)),
    labels = NULL
  ) +
  labs(
    title = "Survival on the Titanic by Passenger Class",
    subtitle = "1st class passengers were far more likely to survive",
    x = NULL, 
    y = NULL, 
    fill = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )

1: Count co-occurrences of Class and Survived — produces 6 rows (3 classes × 2 outcomes) with a raw count n for each combination
2: total = sum(n) computes the total passengers per class within the group_by() context — used for proportion calculations and the right-side total labels
3: prop = n / total converts raw counts to proportions (0 to 1); this is what drives the width of each bar segment
4: cumulative = cumsum(prop) is the running sum of proportions within each class — it marks the right edge of each stacked bar segment
5: label_pos = cumulative - (prop / 2) finds the midpoint of each segment (right edge minus half the width) — the x position where count labels are centred
6: fct_rev() on both aesthetics reverses factor order: puts 1st Class at the top of the y-axis and stacks “Survived” on the left of each bar so blue always leads
7: geom_col() draws horizontal stacked bars from the pre-computed proportions; width = 0.6 controls bar thickness leaving visible gaps between rows
8: First geom_text places the raw passenger count at label_pos — the centre of each segment — with white bold text for legibility against both bar colours
9: Second geom_text adds a “Total: N” label to the right of each full bar; the ifelse prints only on the “Yes” segment to avoid duplicate labels per row
10: scale_fill_manual assigns blue (#60a5fa) to survivors and red (#f87171) to deaths, with “Survived” / “Died” as legend labels in the correct display order
11: scale_x_continuous suppresses the 0–1 proportion axis labels and adds 15% right-side padding so the total labels have room beyond the bar edge

6 Summary Table

Show the code

titanic_df |>
  count(Class, Survived) |>
  pivot_wider(names_from = Survived,
              values_from = n) |>
  mutate(
    Total = No + Yes,
    `Survival Rate` = Yes / Total
  ) |>
  rename(Died = No, Survived = Yes) |>
  gt() |>
  fmt_percent(`Survival Rate`, decimals = 1) |>
  data_color(
    columns = `Survival Rate`,
    palette = c("#f87171", "#fbbf24", "#4ade80"),
    domain = c(0.2, 0.7)
  ) |>
  cols_align(align = "center",
             columns = -Class) |>
  tab_header(
    title = md("**Titanic Survival by Class**"),
    subtitle = "A clear gradient from 1st Class to 3rd Class"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )

1: Count co-occurrences of Class and Survived — produces 6 rows (3 classes × 2 outcomes) as the starting point for the table
2: pivot_wider reshapes from long to wide format: one row per class, with separate No and Yes columns holding the raw counts
3: mutate computes Total passengers per class and derives Survival Rate as the fraction who survived — both will become table columns
4: rename swaps No → Died and Yes → Survived for clean, reader-friendly column headers before passing to gt()
5: gt() converts the tibble into a gt table object — all subsequent |> calls are gt formatting verbs layered on top
6: fmt_percent() formats the Survival Rate column as a percentage with one decimal place (e.g. 62.5%)
7: data_color() applies a red → amber → green gradient to the Survival Rate cells; domain = c(0.2, 0.7) anchors the colour scale to the actual data range
8: cols_align() centre-aligns all numeric columns; the -Class selector excludes Class, leaving it left-aligned by default
9: tab_header() adds a bold md() title and a plain subtitle above the table; the heading is aligned left via tab_options() below
10: tab_style() bolds all column label text using cell_text(weight = "bold") targeted precisely at cells_column_labels()
11: tab_options() sets a 14px base font size and left-aligns the heading block to match the bar chart’s title alignment

Titanic Survival by Class
A clear gradient from 1st Class to 3rd Class
Class	Died	Survived	Total	Survival Rate
1st	122	203	325	62.5%
2nd	167	118	285	41.4%
3rd	528	178	706	25.2%

The visualisation and table both reveal a clear pattern: survival rates decreased dramatically as you moved down the class hierarchy.

7 The Chi-Squared Test

Show the code

titanic_df |>
  select(Class, Survived) |>
  table() |>
  chisq.test()

1: Isolate the two variables of interest — Class (3 levels) and Survived (2 levels); all other columns are dropped before testing
2: table() cross-tabulates the two variables into a 3 × 2 contingency matrix of observed counts — this is the direct input to the chi-squared test
3: chisq.test() runs Pearson’s chi-squared test on the contingency matrix; by default it tests for independence using the asymptotic distribution with no continuity correction


    Pearson's Chi-squared test

data:  table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16

8 Interpreting the Results

The output tells us three key things:

X-squared: The test statistic. Larger values indicate greater deviation from what we’d expect if the variables were independent.
df (degrees of freedom): Calculated as (rows - 1) × (columns - 1) = (3 - 1) × (2 - 1) = 2.
p-value: Extremely small (< 2.2e-16). This is far below the conventional threshold of 0.05.

Important

Conclusion: We reject the null hypothesis of independence. There is a statistically significant association between passenger class and survival on the Titanic. Your class ticket did influence your chances of survival—tragically, those in lower classes were significantly less likely to survive.