Chi-Squared Test in R

Testing Association Between Categorical Variables with Titanic Data

Author

Abdullah Al Shamim

Published

May 31, 2026

1 What is a Chi-Squared Test?

The chi-squared test of independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our data to the frequencies we would expect if the two variables were completely independent of each other.

If the observed counts differ substantially from the expected counts, we have evidence that the variables are associated.


2 The Question

We’ll use the famous Titanic dataset to ask: Was survival on the Titanic associated with passenger class?

In other words, did your ticket class (1st, 2nd, or 3rd) affect your chances of surviving the disaster?


3 Setup

Show the code
library(tidyverse)
library(gt)
library(gtExtras)
1
Load the tidyverse — provides ggplot2, dplyr, tidyr, and forcats used throughout
2
Load gt — for creating polished, publication-ready HTML tables
3
Load gtExtras — extends gt with helpers like data_color() for conditional cell colouring

4 The Data

The Titanic dataset is built into R. It contains counts of passengers and crew aboard the RMS Titanic, cross-classified by class, sex, age, and survival status. We’ll focus on passengers only.

Show the code
titanic_df <- Titanic |>
  as_tibble() |>
  uncount(n) |>
  filter(Class != "Crew")

glimpse(titanic_df)
1
Titanic is a built-in R 4-dimensional array storing aggregated passenger counts — not individual rows
2
as_tibble() converts the array to a long-format tibble with a column n holding each combination’s count
3
uncount(n) expands each aggregate row into n individual rows, giving exactly one row per passenger
4
Exclude the Crew class — keeping only the three passenger classes (1st, 2nd, 3rd) for the analysis
5
glimpse() prints a compact preview — confirms 1,316 rows and 4 character columns
Rows: 1,316
Columns: 4
$ Class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age      <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…

5 Visualising the Question

Before running the test, let’s visualise the relationship between class and survival.

Show the code
titanic_summary <- titanic_df |>
  count(Class, Survived) |>
  group_by(Class) |>
  mutate(
    total = sum(n),
    prop = n / total,
    cumulative = cumsum(prop),
    label_pos = cumulative - (prop / 2)
  ) |>
  ungroup()

titanic_summary |>
  ggplot(aes(x = prop, 
             y = fct_rev(Class),
             fill = fct_rev(Survived))) +
  geom_col(width = 0.6) +
  geom_text(aes(x = label_pos, label = n),
            colour = "white", 
            fontface = "bold",
            size = 5) +
  geom_text(aes(x = 1.08,
                label = ifelse(Survived == "Yes", 
                               paste("Total:", total), 
                               "")),
            hjust = 0, 
            colour = "grey40",
            size = 4) +
  scale_fill_manual(
    values = c("Yes" = "#60a5fa", "No" = "#f87171"),
    labels = c("Survived", "Died"),
    breaks = c("Yes", "No")
  ) +
  scale_x_continuous(
    expand = expansion(mult = c(0, 0.15)),
    labels = NULL
  ) +
  labs(
    title = "Survival on the Titanic by Passenger Class",
    subtitle = "1st class passengers were far more likely to survive",
    x = NULL, 
    y = NULL, 
    fill = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )
1
Count co-occurrences of Class and Survived — produces 6 rows (3 classes × 2 outcomes) with a raw count n for each combination
2
total = sum(n) computes the total passengers per class within the group_by() context — used for proportion calculations and the right-side total labels
3
prop = n / total converts raw counts to proportions (0 to 1); this is what drives the width of each bar segment
4
cumulative = cumsum(prop) is the running sum of proportions within each class — it marks the right edge of each stacked bar segment
5
label_pos = cumulative - (prop / 2) finds the midpoint of each segment (right edge minus half the width) — the x position where count labels are centred
6
fct_rev() on both aesthetics reverses factor order: puts 1st Class at the top of the y-axis and stacks “Survived” on the left of each bar so blue always leads
7
geom_col() draws horizontal stacked bars from the pre-computed proportions; width = 0.6 controls bar thickness leaving visible gaps between rows
8
First geom_text places the raw passenger count at label_pos — the centre of each segment — with white bold text for legibility against both bar colours
9
Second geom_text adds a “Total: N” label to the right of each full bar; the ifelse prints only on the “Yes” segment to avoid duplicate labels per row
10
scale_fill_manual assigns blue (#60a5fa) to survivors and red (#f87171) to deaths, with “Survived” / “Died” as legend labels in the correct display order
11
scale_x_continuous suppresses the 0–1 proportion axis labels and adds 15% right-side padding so the total labels have room beyond the bar edge


6 Summary Table

Show the code
titanic_df |>
  count(Class, Survived) |>
  pivot_wider(names_from = Survived,
              values_from = n) |>
  mutate(
    Total = No + Yes,
    `Survival Rate` = Yes / Total
  ) |>
  rename(Died = No, Survived = Yes) |>
  gt() |>
  fmt_percent(`Survival Rate`, decimals = 1) |>
  data_color(
    columns = `Survival Rate`,
    palette = c("#f87171", "#fbbf24", "#4ade80"),
    domain = c(0.2, 0.7)
  ) |>
  cols_align(align = "center",
             columns = -Class) |>
  tab_header(
    title = md("**Titanic Survival by Class**"),
    subtitle = "A clear gradient from 1st Class to 3rd Class"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )
1
Count co-occurrences of Class and Survived — produces 6 rows (3 classes × 2 outcomes) as the starting point for the table
2
pivot_wider reshapes from long to wide format: one row per class, with separate No and Yes columns holding the raw counts
3
mutate computes Total passengers per class and derives Survival Rate as the fraction who survived — both will become table columns
4
rename swaps NoDied and YesSurvived for clean, reader-friendly column headers before passing to gt()
5
gt() converts the tibble into a gt table object — all subsequent |> calls are gt formatting verbs layered on top
6
fmt_percent() formats the Survival Rate column as a percentage with one decimal place (e.g. 62.5%)
7
data_color() applies a red → amber → green gradient to the Survival Rate cells; domain = c(0.2, 0.7) anchors the colour scale to the actual data range
8
cols_align() centre-aligns all numeric columns; the -Class selector excludes Class, leaving it left-aligned by default
9
tab_header() adds a bold md() title and a plain subtitle above the table; the heading is aligned left via tab_options() below
10
tab_style() bolds all column label text using cell_text(weight = "bold") targeted precisely at cells_column_labels()
11
tab_options() sets a 14px base font size and left-aligns the heading block to match the bar chart’s title alignment
Titanic Survival by Class
A clear gradient from 1st Class to 3rd Class
Class Died Survived Total Survival Rate
1st 122 203 325 62.5%
2nd 167 118 285 41.4%
3rd 528 178 706 25.2%

The visualisation and table both reveal a clear pattern: survival rates decreased dramatically as you moved down the class hierarchy.


7 The Chi-Squared Test

Show the code
titanic_df |>
  select(Class, Survived) |>
  table() |>
  chisq.test()
1
Isolate the two variables of interest — Class (3 levels) and Survived (2 levels); all other columns are dropped before testing
2
table() cross-tabulates the two variables into a 3 × 2 contingency matrix of observed counts — this is the direct input to the chi-squared test
3
chisq.test() runs Pearson’s chi-squared test on the contingency matrix; by default it tests for independence using the asymptotic distribution with no continuity correction

    Pearson's Chi-squared test

data:  table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16

8 Interpreting the Results

The output tells us three key things:

  • X-squared: The test statistic. Larger values indicate greater deviation from what we’d expect if the variables were independent.

  • df (degrees of freedom): Calculated as (rows - 1) × (columns - 1) = (3 - 1) × (2 - 1) = 2.

  • p-value: Extremely small (< 2.2e-16). This is far below the conventional threshold of 0.05.

Important

Conclusion: We reject the null hypothesis of independence. There is a statistically significant association between passenger class and survival on the Titanic. Your class ticket did influence your chances of survival—tragically, those in lower classes were significantly less likely to survive.