Chi-Squared Test of Independence

Analyzing Social Stratification and Survival Aboard the Titanic

Author

Abdullah Al Shamim

Published

May 25, 2026

1. What is a Chi-Squared Test?

The Chi-Squared Test of Independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our actual data to the frequencies we would expect to see if the two variables were completely independent of each other.

If the observed counts differ substantially from the expected counts, we have evidence that the variables are associated.

The Core Intuition

Independent: Knowing the value of Variable A tells you nothing about Variable B.
Dependent: Knowing Variable A helps you predict Variable B. If the Observed counts differ substantially from the Expected baseline counts, we have strong evidence that the variables are associated.

2. The Question

We’ll use the historic Titanic dataset to investigate a crucial sociological question:

Was survival on the Titanic associated with passenger class?

In other words, did your ticket status (1st, 2nd, or 3rd Class) structurally affect your chances of surviving the disaster, or was survival purely random across all groups?

3. Environment Setup

We begin by loading the tidyverse ecosystem for data manipulation and visualization, along with gt and gtExtras to render publication-quality tables.

Code

# Load Core Libraries
library(tidyverse)
library(gt)
library(gtExtras)

4. The Data

The Titanic dataset is natively built into R. It contains cross-classified counts of passengers and crew members. To maintain our focus on social stratification, we will filter out the crew and expand the compressed table into individual rows.

Code

# Clean, expand, and filter the dataset
titanic_df <- Titanic |>  
  as_tibble() |>  
  uncount(n) |>  
  filter(Class != "Crew")

Code

glimpse(titanic_df)

Rows: 1,316
Columns: 4
$ Class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age      <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…

5. Visualizing the Question

Before diving into the math, it is vital to visualize the raw distributions. We will construct a horizontal relative-stacked bar chart to evaluate the proportions of life and loss within each ticket tier.

Code

# Calculate internal metrics for positioning labels
titanic_summary <- titanic_df |>  
  count(Class, Survived) |>  
  group_by(Class) |>  
  mutate(
    total = sum(n),
    prop = n / total,
    cumulative = cumsum(prop),
    label_pos = cumulative - (prop / 2)
  ) |>  
  ungroup()

# Generate the plot
titanic_summary |>  
  ggplot(aes(x = prop, y = fct_rev(Class), fill = fct_rev(Survived))) +  
  geom_col(width = 0.6) +  
  geom_text(aes(x = label_pos, label = n), colour = "white", fontface = "bold", size = 5) +  
  geom_text(aes(x = 1.08, label = ifelse(Survived == "Yes", paste("Total:", total), "")), 
            hjust = 0, colour = "grey40", size = 4) +  
  scale_fill_manual(
    values = c("Yes" = "#60a5fa", "No" = "#f87171"),
    labels = c("Survived", "Died"),
    breaks = c("Yes", "No")
  ) +  
  scale_x_continuous(expand = expansion(mult = c(0, 0.15)), labels = NULL) +  
  labs(
    title = "Survival on the Titanic by Passenger Class",
    subtitle = "1st class passengers were far more likely to survive",
    x = NULL, y = NULL, fill = NULL
  ) +  
  theme_minimal(base_size = 14) +  
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )

Code

titanic_df |>  
  count(Class, Survived) |>  
  pivot_wider(names_from = Survived, values_from = n) |>  
  mutate(
    Total = No + Yes,
    `Survival Rate` = Yes / Total
  ) |>  
  rename(Died = No, Survived = Yes) |>  
  gt() |>  
  fmt_percent(`Survival Rate`, decimals = 1) |>  
  data_color(
    columns = `Survival Rate`,
    palette = c("#f87171", "#fbbf24", "#4ade80"),
    domain = c(0.2, 0.7)
  ) |>  
  cols_align(align = "center", columns = -Class) |>  
  tab_header(
    title = md("**Titanic Survival by Class**"),
    subtitle = "A clear socio-economic gradient from 1st Class to 3rd Class"
  ) |>  
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>  
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )

Titanic Survival by Class
A clear socio-economic gradient from 1st Class to 3rd Class
Class	Died	Survived	Total	Survival Rate
1st	122	203	325	62.5%
2nd	167	118	285	41.4%
3rd	528	178	706	25.2%

6. The Chi-Squared Test

While the raw disparity is visually obvious, we must prove that this distribution did not happen entirely by chance. We cross-tabulate the data and subject it to Pearson’s Chi-squared test.

Code

# Conduct the Test of Independence
titanic_df |>  
  select(Class, Survived) |>  
  table() |>  
  chisq.test()


    Pearson's Chi-squared test

data:  table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16

7. Interpreting the Results

The statistical output yields three critical values required for formal reporting:

Value: 133.05
Interpretation: This measures how far your actual data landed from a theoretical model where class didn’t matter. A value of 133.05 indicates an extreme deviation from independence.

Value: 2
Interpretation: Calculated mathematically as:

\[\text{df} = (\text{Rows} - 1) \times (\text{Columns} - 1) = (3 - 1) \times (2 - 1) = 2\]

Value: < 2.2e-16 (Extremely close to 0)
Interpretation: Since this value is well below the standard threshold of 0.05, we officially reject the Null Hypothesis (\(H_0\)).

Final Academic Conclusion

There is an overwhelmingly significant association between passenger ticket class and survival rates on the Titanic. Ticket class structurally influenced a passenger’s likelihood of making it to a lifeboat. Tragically, those trapped in the lower classes faced vastly diminished odds of survival.

Analytical Rule of Thumb

Metric	High Value	Low Value
Chi-Square (\(X^2\))	Strong deviation from randomness	Observed data matches random expectations
P-Value	Statistically insignificant (Noise)	Statistically Significant (Real Pattern)