Chi-Squared Test of Independence

Author

Abdullah Al Shamim

Published

May 25, 2026

<h1>Chi-Squared Test of Independence</h1>
<div class="subtitle">Analyzing Social Stratification and Survival Aboard the Titanic</div>
<div class="meta-info">
    By <strong>Abdullah Al Shamim</strong> | Published: 2026-05-25
</div>

1. What is a Chi-Squared Test?

The Chi-Squared Test of Independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our actual data to the frequencies we would expect to see if the two variables were completely independent of each other.

The Core Intuition
  • Independent: Knowing the value of Variable A tells you nothing about Variable B.
  • Dependent: Knowing Variable A helps you predict Variable B. If the Observed counts differ substantially from the Expected baseline counts, we have strong evidence that the variables are associated.

2. The Question

We’ll use the historic Titanic dataset to investigate a crucial sociological question:

Was survival on the Titanic associated with passenger class?

In other words, did your ticket status (1st, 2nd, or 3rd Class) structurally affect your chances of surviving the disaster, or was survival purely random across all groups?


3. Environment Setup

We begin by loading the tidyverse ecosystem for data manipulation and visualization, along with gt and gtExtras to render publication-quality tables.

Code
# Load Core Libraries
library(tidyverse)
library(gt)
library(gtExtras)

4. The Data

The Titanic dataset is natively built into R. It contains cross-classified counts of passengers and crew members. To maintain our focus on social stratification, we will filter out the crew and expand the compressed table into individual rows.

Code
# Clean, expand, and filter the dataset
titanic_df <- Titanic |>  
  as_tibble() |>  
  uncount(n) |>  
  filter(Class != "Crew")
Code
glimpse(titanic_df)
Rows: 1,316
Columns: 4
$ Class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age      <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…

5. Visualizing the Question

Before diving into the math, it is vital to visualize the raw distributions. We will construct a horizontal relative-stacked bar chart to evaluate the proportions of life and loss within each ticket tier.

Code
# Calculate internal metrics for positioning labels
titanic_summary <- titanic_df |>  
  count(Class, Survived) |>  
  group_by(Class) |>  
  mutate(
    total = sum(n),
    prop = n / total,
    cumulative = cumsum(prop),
    label_pos = cumulative - (prop / 2)
  ) |>  
  ungroup()

# Generate the plot
titanic_summary |>  
  ggplot(aes(x = prop, y = fct_rev(Class), fill = fct_rev(Survived))) +  
  geom_col(width = 0.6) +  
  geom_text(aes(x = label_pos, label = n), colour = "white", fontface = "bold", size = 5) +  
  geom_text(aes(x = 1.08, label = ifelse(Survived == "Yes", paste("Total:", total), "")), 
            hjust = 0, colour = "grey40", size = 4) +  
  scale_fill_manual(
    values = c("Yes" = "#60a5fa", "No" = "#f87171"),
    labels = c("Survived", "Died"),
    breaks = c("Yes", "No")
  ) +  
  scale_x_continuous(expand = expansion(mult = c(0, 0.15)), labels = NULL) +  
  labs(
    title = "Survival on the Titanic by Passenger Class",
    subtitle = "1st class passengers were far more likely to survive",
    x = NULL, y = NULL, fill = NULL
  ) +  
  theme_minimal(base_size = 14) +  
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )

Code
titanic_df |>  
  count(Class, Survived) |>  
  pivot_wider(names_from = Survived, values_from = n) |>  
  mutate(
    Total = No + Yes,
    `Survival Rate` = Yes / Total
  ) |>  
  rename(Died = No, Survived = Yes) |>  
  gt() |>  
  fmt_percent(`Survival Rate`, decimals = 1) |>  
  data_color(
    columns = `Survival Rate`,
    palette = c("#f87171", "#fbbf24", "#4ade80"),
    domain = c(0.2, 0.7)
  ) |>  
  cols_align(align = "center", columns = -Class) |>  
  tab_header(
    title = md("**Titanic Survival by Class**"),
    subtitle = "A clear socio-economic gradient from 1st Class to 3rd Class"
  ) |>  
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>  
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )
Titanic Survival by Class
A clear socio-economic gradient from 1st Class to 3rd Class
Class Died Survived Total Survival Rate
1st 122 203 325 62.5%
2nd 167 118 285 41.4%
3rd 528 178 706 25.2%

The visualization and the interactive table both reveal a stark disparity: survival rates decreased dramatically as you moved down the socio-economic class hierarchy.


6. The Chi-Squared Test

While the raw disparity is visually obvious, we must prove that this distribution did not happen entirely by chance. We cross-tabulate the data and subject it to Pearson’s Chi-squared test.

Code
# Conduct the Test of Independence
titanic_df |>  
  select(Class, Survived) |>  
  table() |>  
  chisq.test()

    Pearson's Chi-squared test

data:  table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16

7. Interpreting the Results

The statistical output yields three critical values required for formal reporting:

  • Value: 133.05
  • Interpretation: This is your distance tracker. It measures how far your actual, real-world data landed from a purely theoretical model where class didn’t matter at all. A massive value like 133.05 indicates an extreme deviation from independence.
  • Value: 2
  • Interpretation: Calculated mathematically as:

\[\text{df} = (\text{Rows} - 1) \times (\text{Columns} - 1)\]

\[\text{df} = (3 - 1) \times (2 - 1) = 2\]

This dictates the exact shape of the underlying Chi-Squared distribution curve used to determine significance.

  • Value: < 2.2e-16 (Extremely close to 0)
  • Interpretation: This is our statistical significance threshold. Since this value is well below the standard threshold of 0.05, we officially reject the Null Hypothesis (\(H_0\)).

Final Academic Conclusion

There is an overwhelmingly significant association between passenger ticket class and survival rates on the Titanic. Ticket class did not simply correlate with survival; it structurally influenced a passenger’s likelihood of making it to a lifeboat. Tragically, those trapped in the lower classes faced vastly diminished odds of survival.


Analytical Rule of Thumb

Metric High Value Low Value
Chi-Square (\(X^2\)) Strong deviation from randomness Observed data matches random expectations
P-Value Statistically insignificant (Noise) Statistically Significant (Real Pattern)