Categorical Data Analysis in R

Chi-Squared Goodness of Fit and Independence Tests

Author

Abdullah Al Shamim

Published

February 17, 2026

Introduction to Chi-Squared Tests

The Chi-Squared test is a fundamental statistical tool for analyzing categorical data. In this lesson, we explore two primary applications:

Goodness of Fit: Does the observed proportion of a single variable (e.g., flower size) match an expected distribution (e.g., equal proportions)?
Test of Independence: Is there a significant relationship between two categorical variables (e.g., Species and Size)?

1. Environment Setup and Data Exploration

We will use the classic iris dataset and transform the continuous Sepal.Length variable into categorical bins.

Code

# Load libraries
library(tidyverse)

# Quick look at the data
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

2. Data Preparation

We use the cut() function to discretize Sepal.Length into three distinct categories: Small, Medium, and Large.

Code

# Converting numeric variables into categorical variables
flower <- iris %>% 
  mutate(size = cut(Sepal.Length,
                    breaks = 3,
                    labels = c("Small", "Medium", "Large"))) %>%
  select(Species, size)  

# View the formatted table
table(flower)

            size
Species      Small Medium Large
  setosa        47      3     0
  versicolor    11     36     3
  virginica      1     32    17

3. Chi-Squared Goodness of Fit Test

This test evaluates one variable. We want to know if the proportions of “Small,” “Medium,” and “Large” flowers are significantly different from each other.

Hypotheses

: The proportions of Small, Medium, and Large flowers are equal ( each).
: The proportions are unequal.

Code

# Visualizing the distribution
flower %>%
  ggplot(aes(x = size)) +
  geom_bar(fill = "#a139ca", alpha = 0.7) +
  labs(title = "Goodness of Fit: Observed Frequencies",
       subtitle = "Proportion of flowers by size",
       x = "Size Category",
       y = "Frequency") +
  theme_test(base_size = 15)

Code

# Performing the test
goodness_fit <- flower %>% 
  select(size) %>% 
  table() %>% 
  chisq.test()

goodness_fit


    Chi-squared test for given probabilities

data:  .
X-squared = 28.44, df = 2, p-value = 6.673e-07

Interpretation: The p-value is approximately . Since , we reject the null hypothesis. The proportions of flower sizes in this dataset are statistically unequal.

4. Chi-Squared Test of Independence

This test evaluates the inter-relationship between two variables: Species and Size.

Hypotheses

: Species and Size are independent (no relationship).
: Species and Size are dependent (there is an association).

Code

# Visualizing the relationship
flower %>%
  ggplot(aes(size, fill = Species)) +
  geom_bar(alpha = 0.7) +
  labs(title = "Test of Independence: Species vs. Size",
       subtitle = "Stacked bar chart showing categorical overlap",
       x = "Size Category",
       y = "Count") +
  theme_test(base_size = 15) +
  scale_fill_manual(values = c("setosa" = "#d48ce1",
                               "versicolor" = "#a139ca",
                               "virginica" = "#6d2683"))

Performing the Test

Code

# Run the independence test
independence_test <- table(flower) %>% chisq.test()

independence_test


    Pearson's Chi-squared test

data:  .
X-squared = 111.63, df = 4, p-value < 2.2e-16

Interpretation: The p-value is . Since , we reject the null hypothesis. This suggests that Species and Size are highly dependent—knowing the species helps predict the likely size of the flower.

5. Checking Assumptions (Expected Values)

A Chi-squared test may be unreliable if more than 20% of the expected frequencies are less than 5. In such cases, Fisher’s Exact Test is preferred.

Code

# View expected frequencies
independence_test$expected

            size
Species         Small   Medium    Large
  setosa     19.66667 23.66667 6.666667
  versicolor 19.66667 23.66667 6.666667
  virginica  19.66667 23.66667 6.666667

Systematic Checklist (Cheat Sheet)

Discretizing Data: cut(variable, breaks = n, labels = c(...))
Contingency Table: table(var1, var2)
Statistical Test: chisq.test(table_object)
Checking Validity: test_object$expected
Alternative for Small Samples: fisher.test(table_object)

Summary: You have successfully mastered both types of Chi-squared tests! You can now determine if a category distribution is balanced and if two categorical variables influence one another.