Introduction

The Chi-Square Test is a nonparametric statistical method used to analyze categorical data. It is used when we have frequency counts and want to test whether:

  1. Chi-Square Goodness-of-Fit Test: The observed frequency distribution differs from an expected theoretical distribution.
  2. Chi-Square Test for Independence: Two categorical variables are associated (dependent) or independent.

Unlike parametric tests, the chi-square test does not assume normality and is useful for count-based data.


1. Chi-Square Goodness-of-Fit Test

Theory

The goodness-of-fit test evaluates whether an observed distribution differs from a hypothesized theoretical distribution.

Hypothesis

  • \(H_0\): The observed frequencies match the expected frequencies.
  • \(H_A\): The observed frequencies differ significantly from the expected frequencies.

Test Statistic

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \] where: - \(O_i\) = observed frequency for category \(i\), - \(E_i\) = expected frequency under \(H_0\), - \(k\) = number of categories.

Under \(H_0\), \(\chi^2\) follows a chi-square distribution with \(k - 1\) degrees of freedom.

Example: Simulated Data

Let’s test whether people’s drink preferences follow a given probability distribution.

# Load required packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.1
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.1
# Simulated observed counts
observed_counts <- c(50, 30, 20)  # Coffee, Tea, Juice
expected_proportions <- c(0.5, 0.3, 0.2)  # Expected proportions
total_count <- sum(observed_counts)

# Compute expected frequencies
expected_counts <- total_count * expected_proportions

# Perform Chi-square Goodness-of-Fit Test
chisq_test <- chisq.test(observed_counts, p = expected_proportions)

# Create a table
df_goodness <- data.frame(
  Category = c("Coffee", "Tea", "Juice"),
  Observed = observed_counts,
  Expected = expected_counts
)

# Display the table
kable(df_goodness, caption = "Observed vs Expected Frequencies")
Observed vs Expected Frequencies
Category Observed Expected
Coffee 50 50
Tea 30 30
Juice 20 20
# Print test result
chisq_test
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_counts
## X-squared = 0, df = 2, p-value = 1

Visualizing Observed vs Expected Frequencies

ggplot(df_goodness, aes(x = Category, y = Observed, fill = Category)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  geom_point(aes(y = Expected), size = 4, color = "red") +
  labs(title = "Observed vs Expected Frequencies", y = "Count") +
  theme_minimal()

Interpretation If p-value < 0.05, reject 𝐻 0

→ observed distribution differs from expected. If p-value > 0.05, fail to reject 𝐻 0

→ observed distribution matches expected.


#### Chi-Square Test for Independence

This test evaluates whether two categorical variables are associated.

\(H_0\): The two categorical variables are independent.

\(H_A\): There is an association between the variables.

For a contingency table:

\(E_{ij} = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}}\)

\(\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

Example: Hair and Eye Color (Real Data)

Let’s analyze whether hair color and eye color are independent.

# Load dataset
data(HairEyeColor)

# Convert to a 2D contingency table (sum over Gender)
hair_eye_data <- margin.table(HairEyeColor, c(1, 2))

# Perform Chi-Square Test
chisq_indep_test <- chisq.test(hair_eye_data)

# Display contingency table
kable(hair_eye_data, caption = "Hair Color vs Eye Color Contingency Table")
Hair Color vs Eye Color Contingency Table
Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
# Print test results
chisq_indep_test
## 
##  Pearson's Chi-squared test
## 
## data:  hair_eye_data
## X-squared = 138.29, df = 9, p-value < 2.2e-16
require(vcd)
## Loading required package: vcd
## Warning: package 'vcd' was built under R version 4.4.2
## Loading required package: grid
mosaic(hair_eye_data, shade = TRUE, legend = TRUE, main = "Hair vs Eye Color Association")

📊 Dataset

Product Type Region A Region B
Type 1 40 60
Type 2 30 50
Type 3 50 40

📝 Task for Students

  1. Create the contingency table in R.
  2. Perform a chi-square test for independence.
  3. Interpret the result.
# Create the contingency table
product_sales <- matrix(c(40, 30, 50, 60, 50, 40), nrow = 3, byrow = TRUE,
                       dimnames = list(ProductType = c("Type 1", "Type 2", "Type 3"),
                                       Region = c("Region A", "Region B")))

product_sales
##            Region
## ProductType Region A Region B
##      Type 1       40       30
##      Type 2       50       60
##      Type 3       50       40

help Code

# Create contingency table
product_data <- matrix(c(40, 60, 30, 50, 50, 40), nrow = 3, byrow = TRUE)
colnames(product_data) <- c("Region A", "Region B")
rownames(product_data) <- c("Type 1", "Type 2", "Type 3")

# Perform Chi-square test
product_chisq_test <- chisq.test(product_data)

# Display table
kable(product_data, caption = "Product Sales Contingency Table")
Product Sales Contingency Table
Region A Region B
Type 1 40 60
Type 2 30 50
Type 3 50 40
# Print test results
product_chisq_test
## 
##  Pearson's Chi-squared test
## 
## data:  product_data
## X-squared = 6.8625, df = 2, p-value = 0.03235