Chi-Square Analysis in R

Statistical Testing for Categorical Data

Illya Mowerman, Ph.D.

Overview

Introduction

Chi-square tests are fundamental statistical methods for analyzing categorical data:

Chi-Square Test of Independence

Purpose

# Create sample marketing data
marketing_data <- data.frame(
  Campaign = c(rep("Email", 200), rep("Social", 150), rep("Print", 150)),
  Response = c(
    rep("Converted", 80), rep("Not_Converted", 120),  # Email
    rep("Converted", 45), rep("Not_Converted", 105),  # Social
    rep("Converted", 30), rep("Not_Converted", 120)   # Print
  )
)

# Create and print contingency table
campaign_table <- table(marketing_data$Campaign, marketing_data$Response)
print(campaign_table)
##         
##          Converted Not_Converted
##   Email         80           120
##   Print         30           120
##   Social        45           105

Marketing Campaign Analysis

# Perform chi-square test
campaign_test <- chisq.test(campaign_table)
print(campaign_test)
## 
##  Pearson's Chi-squared test
## 
## data:  campaign_table
## X-squared = 16.129, df = 2, p-value = 0.0003145
# Visualize results
ggplot(marketing_data, aes(x = Campaign, fill = Response)) +
  geom_bar(position = "fill") +
  labs(title = "Conversion Rates by Marketing Campaign",
       y = "Proportion",
       x = "Campaign Type") +
  theme_minimal()

Chi-Square Goodness of Fit Test

Purpose

# Create sample dice roll data
dice_rolls <- c(40, 45, 55, 35, 50, 35)  # Frequency of each number (1-6)
names(dice_rolls) <- 1:6

# Expected probabilities (fair die)
expected_prob <- rep(1/6, 6)

Dice Roll Analysis

# Perform chi-square goodness of fit test
dice_test <- chisq.test(dice_rolls, p = expected_prob)
print(dice_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  dice_rolls
## X-squared = 7.6923, df = 5, p-value = 0.174
# Visualize results
dice_df <- data.frame(
  Number = 1:6,
  Observed = dice_rolls,
  Expected = sum(dice_rolls)/6
)

ggplot(dice_df, aes(x = factor(Number))) +
  geom_bar(aes(y = Observed), stat = "identity", fill = "skyblue") +
  geom_hline(yintercept = sum(dice_rolls)/6, color = "red", linetype = "dashed") +
  labs(title = "Dice Roll Frequencies",
       x = "Dice Number",
       y = "Frequency") +
  theme_minimal()

Assumptions and Requirements

  1. Independence of observations
  2. Expected frequencies ≥ 5 in each cell
  3. Categorical or nominal data
  4. Random sampling
# Check expected frequencies
campaign_test$expected
##         
##          Converted Not_Converted
##   Email       62.0         138.0
##   Print       46.5         103.5
##   Social      46.5         103.5

Practice Exercise

# Create student dataset
student_data <- data.frame(
  Gender = c(rep("Male", 100), rep("Female", 100)),
  Study_Method = c(
    rep("Online", 40), rep("Traditional", 60),  # Male
    rep("Online", 55), rep("Traditional", 45)   # Female
  )
)

# Tasks:
# 1. Create contingency table
student_table <- table(student_data$Gender, student_data$Study_Method)
print(student_table)
##         
##          Online Traditional
##   Female     55          45
##   Male       40          60
# 2. Perform chi-square test
student_test <- chisq.test(student_table)
print(student_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  student_table
## X-squared = 3.9298, df = 1, p-value = 0.04744

Practice Exercise Visualization

# 3. Visualize results
ggplot(student_data, aes(x = Gender, fill = Study_Method)) +
  geom_bar(position = "fill") +
  labs(title = "Study Method Preference by Gender",
       y = "Proportion",
       x = "Gender") +
  theme_minimal()

Key Tips

Common Pitfalls

  1. Using with small expected frequencies
  2. Applying to non-categorical data
  3. Ignoring assumptions
  4. Over-interpreting results
  5. Not considering effect size

Additional Resources

Custom CSS (save as custom.css)

.title-slide {
    background-color: #2C3E50;
    color: white;
}

h2 {
    color: #2C3E50;
    border-bottom: 2px solid #2C3E50;
}

.emphasized {
    font-weight: bold;
    color: #E74C3C;
}

code {
    color: #2980B9;
}

pre {
    background-color: #F7F9F9;
    border: 1px solid #BDC3C7;
    border-radius: 5px;
}