2025-06-08

Chi-Square Test \((\chi^2-\text{Test})\)

  • Used for determining if two categorical variables are statistically independent or dependent
  • Example: are # car accidents statistically related to day of the week?
  • If num_car_accidents and day_of_the_week are independent, we might expect (in a perfect scenario) the same number of accidents each day. How far must we depart from expected to statistically conclude that they might be dependent?
Day Of Week Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Actual Acc 8 5 5 6 9 12 10
Exp Acc 7 7 7 7 7 7 7

Setup

\(\text{Null Hypothesis H}_0\)

  • We expect equal frequencies of car accidents each day of the week

\(\text{Alternate Hypothesis H}_1\)

  • Unequal frequencies (implying dependency)

\(\text{Degrees of Freedom : } df = n-1\)

  • 7 days of the week - 1 = 6 degrees of freedom

\(\text{Critical Value : } CV\)

  • determines what level of significance to accept or reject the null hypothesis (above CV -> accept, below -> reject)

Example \(\chi^2\) plot

  • blue, red, orange, green -> df = 2, 5, 8, 10

Example \(\chi^2\) plot

Calculating \(\chi^2_\alpha\) value

  • Critical value found by using a table, unless you really want to do it manually
  • For small degrees of freedom, you can approximate \(\alpha\) with the ‘Wilson-Hilferty transformation’

\(\chi^2_{\alpha, df}\approx df\left(1-\frac{2}{9\cdot df}+z_\alpha\sqrt{\frac{2}{9\cdot df}}\right)^3\)

  • where \(z_\alpha\) is the associated z-score for the desired significance level (0.05) \(\approx 1.645\)

  • 12.592 (lookup table) vs. 12.5686 (using above formula)

Calculating \(\chi^2_c\) value

  • Significantly simpler
  • If this value is smaller than \(\chi^2_\alpha \rightarrow \text{accept H}_0\), else reject

\(\chi^2_c = \sum\limits_{i = 1}^k\frac{(x_i-m_i)^2}{m_i}\)

  • where \(x_i\) is the observed frequency, \(m_i\) is the expected frequency, and \(k\) is the particular day of the week

  • if we expect 7 accidents per day, regardless of the day_of_the_week:

\(\chi^2_c=\frac{(8-7)^2}{7}+\frac{(5-7)^2}{7}+\;...\;\frac{(10-7)^2}{7}\approx6.867\)

\(\chi^2\) plot with calculated values

Code for \(\chi^2\) plot with calculated values

# compacted for better display on slides
df <- 6
alpha <- 0.05
critical_value <- qchisq(1 - alpha, df)                   # chisq funs come with
calculated_value <- 6.867                                 # base R!

x <- seq(0, 20, length.out = 1000)                        # length.out makes 
y <- dchisq(x, df)                                        # curve much smoother
data <- data.frame(x, y)                                  # make x, y into dataframe 

fig2 <- ggplot(data, aes(x = x, y = y)) +
  geom_line(color = "blue", linewidth = 0.5) +            # Chi-square curve
  geom_ribbon(data = subset(data, x >= critical_value),   # fill area
              aes(ymax = y, ymin = 0), 
              fill = "green", alpha = 0.2) +
  geom_ribbon(data = subset(data, x <= calculated_value), # fill area
              aes(ymax = y, ymin = 0), 
              fill = "red", alpha = 0.2) +
  geom_vline(xintercept = critical_value,                 # critical value line
             color = "green", linetype = "dashed", linewidth = 0.5) +
  geom_vline(xintercept = calculated_value,               # our chi-square value line
             color = "red", linetype = "dashed", linewidth = 0.5) +
  labs(title = "Chi-Square Distribution (df=6) with CV corresponding to a=0.05",
       x = "Chi-Square Value",
       y = "Density")