Two Categorical Variables - Measures of Association

Motivating Question: Does how an NFL drive end differ based on how the drive begins??

Explanatory Variable: Drive Start - Kickoff(1), Punt(2), Interception(3), Fumble(4)
Response Variable: Drive End - Turnover, Punt, Field Goal, Touchdown

Now we can summarize our results in an \(I \times J\) table. Specifically, a \(4 \times 4\) table since how a drive starts and ends both have 4 different outcomes.

Let’s change the order of the groups using factor(levels = c(...))

drives <- 
  drives |> 
  mutate(drive_start = factor(drive_start,
                              levels = c("Punt", "Kickoff", "Interception", "Fumble")), # Reordering the groups
         
         # And let's reorder drive end again as well
         drive_end = factor(drive_end, 
                            levels = c("Turnover", "Punt", "Field Goal", "Touchdown"),
                            ordered = T))

# Look at our resulting 4x4 table
drive_freq <- 
  xtabs(formula = ~ drive_start + drive_end,
        data = drives)

# Make the table look nice
drive_freq |> 
  data.frame() |> 
  pivot_wider(names_from = "drive_end",
              values_from = "Freq") |> 
  rename(`Drive Start` = drive_start) |> 
  
  gt() |> 
  cols_align(align = "center") |> 
  gt_add_divider(columns = everything(),
                 color = "black",
                 sides = "all") |> 
  tab_header(title = "Drive Result")

Drive Start	Turnover	Punt	Field Goal	Touchdown
Drive Result
Punt	356	1074	432	537
Kickoff	459	1238	496	663
Interception	40	106	107	114
Fumble	60	138	120	125

Cramer’s V

One measure of association uses the test statistic of the Pearson’s Test of Independence.

\[V = \sqrt{\frac{\chi^2/N}{\min(I-1, J-1)}}\]

And like an \(R^2\) value, it will be somewhere between 0 and 1. 0 indicates no association, 1 indicates perfect association.

The cramersv() function in the confintr package will calculate Cramer’s V for us using the two-way table created by xtabs()

pacman::p_load(confintr)

confintr::cramersv(drives[ , c('drive_start', 'drive_end')])

## [1] 0.07201795

We have a value of about 0.072, indicating a fairly weak association. But why is the association weak when we had very strong evidence of an association?

The hypothesis test just tries to answer the question if an association exists and it won’t measure how strong that association is.

While the p-value is very small, we also have an incredibly large sample size, indicating we have very strong evidence of an association, but not necessarily evidence that it is a strong association!

Confidence Interval for V

We can create a confidence interval for V by using ci_cramersv() instead of cramersv()

confintr::ci_cramersv(drives[ , c('drive_start', 'drive_end')])

## 
##  Two-sided 95% chi-squared confidence interval for the population
##  Cramer's V
## 
## Sample estimate: 0.07201795 
## Confidence interval:
##       2.5%      97.5% 
## 0.05837129 0.08661868

Dissimilarity Index

The dissimilarity index is a measure that tells us how many cases are in the “wrong” cell if a particular model is true.

For example, if the DI is 0.10, then 10% of the cases would need to be moved into a different \(Y\) group in order for the model to have a perfect fit.

The dissimilarity index formula is:

\[DI = \sum_{i,j}^{I,J} \frac{|n_{ij} - \frac{n_{i\bullet}\times n_{\bullet j}}{N}|}{2N}\]

It is the absolute value of difference in the observed vs expected counts, summed up, and divided by \(2N\).

(We divide by 2 because the difference will add both the cases moved into a cell and the same cases that had to be moved out of a cell).

There isn’t a function to calculate it in R, but we can calculate it pretty quickly by hand. We first need to use chisq_test() and expected_freq() from rstatix again.

drives_exp <- 
  chisq_test(x = drive_freq,
             correct = F) |> 
  expected_freq()

drives_exp

##               drive_end
## drive_start     Turnover      Punt Field Goal Touchdown
##   Punt         361.92663 1011.0213  456.85820 569.19390
##   Kickoff      430.87222 1203.6168  543.88788 677.62308
##   Interception  55.36768  154.6664   69.89035  87.07552
##   Fumble        66.83347  186.6955   84.36356 105.10750

From there, we can use our observed counts table, subtract the expected counts table, and calculate the DI

wrong_cell_count <- 
  abs(drive_freq/2 - drives_exp/2) |>  
  sum() 

wrong_cell_count

## [1] 245.0528

So the independence model predicts about 245 cases in the wrong X,Y category. We can convert it to a proportion by dividing by the total sample size:

wrong_cell_count/sum(drive_freq)

## [1] 0.04040441

And we get a dissimilarity index of about 4%, meaning the Independence model mispredicts about 4% of the drives in the data. Not a very large percentage, indicating that the independence model isn’t that wrong!

Two Categorical Variables - Measures of Association

STAT 5350

Chapter 2.3 Part C

Cramer’s V

Confidence Interval for V

Dissimilarity Index