Motivating Question: Does how an NFL drive end differ based on how the drive begins??
Now we can summarize our results in an \(I \times J\) table. Specifically, a \(4 \times 4\) table since how a drive starts and ends both have 4 different outcomes.
Let’s change the order of the groups using
factor(levels = c(...))
drives <-
drives |>
mutate(drive_start = factor(drive_start,
levels = c("Punt", "Kickoff", "Interception", "Fumble")), # Reordering the groups
# And let's reorder drive end again as well
drive_end = factor(drive_end,
levels = c("Turnover", "Punt", "Field Goal", "Touchdown"),
ordered = T))
# Look at our resulting 4x4 table
drive_freq <-
xtabs(formula = ~ drive_start + drive_end,
data = drives)
# Make the table look nice
drive_freq |>
data.frame() |>
pivot_wider(names_from = "drive_end",
values_from = "Freq") |>
rename(`Drive Start` = drive_start) |>
gt() |>
cols_align(align = "center") |>
gt_add_divider(columns = everything(),
color = "black",
sides = "all") |>
tab_header(title = "Drive Result")
Drive Result | ||||
Drive Start | Turnover | Punt | Field Goal | Touchdown |
---|---|---|---|---|
Punt | 356 | 1074 | 432 | 537 |
Kickoff | 459 | 1238 | 496 | 663 |
Interception | 40 | 106 | 107 | 114 |
Fumble | 60 | 138 | 120 | 125 |
One measure of association uses the test statistic of the Pearson’s Test of Independence.
\[V = \sqrt{\frac{\chi^2/N}{\min(I-1, J-1)}}\]
And like an \(R^2\) value, it will be somewhere between 0 and 1. 0 indicates no association, 1 indicates perfect association.
The cramersv()
function in the confintr
package will calculate Cramer’s V for us using the two-way table created
by xtabs()
pacman::p_load(confintr)
confintr::cramersv(drives[ , c('drive_start', 'drive_end')])
## [1] 0.07201795
We have a value of about 0.072, indicating a fairly weak association. But why is the association weak when we had very strong evidence of an association?
The hypothesis test just tries to answer the question if an association exists and it won’t measure how strong that association is.
While the p-value is very small, we also have an incredibly large sample size, indicating we have very strong evidence of an association, but not necessarily evidence that it is a strong association!
We can create a confidence interval for V by using
ci_cramersv()
instead of cramersv()
confintr::ci_cramersv(drives[ , c('drive_start', 'drive_end')])
##
## Two-sided 95% chi-squared confidence interval for the population
## Cramer's V
##
## Sample estimate: 0.07201795
## Confidence interval:
## 2.5% 97.5%
## 0.05837129 0.08661868
The dissimilarity index is a measure that tells us how many cases are in the “wrong” cell if a particular model is true.
For example, if the DI is 0.10, then 10% of the cases would need to be moved into a different \(Y\) group in order for the model to have a perfect fit.
The dissimilarity index formula is:
\[DI = \sum_{i,j}^{I,J} \frac{|n_{ij} - \frac{n_{i\bullet}\times n_{\bullet j}}{N}|}{2N}\]
It is the absolute value of difference in the observed vs expected counts, summed up, and divided by \(2N\).
(We divide by 2 because the difference will add both the cases moved into a cell and the same cases that had to be moved out of a cell).
There isn’t a function to calculate it in R, but we can calculate it
pretty quickly by hand. We first need to use chisq_test()
and expected_freq()
from rstatix
again.
drives_exp <-
chisq_test(x = drive_freq,
correct = F) |>
expected_freq()
drives_exp
## drive_end
## drive_start Turnover Punt Field Goal Touchdown
## Punt 361.92663 1011.0213 456.85820 569.19390
## Kickoff 430.87222 1203.6168 543.88788 677.62308
## Interception 55.36768 154.6664 69.89035 87.07552
## Fumble 66.83347 186.6955 84.36356 105.10750
From there, we can use our observed counts table, subtract the expected counts table, and calculate the DI
wrong_cell_count <-
abs(drive_freq/2 - drives_exp/2) |>
sum()
wrong_cell_count
## [1] 245.0528
So the independence model predicts about 245 cases in the wrong X,Y category. We can convert it to a proportion by dividing by the total sample size:
wrong_cell_count/sum(drive_freq)
## [1] 0.04040441
And we get a dissimilarity index of about 4%, meaning the Independence model mispredicts about 4% of the drives in the data. Not a very large percentage, indicating that the independence model isn’t that wrong!