Testing to see if how a drive starts (drive_start2
) and
if it ends in a touchdown (drive_end2
) are associated:
chisq_test(x = drives$drive_start2, y = drives$drive_end2)
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <int> <dbl> <dbl> <int> <chr> <chr>
## 1 4152 18.4 0.000018 1 Chi-square test ****
Let’s include another variable: yds_to_td
-> how far
away the drive starts from the endzone (closer to the endzone the easier
it should be to score a TD)
Since we already tested if \(X\)
(drive_start2
) and \(Y\)
(drive_end2
) are associated, let’s check to see if \(X\) and \(Y\) are both independent of \(Z\) (yds_to_td
)
Creating a summarized data set for yds_to_td
,
drive_start2
, drive_end2
drive_sum <-
drives |>
count(yds_to_td, drive_start2, drive_end2) |>
# Calculating the proportions: n/sum(n)
mutate(sat_prop = n/sum(n))
drive_sum
## yds_to_td drive_start2 drive_end2 n sat_prop
## 1 < 25 Kick No TD 8 0.001926782
## 2 < 25 Kick TD 11 0.002649326
## 3 < 25 Turnover No TD 32 0.007707129
## 4 < 25 Turnover TD 39 0.009393064
## 5 25 - 50 Kick No TD 75 0.018063584
## 6 25 - 50 Kick TD 44 0.010597303
## 7 25 - 50 Turnover No TD 86 0.020712909
## 8 25 - 50 Turnover TD 56 0.013487476
## 9 50 - 75 Kick No TD 658 0.158477842
## 10 50 - 75 Kick TD 258 0.062138728
## 11 50 - 75 Turnover No TD 81 0.019508671
## 12 50 - 75 Turnover TD 29 0.006984586
## 13 >= 75 Kick No TD 2096 0.504816956
## 14 >= 75 Kick TD 584 0.140655106
## 15 >= 75 Turnover No TD 78 0.018786127
## 16 >= 75 Turnover TD 17 0.004094412
We’ll also need \(I\), \(J\), and \(K\), the number of groups for each of our 3 variables:
# I = number of different ways a drive can start (Kick/Turnover) = 2
I <- n_distinct(drives$drive_start2)
# J = number of different ways a drive can end (TD/No TD) = 2
J <- n_distinct(drives$drive_end2)
# K = number of different distances a drive can start from the endzone = 4
K <- n_distinct(drives$yds_to_td)
To test joint independence (JI) vs the full association model (FA), we need to calculate the expected proportions assuming drive_start2 and drive_end2 are both independent of yds_to_td
\[\hat{\pi}_{ijk,0} = \frac{n_{ij\bullet}}{N} \frac{n_{\bullet\bullet k}}{N}\]
and the degrees of freedom will be:
\[df = r_1 - r_0 = \\IJK - 1 - (IJ + K -2) =\\ 2(2)(4) - 1 - (2(2) + 4 -2) = \\ 15 - 6 = \\9\]
We could work it out “by hand” like we needed to do for Complete Independence, but if we want to use a canned function, the simplest way to test for joint independence is to combine the two variables that we aren’t claiming are independent together into one variable!
Meaning, we’ll create a variable called start_end
that
tells us how the drive started and ended!
drives <-
drives |>
mutate(start_end = paste(drive_start2, drive_end2, sep = ":"))
drives |>
head(n = 10)
## drive_id drive_start drive_end drive_start2 drive_end2 yds_to_td
## 1 2018_02_DET_SF_8 Fumble Field Goal Turnover No TD < 25
## 2 2019_06_HOU_KC_3 Fumble Field Goal Turnover No TD < 25
## 3 2019_16_GB_MIN_2 Fumble Field Goal Turnover No TD < 25
## 4 2016_09_ATL_TB_4 Fumble Field Goal Turnover No TD < 25
## 5 2020_03_TEN_MIN_5 Fumble Field Goal Turnover No TD < 25
## 6 2018_10_DET_CHI_13 Fumble Field Goal Turnover No TD < 25
## 7 2018_04_NO_NYG_7 Fumble Field Goal Turnover No TD < 25
## 8 2017_04_TEN_HOU_23 Fumble Field Goal Turnover No TD < 25
## 9 2021_05_CLE_LAC_10 Fumble Field Goal Turnover No TD < 25
## 10 2019_09_CLE_DEN_10 Fumble Field Goal Turnover No TD < 25
## start_end
## 1 Turnover:No TD
## 2 Turnover:No TD
## 3 Turnover:No TD
## 4 Turnover:No TD
## 5 Turnover:No TD
## 6 Turnover:No TD
## 7 Turnover:No TD
## 8 Turnover:No TD
## 9 Turnover:No TD
## 10 Turnover:No TD
Once we’ve done that, we can conduct our hypothesis test by using
chisq_test()
from the rstatix
package like we
did earlier in this document:
JI_vs_sat <- chisq_test(x = drives$start_end, y = drives$yds_to_td)
## Warning in stats::chisq.test(x, y, correct = correct, p = p, rescale.p =
## rescale.p, : Chi-squared approximation may be incorrect
JI_vs_sat
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <int> <dbl> <dbl> <int> <chr> <chr>
## 1 4152 1328. 3.46e-280 9 Chi-square test ****
Now we have the:
test statistic = \(1327\)
degrees of freedom = \(9\)
p-value = \(\approx 0\)
So we have very, very strong evidence that at least one of the two
variables (drive_start2
or drive_end2
) is also
associated with yds_to_td
!
Note that the test statistic is smaller than our test statistic for complete independence. That’s because the more complicated the model (JI) will always fit the data better than the simpler model (CI), and will be closer to the observed proportions (saturated)!
We can use the same “trick” to calculate the likelihood ratio \(G\)-test using the GTest()
function in the DescTools
package:
JI_vs_sat_Gtest <- DescTools::GTest(x = drives$start_end, y = drives$yds_to_td)
JI_vs_sat_Gtest
##
## Log likelihood ratio (G-test) test of independence without correction
##
## data: drives$start_end and drives$yds_to_td
## G = 797.97, X-squared df = 9, p-value < 2.2e-16
While the test statistic is noticeably smaller than Pearson’s test stat, it is still very large and indicates that how a drive starts and if it ends in a TD are not independent of where the drive starts!
We still should check to see if our sample size is large enough. We
can use the expected_freq()
function to calculate our
expected counts for us:
expected_freq(JI_vs_sat) |>
round(digits = 1) |>
# Converting from a table to a data frame
as.data.frame() |>
# changing the row names to a column then stacking the columns of exp
rownames_to_column('XY') |>
pivot_longer(
cols = -XY,
names_to = 'yds_to_go',
values_to = 'expected_count'
) |>
# arranging from smallest to largest expected count
arrange(expected_count)
## # A tibble: 16 × 3
## XY yds_to_go expected_count
## <chr> <chr> <dbl>
## 1 Turnover:TD < 25 3.1
## 2 Turnover:No TD < 25 6
## 3 Turnover:TD 25 - 50 8.9
## 4 Turnover:No TD 25 - 50 17.4
## 5 Kick:TD < 25 19.4
## 6 Turnover:TD 50 - 75 34.8
## 7 Kick:TD 25 - 50 56.4
## 8 Kick:No TD < 25 61.5
## 9 Turnover:No TD 50 - 75 68.4
## 10 Turnover:TD >= 75 94.2
## 11 Kick:No TD 25 - 50 178.
## 12 Turnover:No TD >= 75 185.
## 13 Kick:TD 50 - 75 222.
## 14 Kick:TD >= 75 600.
## 15 Kick:No TD 50 - 75 701.
## 16 Kick:No TD >= 75 1896.
Like last time, we have 1 expected count less than 5, indicating that a \(\chi^2\) distribution might not be appropriate to find the p-value .
But again, with such a large test statistic, we can be confident that
at least drive_start
or drive_end
is
associated with yds_to_td
.
So it appears that all 3 variables have an association!
We’ll use the std_residuals()
function to find the
standardized residuals assuming that the two variables are jointly
independent and then create a heat map of the differences, like we did
last time
std_residuals(JI_vs_sat) |>
data.frame() |>
rename(
std_res = Freq,
yds_to_td = y
) |>
# Separate will split apart the two variables that we combined together
separate(
col = x, # Column to split
into = c("drive_start", "drive_end"), # name of the new columns
sep = ":" # Where to split the column
) |>
# Creating the heat map using ggplot and geom_tile()
ggplot(
mapping = aes(
x = drive_end,
y = drive_start,
fill = std_res
)
) +
geom_tile(color = "white") +
geom_text(mapping = aes(label = round(std_res, digits = 2))) +
facet_wrap(facets = vars(yds_to_td)) +
# Removes the buffer around the plot
coord_cartesian(expand = F) +
scale_fill_gradient2(
low = "red",
mid = "white",
high = "blue",
midpoint = 0
) +
labs(
x = "How a drive ends",
y = "How a drive begins",
fill = "z-score for \nFA vs JI models"
)