Testing to see if how a drive starts (drive_start2
) and
if it ends in a touchdown (drive_end2
) are associated:
chisq_test(x = drives$drive_start2, y = drives$drive_end2)
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <int> <dbl> <dbl> <int> <chr> <chr>
## 1 4152 18.4 0.000018 1 Chi-square test ****
Let’s include another variable: yds_to_td
-> how far
away the drive starts from the endzone (closer to the endzone the easier
it should be to score a TD)
Since we already tested if \(X\)
(drive_start2
) and \(Y\)
(drive_end2
) are associated, let’s check to see if \(X\) and \(Y\) are both independent of \(Z\) (yds_to_td
)
Let’s test to see if how a drive begins (drive_start2
)
and if the drive results in a touchdown (drive_end2
) are
independent, if we know where on the field the drive began
(yds_to_td
)
Now we are testing to see if \(X\) and \(Y\) are independent, conditioning on \(Z\)
The expected proportions for the outcomes of \(X\), \(Y\), and \(Z\) assuming conditional independence on \(Z\) is:
\[\hat{\pi}_{ijk,0} = P(X = i|Z = k) P(Y = j|Z = k) P(Z = k) = \frac{n_{i\bullet k}}{n_{\bullet\bullet k}} \frac{n_{\bullet jk}}{n_{\bullet\bullet k}} \frac{n_{\bullet\bullet k}}{N}\]
The number of parameters for the conditional independence model is:
\[(I + J - 1)K - 1\]
and the degrees of freedom of the test will be:
\[df = r_1 - r_0 = \\IJK - 1 -(I + J -1)K - 1 = \\2(2)(4) - 1 -(2 + 2 -1)4 - 1 = \\15 - 11 =\\ 4\]
chisq_test()
We can shortcut calculating our test statistic by performing a \(\chi^2\) test between \(X\) and \(Y\) for each group of \(Z\), then adding their individual test statistics together!
We can see the four different \(2\times 2\) tables below:
xtabs(formula = ~ drive_start2 + drive_end2 + yds_to_td, data = drives)
## , , yds_to_td = < 25
##
## drive_end2
## drive_start2 No TD TD
## Kick 8 11
## Turnover 32 39
##
## , , yds_to_td = 25 - 50
##
## drive_end2
## drive_start2 No TD TD
## Kick 75 44
## Turnover 86 56
##
## , , yds_to_td = 50 - 75
##
## drive_end2
## drive_start2 No TD TD
## Kick 658 258
## Turnover 81 29
##
## , , yds_to_td = >= 75
##
## drive_end2
## drive_start2 No TD TD
## Kick 2096 584
## Turnover 78 17
xtabs(formula = ~ yds_to_td + drive_start2 + drive_end2, data = drives) |>
ftable()
## drive_end2 No TD TD
## yds_to_td drive_start2
## < 25 Kick 8 11
## Turnover 32 39
## 25 - 50 Kick 75 44
## Turnover 86 56
## 50 - 75 Kick 658 258
## Turnover 81 29
## >= 75 Kick 2096 584
## Turnover 78 17
Using summarize()
and chisq_test()
, we can
perform a separate \(\chi^2\) for \(X\) and \(Y\) at each level of \(Z\):
partial_chisq_tests <-
drives |>
# Calculating the test statistic, df, and p-value
# for each individual partial table of yds_to_td
summarize(
.by = yds_to_td,
test_stat = chisq_test(drive_start2, drive_end2)$statistic,
df = chisq_test(drive_start2, drive_end2)$df,
p_val = chisq_test(drive_start2, drive_end2)$p
)
partial_chisq_tests
## yds_to_td test_stat df p_val
## 1 < 25 6.726077e-31 1 1.000
## 2 >= 75 6.073278e-01 1 0.436
## 3 25 - 50 7.819726e-02 1 0.780
## 4 50 - 75 8.151203e-02 1 0.775
Examining each partial table individually, no evidence that how the drive ends depends on how the drive begins at any level of yards to touchdown.
But we don’t want to conduct 4 separate tests, but one big “omnibus” test. How?
Just combine the test statistics of each partial table!
partial_chisq_tests |>
# Adding the partial test stats and df together
summarize(
test_stat = sum(test_stat),
df = sum(df)
) |>
# calculating the p-value
mutate(
p_val = pchisq(test_stat, df = df, lower.tail = F)
)
## test_stat df p_val
## 1 0.7670371 4 0.9428118
Out test statistic is 0.767, which is small, and our p-value is close to 1, indicating that we don’t have evidence that drive start and drive end are associated, once we account for the starting location of the drive.
What that tells us is, while drive_start2 and drive_end2 are associated, they are related because of their association with yds_to_td.
Because a drive that begins from a turnover is more likely to be closer to the endzone, it is also more likely to end with a touchdown.
But if a drive starts at the same location, the probability it ends with a touchdown is the same if it began from a kickoff/punt or a turnover!
If we wanted to conduct a likelihood ratio G-test instead of
Pearson’s \(\chi^2\) test, we can
repeat the code chunk named “partial_chi_squareds”, but replace
chisq_test()
with GTest()
in
DescTools
and $p
with $p.value
and $df
with $parameter
library(DescTools)
partial_G_tests <-
drives |>
# Calculating the test statistic, df, and p-value
# for each individual partial table
summarize(
.by = yds_to_td,
test_stat = GTest(drive_start2, drive_end2)$statistic,
df = GTest(drive_start2, drive_end2)$parameter,
p_val = GTest(drive_start2, drive_end2)$p.value
)
partial_G_tests
## yds_to_td test_stat df p_val
## 1 < 25 0.05352689 1 0.8170361
## 2 >= 75 0.85764542 1 0.3543984
## 3 25 - 50 0.16616331 1 0.6835443
## 4 50 - 75 0.16002587 1 0.6891327
# test stat and p-value
partial_G_tests |>
summarize(
test_stat = sum(test_stat),
df = sum(df)
) |>
mutate(
p_val = pchisq(test_stat, df, lower.tail = F)
)
## test_stat df p_val
## 1 1.237361 4 0.8719098
We end up with similar results!
Unfortunately, there isn’t a simple way to determine if any expected counts are below 5 using the methods above, which means we have to do it in a complicated way:
drives |>
summarize(
.by = yds_to_td,
below_5 = sum(chisq.test(drive_start2, drive_end2)$expected <5)
)
## yds_to_td below_5
## 1 < 25 0
## 2 >= 75 0
## 3 25 - 50 0
## 4 50 - 75 0
All of our expected counts for all 16 groups are at least 5, so we meet our sample size requirement!