We can create a three-way table of our three variables using
xtabs()
formula = ~ X + Y + Z
data =
the data setdrives_table_3 <-
xtabs(
formula = ~ drive_start2 + drive_end2 + yds_to_td,
data = drives
)
drives_table_3
## , , yds_to_td = < 25
##
## drive_end2
## drive_start2 No TD TD
## Kick 8 11
## Turnover 32 39
##
## , , yds_to_td = 25 - 50
##
## drive_end2
## drive_start2 No TD TD
## Kick 75 44
## Turnover 86 56
##
## , , yds_to_td = 50 - 75
##
## drive_end2
## drive_start2 No TD TD
## Kick 658 258
## Turnover 81 29
##
## , , yds_to_td = >= 75
##
## drive_end2
## drive_start2 No TD TD
## Kick 2096 584
## Turnover 78 17
You can convert the partial tables created by xtabs()
to
a flat table using ftable(...)
ftable(drives_table_3)
## yds_to_td < 25 25 - 50 50 - 75 >= 75
## drive_start2 drive_end2
## Kick No TD 8 75 658 2096
## TD 11 44 258 584
## Turnover No TD 32 86 81 78
## TD 39 56 29 17
Testing to see if how a drive starts (drive_start2
) and
if it ends in a touchdown (drive_end2
) are associated:
chisq_test(x = drives$drive_start2, y = drives$drive_end2)
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <int> <dbl> <dbl> <int> <chr> <chr>
## 1 4152 18.4 0.000018 1 Chi-square test ****
Let’s include another variable: yds_to_td
-> how far
away the drive starts from the endzone (closer to the endzone the easier
it should be to score a TD)
Creating a summarized data set for yds_to_td
,
drive_start2
, drive_end2
and create the joint
proportions: \(p_{ijk}\)
drive_sum <-
drives |>
count(yds_to_td, drive_start2, drive_end2) |>
# Calculating the proportions: n/sum(n)
mutate(FA_prop = n/sum(n))
drive_sum
## yds_to_td drive_start2 drive_end2 n FA_prop
## 1 < 25 Kick No TD 8 0.001926782
## 2 < 25 Kick TD 11 0.002649326
## 3 < 25 Turnover No TD 32 0.007707129
## 4 < 25 Turnover TD 39 0.009393064
## 5 25 - 50 Kick No TD 75 0.018063584
## 6 25 - 50 Kick TD 44 0.010597303
## 7 25 - 50 Turnover No TD 86 0.020712909
## 8 25 - 50 Turnover TD 56 0.013487476
## 9 50 - 75 Kick No TD 658 0.158477842
## 10 50 - 75 Kick TD 258 0.062138728
## 11 50 - 75 Turnover No TD 81 0.019508671
## 12 50 - 75 Turnover TD 29 0.006984586
## 13 >= 75 Kick No TD 2096 0.504816956
## 14 >= 75 Kick TD 584 0.140655106
## 15 >= 75 Turnover No TD 78 0.018786127
## 16 >= 75 Turnover TD 17 0.004094412
We’ll also need \(I\), \(J\), and \(K\), the number of groups for each of our 3 variables:
# I = number of different ways a drive can start (Kick/Turnover) = 2
I <- n_distinct(drives$drive_start2)
# J = number of different ways a drive can end (TD/No TD) = 2
J <- n_distinct(drives$drive_end2)
# K = number of different distances a drive can start from the endzone = 4
K <- n_distinct(drives$yds_to_td)
Are all three variables independent of one another?
To test complete independence vs the full association model, we need to calculate the expected proportions assuming all three variables are independent:
\[\hat{\pi}_{ijk,0} = \frac{n_{i\bullet\bullet}}{N} \frac{n_{\bullet j\bullet}}{N} \frac{n_{\bullet\bullet k}}{N}\]
So we need to find the total counts for each group of each variable separately. Not difficult, but it is a little tedious to do.
drive_CI <-
drive_sum |>
# Start by adding the totals for each yds_to_td group
mutate(
.by = yds_to_td,
yds_n = sum(n)
) |>
# Next, find the totals for each drive_start2 group:
mutate(
.by = drive_start2,
start_n = sum(n)
) |>
# Finally, count for each group of drive_end2:
mutate(
.by = drive_end2,
end_n = sum(n)
) |>
# Now we can calculate the expected proportion for each outcome assuming complete independence
mutate(
CI_prop = yds_n/sum(n) * start_n/sum(n) * end_n/sum(n)
) |>
# Dropping the count columns because we don't need them in the future:
dplyr::select(-yds_n, -start_n, -end_n)
drive_CI
## yds_to_td drive_start2 drive_end2 n FA_prop CI_prop
## 1 < 25 Kick No TD 8 0.001926782 0.014620539
## 2 < 25 Kick TD 11 0.002649326 0.004873513
## 3 < 25 Turnover No TD 32 0.007707129 0.001636686
## 4 < 25 Turnover TD 39 0.009393064 0.000545562
## 5 25 - 50 Kick No TD 75 0.018063584 0.042399564
## 6 25 - 50 Kick TD 44 0.010597303 0.014133188
## 7 25 - 50 Turnover No TD 86 0.020712909 0.004746389
## 8 25 - 50 Turnover TD 56 0.013487476 0.001582130
## 9 50 - 75 Kick No TD 658 0.158477842 0.166674150
## 10 50 - 75 Kick TD 258 0.062138728 0.055558050
## 11 50 - 75 Turnover No TD 81 0.019508671 0.018658220
## 12 50 - 75 Turnover TD 29 0.006984586 0.006219407
## 13 >= 75 Kick No TD 2096 0.504816956 0.450799966
## 14 >= 75 Kick TD 584 0.140655106 0.150266655
## 15 >= 75 Turnover No TD 78 0.018786127 0.050464485
## 16 >= 75 Turnover TD 17 0.004094412 0.016821495
Now that we have the expected proportions (CI model) and the observed proportions (FA model), we can calculate our test statistic!
CI_FA_test <-
drive_CI |>
# Calculating the individual pieces of our test statistics (chi^2 and G)
mutate(
zi2 = (FA_prop - CI_prop)^2/CI_prop,
gi = n*log(FA_prop/CI_prop)
) |>
# Adding the individual pieces to get the test statistic:
summarize(
chi2 = sum(n)*sum(zi2),
lrt_g = 2*sum(gi)
) |>
# Changing the results from being stored in separate columns to the same column
pivot_longer(
cols = chi2:lrt_g,
names_to = "test",
values_to = "stat"
)
CI_FA_test
## # A tibble: 2 × 2
## test stat
## <chr> <dbl>
## 1 chi2 1553.
## 2 lrt_g 816.
Both test statistics appear to be very, very large (probably large enough that we can go ahead and reject the null hypothesis). But we should also calculate the degrees of freedom and find the p-value, just to be sure
The degrees of freedom are
\[IJK - 1 - (I + J + K - 3) =\\ 2\times2\times4 - 1 -(2 + 2 + 4 - 3) = \\15 - 5 = \\10\]
# The number of unique proportions needed for the FA model
r1 <- I*J*K - 1
# The number of unique proportions needed for the CI model
r0 <- I + J + K - 3
# The degrees of freedom: r1 - r0
df_CI <- r1 - r0
df_CI
## [1] 10
Now that we have the degrees of freedom, let’s calculate the p-values!
#CI_FA_test <-
CI_FA_test |>
mutate(p_val = pchisq(stat, df = df_CI, lower = F))
## # A tibble: 2 × 3
## test stat p_val
## <chr> <dbl> <dbl>
## 1 chi2 1553. 0
## 2 lrt_g 816. 8.30e-169
We get p-values that are essentially 0, indicating that at least 2 of the 3 variables are associated.
However, for the results to hold, we need to make sure our sample size is large enough to conduct a test. How large of a sample do we need?
Any of these \(\chi^2\) tests (or LRT \(G\) tests) all have the same sample size requirement: that \(N\) times the expected proportions are all at least 5. We’ll have 16 expected counts, but we only need to focus on the smallest one!
# Checking to see if our sample size is large enough:
drive_CI |>
mutate(n_CI = sum(n)*CI_prop) |>
arrange(n_CI)
## yds_to_td drive_start2 drive_end2 n FA_prop CI_prop n_CI
## 1 < 25 Turnover TD 39 0.009393064 0.000545562 2.265173
## 2 25 - 50 Turnover TD 56 0.013487476 0.001582130 6.569003
## 3 < 25 Turnover No TD 32 0.007707129 0.001636686 6.795520
## 4 25 - 50 Turnover No TD 86 0.020712909 0.004746389 19.707009
## 5 < 25 Kick TD 11 0.002649326 0.004873513 20.234827
## 6 50 - 75 Turnover TD 29 0.006984586 0.006219407 25.822977
## 7 25 - 50 Kick TD 44 0.010597303 0.014133188 58.680997
## 8 < 25 Kick No TD 8 0.001926782 0.014620539 60.704480
## 9 >= 75 Turnover TD 17 0.004094412 0.016821495 69.842847
## 10 50 - 75 Turnover No TD 81 0.019508671 0.018658220 77.468931
## 11 25 - 50 Kick No TD 75 0.018063584 0.042399564 176.042991
## 12 >= 75 Turnover No TD 78 0.018786127 0.050464485 209.528540
## 13 50 - 75 Kick TD 258 0.062138728 0.055558050 230.677023
## 14 >= 75 Kick TD 584 0.140655106 0.150266655 623.907153
## 15 50 - 75 Kick No TD 658 0.158477842 0.166674150 692.031069
## 16 >= 75 Kick No TD 2096 0.504816956 0.450799966 1871.721460
There is one outcome with an expected count less than 5, so our p-values may not be completely accurate. But since the test statistic is very large compared to the degrees of freedom, we can be confident that we reached the right decision, even if we can’t be confident the p-values are that accurate.
Let’s create another heat map, but let’s create a separate heat map
for each outcome of our control variable, yds_to_td
:
drive_CI |>
# Calculating the Standardized z-score for each of the 16 groups:
mutate(
zi = (FA_prop - CI_prop)/sqrt(CI_prop * (1-CI_prop)/sum(n))
) |>
# Creating the heat map using ggplot and geom_tile()
ggplot(
mapping = aes(
x = drive_end2,
y = drive_start2,
fill = zi
)
) +
geom_tile(color = "white") +
geom_text(mapping = aes(label = round(zi, digits = 2))) +
facet_wrap(facets = vars(yds_to_td)) +
# Removes the buffer around the plot
coord_cartesian(expand = F) +
scale_fill_gradient2(
low = "red",
mid = "white",
high = "blue",
midpoint = 0
) +
labs(
x = "How a drive ends",
y = "How a drive begins",
fill = "z-score for \nFA vs CI models"
)