Testing Joint Independence

Testing to see if how a drive starts (drive_start2) and if it ends in a touchdown (drive_end2) are associated:

chisq_test(x = drives$drive_start2, y = drives$drive_end2)

## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <int>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1  4152      18.4 0.000018     1 Chi-square test ****

Let’s include another variable: yds_to_td -> how far away the drive starts from the endzone (closer to the endzone the easier it should be to score a TD)

Since we already tested if \(X\) (drive_start2) and \(Y\) (drive_end2) are associated, let’s check to see if \(X\) and \(Y\) are both independent of \(Z\) (yds_to_td)

Summarizing the data: counts and observed proportions

Creating a summarized data set for yds_to_td, drive_start2, drive_end2

drive_sum <- 
  drives |> 
  count(yds_to_td, drive_start2, drive_end2) |> 
  # Calculating the proportions: n/sum(n)
  mutate(sat_prop = n/sum(n))

drive_sum

##    yds_to_td drive_start2 drive_end2    n    sat_prop
## 1       < 25         Kick      No TD    8 0.001926782
## 2       < 25         Kick         TD   11 0.002649326
## 3       < 25     Turnover      No TD   32 0.007707129
## 4       < 25     Turnover         TD   39 0.009393064
## 5    25 - 50         Kick      No TD   75 0.018063584
## 6    25 - 50         Kick         TD   44 0.010597303
## 7    25 - 50     Turnover      No TD   86 0.020712909
## 8    25 - 50     Turnover         TD   56 0.013487476
## 9    50 - 75         Kick      No TD  658 0.158477842
## 10   50 - 75         Kick         TD  258 0.062138728
## 11   50 - 75     Turnover      No TD   81 0.019508671
## 12   50 - 75     Turnover         TD   29 0.006984586
## 13     >= 75         Kick      No TD 2096 0.504816956
## 14     >= 75         Kick         TD  584 0.140655106
## 15     >= 75     Turnover      No TD   78 0.018786127
## 16     >= 75     Turnover         TD   17 0.004094412

We’ll also need \(I\), \(J\), and \(K\), the number of groups for each of our 3 variables:

# I = number of different ways a drive can start (Kick/Turnover) = 2
I <- n_distinct(drives$drive_start2)

# J = number of different ways a drive can end (TD/No TD) = 2
J <- n_distinct(drives$drive_end2)

# K = number of different distances a drive can start from the endzone = 4
K <- n_distinct(drives$yds_to_td)

Testing for Joint Independence

To test joint independence (JI) vs the full association model (FA), we need to calculate the expected proportions assuming drive_start2 and drive_end2 are both independent of yds_to_td

\[\hat{\pi}_{ijk,0} = \frac{n_{ij\bullet}}{N} \frac{n_{\bullet\bullet k}}{N}\]

and the degrees of freedom will be:

\[df = r_1 - r_0 = \\IJK - 1 - (IJ + K -2) =\\ 2(2)(4) - 1 - (2(2) + 4 -2) = \\ 15 - 6 = \\9\]

We could work it out “by hand” like we needed to do for Complete Independence, but if we want to use a canned function, the simplest way to test for joint independence is to combine the two variables that we aren’t claiming are independent together into one variable!

Meaning, we’ll create a variable called start_endthat tells us how the drive started and ended!

drives <- 
  drives |> 
  mutate(start_end = paste(drive_start2, drive_end2, sep = ":"))

drives |> 
  head(n = 10)

##              drive_id drive_start  drive_end drive_start2 drive_end2 yds_to_td
## 1    2018_02_DET_SF_8      Fumble Field Goal     Turnover      No TD      < 25
## 2    2019_06_HOU_KC_3      Fumble Field Goal     Turnover      No TD      < 25
## 3    2019_16_GB_MIN_2      Fumble Field Goal     Turnover      No TD      < 25
## 4    2016_09_ATL_TB_4      Fumble Field Goal     Turnover      No TD      < 25
## 5   2020_03_TEN_MIN_5      Fumble Field Goal     Turnover      No TD      < 25
## 6  2018_10_DET_CHI_13      Fumble Field Goal     Turnover      No TD      < 25
## 7    2018_04_NO_NYG_7      Fumble Field Goal     Turnover      No TD      < 25
## 8  2017_04_TEN_HOU_23      Fumble Field Goal     Turnover      No TD      < 25
## 9  2021_05_CLE_LAC_10      Fumble Field Goal     Turnover      No TD      < 25
## 10 2019_09_CLE_DEN_10      Fumble Field Goal     Turnover      No TD      < 25
##         start_end
## 1  Turnover:No TD
## 2  Turnover:No TD
## 3  Turnover:No TD
## 4  Turnover:No TD
## 5  Turnover:No TD
## 6  Turnover:No TD
## 7  Turnover:No TD
## 8  Turnover:No TD
## 9  Turnover:No TD
## 10 Turnover:No TD

Once we’ve done that, we can conduct our hypothesis test by using chisq_test() from the rstatix package like we did earlier in this document:

JI_vs_sat <- chisq_test(x = drives$start_end, y = drives$yds_to_td)

## Warning in stats::chisq.test(x, y, correct = correct, p = p, rescale.p =
## rescale.p, : Chi-squared approximation may be incorrect

JI_vs_sat

## # A tibble: 1 × 6
##       n statistic         p    df method          p.signif
## * <int>     <dbl>     <dbl> <int> <chr>           <chr>   
## 1  4152     1328. 3.46e-280     9 Chi-square test ****

Now we have the:

test statistic = \(1327\)
degrees of freedom = \(9\)
p-value = \(\approx 0\)

So we have very, very strong evidence that at least one of the two variables (drive_start2 or drive_end2) is also associated with yds_to_td!

Note that the test statistic is smaller than our test statistic for complete independence. That’s because the more complicated the model (JI) will always fit the data better than the simpler model (CI), and will be closer to the observed proportions (saturated)!

Liklihood Ratio G-Test

We can use the same “trick” to calculate the likelihood ratio \(G\)-test using the GTest() function in the DescTools package:

JI_vs_sat_Gtest <- DescTools::GTest(x = drives$start_end, y = drives$yds_to_td)

JI_vs_sat_Gtest

## 
##  Log likelihood ratio (G-test) test of independence without correction
## 
## data:  drives$start_end and drives$yds_to_td
## G = 797.97, X-squared df = 9, p-value < 2.2e-16

While the test statistic is noticeably smaller than Pearson’s test stat, it is still very large and indicates that how a drive starts and if it ends in a TD are not independent of where the drive starts!

Checking Conditions

We still should check to see if our sample size is large enough. We can use the expected_freq() function to calculate our expected counts for us:

expected_freq(JI_vs_sat) |> 
  round(digits = 1) |> 
  # Converting from a table to a data frame
  as.data.frame() |> 
  # changing the row names to a column then stacking the columns of exp
  rownames_to_column('XY') |> 
  pivot_longer(
    cols = -XY,
    names_to = 'yds_to_go',
    values_to = 'expected_count'
  ) |> 
  # arranging from smallest to largest expected count
  arrange(expected_count)

## # A tibble: 16 × 3
##    XY             yds_to_go expected_count
##    <chr>          <chr>              <dbl>
##  1 Turnover:TD    < 25                 3.1
##  2 Turnover:No TD < 25                 6  
##  3 Turnover:TD    25 - 50              8.9
##  4 Turnover:No TD 25 - 50             17.4
##  5 Kick:TD        < 25                19.4
##  6 Turnover:TD    50 - 75             34.8
##  7 Kick:TD        25 - 50             56.4
##  8 Kick:No TD     < 25                61.5
##  9 Turnover:No TD 50 - 75             68.4
## 10 Turnover:TD    >= 75               94.2
## 11 Kick:No TD     25 - 50            178. 
## 12 Turnover:No TD >= 75              185. 
## 13 Kick:TD        50 - 75            222. 
## 14 Kick:TD        >= 75              600. 
## 15 Kick:No TD     50 - 75            701. 
## 16 Kick:No TD     >= 75             1896.

Like last time, we have 1 expected count less than 5, indicating that a \(\chi^2\) distribution might not be appropriate to find the p-value .

But again, with such a large test statistic, we can be confident that at least drive_start or drive_end is associated with yds_to_td.

So it appears that all 3 variables have an association!

Locating where the differences occur

We’ll use the std_residuals() function to find the standardized residuals assuming that the two variables are jointly independent and then create a heat map of the differences, like we did last time

std_residuals(JI_vs_sat) |> 
  data.frame() |> 
  
  rename(
    std_res = Freq,
    yds_to_td = y
  ) |> 
  
  # Separate will split apart the two variables that we combined together
  separate(
    col = x,                               # Column to split
    into = c("drive_start", "drive_end"),  # name of the new columns
    sep = ":"                          # Where to split the column
  ) |>
  # Creating the heat map using ggplot and geom_tile()
  ggplot(
    mapping = aes(
      x = drive_end,
      y = drive_start,
      fill = std_res
    )
  ) + 
  
  geom_tile(color = "white") + 
  
  geom_text(mapping = aes(label = round(std_res, digits = 2))) +
  
  facet_wrap(facets = vars(yds_to_td)) + 
  
  # Removes the buffer around the plot
  coord_cartesian(expand = F) + 
  
  scale_fill_gradient2(
    low = "red",
    mid = "white",
    high = "blue",
    midpoint = 0
  ) + 
  
  labs(
    x = "How a drive ends",
    y = "How a drive begins",
    fill = "z-score for \nFA vs JI models"
  )