Part A: Testing Complete Independence

Three-way table

We can create a three-way table of our three variables using xtabs()

formula = ~ X + Y + Z
data = the data set

drives_table_3 <- 
  xtabs(
    formula = ~ drive_start2 + drive_end2 + yds_to_td,
    data = drives
  )

drives_table_3

## , , yds_to_td = < 25
## 
##             drive_end2
## drive_start2 No TD   TD
##     Kick         8   11
##     Turnover    32   39
## 
## , , yds_to_td = 25 - 50
## 
##             drive_end2
## drive_start2 No TD   TD
##     Kick        75   44
##     Turnover    86   56
## 
## , , yds_to_td = 50 - 75
## 
##             drive_end2
## drive_start2 No TD   TD
##     Kick       658  258
##     Turnover    81   29
## 
## , , yds_to_td = >= 75
## 
##             drive_end2
## drive_start2 No TD   TD
##     Kick      2096  584
##     Turnover    78   17

You can convert the partial tables created by xtabs() to a flat table using ftable(...)

ftable(drives_table_3)

##                         yds_to_td < 25 25 - 50 50 - 75 >= 75
## drive_start2 drive_end2                                     
## Kick         No TD                   8      75     658  2096
##              TD                     11      44     258   584
## Turnover     No TD                  32      86      81    78
##              TD                     39      56      29    17

Test for two variables

Testing to see if how a drive starts (drive_start2) and if it ends in a touchdown (drive_end2) are associated:

chisq_test(x = drives$drive_start2,  y = drives$drive_end2)

## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <int>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1  4152      18.4 0.000018     1 Chi-square test ****

Let’s include another variable: yds_to_td -> how far away the drive starts from the endzone (closer to the endzone the easier it should be to score a TD)

Summarizing the data: counts and observed proportions

Creating a summarized data set for yds_to_td, drive_start2, drive_end2 and create the joint proportions: \(p_{ijk}\)

drive_sum <- 
  drives |> 
  count(yds_to_td, drive_start2, drive_end2) |> 
  # Calculating the proportions: n/sum(n)
  mutate(FA_prop = n/sum(n))

drive_sum

##    yds_to_td drive_start2 drive_end2    n     FA_prop
## 1       < 25         Kick      No TD    8 0.001926782
## 2       < 25         Kick         TD   11 0.002649326
## 3       < 25     Turnover      No TD   32 0.007707129
## 4       < 25     Turnover         TD   39 0.009393064
## 5    25 - 50         Kick      No TD   75 0.018063584
## 6    25 - 50         Kick         TD   44 0.010597303
## 7    25 - 50     Turnover      No TD   86 0.020712909
## 8    25 - 50     Turnover         TD   56 0.013487476
## 9    50 - 75         Kick      No TD  658 0.158477842
## 10   50 - 75         Kick         TD  258 0.062138728
## 11   50 - 75     Turnover      No TD   81 0.019508671
## 12   50 - 75     Turnover         TD   29 0.006984586
## 13     >= 75         Kick      No TD 2096 0.504816956
## 14     >= 75         Kick         TD  584 0.140655106
## 15     >= 75     Turnover      No TD   78 0.018786127
## 16     >= 75     Turnover         TD   17 0.004094412

We’ll also need \(I\), \(J\), and \(K\), the number of groups for each of our 3 variables:

# I = number of different ways a drive can start (Kick/Turnover) = 2
I <- n_distinct(drives$drive_start2)

# J = number of different ways a drive can end (TD/No TD) = 2
J <- n_distinct(drives$drive_end2)

# K = number of different distances a drive can start from the endzone = 4
K <- n_distinct(drives$yds_to_td)

Testing for complete independence

Are all three variables independent of one another?

To test complete independence vs the full association model, we need to calculate the expected proportions assuming all three variables are independent:

\[\hat{\pi}_{ijk,0} = \frac{n_{i\bullet\bullet}}{N} \frac{n_{\bullet j\bullet}}{N} \frac{n_{\bullet\bullet k}}{N}\]

So we need to find the total counts for each group of each variable separately. Not difficult, but it is a little tedious to do.

drive_CI <- 
  drive_sum |> 
  # Start by adding the totals for each yds_to_td group
  mutate(
    .by = yds_to_td,
    yds_n = sum(n)
  ) |> 
  # Next, find the totals for each drive_start2 group:
  mutate(
    .by = drive_start2,
    start_n = sum(n)
  ) |> 
  # Finally, count for each group of drive_end2:
  mutate(
    .by = drive_end2,
    end_n = sum(n)
  ) |> 
  # Now we can calculate the expected proportion for each outcome assuming complete independence
  mutate(
    CI_prop = yds_n/sum(n) * start_n/sum(n) * end_n/sum(n)
  ) |> 
  
  # Dropping the count columns because we don't need them in the future:
  dplyr::select(-yds_n, -start_n, -end_n)

drive_CI

##    yds_to_td drive_start2 drive_end2    n     FA_prop     CI_prop
## 1       < 25         Kick      No TD    8 0.001926782 0.014620539
## 2       < 25         Kick         TD   11 0.002649326 0.004873513
## 3       < 25     Turnover      No TD   32 0.007707129 0.001636686
## 4       < 25     Turnover         TD   39 0.009393064 0.000545562
## 5    25 - 50         Kick      No TD   75 0.018063584 0.042399564
## 6    25 - 50         Kick         TD   44 0.010597303 0.014133188
## 7    25 - 50     Turnover      No TD   86 0.020712909 0.004746389
## 8    25 - 50     Turnover         TD   56 0.013487476 0.001582130
## 9    50 - 75         Kick      No TD  658 0.158477842 0.166674150
## 10   50 - 75         Kick         TD  258 0.062138728 0.055558050
## 11   50 - 75     Turnover      No TD   81 0.019508671 0.018658220
## 12   50 - 75     Turnover         TD   29 0.006984586 0.006219407
## 13     >= 75         Kick      No TD 2096 0.504816956 0.450799966
## 14     >= 75         Kick         TD  584 0.140655106 0.150266655
## 15     >= 75     Turnover      No TD   78 0.018786127 0.050464485
## 16     >= 75     Turnover         TD   17 0.004094412 0.016821495

Now that we have the expected proportions (CI model) and the observed proportions (FA model), we can calculate our test statistic!

CI_FA_test <- 
  drive_CI |> 
  # Calculating the individual pieces of our test statistics (chi^2 and G)
  mutate(
    zi2 = (FA_prop - CI_prop)^2/CI_prop,
    gi = n*log(FA_prop/CI_prop)
  ) |> 
  
  # Adding the individual pieces to get the test statistic:
  summarize(
    chi2 = sum(n)*sum(zi2),
    lrt_g = 2*sum(gi)
  ) |> 
  
  # Changing the results from being stored in separate columns to the same column
  pivot_longer(
    cols = chi2:lrt_g,
    names_to = "test",
    values_to = "stat"
  )

CI_FA_test

## # A tibble: 2 × 2
##   test   stat
##   <chr> <dbl>
## 1 chi2  1553.
## 2 lrt_g  816.

Both test statistics appear to be very, very large (probably large enough that we can go ahead and reject the null hypothesis). But we should also calculate the degrees of freedom and find the p-value, just to be sure

The degrees of freedom are

\[IJK - 1 - (I + J + K - 3) =\\ 2\times2\times4 - 1 -(2 + 2 + 4 - 3) = \\15 - 5 = \\10\]

# The number of unique proportions needed for the FA model
r1 <- I*J*K - 1

# The number of unique proportions needed for the CI model
r0 <- I + J + K - 3

# The degrees of freedom: r1 - r0
df_CI <- r1 - r0
df_CI

## [1] 10

Now that we have the degrees of freedom, let’s calculate the p-values!

#CI_FA_test <- 
  CI_FA_test |> 
  mutate(p_val = pchisq(stat, df = df_CI, lower = F))

## # A tibble: 2 × 3
##   test   stat     p_val
##   <chr> <dbl>     <dbl>
## 1 chi2  1553. 0        
## 2 lrt_g  816. 8.30e-169

We get p-values that are essentially 0, indicating that at least 2 of the 3 variables are associated.

However, for the results to hold, we need to make sure our sample size is large enough to conduct a test. How large of a sample do we need?

Any of these \(\chi^2\) tests (or LRT \(G\) tests) all have the same sample size requirement: that \(N\) times the expected proportions are all at least 5. We’ll have 16 expected counts, but we only need to focus on the smallest one!

# Checking to see if our sample size is large enough:
drive_CI |> 
  mutate(n_CI = sum(n)*CI_prop) |> 
  arrange(n_CI)

##    yds_to_td drive_start2 drive_end2    n     FA_prop     CI_prop        n_CI
## 1       < 25     Turnover         TD   39 0.009393064 0.000545562    2.265173
## 2    25 - 50     Turnover         TD   56 0.013487476 0.001582130    6.569003
## 3       < 25     Turnover      No TD   32 0.007707129 0.001636686    6.795520
## 4    25 - 50     Turnover      No TD   86 0.020712909 0.004746389   19.707009
## 5       < 25         Kick         TD   11 0.002649326 0.004873513   20.234827
## 6    50 - 75     Turnover         TD   29 0.006984586 0.006219407   25.822977
## 7    25 - 50         Kick         TD   44 0.010597303 0.014133188   58.680997
## 8       < 25         Kick      No TD    8 0.001926782 0.014620539   60.704480
## 9      >= 75     Turnover         TD   17 0.004094412 0.016821495   69.842847
## 10   50 - 75     Turnover      No TD   81 0.019508671 0.018658220   77.468931
## 11   25 - 50         Kick      No TD   75 0.018063584 0.042399564  176.042991
## 12     >= 75     Turnover      No TD   78 0.018786127 0.050464485  209.528540
## 13   50 - 75         Kick         TD  258 0.062138728 0.055558050  230.677023
## 14     >= 75         Kick         TD  584 0.140655106 0.150266655  623.907153
## 15   50 - 75         Kick      No TD  658 0.158477842 0.166674150  692.031069
## 16     >= 75         Kick      No TD 2096 0.504816956 0.450799966 1871.721460

There is one outcome with an expected count less than 5, so our p-values may not be completely accurate. But since the test statistic is very large compared to the degrees of freedom, we can be confident that we reached the right decision, even if we can’t be confident the p-values are that accurate.

Finding where the differences occur

Let’s create another heat map, but let’s create a separate heat map for each outcome of our control variable, yds_to_td:

drive_CI |>
  
  # Calculating the Standardized z-score for each of the 16 groups:
  mutate(
    zi = (FA_prop - CI_prop)/sqrt(CI_prop * (1-CI_prop)/sum(n))
  ) |> 
  
  # Creating the heat map using ggplot and geom_tile()
  ggplot(
    mapping = aes(
      x = drive_end2,
      y = drive_start2,
      fill = zi
    )
  ) + 
  
  geom_tile(color = "white") + 
  
  geom_text(mapping = aes(label = round(zi, digits = 2))) +
  
  facet_wrap(facets = vars(yds_to_td)) + 
  
  # Removes the buffer around the plot
  coord_cartesian(expand = F) + 
  
  scale_fill_gradient2(
    low = "red",
    mid = "white",
    high = "blue",
    midpoint = 0
  ) + 
  
  labs(
    x = "How a drive ends",
    y = "How a drive begins",
    fill = "z-score for \nFA vs CI models"
  )