Two Categorical Variables - Tests of Association

Motivating Question: Does how an NFL drive end differ based on how the drive begins??

Explanatory Variable: Drive Start - Punt(1), Kickoff(2), Interception(3), Fumble(4)
Response Variable: Drive End: Touchdown(1) or No Touchdown(2)

Now we can summarize our results in an $I \times J$ table. Specifically, a $4 \times 4$ table since how a drive starts and ends both have 4 different outcomes.

Let’s change the order of the groups using factor(levels = c(...))

drives <- 
  drives |> 
  mutate(drive_start = factor(drive_start,
                              levels = c("Punt", "Kickoff", "Interception", "Fumble")), # Reordering the groups
         
         # And let's reorder drive end again as well
         drive_end = factor(drive_end, 
                            levels = c("Turnover", "Punt", "Field Goal", "Touchdown"),
                            ordered = T))

# Look at our resulting 4x4 table
drive_freq <- 
  xtabs(formula = ~ drive_start + drive_end,
        data = drives)

# Make the table look nice
drive_freq |> 
  data.frame() |> 
  pivot_wider(names_from = "drive_end",
              values_from = "Freq") |> 
  rename(`Drive Start` = drive_start) |> 
  
  gt() |> 
  cols_align(align = "center") |> 
  gt_add_divider(columns = everything(),
                 color = "black",
                 sides = "all") |> 
  tab_header(title = "Drive Result")

Drive Start	Turnover	Punt	Field Goal	Touchdown
Drive Result
Punt	356	1074	432	537
Kickoff	459	1238	496	663
Interception	40	106	107	114
Fumble	60	138	120	125

Let’s convert from counts to conditional proportions using prop.table():

drive_freq |> 
  # Convert counts to conditional proportions
  prop.table(margin = "drive_start") |> 
  # Display 3 significant digits
  signif(digits = 3) |> 
  # Convert to a data frame
  data.frame() |> 
  # Changing drive_end from 1 column to a column per way the drive ends
  pivot_wider(names_from = "drive_end",
              values_from = "Freq") |> 
  # Renaming drive_start to a table more friendly display
  rename(`Drive Start` = drive_start) |> 
  
  # Creating a nice looking table
  gt() |> 
  cols_align(align = "center") |> 
  gt_add_divider(columns = everything(),
                 color = "black",
                 sides = "all") |> 
  tab_header(title = "Drive Result")

Drive Start	Turnover	Punt	Field Goal	Touchdown
Drive Result
Punt	0.148	0.448	0.180	0.224
Kickoff	0.161	0.433	0.174	0.232
Interception	0.109	0.289	0.292	0.311
Fumble	0.135	0.312	0.271	0.282

Always a good idea to include a graph of the data!

drives |> 
  ggplot(mapping = aes(x = drive_start,
                       fill = fct_rev(drive_end))) +
  
  geom_bar(position = "fill") + 
  
  # Adding labels and title
  labs(y = "Proportion",
       x = "How Drive Began",
       fill = "Drive \nResult") + 
  
  # Removing the grid lines
  theme_test() + 
  
  # Removing the space at the bottom of the graph
  scale_y_continuous(expand = c(0, 0, 0, 0.05))

There is a more noticeable difference between how a drive begins and how a drive ends when examining all four groups.

drives are more likely to end with a field goal and less likely to end with a punt after some sort of turnover (interception/fumble) than if the drive started from a punt or kickoff.
Touchdowns occurred the most often and turnovers the least often when the drive began from an interception than the other 3 ways a drive could start

These differences are very large, so it is possible that the differences we see are just due to sample variation and not indicative of a difference in the population. So what do we do?

We want to test to see if the Independence Model is adequate for the observed counts or if the more complicated Association Model is necessary to describe the population proportions

Independence model: $P(X = i, Y = j) = P(X = i)\times P(Y = j)$
Association model: $P(X = i, Y = j) \ne P(X = i)\times P(Y = j)$

Test of Association

When we only have two variables, we only have the two different models (Independence/Association).

Our Null Hypothesis assumes that the two variables are independent (Independent Model is true about the population).

\[H_0: \textrm{How an NFL drive begins and how it ends are Independent of one another} \rightarrow \pi_{ij} = \pi_{i\bullet} \times \pi_{\bullet j}\]

The alternative claims that the Association Model is necessary to accurately describe the probabilities formed by the groups of the two variables.

\[H_a: \textrm{How an NFL drive begins and how it ends are associated with one another} \rightarrow \pi_{ij} \ne \pi_{i\bullet} \times \pi_{\bullet j}\]

Test Stat

Our statistic is the same as it was for our goodness of fit tests, just we replace $\pi_{i,0}$ with the estimated joint probability assuming the two variables are independent: $\hat{\pi}_{ij,0} = p_{i\bullet} \times p_{\bullet j}$

\[\chi^2 = N \sum_{i,j}^{I,J} \frac{ (p_{ij} - \hat{\pi}_{ij,0} )^2}{\hat{\pi}_{ij,0}}\]

\[G = 2\sum_{i,j}^{I,J} n_{ij} \ln \frac{p_{ij}}{\hat{\pi}_{ij,0}}\]

Both test statistics will follow a $\chi^2_r$ distribution as long as all of the expected counts are at least 5:

\[N\hat{\pi}_{ij,0} \ge 5\]

Degrees of freedom

If it follows a $\chi^2_r$ distribution, what is the degrees of freedom, $r$?

It follows the same pattern as other tests we’ve seen:

$r_1 = $ the number of proportions estimated in the alternative hypothesis model:

Need to estimate the joint probabilities for the association model: $\pi_{ij} \rightarrow p_{ij}$
There will be $IJ - 1$ different proportions we need to estimate since $\sum_{i,j}^{I,J} \pi_{ij} = 1$

$r_0 = $ the number of proportions estimated in the null hypothesis model:

Need to estimate the marginal probabilities for the association model $\pi_{i\bullet}$ and $\pi_{\bullet j}$
There will be $I - 1$ different marginal proportions for $X$ to estimate and $J - 1$ proportions to estimate for $Y$

The degrees of freedom will be the difference between $r_1$ and $r_0$:

\[r = r_1 - r_0 = (IJ - 1) - ((I - 1) + (J - 1)) = (I-1)(J-1)\]

For our example, $I = J = 4$, so our degrees of freedom will be $(4 - 1)(4 - 1) = 9$

Test by hand

drives_TOA <- 
  drives |>
  # Calculate the joint counts:
  summarize(
    .by = c(drive_start, drive_end),
    counts = n()
  ) |> 
  
  # Calculate the joint proportion:
  mutate(joint_prop = counts/sum(counts)) |>  # n_ij/N

  # Calculating the marginal proportions for drive_start
  mutate(
    .by = drive_start,
    prop_start = sum(joint_prop)
  ) |>  # sum of p_ij over j = p_i.
  
  
  # Calculating the marginal proportions for drive_end
  mutate(
    .by = drive_end,
    prop_end = sum(joint_prop)
  ) |>  # sum of p_ij over i = p_.j
  
  
  # Calculating the remaining values needed for the test:
  mutate(
    # The joint proportions assuming Ho is true: p_i. * p_.j
    prop_null = prop_start * prop_end,
    
    # The individual zi2 
    zi2 = sum(counts) * (joint_prop - prop_null)^2/prop_null,
    
    # The individual g_i
    g_i = counts * log(joint_prop/prop_null)
  )
  
drives_TOA |> 
  mutate(across(.cols = where(is.numeric),
                .fns = round,
                digits = 3))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(.cols = where(is.numeric), .fns = round, digits = 3)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

##     drive_start  drive_end counts joint_prop prop_start prop_end prop_null
## 1        Fumble Field Goal    120      0.020      0.073    0.190     0.014
## 2        Fumble       Punt    138      0.023      0.073    0.421     0.031
## 3        Fumble  Touchdown    125      0.021      0.073    0.237     0.017
## 4        Fumble   Turnover     60      0.010      0.073    0.151     0.011
## 5  Interception Field Goal    107      0.018      0.061    0.190     0.012
## 6  Interception       Punt    106      0.017      0.061    0.421     0.026
## 7  Interception  Touchdown    114      0.019      0.061    0.237     0.014
## 8  Interception   Turnover     40      0.007      0.061    0.151     0.009
## 9       Kickoff Field Goal    496      0.082      0.471    0.190     0.090
## 10      Kickoff       Punt   1238      0.204      0.471    0.421     0.198
## 11      Kickoff  Touchdown    663      0.109      0.471    0.237     0.112
## 12      Kickoff   Turnover    459      0.076      0.471    0.151     0.071
## 13         Punt Field Goal    432      0.071      0.396    0.190     0.075
## 14         Punt       Punt   1074      0.177      0.396    0.421     0.167
## 15         Punt  Touchdown    537      0.089      0.396    0.237     0.094
## 16         Punt   Turnover    356      0.059      0.396    0.151     0.060
##       zi2     g_i
## 1  15.053  42.283
## 2  12.701 -41.707
## 3   3.765  21.666
## 4   0.699  -6.472
## 5  19.704  45.571
## 6  15.313 -40.050
## 7   8.325  30.714
## 8   4.265 -13.005
## 9   4.216 -45.715
## 10  0.982  34.870
## 11  0.316 -14.464
## 12  1.836  29.027
## 13  1.353 -24.169
## 14  3.923  64.901
## 15  1.821 -31.266
## 16  0.097  -5.878

Calculating the test statitsics and p-values

tibble(
  test =      c(            'chisq',                   'G'),
  test_stat = c(sum(drives_TOA$zi2), 2*sum(drives_TOA$g_i)),
  p_val = pchisq(q = test_stat, df = 9, lower = F) |> signif(digits = 4)
)

## # A tibble: 2 × 3
##   test  test_stat    p_val
##   <chr>     <dbl>    <dbl>
## 1 chisq      94.4 2.15e-16
## 2 G          92.6 4.86e-16

Both tests agree that there is incredibly strong evidence of an association between how a drive starts and how a drive ends

Using the canned functions in R

We can use our two tests we used earlier to conduct a test of independence:

chisq_test() in the rstatix package with the following arguments:

x = which can be a couple of different options:
a two way table of counts
a vector for the groups of the explanatory variables. If this x is used we need to specify y
y = the vector of the other factor variable. Only used if x is a single vector as well

# Using the drive_freq table:
nfl_chisq <- 
  chisq_test(x = drive_freq)

nfl_chisq

## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <int>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1  6065      94.4 2.15e-16     9 Chi-square test ****

# Using the the raw data.frame
chisq_test(x = drives$drive_start,
           y = drives$drive_end)

## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <int>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1  6065      94.4 2.15e-16     9 Chi-square test ****

The Gtest() function in DescTools package works the same way as the chisq_test() function:

pacman::p_load(DescTools)
nfl_G <- 
  GTest(x = drive_freq)

nfl_G

## 
##  Log likelihood ratio (G-test) test of independence without correction
## 
## data:  drive_freq
## G = 92.612, X-squared df = 9, p-value = 4.441e-16

# Using the raw data instead of the summary table
GTest(x = drives$drive_start,
      y = drives$drive_end)

## 
##  Log likelihood ratio (G-test) test of independence without correction
## 
## data:  drives$drive_start and drives$drive_end
## G = 92.612, X-squared df = 9, p-value = 4.441e-16

We should also check that the sample size is large enough

\[\tilde{n}_{ij} = N (p_{i\bullet}\times p_{\bullet j}) \ge 5\]

These can be displayed using the object created by chisq_test() in the expected_freq() function:

expected_freq(nfl_chisq) |> 
  floor()

##               drive_end
## drive_start    Turnover Punt Field Goal Touchdown
##   Punt              361 1011        456       569
##   Kickoff           430 1203        543       677
##   Interception       55  154         69        87
##   Fumble             66  186         84       105

Since we just need the smallest to be at least 5, we can use min() to find the smallest expected

expected_freq(nfl_chisq) |> 
  floor() |> 
  min()

## [1] 55

Since the smallest expected count is 55, we have a large enough sample size to use the $\chi^2$ distribution to find the p-value.

Post Hoc Analysis: Finding where the associations occur

Just rejecting our null hypothesis and claiming that the two variables are not independent usually is not a very strong claim. We’d like to see where the association occurs in combination of groups.

Using Pearson’s test, we can find two different residuals:

Pearson’s residuals:

\[z_i = \frac{ p_{ij} - \hat{\pi}_{ij,0}}{\sqrt{\frac{\hat{\pi}_{ij,0}}{N}}}\]

Standardized Residuals:

\[z_i = \frac{ p_{ij} - \hat{\pi}_{ij,0}}{\sqrt{\frac{\hat{\pi}_{ij,0}(1-\hat{\pi}_{ij,0})}{N}}}\]

Typically we use the Standardized residuals since they are closer to standard Normal variables than Pearson’s residuals.

We can find either of these using either pearson_residuals() or std_residuals() with the object created by chisq_test()

# Pearson's Residuals
drive_pearson <- 
  pearson_residuals(nfl_chisq)

drive_pearson |> 
  round(digits = 2)

##               drive_end
## drive_start    Turnover  Punt Field Goal Touchdown
##   Punt            -0.31  1.98      -1.16     -1.35
##   Kickoff          1.36  0.99      -2.05     -0.56
##   Interception    -2.07 -3.91       4.44      2.89
##   Fumble          -0.84 -3.56       3.88      1.94

Standardized residuals

drive_standard <- 
  std_residuals(nfl_chisq)

drive_standard |> 
  round(digits = 2)

##               drive_end
## drive_start    Turnover  Punt Field Goal Touchdown
##   Punt            -0.43  3.35      -1.66     -1.99
##   Kickoff          2.02  1.79      -3.14     -0.88
##   Interception    -2.31 -5.31       5.09      3.41
##   Fumble          -0.94 -4.87       4.48      2.31

The larger the table, the harder it is to notice any patters for the associations. So instead of a table, let’s make a graph!

We’ll just use the standardized residuals since they more closely follow a standard Normal distribution

With 16 different category combinations, if the standardized residual is at least 3 (or -3), we say that that combination occurs more (or less) frequently than would be expected if the variables were independent.

# Start with the residual table
drive_standard |> 
  # Convert to a data frame
  data.frame() |> 
  # Rename Freq to residual
  rename(residual = Freq) |> 
  # Creating a column called 
  mutate(sig = if_else(abs(residual) >= 3,
                       true = "Yes",
                       false = "No")) |> 

  # use ggplot() to create a tile graph
  ggplot(mapping = aes(x = drive_start,
                       y = drive_end,
                       fill = residual)) +
  
  geom_tile(mapping = aes(color = sig),   # Draws a different color line around the tile if it is significant
            size = 1,
            width = 0.98,
            height = 0.98) +
  
  # Adding the residuals to the plot using geom_text()
  geom_text(mapping = aes(label = round(residual, digits = 2))) +
  
  labs(x = "How the Drive Started",
       y = "Drive Result",
       fill = "Standardized \nResidual") +
  
  scale_x_discrete(expand = c(0,0)) + 
  scale_y_discrete(expand = c(0,0)) + 
  
  # Change the color gradient for the fill aesthetic
  scale_fill_gradient2(low = "orange",
                       high = "blue") +
  
  # Changint the line to be red if it is significant and white it if is not
  scale_color_manual(values = c("Yes" = "red",
                                "No" = "white")) +
  
  theme_test() + 
  guides(color = "none")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The residuals show that:

Drives are more likely to end with a field goal and less likely to end with a punt if they began as a result of a turnover (Interception or Fumble)
Drives that began from a punt are more likely to end with a punt.
Drives that began from a kickoff are less likely to end in a field goal.
A drive is more likely to end in a Touchdown if it started as a result of an interception.