Motivating Question: Does how an NFL drive end differ based on how the drive begins??
Both the explanatory and the response variable can be considered
ordinal, since there are preferred ways for a drive to start and also
preferred ways for a drive to end. The orders have been listed above, so
let’s also order them using factor(levels = c(...))
drives <- 
  drives |> 
  mutate(
    drive_start = factor(drive_start,
                         levels = c("Kickoff", "Punt", "Interception", "Fumble"),
                         ordered = T), # Reordering the groups
    
    # And let's reorder drive end again as well
    drive_end = factor(drive_end, 
                       levels = c("Turnover", "Punt", "Field Goal", "Touchdown"),
                       ordered = T)
  )
# Look at our resulting 4x4 table
drive_freq <- 
  xtabs(formula = ~ drive_start + drive_end,
        data = drives)
drive_freq
##               drive_end
## drive_start    Turnover Punt Field Goal Touchdown
##   Kickoff           459 1238        496       663
##   Punt              356 1074        432       537
##   Interception       40  106        107       114
##   Fumble             60  138        120       125
There are two ways we can calculate Maentel-Haenszel’s \(r\)
We can calculate the counts for each combination of \(X\) and \(Y\) using the count()
function:
drive_counts <- 
  drives |> 
  count(
    drive_start,
    drive_end,
    name = "count"
  )
drive_counts
##     drive_start  drive_end count
## 1       Kickoff   Turnover   459
## 2       Kickoff       Punt  1238
## 3       Kickoff Field Goal   496
## 4       Kickoff  Touchdown   663
## 5          Punt   Turnover   356
## 6          Punt       Punt  1074
## 7          Punt Field Goal   432
## 8          Punt  Touchdown   537
## 9  Interception   Turnover    40
## 10 Interception       Punt   106
## 11 Interception Field Goal   107
## 12 Interception  Touchdown   114
## 13       Fumble   Turnover    60
## 14       Fumble       Punt   138
## 15       Fumble Field Goal   120
## 16       Fumble  Touchdown   125
Now we have a data frame with 16 rows, one for each combination of our two variables (\(4\times4\)).
What we need to do next is assign each group of \(X\) a score, and then each group of \(Y\) a score. Let’s keep it simple and just
use 1 - 4 for each for now. We can assign each group a value using
as.numeric() inside mutate()
drive_MH <- 
  drive_counts |> 
  mutate(
    start_score = as.numeric(drive_start),
    end_score = as.numeric(drive_end)
  )
drive_MH
##     drive_start  drive_end count start_score end_score
## 1       Kickoff   Turnover   459           1         1
## 2       Kickoff       Punt  1238           1         2
## 3       Kickoff Field Goal   496           1         3
## 4       Kickoff  Touchdown   663           1         4
## 5          Punt   Turnover   356           2         1
## 6          Punt       Punt  1074           2         2
## 7          Punt Field Goal   432           2         3
## 8          Punt  Touchdown   537           2         4
## 9  Interception   Turnover    40           3         1
## 10 Interception       Punt   106           3         2
## 11 Interception Field Goal   107           3         3
## 12 Interception  Touchdown   114           3         4
## 13       Fumble   Turnover    60           4         1
## 14       Fumble       Punt   138           4         2
## 15       Fumble Field Goal   120           4         3
## 16       Fumble  Touchdown   125           4         4
From there, we just need to calculate a weighted correlation to find
Maentel-Haenszel’s \(r\) using
cov.wt() where \(X\) and
\(Y\) are the assigned scores and the
weights are the counts calculated earlier:
MH_r <- 
  cov.wt(
    x = drive_MH |> 
      dplyr::select(start_score, end_score),
    wt = drive_MH$count,
    method = "ML",   # So it divides by the correct number
    cor = TRUE
  ) # Have it calculate the correlations
MH_r$cor |> 
  data.frame()
##             start_score  end_score
## start_score  1.00000000 0.07051453
## end_score    0.07051453 1.00000000
We get a correlation of 0.0705, not a very strong correlation (which agrees with our methods of association for nominal variables we saw earlier)
If we want to see if it is significant, we can conduct our \(M^2\)-test
\[M^2 = (N-1)r^2 \sim \chi^2_1\]
# Test statistic
M2_stat <- (nrow(drives) - 1)*MH_r$cor[2,1]^2
# P-value
c(
  'test stat' = round(M2_stat, 3),
  'p-value' = pchisq(
    q = M2_stat,
    df = 1,
    lower.tail = F
  )
)
##    test stat      p-value 
## 3.015200e+01 3.994711e-08
So we end up with a very small p-value. While the “correlation” is weak, it is definitely not 0!
It’s actually easier to calculate the correlation using the
unsummarized data. All we need to do is the same conversion we did
earlier with as.numeric() and then calculate the
correlation using cor()
# Start with the unsummarized data
drives |> 
  # Convert the data from ordinal to quantitative using the assigned scores
  # as.numeric works if we are just using the 1,...,I and 1,...,J method
  mutate(
    start_score = as.numeric(drive_start),
    end_score = as.numeric(drive_end)
  ) |> 
  
  # Pick the two columns with the scores using dplyr::select()
  dplyr::select(start_score, end_score) |> 
  
  # Then pass the results into cor()
  cor()
##             start_score  end_score
## start_score  1.00000000 0.07051453
## end_score    0.07051453 1.00000000
Same exact result as how we did it with the summarized data!
If you want to assign specific scores, you can use the
case_when() function and list out each group and the score
that you want to assign that group
# Start with the unsummarized data
drives |> 
  # Let's keep start_score the same, but change end_score to reflect points earned:
  mutate(
    start_score = as.numeric(drive_start),
    end_score = case_when(
      drive_end == "Turnover" ~ -3,
      drive_end == "Punt" ~ 0,
      drive_end == "Field Goal" ~ 3,
      drive_end == "Touchdown" ~ 7
    )
  ) |> 
  
  # Pick the two columns with the scores using dplyr::select()
  dplyr::select(start_score, end_score) |> 
  
  # Then pass the results into cor()
  cor()
##             start_score  end_score
## start_score  1.00000000 0.06765273
## end_score    0.06765273 1.00000000
Overall, not much has changed. That’s because a linear transformation of the scores will give you the exact same result (because correlations are invariant or “immune” to linear transformations!)
For instance, our scores for drive end are essentially
\[y = 3*x - 6\] Turnover: \(x = 1 \rightarrow y = -3\) Punt: \(x = 2 \rightarrow y = 0\) Field Goal: \(x = 3 \rightarrow y = 3\) Touchdown: \(x = 4 \rightarrow y = 6\)
The only difference is that we assigned a touchdown 7 points instead of 6. If we redid it with a touchdown worth 6 points, the correlations will be exactly the same!
# Start with the unsummarized data
drives |> 
  # Let's keep start_score the same, but change end_score to reflect points earned:
  mutate(
    start_score = as.numeric(drive_start),
    end_score = case_when(
      drive_end == "Turnover" ~ -3,
      drive_end == "Punt" ~ 0,
      drive_end == "Field Goal" ~ 3,
      drive_end == "Touchdown" ~ 6
    )
  ) |> 
  
  # Pick the two columns with the scores using dplyr::select()
  dplyr::select(start_score, end_score) |> 
  
  # Then pass the results into cor()
  cor()
##             start_score  end_score
## start_score  1.00000000 0.07051453
## end_score    0.07051453 1.00000000