Motivating Question: Does how an NFL drive end differ based on how the drive begins??
Both the explanatory and the response variable can be considered
ordinal, since there are preferred ways for a drive to start and also
preferred ways for a drive to end. The orders have been listed above, so
let’s also order them using factor(levels = c(...))
drives <-
drives |>
mutate(
drive_start = factor(drive_start,
levels = c("Kickoff", "Punt", "Interception", "Fumble"),
ordered = T), # Reordering the groups
# And let's reorder drive end again as well
drive_end = factor(drive_end,
levels = c("Turnover", "Punt", "Field Goal", "Touchdown"),
ordered = T)
)
# Look at our resulting 4x4 table
drive_freq <-
xtabs(formula = ~ drive_start + drive_end,
data = drives)
drive_freq
## drive_end
## drive_start Turnover Punt Field Goal Touchdown
## Kickoff 459 1238 496 663
## Punt 356 1074 432 537
## Interception 40 106 107 114
## Fumble 60 138 120 125
There are two ways we can calculate Maentel-Haenszel’s \(r\)
We can calculate the counts for each combination of \(X\) and \(Y\) using the count()
function:
drive_counts <-
drives |>
count(
drive_start,
drive_end,
name = "count"
)
drive_counts
## drive_start drive_end count
## 1 Kickoff Turnover 459
## 2 Kickoff Punt 1238
## 3 Kickoff Field Goal 496
## 4 Kickoff Touchdown 663
## 5 Punt Turnover 356
## 6 Punt Punt 1074
## 7 Punt Field Goal 432
## 8 Punt Touchdown 537
## 9 Interception Turnover 40
## 10 Interception Punt 106
## 11 Interception Field Goal 107
## 12 Interception Touchdown 114
## 13 Fumble Turnover 60
## 14 Fumble Punt 138
## 15 Fumble Field Goal 120
## 16 Fumble Touchdown 125
Now we have a data frame with 16 rows, one for each combination of our two variables (\(4\times4\)).
What we need to do next is assign each group of \(X\) a score, and then each group of \(Y\) a score. Let’s keep it simple and just
use 1 - 4 for each for now. We can assign each group a value using
as.numeric()
inside mutate()
drive_MH <-
drive_counts |>
mutate(
start_score = as.numeric(drive_start),
end_score = as.numeric(drive_end)
)
drive_MH
## drive_start drive_end count start_score end_score
## 1 Kickoff Turnover 459 1 1
## 2 Kickoff Punt 1238 1 2
## 3 Kickoff Field Goal 496 1 3
## 4 Kickoff Touchdown 663 1 4
## 5 Punt Turnover 356 2 1
## 6 Punt Punt 1074 2 2
## 7 Punt Field Goal 432 2 3
## 8 Punt Touchdown 537 2 4
## 9 Interception Turnover 40 3 1
## 10 Interception Punt 106 3 2
## 11 Interception Field Goal 107 3 3
## 12 Interception Touchdown 114 3 4
## 13 Fumble Turnover 60 4 1
## 14 Fumble Punt 138 4 2
## 15 Fumble Field Goal 120 4 3
## 16 Fumble Touchdown 125 4 4
From there, we just need to calculate a weighted correlation to find
Maentel-Haenszel’s \(r\) using
cov.wt()
where \(X\) and
\(Y\) are the assigned scores and the
weights are the counts calculated earlier:
MH_r <-
cov.wt(
x = drive_MH |>
dplyr::select(start_score, end_score),
wt = drive_MH$count,
method = "ML", # So it divides by the correct number
cor = TRUE
) # Have it calculate the correlations
MH_r$cor |>
data.frame()
## start_score end_score
## start_score 1.00000000 0.07051453
## end_score 0.07051453 1.00000000
We get a correlation of 0.0705, not a very strong correlation (which agrees with our methods of association for nominal variables we saw earlier)
If we want to see if it is significant, we can conduct our \(M^2\)-test
\[M^2 = (N-1)r^2 \sim \chi^2_1\]
# Test statistic
M2_stat <- (nrow(drives) - 1)*MH_r$cor[2,1]^2
# P-value
c(
'test stat' = round(M2_stat, 3),
'p-value' = pchisq(
q = M2_stat,
df = 1,
lower.tail = F
)
)
## test stat p-value
## 3.015200e+01 3.994711e-08
So we end up with a very small p-value. While the “correlation” is weak, it is definitely not 0!
It’s actually easier to calculate the correlation using the
unsummarized data. All we need to do is the same conversion we did
earlier with as.numeric()
and then calculate the
correlation using cor()
# Start with the unsummarized data
drives |>
# Convert the data from ordinal to quantitative using the assigned scores
# as.numeric works if we are just using the 1,...,I and 1,...,J method
mutate(
start_score = as.numeric(drive_start),
end_score = as.numeric(drive_end)
) |>
# Pick the two columns with the scores using dplyr::select()
dplyr::select(start_score, end_score) |>
# Then pass the results into cor()
cor()
## start_score end_score
## start_score 1.00000000 0.07051453
## end_score 0.07051453 1.00000000
Same exact result as how we did it with the summarized data!
If you want to assign specific scores, you can use the
case_when()
function and list out each group and the score
that you want to assign that group
# Start with the unsummarized data
drives |>
# Let's keep start_score the same, but change end_score to reflect points earned:
mutate(
start_score = as.numeric(drive_start),
end_score = case_when(
drive_end == "Turnover" ~ -3,
drive_end == "Punt" ~ 0,
drive_end == "Field Goal" ~ 3,
drive_end == "Touchdown" ~ 7
)
) |>
# Pick the two columns with the scores using dplyr::select()
dplyr::select(start_score, end_score) |>
# Then pass the results into cor()
cor()
## start_score end_score
## start_score 1.00000000 0.06765273
## end_score 0.06765273 1.00000000
Overall, not much has changed. That’s because a linear transformation of the scores will give you the exact same result (because correlations are invariant or “immune” to linear transformations!)
For instance, our scores for drive end are essentially
\[y = 3*x - 6\] Turnover: \(x = 1 \rightarrow y = -3\) Punt: \(x = 2 \rightarrow y = 0\) Field Goal: \(x = 3 \rightarrow y = 3\) Touchdown: \(x = 4 \rightarrow y = 6\)
The only difference is that we assigned a touchdown 7 points instead of 6. If we redid it with a touchdown worth 6 points, the correlations will be exactly the same!
# Start with the unsummarized data
drives |>
# Let's keep start_score the same, but change end_score to reflect points earned:
mutate(
start_score = as.numeric(drive_start),
end_score = case_when(
drive_end == "Turnover" ~ -3,
drive_end == "Punt" ~ 0,
drive_end == "Field Goal" ~ 3,
drive_end == "Touchdown" ~ 6
)
) |>
# Pick the two columns with the scores using dplyr::select()
dplyr::select(start_score, end_score) |>
# Then pass the results into cor()
cor()
## start_score end_score
## start_score 1.00000000 0.07051453
## end_score 0.07051453 1.00000000