Two Categorical Variables - Binary Response and Nonbinary Explanatory

Motivating Question: Is a team more or less likely to score a touchdown depending on how the drive began?

Explanatory Variable: Drive Start - Punt(1), Kickoff(2), Interception(3), Fumble(4)
Response Variable: Drive End: Touchdown(1) or No Touchdown(2)

Now we can store our results in an \(I\times 2\) table. Not as simple as a \(2 \times 2\) table, but simpler than an \(I \times J\) table.

drives2 <- 
  drives |> 
  mutate(drive_start = factor(drive_start,
                              levels = c("Punt", "Kickoff", "Interception", "Fumble")), # Reordering the groups
         
         # And let's convert drive end to a binary variable again
         drive_end = if_else(drive_end == "Touchdown",
                             true = "TD",
                             false = "noTD") |> 
                     factor(levels = c("TD", "noTD")))

# Look at our resulting 4x2 table
drive_freq <- 
  xtabs(formula = ~ drive_start + drive_end,
        data = drives2)

# Make the table look nice
drive_freq |> 
  data.frame() |> 
  pivot_wider(names_from = "drive_end",
              values_from = "Freq") |> 
  rename(Touchdown = TD,
         `No Touchdown` = noTD,
         `Drive Start` = drive_start) |> 
  gt() |> 
  cols_align(align = "center") |> 
  gt_add_divider(columns = everything(),
                 color = "black",
                 sides = "all")

Drive Start	Touchdown	No Touchdown
Punt	537	1862
Kickoff	663	2193
Interception	114	253
Fumble	125	318

Let’s convert from counts to conditional proportions using prop.table():

drive_freq |> 
  prop.table(margin = "drive_start") |> 
  signif(digits = 3) |> 
  data.frame() |> 
  pivot_wider(names_from = "drive_end",
              values_from = "Freq") |> 
  rename(Touchdown = TD,
         `No Touchdown` = noTD,
         `Drive Start` = drive_start) |> 
  gt() |> 
  cols_align(align = "center") |> 
  gt_add_divider(columns = everything(),
                 color = "black",
                 sides = "all")

Drive Start	Touchdown	No Touchdown
Punt	0.224	0.776
Kickoff	0.232	0.768
Interception	0.311	0.689
Fumble	0.282	0.718

Always a good idea to include a graph of the data!

drive_freq |> 
  # Calculate the conditional proportions
  prop.table(margin = "drive_start") |> 
  # Convert the table to a data frame to be used by ggplot()
  data.frame() |> 
  # Rename Freq to prop
  rename(prop = Freq) |> 
  
  ggplot(mapping = aes(x = drive_start,
                       fill = drive_end,
                       y = prop)) + 
  
  geom_col(width = 0.5) + 
  
  labs(y = "Proportion",
       x = "How Drive Began",
       fill = "Drive \nResult") + 
  
  theme_test() + 
  
  scale_y_continuous(expand = c(0, 0, 0, 0.05)) + 
  
  scale_fill_manual(labels = c("Touchdown", "No Touchdown"),
                    values = c("steelblue", "darkorange"))

There are some differences, but it isn’t a very noticeable difference.

Summarizing the associations

When \(Y\) is binary, we can still calculate risk ratios and odds ratios in a fairly straight forward manner. The main difference is we need to decide on a baseline or reference group to use for the explanatory variable (Say group 1). From there, we’ll calculate the relative risk and odds ratio for each other group in comparison to the reference group:

Odds of a TD if the drive starts from a kickoff compared to a punt
Odds of a TD if the drive starts from an interception compared to a punt
Odds of a TD if the drive starts from a fumble compared to a punt

While the list above mentions the odds, we can use the probability (or risk) as well!

Since the data was collected randomly with regards to \(X\) and \(Y\), we can calculate both the relative risk and odds ratio (instead of just the odds ratio)!

Risk Ratios

We’ll use the same function from the epitools package to calculate the risk ratio that we did when we were analyzing two binary variables.

The riskratio() function wants a table with the first column being failures (noTD) and the second being successes (TD) and the first row should be the reference group. The order of the remaining columns is not that important.

td_rr <- 
  riskratio(x = drive_freq,
            rev = "c",
            method = "wald")

# p.values
td_rr$p.value |> 
  data.frame() |> 
  signif(digits = 3)

##              midp.exact fisher.exact chi.square
## Punt                 NA           NA         NA
## Kickoff        0.476000     0.489000   0.475000
## Interception   0.000381     0.000444   0.000262
## Fumble         0.008720     0.008530   0.007630

# Estimates and confidence intervals
td_rr$measure |> 
  data.frame() |> 
  signif(digits = 3)

##              estimate lower upper
## Punt             1.00    NA    NA
## Kickoff          1.04 0.938  1.15
## Interception     1.39 1.170  1.64
## Fumble           1.26 1.070  1.49

Since the confidence interval for Kickoff has 1 inside, we don’t have evidence that the probability a drive ends in a touchdown is different for kickoffs compared to a punt at a 5% significance level.

But we can be 95% confident that the probability a drive ends in a touchdown is 1.170 to 1.64 times higher if the drive starts from an interception and 1.07 to 1.46 higher for a fumble than if the drive started with a punt.

Visualizing the confidence intervals for relative risk

We can visualize the association using relative risk with a Cleveland plot, which has

the values on the x-axis
the groups on the y-axis
the estimates represented by a point
the confidence intervals represented by a line

td_rr$measure |> 
  data.frame() |>
  
  # Changing the groups from row names to having their own column:
  rownames_to_column(var = "drive_start") |> 
  
  # Let's remove the row for the reference group
  filter(!is.na(lower)) |> 
  
  # use ggplot() to create our graph
  ggplot(mapping = aes(x = estimate,
                       y = reorder(drive_start, -estimate),
                       color = drive_start)) + 
  
  # Adding the estimate
  geom_point(size = 3) +
  
  # Adding the line for the confidence interval with geom_segment()
  geom_segment(mapping = aes(x = lower,
                             xend = upper,
                             y = drive_start,
                             yend = drive_start),
               size = 1) + 
  
  labs(x = NULL,  # Removing the x-axis label
       y = "Drive Beginning",
       title = "Risk ratio of NFL Drives Ending in a TD \nfor how They Start Compared to a Punt") +
  
  theme_bw() +
  
  theme(legend.position = 'none',
        plot.title = element_text(hjust = 0.5))  +
  
  # and let's add a vertical line at 1 to indicate no difference compared to a punt
  geom_vline(xintercept = 1,
             size = 1,
             linetype = "dashed")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph is nice because it let’s us quickly see where significant differences occur, and the direction of the difference:

If the entire interval is on the left of the dashed line, then that means the probability of a success for the group is lower than the reference group.
If the entire interval is on the right of the dashed line, then that indicates the probability of a success for that group is higher than the reference group.
If the interval overlaps with the dashed line, then we don’t have any evidence that there is a difference in the probability of success for the group compared to the reference group.

Odds Ratios

We’ll use the oddsratio() function from the epitools package. oddsratio() wants the same table as riskratio() does. In fact, you can just copy and paste the code for riskratio() and just change the function name!

td_or <- 
  oddsratio(x = drive_freq,
            rev = "c",
            method = "wald")

# p.values
td_or$p.value |> 
  data.frame() |> 
  signif(digits = 3)

##              midp.exact fisher.exact chi.square
## Punt                 NA           NA         NA
## Kickoff        0.476000     0.489000   0.475000
## Interception   0.000381     0.000444   0.000262
## Fumble         0.008720     0.008530   0.007630

# Estimates and confidence intervals
td_or$measure |> 
  data.frame() |> 
  signif(digits = 3)

##              estimate lower upper
## Punt             1.00    NA    NA
## Kickoff          1.05 0.921  1.19
## Interception     1.56 1.230  1.99
## Fumble           1.36 1.080  1.71

Since the confidence interval for Kickoff has 1 inside, we don’t have evidence that the odds a drive ends in a touchdown is different for kickoffs compared to a punt at a 5% significance level.

But we can be 95% confident that the **odds* a drive ends in a touchdown is 1.230 to 1.99 times higher if the drive starts from an interception and 1.08 to 1.71 times higher for a fumble than if the drive started with a punt.

we can also create the same type of graph, but now representing the oddsratio instead:

td_or$measure |> 
  data.frame() |>
  
  # Changing the groups from row names to having their own column:
  rownames_to_column(var = "drive_start") |> 
  
  # Let's remove the row for the reference group
  filter(!is.na(lower)) |> 
  
  # use ggplot() to create our graph
  ggplot(mapping = aes(x = estimate,
                       y = reorder(drive_start, -estimate),
                       color = drive_start)) + 
  
  # Adding the estimate
  geom_point(size = 3) +
  
  # Adding the line for the confidence interval with geom_segment()
  geom_segment(mapping = aes(x = lower,
                             xend = upper,
                             y = drive_start,
                             yend = drive_start),
               size = 1) + 
  
  labs(x = NULL,  # Removing the x-axis label
       y = "Drive Beginning",
       title = "Odds ratio of NFL Drives Ending in a TD \nfor how They Start Compared to a Punt") +
  
  theme_bw() +
  
  theme(legend.position = 'none',
        plot.title = element_text(hjust = 0.5))  +
  
  # and let's add a vertical line at 1 to indicate no difference compared to a punt
  geom_vline(xintercept = 1,
             size = 1,
             linetype = "dashed")  + 
  
  # You can optionally add the end points to the graph using 2 instances of geom_text()
  geom_text(mapping = aes(x = lower,
                          label = round(lower, digits = 2)),
            vjust = -0.5) + 
  
  geom_text(mapping = aes(x = upper,
                          label = round(upper, digits = 2)),
            vjust = -0.5)

Two Categorical Variables - Binary Response and Nonbinary Explanatory

STAT 5350

Chapter 2

Summarizing the associations

Risk Ratios

Visualizing the confidence intervals for relative risk

Odds Ratios