Introduction

Exploring Mean Distance between Players in NFL Plays

This analysis examines the distance between the ball carrier and potential tacklers during an NFL play using player tracking data. We use an example play—a punt return by Isiah McKenzie of the Buffalo Bills versus the Baltimore Ravens on January 3, 2021—to compare two methods for calculating distances: Euclidean and Manhattan Distance. Understanding these measures helps uncover nuanced spatial relationships and movement constraints during football plays.

# Load required libraries
library(data.table)
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)  # For percentage formatting

Comparison of Methods: Euclidean vs. Manhattan Distance

Euclidean Distance:

Formula: √(X₁−X₂)²+(Y₁−Y₂)²

Measures straight-line distance, providing the shortest path between two players. Pros: Ideal for assessing immediate proximity, such as determining how close a defender is to making a tackle. Cons: Does not account for field constraints or practical movement patterns like avoiding other players or following routes.

Manhattan Distance:

Formula: ∣X₁−X₂∣ + ∣Y₁−Y₂∣ Measures the total horizontal and vertical distance, reflecting grid-like movement paths. Pros: Better represents practical player movement constrained by field layout or tactical situations. Cons: Overestimates straight-line distances, which may not be realistic for plays requiring direct interaction (e.g., tackles).

df2 <- df %>%
  left_join(df_player1, by = c("playId" = "playId", "gameId" = "gameId", "frameId" = "frameId")) %>%
  
  # Calculate both Euclidean and Manhattan distances
  mutate(euclidean_dist = ifelse(team.x != team.y & team.x != "football", 
                                 sqrt((x.x - x.y)^2 + (y.x - y.y)^2), 
                                 NA),
         manhattan_dist = ifelse(team.x != team.y & team.x != "football", 
                                 abs(x.x - x.y) + abs(y.x - y.y), 
                                 NA)) %>%
  
  # Compute means for both distance types
  group_by(gameId, playId, frameId) %>%
  mutate(mean_euclidean_dist = mean(euclidean_dist, na.rm = TRUE),
         mean_manhattan_dist = mean(manhattan_dist, na.rm = TRUE)) %>%
  ungroup()

Visualization 1: Distance Distributions

The plot “Comparison of Euclidean and Manhattan Distance Distributions” compares the two metrics:

Key Insights: Both distributions share similar shapes, suggesting that they capture the same overall spatial patterns. The Manhattan Distance curve consistently lies to the right, reflecting its inherently higher values compared to Euclidean Distance. Manhattan Distance emphasizes longer paths, making it more sensitive to constraints like player positioning or movement zones.

library(ggplot2)

ggplot(df2 %>% filter(!is.na(euclidean_dist) & !is.na(manhattan_dist))) +
  geom_density(aes(x = euclidean_dist, fill = "Euclidean"), alpha = 0.5) +
  geom_density(aes(x = manhattan_dist, fill = "Manhattan"), alpha = 0.5) +
  labs(title = "Comparison of Euclidean and Manhattan Distance Distributions",
       x = "Distance",
       y = "Density") +
  scale_fill_manual(values = c("Euclidean" = "blue", "Manhattan" = "red")) +
  theme_minimal()

Visualization 2: Distribution of Distance Differences (Manhattan - Euclidean)

The plot “Distribution of Distance Differences (Manhattan - Euclidean)” shows how the two metrics differ:

Key Insights: Most differences fall between 0 and 10, with the highest density around 4-5 units, indicating modest deviations for the majority of cases. The right tail shows some extreme differences (up to ~17.7), likely driven by plays where players are far apart diagonally. The concentration near zero suggests that for nearby players, the two metrics align closely, but differences grow with increasing distance.

df2 <- df2 %>%
  mutate(distance_diff = manhattan_dist - euclidean_dist)

ggplot(df2 %>% filter(!is.na(distance_diff)), aes(x = distance_diff)) +
  geom_histogram(binwidth = 0.5, fill = "purple", color = "black") +
  labs(title = "Distribution of Distance Differences (Manhattan - Euclidean)",
       x = "Difference (Manhattan - Euclidean)",
       y = "Count") +
  theme_minimal()

Summary Statistics: The mean difference between Manhattan and Euclidean distances is 4.79 units, reflecting a consistent gap due to Manhattan’s grid-like calculation. The median difference of 4.29 indicates that half of the differences are below this value, with slightly skewed larger differences. The standard deviation of 3.48 units highlights moderate variability, while the maximum difference of 17.7 units underscores outliers for long diagonal distances.

df2 %>%
  summarize(mean_diff = mean(distance_diff, na.rm = TRUE),
            median_diff = median(distance_diff, na.rm = TRUE),
            sd_diff = sd(distance_diff, na.rm = TRUE),
            min_diff = min(distance_diff, na.rm = TRUE),
            max_diff = max(distance_diff, na.rm = TRUE))

## # A tibble: 1 × 5
##   mean_diff median_diff sd_diff min_diff max_diff
##       <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1      4.79        4.29    3.48        0     17.7

Paired t-Test Between Metrics: The paired t-test confirms a highly significant difference between Manhattan and Euclidean distances (𝑝< 2.2 × 10⁻¹⁶). The mean difference is 4.79 units, with a 95% confidence interval of [4.656, 4.927]. This result demonstrates that Manhattan distances are consistently larger than Euclidean distances. The high t-statistic (𝑡= 69.394) underscores that this difference is not due to random variation, but rather an inherent property of the two metrics.

t_test_result <- t.test(df2$manhattan_dist, df2$euclidean_dist, paired = TRUE, na.rm = TRUE)
print(t_test_result)

## 
##  Paired t-test
## 
## data:  df2$manhattan_dist and df2$euclidean_dist
## t = 69.394, df = 2540, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  4.656008 4.926792
## sample estimates:
## mean difference 
##          4.7914

Conclusion

This analysis reveals significant differences between Euclidean and Manhattan distances when measuring spatial relationships in NFL plays. Manhattan Distance better captures practical movement patterns and tactical constraints, making it useful for analyzing defensive coverage, zone movement, or route navigation. In contrast, Euclidean Distance is ideal for proximity-based interactions, such as assessing the likelihood of a tackle or catch. Combining these metrics can provide a more comprehensive understanding of player dynamics, offering valuable insights for coaching strategies, play design, and defensive analysis.