Introduction

We were tasked with improving the metric provided to identify the defensive players who were closest to the player of interest/ had the shortest distance from them . We then were tasked with comparing this data and visualizing the results

Description of Project

To modify the metric used to calculate distance from the player, we used the arrange function to sort the players from greatest to least distance from the player of interest. We then isolated the last three players from these results, the players with the shortest distance to the player per frame. To analyze this metric and the metric provided to us, we used a t test to compare.

Data Visualization

To visualize the comparison/ analysis done with the t test, we created a density plot to model the difference between the given and developed results

setwd("C:/Users/genny/OneDrive/Documents/IS470 Sports Analysis")

library(ggplot2)
library(data.table)
library(dplyr)
library(scales)
library(tidytext)
library(RColorBrewer)
library(kableExtra)
library(scales)  
library(patchwork)
library(reshape2)

df_tracking <- fread("Data/NFLBDB2022/tracking2020.csv")
df_plays    <- fread("Data/NFLBDB2022/plays.csv")
df_games    <- fread("Data/NFLBDB2022/games.csv")

# you are creating a metric for a player in a play (playId) in a game (gameId)
my_gameId <- 2021010300
my_playId <- 1586
my_playerNumber <- 19

df <- df_tracking %>%
  filter(gameId == my_gameId & playId == my_playId ) %>%
  left_join(df_plays, by = c("playId" = "playId", "gameId" = "gameId")) %>%
  select(x, y, displayName, jerseyNumber, team, gameId, playId, frameId, time,
         nflId, dis, playDescription, s) %>%
  data.frame()

df_player1 <- df %>%
  filter(jerseyNumber == my_playerNumber) %>%
  select(gameId, playId, frameId, x, y, dis, team) %>%
  data.frame()

results <- df %>%
  left_join(df_player1, by = c( "playId" = "playId", "gameId" = "gameId", "frameId" = "frameId")) %>%
  
  # ----------------------------- here is the code that I created in class -----------------------------
  # compute the distance to each player on the opposing team (not to your teammates and not to the ball)

  mutate(dist = ifelse(team.x != team.y & team.x != "football", 
                       sqrt((x.x - x.y)^2 + (y.x - y.y)^2), 
                       NA)) %>%
  group_by(gameId, playId, frameId) %>%
  mutate(mean_dist_to_player = mean(dist, na.rm = TRUE),
         alpha = 1 - (mean_dist_to_player / 113.33)) %>%


# -------------------------------------start your code below ----------------------------------------

 # Arrange code by frameId and distance to determine closest three players

  arrange(frameId, dist) %>%  # Arrange by frameId and dist
  group_by(frameId, team.x == "away") %>%
  mutate(closest3_distance = nth(dist, 3, order_by = dist),
         closest_players = ifelse(dist <= closest3_distance, 1, 0)) %>%
  ungroup() %>%
  data.frame()

Statistical Test 1: A t test that compares the average distance to target player vs the distance of the closest three playerstwo sample t test to compare if theres a signficant difference in the average distance and closest

t_test_data <- results %>%
  select(mean_dist_to_player, closest3_distance)%>%
  na.omit()

t_test_result <- t.test(t_test_data$mean_dist_to_player,
                        t_test_data$closest3_distance)

print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  t_test_data$mean_dist_to_player and t_test_data$closest3_distance
## t = 10.107, df = 5045.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.102183 6.076631
## sample estimates:
## mean of x mean of y 
##  27.08562  21.99621
# correlation test

cor_results <- cor.test(t_test_data$mean_dist_to_player,
         t_test_data$closest3_distance , method = "pearson")

print(cor_results)
## 
##  Pearson's product-moment correlation
## 
## data:  t_test_data$mean_dist_to_player and t_test_data$closest3_distance
## t = 371.09, df = 2539, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9901746 0.9915846
## sample estimates:
##       cor 
## 0.9909068
   # ------------------------------------- your code ends here -----------------------------------------

Visualization 2: Three density plots that show the comparison of the original metric, average distance, and the closest three players

#density graph histogram

d_df <- melt(results %>% select(mean_dist_to_player,closest3_distance, dist)%>% filter(!is.na(dist)), id = c())

# Plot
plot1 <- ggplot(d_df, aes(x = value, fill = variable)) +
  geom_density(fill = "lightblue", alpha = 0.2) +
  facet_wrap(~variable, scale = "free", nrow = 1)

# Print the plot
print(plot1)


Conclusion

After filtering and sorting the data to units of interest, we were able to plot and visualize the different densities of the mean distance vs the closest three players’ distance. We found that the Closest 3 players, as expected had shorter distances to the target player overall in comparison to the mean distance of the players.

Again as expected the mean distance has a larger distribution across the graph, peaking at 12-15 units and again at 50

There is significant overlap around 12-15 units, suggesting correlation between the mean and the closest three players around that time

These plots suggests that players mean distance includes farther players leading to a spread out distribution, but upon consolidating the data of interest to the closest three players, results in consistently shorter distances. This is relevant in certain tactical contexts like analyzing traffic in this example and/or close player interactions or pressure

talk about t test and correlation test

We also used a Pearson correlation test to determine the correlation between the average distance the three closest players. The results were as follows: a strong positive correlation : .99 indicating a strong positive correlation between the two variables p-value: <2.2e-16 very low p value indicating strong likelihood that the correlation is not random and supports the hypothesis that this is a meaningful relationship
confidence interval of 95% from 0.9901746 to 0.9915846, which is very close to 1 and indicates true correlation being close to .99