Sports Analytics Assignment 3

# this was to figure out how to create chunks
library(ggplot2)
library(ggalt)
library(ggforce)
library(hms)
library(gganimate)
library(data.table)
library(dplyr)
library(RColorBrewer)
library(nflfastR)
library(ggimage)
library(png)
library(gifski)
#Here is my code

Project and Metric Explantaion

For our project, we focused on improving the code provided to us by Paul Tallon by narrowing down the data provided for a more accurate analysis.

We started this off by finding the 25th and 75th quartile data points for the mean of the distance calculated in the prior code. We used this to identify relevant ranges for us to filter our data down with only to find the “impactful data”. These quartiles (calculated at 12.3 and 48.2) give us the range so that we always have the back 25% of data taken while removing the front 25% (during which the majority of players are too far away to be impactful).

After initializing a data frame, a for loop is created to run 10,000 times to reduce variation. The for loop uses a uniform distribution and the quartiles we calculated earlier take a randomly generated number between the ends. Then, we create a new variable ” rev_dist” that only pulls from “dist” column if the calculated straight-line distance is shorter than the cutoff.

This means that in addition to the football and friendly players on the field, any players farther than the calculated range (identified in variable “limit”) are not included in the data, since their impact is negligible at that point.

A second variable “rev_mean_dist_to_player” is then created by averaging the eligible players on the field (those within the distance cutoff who have been awarded subsequent “rev_dist” values. This gives us a more accurate look at the average distance from players on the field to the player with the ball without having players that are way too far away to impact the ball, skewing the data.

To compare these methods, we then ran a t-test and a correlation test to determine how comparable the two average distance metrics were to each other.

We were able to conclude that the two tests were statistically similar to each other with average p_values and correlations printed below. This shows how although the two tests’ results are similar, ours is more refined focussing on more of the relevant player and excluding the player data where impact on the ball is likely, This provided us with a better idea of analyzing the pressure observed on the field.

#remmeber to set the wd to where I need it to be
setwd("C:/New Data")
file.path(getwd(),"Data/NFLBDB2022")

## [1] "C:/New Data/Data/NFLBDB2022"

# ---------- supplied functions start here -----------------------------------------------------------------


# ---------------------------------------------------------------------------
# --------------------------- Load Libraries --------------------------------
# ---------------------------------------------------------------------------
# supply a vector of desired packages. Packages will be installed if not
# already installed and available to RStudio


load_packages <- function(packages = c()) {
  if (length(packages) == 0) {
    print('You did not specify any packages/libraries to load')
  }
  else {
    for (i in packages){
      if(! i %in% installed.packages()){
        install.packages(i, dependencies = TRUE)
      }
      suppressMessages(suppressWarnings(library(i, character.only=T)))
    }
  }
}



load_packages(c("data.table","dplyr","ggplot2", "ggalt", "ggforce","hms", "gganimate", "RColorBrewer", "nflfastR", "ggimage", "png", "gifski"))
# -----------------------------------------------------------------------------------------------
# ------------ Checks for a valid playId and gameId; optionally check for valid frameId ---------
# -----------------------------------------------------------------------------------------------








# ----------  main code starts here ----------------------------------------------------------------------

load_packages(c("ggplot2", "ggalt", "ggforce", "hms", "gganimate", "lubridate", "data.table", "dplyr", "nflfastR", "gifski", "png", "ggimage", "knitr", "kableExtra"))

df_tracking <- fread("Data/NFLBDB2022/tracking2020.csv")
df_plays    <- fread("Data/NFLBDB2022/plays.csv")
df_games    <- fread("Data/NFLBDB2022/games.csv")


# you are creating a metric for a player in a play (playId) in a game (gameId)
my_gameId <- 2021010300
my_playId <- 1586 
my_playerNumber <- 19


df <- df_tracking %>%
  filter(gameId == my_gameId & playId == my_playId ) %>%
  left_join(df_plays, by = c("playId" = "playId", "gameId" = "gameId")) %>%
  select(x, y, displayName, jerseyNumber, team, gameId, playId, frameId, time, 
         nflId, dis, playDescription, s) %>%
  data.frame()





team1 <- as.character(  df_games[df_games$gameId == my_gameId, "homeTeamAbbr"]  )
team2 <- as.character(  df_games[df_games$gameId == my_gameId, "visitorTeamAbbr"]  )


df_team1 <- teams_colors_logos %>%
  filter(team_abbr == team1)


df_team2 <- teams_colors_logos %>%
  filter(team_abbr == team2)


df_player1 <- df %>%
  filter(jerseyNumber == my_playerNumber) %>%
  select(gameId, playId, frameId, x, y, dis, team) %>%
  data.frame()


df <- df %>%
  left_join(df_player1, by = c( "playId" = "playId", "gameId" = "gameId", "frameId" = "frameId")) %>%
  
  # ----------------------------- here is the code that I created in class -----------------------------
# compute the distance to each player on the opposing team (not to your teammates and not to the ball)
mutate(dist = ifelse( team.x != team.y & team.x != "football", 
                      sqrt( (x.x - x.y)^2 + (y.x - y.y)^2), 
                      NA)) %>%
  
  group_by(gameId, playId, frameId) %>%
  
  # compute the average of all distances
  mutate(mean_dist_to_player = mean(dist, na.rm = TRUE ),
         alpha = 1 - (mean_dist_to_player/113.33)  ) %>%
  ungroup() 
  #data.frame()
  # -------------------------------------- my code ends here ------------------------------------------
# -------------------------------------start your code below ----------------------------------------
#Determining Our Range


quant1 <- quantile(df$mean_dist_to_player, .25)
quant2 <- quantile(df$mean_dist_to_player, .75)

results <- data.frame(limit = numeric(), t_statistic = numeric(), p_value = numeric(), correlation = numeric())

for (i in 1:10000) {
  limit <- runif(1, min = quant1, max = quant2)
   df <- df %>% 
    mutate(rev_dist = ifelse(dist < limit, dist, NA)) %>%
    group_by(frameId) %>%
    mutate(rev_mean_dist_to_player = mean(rev_dist, na.rm = TRUE)) %>%
    mutate(rev_mean_dist_to_player = ifelse(is.nan(rev_mean_dist_to_player), NA, rev_mean_dist_to_player)) %>%
    ungroup() 
    
  
  tt = t.test(df$mean_dist_to_player, df$rev_mean_dist_to_player)
  cor_value = cor(df$mean_dist_to_player, df$rev_mean_dist_to_player, method = "pearson", use = "complete.obs")
  
  results = rbind(results, data.frame(limit = limit, t_statistic = tt$statistic, p_value = tt$p.value, correlation = cor_value))
}

# Print statements with labels
print(paste("", "mean p_value:", mean(results$p_value)))

## [1] " mean p_value: 1.40508819864771e-96"

print(paste("", "mean correlation:", mean(results$correlation)))

## [1] " mean correlation: 0.706588052836873"

#Correlation test
cor(df$mean_dist_to_player, df$rev_mean_dist_to_player, method = c("pearson", "kendall", "spearman"), use = "complete.obs")

## [1] 0.9743552

  # ------------------------------------- your code ends here -----------------------------------------
# compute a running total of the distance traveled
#arrange(nflId, frameId) %>%
  #group_by(nflId) %>%
  #mutate( dist_running_total = cumsum(dis.x)) %>%
  #rename( "x" = "x.x",
          #"y" = "y.x",
          #"team" = "team.x",
          #"dis" = "dis.x") %>%
  


# run the line below if you want to visualize the play
#visualize_play(df, df_team1, df_team2, playerNumber = 19)

# think about what statistical tests to run to compare my Euclidean measure with your measures
# correlation
# t-test
# prob density functions overlapping

#cor(df$dist, df$rev_dist, method = c("pearson", "kendall", "spearman"), use = "complete.obs")


ggplot(data = df[!is.na(df$rev_mean_dist_to_player),], aes(rev_mean_dist_to_player, fill = "red"), color = "black") +
  geom_density(position = "identity", alpha=0.5)

x <- df %>%
  select(mean_dist_to_player, rev_mean_dist_to_player) %>% stack() %>%
  data.frame() 


ggplot(data = x[!is.na(x$values),], aes(values, group = ind, fill = ind)) + geom_density(alpha=0.5)

print(head(results, 10 ))

##       limit t_statistic      p_value correlation
## t  33.35590    45.64800  0.00000e+00   0.9279480
## t1 27.39759    54.64683  0.00000e+00   0.8498197
## t2 32.17694    46.95815  0.00000e+00   0.9166306
## t3 35.29571    42.63508  0.00000e+00   0.9462354
## t4 23.17084    61.66328  0.00000e+00   0.6340451
## t5 23.26151    61.49464  0.00000e+00   0.6584847
## t6 27.04559    55.66314  0.00000e+00   0.8234476
## t7 33.93802    44.64231  0.00000e+00   0.9367159
## t8 47.59498    22.28513 3.57539e-107   0.9878944
## t9 33.14170    45.76013  0.00000e+00   0.9276389

Sports Analytics Assignment 3

Brad Kramer and Drew Carrier

11/04/2024

Project and Metric Explantaion