Introduction/ Description

In our Project, we set out to determine the impact of motion on data. We decided to do this by taking the data provided by the NFLs Big Data Bowl to predict player movements through Kalman filters. Following this, we used our Random Forrest to classify and group it into sucsess’s and failures. We where then able to analyze and compare the impact motion had on the plays by breaking it into Motion and Non Motion sets and running that data. Eventually we broke it down further, comparing specific type of motion plays to make reccomendations about said types.

Settup and Preperation

First, we settup our R studio run by connecting it with our directorys, and loaded in some packages and functions to run code more effectively

#Set working Directory/Load libraries
setwd("C:/")
directory = paste0(getwd(), "/Data/NFLBDB2025")
source('https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R')
load_packages(c("ggplot2", "ggalt", "ggforce", "hms", "gganimate", "lubridate", "data.table", "dplyr", "nflfastR", "gifski", "png", "ggimage", "tidyverse", "caret", "randomForest", "kableExtra", "viridis"))

Next, we filtered our data to remove the unnecessary rows and re-order, before setting up our statement to load in our desired data.

#Define remove_frames_before_lineset
remove_frames_before_lineset <- function(df, renumberFrames = TRUE) {  
  # Keep rows after the 'line_set' event is found
  lineset_df <- df %>%
    arrange(gameId, playId, frameId) %>%
    group_by(playId, gameId) %>%
    mutate(m1 = which(event == 'line_set', arr.ind = TRUE)[1]) %>%
    filter(row_number(gameId) >= unique(m1)) %>%
    ungroup() %>%
    data.frame()
  
  # Renumber frames if the flag is TRUE
  if (renumberFrames) {
    lineset_df <- lineset_df %>%
      group_by(gameId, playId) %>%
      mutate(frameId = frameId - min(frameId) + 1) %>%
      ungroup() %>%
      data.frame()
  }
  return(lineset_df)
}

DataLoader <- function(directory, 
                       merge = FALSE,
                       columns = c(),
                       sample_fraction = 0.1,
                       renumberFrames = TRUE) {
  # Initialize an empty list to store sampled data frames
  sampled_dfs <- list()
  specific_play_data <- data.frame()
  
  for (week in seq(1, 9)) {
      # Load the data for the current week
      df <- load_data_for_one_week(directory, week, merge, columns)
      
      # Apply remove_frames_before_lineset
      df <- remove_frames_before_lineset(df, renumberFrames = renumberFrames)
      
      # Identify unique plays
      unique_plays <- df %>%
        distinct(playId, gameId)
      
      # Sample 1/10 of the plays
      sampled_plays <- unique_plays %>%
        sample_frac(size = sample_fraction)
      
      # Filter the data to include only the sampled plays
      df_sample <- df %>%
        semi_join(sampled_plays, by = c("playId", "gameId"))
      
      # Append the sampled data to the list
      sampled_dfs[[week]] <- df_sample
      
      # Ensure the specific play is included
      specific_play_data <- rbind(specific_play_data, df %>% filter(playId == 101 & gameId == 2022090800))
  }
  
  # Combine all sampled data frames into one
  combined_df <- bind_rows(sampled_dfs)
  
  # Ensure the final dataframe contains all rows where playId == 101 and gameId == 2022090800
  final_df <- bind_rows(combined_df, specific_play_data) %>%
    distinct()  # Remove any duplicate rows
  
  return(final_df)
}

After that, we called our function to load in our data with relivant columns. We took a random sample of 1/10 of the data to get a fair spread from across the weeks, but without taking in too much data.

#Calling Data Loading functions
df<- DataLoader(directory, merge = TRUE, columns = c("gameId", "playId", "frameId","x", "y", "yardsGained", "event", "inMotionAtBallSnap", "shiftSinceLineset", "motionSinceLineset", "s", "club", "nflId", "displayName", "team", "dir", "dis", "yardsToGo", "quarter", "down"), renumberFrames = TRUE)

Initial Stats

We created some initial graphs to show the descriptive results of the data on motion before messing with the code to get a base hypothesis.

TwoXTwoTab2 <- df %>%
  # Create a flag indicating if motion is present in a play
  mutate(MotionPresent = ifelse(inMotionAtBallSnap == TRUE, TRUE, FALSE)) %>%
  # Arrange data so that rows with motion are prioritized
  arrange(gameId, playId, desc(MotionPresent)) %>%
  # Select distinct gameId and playId, keeping rows with motion when available
  distinct(gameId, playId, .keep_all = TRUE) %>% 
  # Create Outcome and Motion columns with descriptive labels
  mutate(Outcome = ifelse(yardsGained > 0, 'Positive_Yards', 'No_Yards'),
         Motion = ifelse(MotionPresent, "Motion_Present", "No_Motion")) %>%
  # Ensure Outcome and Motion are not NA
  filter(!is.na(Outcome) & !is.na(Motion)) %>%
  # Group by Motion and Outcome
  group_by(Motion, Outcome) %>%
  # Summarize count and calculate percentages within each Motion group
  summarise(n = n(), .groups = 'drop') %>% 
  group_by(Motion) %>% 
  mutate(percentage = (n / sum(n)) * 100) %>%
  # Reshape the table and set the top-left column header to an empty string
  select(Motion, Outcome, percentage) %>%
  pivot_wider(names_from = Outcome, values_from = percentage, values_fill = 0) %>%
  data.frame()
# Output the TwoXTwoTab2 table
knitr::kable(TwoXTwoTab2, caption = "Motion vs Outcome Table (2x2)", format = "html")%>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Motion vs Outcome Table (2x2)
Motion	No_Yards	Positive_Yards
Motion_Present	29.48113	70.51887
No_Motion	28.40801	71.59199

#Breaking down motion further
ThreexTwo <- df %>%
  mutate(MotionType = case_when(
    shiftSinceLineset ~ "Shift",
    motionSinceLineset & !inMotionAtBallSnap ~ "Jet Motion",
    inMotionAtBallSnap ~ "Orbit Motion")) %>%
  # Prioritize rows with motion for distinct plays
  arrange(gameId, playId, desc(inMotionAtBallSnap)) %>%
  distinct(gameId, playId, .keep_all = TRUE) %>%
  # Create descriptive labels for Outcome and MotionType
  mutate(Outcome = ifelse(yardsGained > 0, 'Positive_Yards', 'No_Yards'),
         MotionType = ifelse(is.na(MotionType), "No Motion", MotionType)) %>%
  # Exclude "No Motion"
  filter(MotionType != "No Motion") %>%
  # Ensure no NA values in Outcome or MotionType
  filter(!is.na(Outcome) & !is.na(MotionType)) %>%
  # Group by MotionType and Outcome
  group_by(MotionType, Outcome) %>%
  # Summarize counts
  summarise(n = n(), .groups = 'drop') %>%
  group_by(MotionType) %>%
  mutate(
    percentage = (n / sum(n)) * 100 # Calculate percentage within each MotionType group
  ) %>%
  # Reshape the table
  select(MotionType, Outcome, percentage) %>%
  pivot_wider(names_from = Outcome, values_from = percentage, values_fill = 0) %>%
  distinct(MotionType, .keep_all = TRUE) %>%
  # Arrange by descending percentage for Positive_Yards
  arrange(desc(`Positive_Yards`)) %>%
  data.frame()

# Output the ThreexTwo table
knitr::kable(ThreexTwo, caption = "Motion Type Breakdown (3x2)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Motion Type Breakdown (3x2)
MotionType	No_Yards	Positive_Yards
Shift	12.50000	87.50000
Jet Motion	13.51351	86.48649
Orbit Motion	29.49640	70.50360

Kalman Filter(s)

We took several steps, the first being creating a kalman filter for a individual player, then one for all of the players on a single play. Ultimately we created one to work for all plays in our data, and forgoed the individual player one. In our data we use the Kalman filter applied across all data for our analyasis and modeling, and the individual play to hep us with additional visualizations.

Kalman filter initial settup

# |-----------------Kalman Filter for a Player -------------------|
# Filter data and convert columns to numeric
gamer_df <- df %>%
  filter(club != "football") %>%
  data.frame()

gamer_df$x <- as.numeric(as.character(gamer_df$x))
gamer_df$y <- as.numeric(as.character(gamer_df$y))
gamer_df$s <- as.numeric(as.character(gamer_df$s))
gamer_df$dir <- as.numeric(as.character(gamer_df$dir))
gamer_df$dis <- as.numeric(as.character(gamer_df$dis))

gamer_df <- gamer_df %>%
  filter(!is.na(x) & !is.na(y) & !is.na(s) & !is.na(dir) & !is.na(dis))

# Kalman filter function with velocity
kalman_filter_velocity <- function(dt, process_noise, measurement_noise, initial_state) {
  state <- initial_state  # Use provided initial state
  P <- diag(6)
  F <- matrix(c(
    1, 0, dt, 0, 0.5 * dt^2, 0,
    0, 1, 0, dt, 0, 0.5 * dt^2,
    0, 0, 1, 0, dt, 0,
    0, 0, 0, 1, 0, dt,
    0, 0, 0, 0, 1, 0,
    0, 0, 0, 0, 0, 1
  ), nrow = 6, byrow = TRUE)
  H <- matrix(c(
    1, 0, 0, 0, 0, 0,
    0, 1, 0, 0, 0, 0
  ), nrow = 2, byrow = TRUE)
  Q <- diag(process_noise, 6)
  R <- diag(measurement_noise, 2)
  
  list(
    predict = function() {
      state <<- F %*% state
      P <<- F %*% P %*% t(F) + Q
    },
    update = function(z) {
      y <- z - H %*% state
      S <- H %*% P %*% t(H) + R
      K <- P %*% t(H) %*% solve(S)
      state <<- state + K %*% y
      P <<- (diag(6) - K %*% H) %*% P
    },
    get_state = function() {
      return(state)
    }
  )
}

# Example usage
dt <- 0.1
process_noise <- 0.1
measurement_noise <- 1.0

Play application of the Kalman filter

# | -------------------Single-Play Application----------------------------------------------|

# Initialize empty data frames to store results
play_predictions <- data.frame()  # To store predicted values for all players
play_metrics <- data.frame()  # To store performance metrics for all players

# Combine all player data for the play
play_data <- gamer_df %>%
  filter(playId == 101 & gameId == 2022090800)  

# Initialize Kalman filter with the first frame's data
initial_state <- matrix(c(
  play_data$x[1], 
  play_data$y[1], 
  0,  # Initial velocity in the x direction (assuming 0 at the start)
  0,  # Initial velocity in the y direction (assuming 0 at the start)
  0,  # Velocity in the z direction (assuming 0 at the start)
  0   # Error covariance (initial assumption)
), nrow = 6, ncol = 1)

# Initialize Kalman filter
kf_velocity <- kalman_filter_velocity(dt, process_noise, measurement_noise, initial_state)

# Iterate over each frame for all players
for (i in 1:nrow(play_data)) {
  # Get the observed measurements for this frame (position x, y)
  z <- matrix(c(play_data$x[i], play_data$y[i]), ncol = 1)
  
  # Predict the next state based on the Kalman filter
  kf_velocity$predict()
  
  # Update the Kalman filter with the new observation
  kf_velocity$update(z)
  state <- kf_velocity$get_state()
  
  # Store the predicted values
  play_predictions <- rbind(
    play_predictions, 
    data.frame(frameId = play_data$frameId[i], 
               nflId = play_data$nflId[i], 
               x_pred = state[1], 
               y_pred = state[2], 
               x = play_data$x[i], 
               y = play_data$y[i], 
               s = play_data$s[i], 
               team = play_data$team[i])
  )
}
# Calculate performance metrics (for the entire dataset)
mse <- mean((play_data$x - play_predictions$x_pred)^2 + (play_data$y - play_predictions$y_pred)^2)
rmse <- sqrt(mse)
ss_total <- sum((play_data$x - mean(play_data$x))^2 + (play_data$y - mean(play_data$y))^2)
ss_residual <- sum((play_data$x - play_predictions$x_pred)^2 + (play_data$y - play_predictions$y_pred)^2)
r_squared <- 1 - (ss_residual / ss_total)

# Append metrics to the metrics data frame
play_metrics <- rbind(play_metrics, data.frame(
  mse = mse,
  rmse = rmse,
  r_squared = r_squared
))

# Print overall performance metrics for the entire dataset
knitr::kable(play_metrics, caption = "Play Kalman Filter Performance Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Play Kalman Filter Performance Metrics
mse	rmse	r_squared
31.55963	5.617796	0.5348461

As you can see, the metrics (especially r^2) is within reason for the play itteration. However, this is not high enough and we wanted to increase the size so we ran the results on more data, leading to the query below. Initially we had this as a larger sample, but after my computer ran all night without being able to finish, we were forced to scale back how large our sample was. Regardless, it generated an effective r^2 metric, showing the effectiveness of the model.

Whole DF Kalman filter implimentation

# |-------------------------- Full DF Kalman----------------------------------------------------|

# Initialize empty data frames to store results
all_predictions <- data.frame()  # To store predicted values for all frames
all_metrics <- data.frame()  # To store overall performance metrics for the entire data


# Get unique playId and gameId combinations
unique_play_game_pairs <- unique(gamer_df[, c("playId", "gameId")])

# Sample of the unique playId and gameId pairs to reduce dataset size
sampled_play_game_pairs <- unique_play_game_pairs[sample(nrow(unique_play_game_pairs), size = nrow(unique_play_game_pairs) / 90, replace = FALSE), ]

# Initialize Kalman filter with the first frame's data from the first play
initial_state_kalman <- matrix(c(
  gamer_df$x[1], 
  gamer_df$y[1], 
  0,  # Initial velocity in the x direction (assuming 0 at the start)
  0,  # Initial velocity in the y direction (assuming 0 at the start)
  0,  # Velocity in the z direction (assuming 0 at the start)
  0   # Error covariance (initial assumption)
), nrow = 6, ncol = 1)

# Initialize Kalman filter for the entire dataset
kf_velocity_kalman <- kalman_filter_velocity(dt, process_noise, measurement_noise, initial_state_kalman)

# Iterate over each sampled playId and gameId combination
for (i in 1:nrow(sampled_play_game_pairs)) {
  # Get the current playId and gameId
  current_play_id <- sampled_play_game_pairs$playId[i]
  current_game_id <- sampled_play_game_pairs$gameId[i]
  
  # Filter the data for the current playId and gameId combination
  current_play_data_kalman <- gamer_df %>%
    filter(playId == current_play_id & gameId == current_game_id)
  
  # Iterate over each frame for the current play
  for (j in 1:nrow(current_play_data_kalman)) {
    # Get the observed measurements for this frame (position x, y)
    z <- matrix(c(current_play_data_kalman$x[j], current_play_data_kalman$y[j]), ncol = 1)
    
    # Predict the next state based on the Kalman filter
    kf_velocity_kalman$predict()
    
    # Update the Kalman filter with the new observation
    kf_velocity_kalman$update(z)
    state_kalman <- kf_velocity_kalman$get_state()
    
    # Store the predicted values for the current frame
    all_predictions <- rbind(
      all_predictions, 
      data.frame(frameId = current_play_data_kalman$frameId[j], 
                 nflId = current_play_data_kalman$nflId[j], 
                 playId = current_play_id,  # Add playId to the predictions
                 gameId = current_game_id,  # Add gameId to the predictions
                 x_pred = state_kalman[1], 
                 y_pred = state_kalman[2], 
                 x = current_play_data_kalman$x[j], 
                 y = current_play_data_kalman$y[j], 
                 s = current_play_data_kalman$s[j], 
                 team = current_play_data_kalman$team[j])
    )
  }
}

# Calculate overall performance metrics for the entire dataset
mse_kalman <- mean((all_predictions$x - all_predictions$x_pred)^2 + (all_predictions$y - all_predictions$y_pred)^2)
rmse_kalman <- sqrt(mse_kalman)
ss_total_kalman <- sum((all_predictions$x - mean(all_predictions$x))^2 + (all_predictions$y - mean(all_predictions$y))^2)
ss_residual_kalman <- sum((all_predictions$x - all_predictions$x_pred)^2 + (all_predictions$y - all_predictions$y_pred)^2)
r_squared_kalman <- 1 - (ss_residual_kalman / ss_total_kalman)

# Store overall metrics in the metrics data frame
all_metrics <- data.frame(
  mse = mse_kalman,
  rmse = rmse_kalman,
  r_squared = r_squared_kalman
)

# Print overall performance metrics for the entire dataset
knitr::kable(all_metrics, caption = "All Kalman Filter Performance Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

All Kalman Filter Performance Metrics
mse	rmse	r_squared
52.98582	7.279136	0.9244106

The metrics computed from the larger kalman filter are much stronger, with a significantly higher r^2 value. To display this strength, we computed a residual graph to show how close the rresiduals values had come to the actuals.

Graph of residuals for whole df

#--------------------------------Full DF Histogram------------------------------------|
## Calculate the residuals (difference between actual and predicted positions)
residuals_x <- all_predictions$x - all_predictions$x_pred
residuals_y <- all_predictions$y - all_predictions$y_pred

# Combine the residuals into one dataset for plotting
residuals <- c(residuals_x, residuals_y)

# Create the histogram with percentage frequency and zoom in on the range [-20, 20]
ggplot(data.frame(residuals), aes(x = residuals)) +
  geom_histogram( 
    bins = 50,  # Increase the number of bins for more granularity
    fill = "skyblue", 
    color = "black", 
    alpha = 0.7
  ) +
  labs(title = "Histogram of Residuals (Percentage)", 
       x = "Residuals", 
       y = "Frequency ") +
  scale_x_continuous(limits = c(-20, 20), breaks = seq(-20, 20, by = 5)) +  # Zoom in and add more ticks
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 18),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

Heatmap Plots

We also created numerous heatmap plots

This first one plots the actual data so you can get a feel for the expected result. The second one, plots the predicted heatmap based on the kalman filters predicted values. Finally, we created a overlay to show the difference between the two to highlight the acuracy of the model(s). As you can see, there is a difference but slight. ``{r Actual Values Heatmap, echo=TRUE, message=FALSE, warning=FALSE, cache=TRUE}

|————————–Actual-Values-Moving HeatMap ——————————————– |

games_df <- df %>% filter(gameId == 2022090800 & playId == 101)

heatmap_single_play_act <- function(games_df) { # Ensure the required columns are present required_columns <- c(“x”, “y”, “frameId”, “yardsGained”, “event”, “playId”, “gameId”, “team”, “nflId”, “inMotionAtBallSnap”) if (!all(required_columns %in% names(games_df))) { stop(“Data frame does not contain all required columns.”) }

# Define the football field dimensions field_width <- 53.3 field_length <- 120

# Filter out rows with incomplete data for the football football_data <- games_df %>% filter(team == “football” & !is.na(x) & !is.na(y))

# Identify the football’s position games_df <- games_df %>% left_join( football_data %>% distinct(frameId, .keep_all = TRUE) %>% select(frameId, football_x = x, football_y = y), by = “frameId” )

# Handle cases where football position is missing games_df <- games_df %>% group_by(frameId) %>% mutate( football_x = ifelse(is.na(football_x), mean(x[team != “football”], na.rm = TRUE), football_x), football_y = ifelse(is.na(football_y), mean(y[team != “football”], na.rm = TRUE), football_y) ) %>% ungroup()

# Define grid size grid_size <- 1 # 1-yard grid cells

# Calculate grid-based density games_df <- games_df %>% rowwise() %>% mutate( grid_x = floor(x / grid_size), grid_y = floor(y / grid_size) ) %>% ungroup()

density_data_act <- games_df %>% group_by(frameId, grid_x, grid_y) %>% summarise(density = n(), .groups = “drop”)

# Add density contributions for adjacent cells adjacent_offsets <- expand.grid(dx = -1:1, dy = -1:1)

expanded_density <- density_data_act %>% rowwise() %>% do({ tibble( frameId = .$frameId, grid_x = .$grid_x + adjacent_offsets$dx, grid_y = .$grid_y + adjacent_offsets$dy, density = .$density / 9 # Spread density across 9 cells ) }) %>% ungroup()

# Aggregate densities density_data_act <- expanded_density %>% group_by(frameId, grid_x, grid_y) %>% summarise(density = sum(density), .groups = “drop”)

# Convert grid coordinates back to field coordinates density_data_act <- density_data_act %>% mutate( x = grid_x * grid_size, y = grid_y * grid_size )

# Assign colors to teams with high-contrast colors team_colors <- c(“home” = “yellow”, “away” = “green”, “football” = “orange”)

# Focus on the area of action games_df <- games_df %>% group_by(frameId) %>% mutate( min_x = max(min(x, na.rm = TRUE) - 15, 0), max_x = min(max(x, na.rm = TRUE) + 15, field_length), min_y = max(min(y, na.rm = TRUE) - 10, 0), max_y = min(max(y, na.rm = TRUE) + 10, field_width) ) %>% ungroup()

# Create the heatmap plot heatmap_plot <- ggplot() + geom_tile(data = density_data_act, aes(x = x, y = y, fill = density), alpha = 0.9) + scale_fill_viridis_c(name = “Player Density”, option = “C”) + geom_point(data = games_df %>% filter(team == “football”), aes(x = football_x, y = football_y), color = “orange”, size = 5, shape = 23, fill = “orange”) + geom_point(data = games_df %>% filter(team == “home” ), aes(x = x, y = y, color = team), size = 3, shape = 21, fill = “white”) + geom_point(data = games_df %>% filter(team == “away” & inMotionAtBallSnap == TRUE), aes(x = x, y = y, color = team), size = 4, shape = 21, fill = “black”) + geom_point(data = games_df %>% filter(team == “away” & inMotionAtBallSnap == FALSE), aes(x = x, y = y, color = team), size = 3, shape = 21, fill = “white”) + scale_color_manual(values = team_colors) + coord_cartesian( xlim = range(games_df$min_x, games_df$max_x, na.rm = TRUE), ylim = range(games_df$min_y, games_df$max_y, na.rm = TRUE) ) + labs( title = “Animated Heatmap: Player Density”, subtitle = “Frame: {frame_time}”, x = “Field Length (yards)”, y = “Field Width (yards)” ) + theme_minimal() + theme( plot.title = element_text(hjust = 0.5, size = 20), plot.subtitle = element_text(hjust = 0.5, size = 14), legend.position = “right” )

# Animate the heatmap animation <- heatmap_plot + transition_time(frameId) + ease_aes(‘linear’)

# Render the animation animate(animation, nframes = max(games_df$frameId, na.rm = TRUE), fps = 10, width = 800, height = 450) return(density_data_act) } #Calling actual Heatmap density_data_act = heatmap_single_play_act(games_df)


Predicted Values Heatmap
``{r Predicted Values Heatmap, echo=TRUE, message=FALSE, warning=FALSE, fig.width=11, fig.height=6.5, cache=TRUE}
# |----------------------------Predictive-Heat-Map-----------------------------------------------|
heatmap_single_play_pred <- function(games_df) {

  # Define the football field dimensions
  field_width <- 53.3
  field_length <- 120
  
  # Define grid size
  grid_size <- 1 # 1-yard grid cells
  
  # Calculate grid-based density
  games_df <- games_df %>%
    filter(!is.na(x_pred) & !is.na(y_pred)) %>%
    rowwise() %>%
    mutate(
      grid_x = floor(x_pred / grid_size),
      grid_y = floor(y_pred / grid_size)
    ) %>%
    ungroup()
  
  density_data_pred <- games_df %>%
    group_by(frameId, grid_x, grid_y) %>%
    summarise(density = n(), .groups = "drop")
  if (nrow(density_data_pred) == 0) stop("Density data is empty.")
  
  # Add density contributions for adjacent cells
  adjacent_offsets <- expand.grid(dx = -1:1, dy = -1:1)
  expanded_density <- density_data_pred %>%
    rowwise() %>%
    do({
      tibble(
        frameId = .$frameId,
        grid_x = .$grid_x + adjacent_offsets$dx,
        grid_y = .$grid_y + adjacent_offsets$dy,
        density = .$density / 9 # Spread density across 9 cells
      )
    }) %>%
    ungroup()
  
  # Aggregate densities
  density_data_pred <- expanded_density %>%
    group_by(frameId, grid_x, grid_y) %>%
    summarise(density = sum(density), .groups = "drop")
  
  # Convert grid coordinates back to field coordinates
  density_data_pred <- density_data_pred %>%
    mutate(
      x = grid_x * grid_size,
      y = grid_y * grid_size
    )
  
  # Assign colors to teams with high-contrast colors
  team_colors <- c("home" = "yellow", "away" = "green")
  
  # Focus on the area of action
  games_df <- games_df %>%
    group_by(frameId) %>%
    mutate(
      min_x = ifelse(all(is.na(x_pred)), 0, max(min(x_pred, na.rm = TRUE) - 15, 0)),
      max_x = ifelse(all(is.na(x_pred)), field_length, min(max(x_pred, na.rm = TRUE) + 15, field_length)),
      min_y = ifelse(all(is.na(y_pred)), 0, max(min(y_pred, na.rm = TRUE) - 10, 0)),
      max_y = ifelse(all(is.na(y_pred)), field_width, min(max(y_pred, na.rm = TRUE) + 10, field_width))
    ) %>%
    ungroup()
  
  # Create the heatmap plot
  heatmap_plot <- ggplot() +
    geom_tile(data = density_data_pred, aes(x = x, y = y, fill = density), alpha = 0.9) +
    scale_fill_viridis_c(name = "Predicted Density", option = "C") +
    geom_point(data = games_df %>% filter(team == "home"), aes(x = x_pred, y = y_pred, color = team), size = 3, shape = 21, fill = "white") +
    geom_point(data = games_df %>% filter(team == "away"), aes(x = x_pred, y = y_pred, color = team), size = 4, shape = 21, fill = "black") +
    geom_point(data = games_df %>% filter(team == "away"), aes(x = x_pred, y = y_pred, color = team), size = 3, shape = 21, fill = "white") +
    scale_color_manual(values = team_colors) +
    coord_cartesian(
      xlim = range(games_df$min_x, games_df$max_x, na.rm = TRUE),
      ylim = range(games_df$min_y, games_df$max_y, na.rm = TRUE)
    ) +
    labs(
      title = "Predicted Animated Heatmap",
      subtitle = "Frame: {frame_time}",
      x = "Field Length (yards)",
      y = "Field Width (yards)"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20),
      plot.subtitle = element_text(hjust = 0.5, size = 14),
      legend.position = "right"
    ) +
    geom_vline(xintercept = seq(0, field_length, by = 10), color = "darkgrey", linetype = "dashed") +
    geom_hline(yintercept = c(0, field_width), color = "darkgrey") +
    geom_text(data = data.frame(x = seq(10, field_length - 10, by = 10), y = -2, label = seq(10, field_length - 10, by = 10)), aes(x = x, y = y, label = label), size = 4, color = "darkgrey")
  
  # Animate the heatmap
  animation <- heatmap_plot +
    transition_time(frameId) +
    ease_aes('linear')
  
  # Render the animation
  if (nrow(games_df) == 0 || nrow(density_data_pred) == 0) {
    stop("No data available for animation.")
  }
  animate(animation, nframes = max(games_df$frameId, na.rm = TRUE), fps = 10, width = 800, height = 450)
  
  return(density_data_pred)
}
#Calling predictive Heatmap
density_data_pred = heatmap_single_play_pred(play_predictions)

Comparative heatmap, to show what the difference between them is in overlay. 

``{r Heatmap Comparison, echo=TRUE, message=FALSE, warning=FALSE, fig.width=11, fig.height=6.5, cache=TRUE}
# ------------------------- Compare actual and predicted density data-------------------------------------| 
compare_density_heatmaps <- function(density_act, density_pred) {
  # Merge actual and predicted density data
  density_comparison <- density_act %>%
    full_join(density_pred, by = c("frameId", "grid_x", "grid_y"), suffix = c("_act", "_pred")) %>%
    mutate(
      density_diff = coalesce(density_pred, 0) - coalesce(density_act, 0),
      x = grid_x * 1,  # Convert grid coordinates to field coordinates
      y = grid_y * 1
    )
  
  # Create the difference heatmap plot with enhanced color scheme
  ggplot(density_comparison, aes(x = x, y = y)) +
    geom_tile(aes(fill = density_diff), alpha = 0.9) +
    scale_fill_gradient2(low = "blue", mid = "darkgray", high = "red", midpoint = 0, 
                         limits = c(-max(abs(density_comparison$density_diff)), max(abs(density_comparison$density_diff))),
                         name = "Density Difference") +
    labs(
      title = "Difference Heatmap: Predicted vs Actual Densities",
      x = "Field Length (yards)",
      y = "Field Width (yards)"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5, size = 18),
      axis.title = element_text(size = 12),
      legend.position = "right"
    ) +
    coord_fixed()
}

# Compare densities and plot the difference heatmap
compare_density_heatmaps(density_data_act, density_data_pred)

# | ----------------------------------Random Forest -------------------------| 

filt_df <- gamer_df %>%
  filter(playId %in% sampled_plays$playId & gameId %in% sampled_plays$gameId)

filt_df$positiveYards <- ifelse(filt_df$yardsgained > 0, 1, 0)

# Remove irrelevant columns from df
df_clean <- df %>%
  select(-c(gameId, playId, frameId, displayName, event, team, nflId, club)) %>%
  filter(!is.na(yardsgained))

# Prepare the data - Exclude any irrelevant or non-numeric columns (e.g., player names)
# Also, handle any missing data (e.g., remove rows with missing yardsgained values)

# Optional: Convert categorical variables like inMotionAtBallSnap to factors if needed
df_clean$inMotionAtBallSnap <- as.factor(df_clean$inMotionAtBallSnap)
df_clean$shiftSinceLineset <- as.factor(df_clean$shiftSinceLineset)
df_clean$motionSinceLineset <- as.factor(df_clean$motionSinceLineset)


df_clean <- filt_df %>%
  select(-c(displayName, teamAbbr, playDescription)) %>%  # Remove non-numeric columns
  filter(!is.na(yardsgained))  # Remove rows with missing yardsgained

df_clean$motionAtBallSnap[is.na(df_clean$motionAtBallSnap)] <- 
  as.factor(names(sort(table(df_clean$motionAtBallSnap), decreasing = TRUE))[1])

df_clean$shiftSinceLineset[is.na(df_clean$shiftSinceLineset)] <- 
  as.factor(names(sort(table(df_clean$shiftSinceLineset), decreasing = TRUE))[1])

df_clean$motionSinceLineset[is.na(df_clean$motionSinceLineset)] <- 
  as.factor(names(sort(table(df_clean$motionSinceLineset), decreasing = TRUE))[1])

# Split the data into training and testing sets (80/20 split)
set.seed(123)  # Set a random seed for reproducibility
trainIndex <- sample(1:nrow(df_clean), 0.8 * nrow(df_clean))
trainData <- df_clean[trainIndex, ]
testData <- df_clean[-trainIndex, ]

# Train the Random Forest model
rf_model <- randomForest(positiveYards ~ ., data = trainData, ntree = 500, mtry = 3, importance = TRUE)

# Print the model summary
print(rf_model)

# Predict probabilities for the test set
rf_predictions <- predict(rf_model, newdata = testData, type = "response")


# | ------------------------- Confusion matrix (2x2)
# Evaluate the model using a confusion matrix
conf_matrix <- table(Predicted = rf_predictions, Actual = testData$positiveYards)

# Display confusion matrix with kable
kable(conf_matrix, caption = "Confusion Matrix for Random Forest Model",
      align = "c")  # Center-align the matrix


# Calculate the model's accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 4)))

# Get feature importance from the model
importance_df <- data.frame(Feature = rownames(importance(rf_model)),
                            Importance = importance(rf_model)[,1])

# Sort features by importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE),]

# ------------------------------- Random Forest Plot --------------------| 

# Use kable to print the results in a neat table
kable(importance_df, caption = "Feature Importance from Random Forest Model", 
      col.names = c("Feature", "Importance"), 
      align = "lcc")  # 'markdown' for R Markdown output

# Create a customized ggplot for feature importance
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "blue") +
  coord_flip() +  # Flip the coordinates to make the plot horizontal
  labs(title = "Random Forest Feature Importance",
       x = "Feature",
       y = "Importance") +
  theme_minimal() +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14),
        plot.title = element_text(size = 16, face = "bold"),
        panel.grid.major = element_line(color = "gray", size = 0.5)) +
  scale_y_continuous(labels = scales::comma)

# | --------------------------Splitting Data for comparative analyasis ----------------------| 

# Filter the gamer_df into two separate data frames: one for motion plays and one for no-motion plays
motion_df <- gamer_df %>%
  filter(inMotionAtBallSnap == TRUE | shiftSinceLineset == TRUE | motionSinceLineset == TRUE)
no_motion_df <- gamer_df %>%
  filter(inMotionAtBallSnap == FALSE & shiftSinceLineset == FALSE & motionSinceLineset == FALSE)

#Breaking up motion into sub Data Frames to compare each motion
MotionBUP <- motion_df %>%
  mutate(MotionType = case_when(
    shiftSinceLineset ~ "Shift",
    motionSinceLineset & !inMotionAtBallSnap ~ "Jet Motion",
    inMotionAtBallSnap ~ "Orbit Motion"
  )) %>%
  # Prioritize rows with motion for distinct plays
  arrange(gameId, playId, desc(inMotionAtBallSnap)) %>%
  distinct(gameId, playId, .keep_all = TRUE)

# Filter data for each motion type: Shift, Jet Motion, Orbit Motion
shift_df <- MotionBUP %>%
  filter(MotionType == "Shift")

jet_motion_df <- MotionBUP %>%
  filter(MotionType == "Jet Motion")

orbit_motion_df <- MotionBUP %>%
  filter(MotionType == "Orbit Motion")

Conclusion/ Notes

Computer Issues loading the vast ammounts of data and predictions hindered us throughout the project, but we where still able to generate good and effective examples of code to display what we were aiming to acomplish. Prior to our BDB submission (which I am planning on doing), we intend to finish and refine the aspects we where not able to fully impliment due to the time constraints. But overall, we are proud of what work were able to acomplish and look forward to the more complete end result.

NFL Big Data Bowl 2025 Mock Submission

B. Kramer

12/14/2024