IS470 Assignment 4.4 - Nick Racelis and Ray Strozyk

Assignment 4 Introduction: Big Data Bowl Submission Project overvview

For our final assignment for the Fall 2024 semester, Nick and I decided to focus on the impact of the usage of pre-snap Motion on rushing plays.

We created a randomForest that would predict the expected yards given a handful of variables from all the plays we were provided with.

After we created this visualizations, we analyzed and visualized the result of our randomForest, expected versus actual yards gained, successful plays for both expected and actual perspectives, and how the impact of motion.

Setting up Working Directory, Functions, and Files

A fairly straight forward section of our code. Set up our working directory, using a link from our professor (Thanks Prof. Tallon!), loading the numerous packages we would uses, and uploading the CSV files to rStudio.

# setwd("C:/Users/raymo/OneDrive/Documents/Loyola Fall 2024/IS470/Data")

setwd("C:/")

source("https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R")

load_packages(c("httr", "ggplot2", "ggalt", "ggforce", "hms", 
                "gganimate", "lubridate", "data.table", "dplyr", 
                "nflfastR", "gifski", "png", "ggimage", "caret", 
                "randomForest", "tidyr", "DT", "reshape2", "ranger"))

games         <- fread("Users/raymo/OneDrive/Documents/Loyola Fall 2024/IS470/Data/games.csv")
plays         <- fread("Users/raymo/OneDrive/Documents/Loyola Fall 2024/IS470/Data/plays.csv")
players       <- fread("Users/raymo/OneDrive/Documents/Loyola Fall 2024/IS470/Data/players.csv")
player_play   <- fread("Users/raymo/OneDrive/Documents/Loyola Fall 2024/IS470/Data/player_play.csv")

rushing_df <- left_join(plays, player_play, by = c("gameId", "playId"))

# NO PROBLEM; RUN AND DONE!

Data Filtering: Identifying Rushing Plays and which use Motion

Our main Data Frame that we worked out to filtered out all plays that were not rushing plays. We created our categorical variable that determines if the rushing play ran is effective. From our perspective, we decided against ‘hardcoding’ a benchmark for what a successful play looks like. We kept a focus on each rushing play with a situational approach and decided to have this variable be ‘softcoded’.

rushing_df_success <- rushing_df %>%
  filter(hadRushAttempt == 1) %>%
  mutate(successful_Run_Play = case_when(
    down == 4 ~ ifelse(yardsToGo < yardsGained, "Successful", "Not Successful"),
    (yardsToGo / (4 - down)) < yardsGained ~ "Successful",
    TRUE ~ "Not Successful")) %>%
  data.frame()

motion_rushing_df <- rushing_df_success %>%
  filter(motionSinceLineset == 1) %>%
  data.frame()

# NO PROBLEM; RUN AND DONE!

Our randomForest Model

Here is the bulk of our code for our randomForest model. We gave the model 6 predictors to work off of, which of the 10+ plus forests we created, this combination provided the largest improvement from basement increase.

df_with_predictors_3 <- rushing_df_success %>%
  select(down, yardsToGo, yardsGained, successful_Run_Play, motionSinceLineset, inMotionAtBallSnap, pff_manZone, rushLocationType)%>%
  mutate(across(where(is.character), as.factor)) %>%  # Convert character variables to factors
  mutate(across(where(is.logical), as.numeric)) %>%   # Convert logical variables to numeric
  data.frame()

df_predictors_cleaned_3 <- na.omit(df_with_predictors_3) %>% data.frame()

set.seed(43)
train_index_3 <- sample(1:nrow(df_predictors_cleaned_3), size = 0.7 * nrow(df_predictors_cleaned_3))
train_data_3 <- df_predictors_cleaned_3[train_index_3, ]
test_data_3 <- df_predictors_cleaned_3[-train_index_3, ]

# Define formula explicitly to avoid dynamic formula issues
formula_rf <- as.formula("yardsGained ~ yardsToGo + down + motionSinceLineset + inMotionAtBallSnap + pff_manZone + rushLocationType")
# Define a custom grid of hyperparameters for tuning
tune_grid <- expand.grid(
  mtry = seq(1, length(df_predictors_cleaned_3), by = 1)  # Number of predictors considered at each split
) 

# Cross-validation settings
train_control <- trainControl(
  method = "cv",                              # Cross-validation
  number = 5,                                 # 5-fold CV
  verboseIter = FALSE,                         # Display training progress
  search = "grid"                             # Grid search for hyperparameters
)

# Train a Random Forest model with hyperparameter tuning

rf_tuned <- train(
  form = formula_rf,                       
  data = train_data_3,
  method = "rf",
  trControl = train_control,
  tuneGrid = tune_grid,
  metric = "RMSE",
  importance = TRUE)

predictions <- predict(rf_tuned, newdata = df_predictors_cleaned_3)

df_predictors_cleaned_3$predicted_yards <- predictions

# NO PROBLEM; RUN AND DONE!

Statistical Evaluation of our random Forest

Below is our section of code that evaluated how our model fared in predicting expect yards. For this model, we looked specifically MSE, Improvement from Baseline based on MSE, and how the importance of each variable impacts the randomForest model’s results. Additionally, we conducted a T-test that revealed that the difference between the motion and non-motion plays is not statistically significant.

train_predictions <- predict(rf_tuned, newdata = train_data_3)
train_mse <- mean((train_data_3$yardsGained - train_predictions)^2)
cat("Training MSE: ", train_mse, "\n")

## Training MSE:  44.13779

test_predictions <- predict(rf_tuned, newdata = test_data_3)
test_mse <- mean((test_data_3$yardsGained - test_predictions)^2)
cat("Test MSE: ", test_mse, "\n")

## Test MSE:  43.83375

baseline_pred <- rep(mean(train_data_3$yardsGained), length(test_data_3$yardsGained))
baseline_mse <- mean((test_data_3$yardsGained - baseline_pred)^2)
cat("Baseline Test MSE: ", baseline_mse, "\n")

## Baseline Test MSE:  44.73895

improvement <- (baseline_mse - test_mse) / baseline_mse * 100
cat("Improvement Over Baseline: ", round(improvement, 2), "%\n")

## Improvement Over Baseline:  2.02 %

importance_values <- varImp(rf_tuned, scale = FALSE)  # Get variable importance
print(importance_values)

## rf variable importance
## 
##                                Overall
## yardsToGo                     18.46548
## down                          17.77698
## pff_manZoneOther              15.19732
## rushLocationTypeOUTSIDE_LEFT   6.64308
## inMotionAtBallSnap             6.05817
## pff_manZoneZone                4.04202
## rushLocationTypeOUTSIDE_RIGHT  3.42475
## rushLocationTypeINSIDE_RIGHT   0.05135
## motionSinceLineset             0.04031
## rushLocationTypeUNKNOWN        0.00000

# varImpPlot(rf_tuned$finalModel, main = "Variable Importance Plot") # We ran it BUT it does not look great in the markdown file.

motion_plays <- df_predictors_cleaned_3$yardsGained[df_predictors_cleaned_3$motionSinceLineset == 1]
non_motion_plays <- df_predictors_cleaned_3$yardsGained[df_predictors_cleaned_3$motionSinceLineset == 0]
result <- t.test(motion_plays, non_motion_plays, paired = FALSE, alternative = "two.sided")
print(result)

## 
##  Welch Two Sample t-test
## 
## data:  motion_plays and non_motion_plays
## t = 1.6325, df = 551.5, p-value = 0.1031
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1230039  1.3334675
## sample estimates:
## mean of x mean of y 
##  5.258065  4.652833

# NO PROBLEM; RUN AND DONE!

Further, Nick and I wanted to delve into how the model’s predictive results in terms of true positives, true negatives, false positives, and false negatives. The code below achieves that. Our model did well at identifying true positives, but struggled strongly with identifying a large amount of false positives below. Our model precisely (using ‘precision’) predicted 53.2% of successful plays correctly. With respect to Recall, 93.6% of the actual successful plays were correctly identified. Our F1 Score, which is a mixed statistical metric involving Precision and Recall, of 67.9% indicates that the model has an acceptable balance, but does have some leans toward higher recall at the expense of precision. Roughly 4.5% of the varinace in the actual yards gained is predicted from our randomForest model (R-squared: 0.04459). Finally, on average, the model’s predicted yardsGained differs from the actual value by about 4.026 yards.

df_predictors_cleaned_3 <- df_predictors_cleaned_3 %>%
  mutate(predicted_successful_Run_Play = case_when(
    down == 4 ~ ifelse(yardsToGo < predicted_yards, "Successful", "Not Successful"),
    (yardsToGo / (4 - down)) < predicted_yards ~ "Successful",
    TRUE ~ "Not Successful"
  )) %>%
  data.frame()

# Create a new column to classify the results as TP, FP, FN, or TN
comparison <- df_predictors_cleaned_3 %>%
  select(yardsToGo, yardsGained, predicted_yards, successful_Run_Play, predicted_successful_Run_Play) %>%
  mutate(
    comparison_result = case_when(
      successful_Run_Play == "Successful" & predicted_successful_Run_Play == "Successful" ~ "True Positive",
      successful_Run_Play == "Not Successful" & predicted_successful_Run_Play == "Successful" ~ "False Positive",
      successful_Run_Play == "Successful" & predicted_successful_Run_Play == "Not Successful" ~ "False Negative",
      successful_Run_Play == "Not Successful" & predicted_successful_Run_Play == "Not Successful" ~ "True Negative",
      TRUE ~ NA_character_  # Handle any unexpected values (if any)
    )
  ) %>%
  data.frame()

table(comparison$comparison_result)

## 
## False Negative False Positive  True Negative  True Positive 
##            212           2747            483           3126

TP <- sum(comparison$comparison_result == "True Positive")
FP <- sum(comparison$comparison_result == "False Positive")
FN <- sum(comparison$comparison_result == "False Negative")
TN <- sum(comparison$comparison_result == "True Negative")

precision <- TP / (TP + FP)
recall    <- TP / (TP + FN)
f1_score  <- 2 * (precision * recall) / (precision + recall)

cat("Precision:", precision, "\n")

## Precision: 0.5322663

cat("Recall:", recall, "\n")

## Recall: 0.9364889

cat("F1 Score:", f1_score, "\n")

## F1 Score: 0.6787537

df_predictors_cleaned_3$rmse <- sqrt(mean((df_predictors_cleaned_3$yardsGained - df_predictors_cleaned_3$predicted_yards)^2))

rsq <- cor(df_predictors_cleaned_3$yardsGained, df_predictors_cleaned_3$predicted_yards)^2
cat("R-squared: ", rsq, "\n")

## R-squared:  0.04459822

mae <- mean(abs(df_predictors_cleaned_3$yardsGained - df_predictors_cleaned_3$predicted_yards))
cat("MAE: ", mae, "\n")

## MAE:  4.026051

# NO PROBLEM; RUN AND DONE!

Our visualizations from the randomForest

We desired to explore the results of the random forest and compare them to what actually happened. We have a handful of graphics below that elabortate how the use of motion impacts the yards gained of rushing plays.

First, we wanted to make a layered histogram showing the situational yards needed (in blue), the actual yards gained (in red), and the predicted yards (in green) with all outliers removed. We anticipated a bulk of predicted yardage to be within a positive gain of 10 yards, as there is a large portion of the actual yards gained (in green) in this same interval. We observed that the majority of plays needed 10 yards for a first down, which would be expected since at each new set of downs results in a ‘1st & 10’. The yards gained is slightly skewed to the right with the majority being between 1 and 5 yards gained. Our model is not as dispersed at the actual yards gained and is mostly predicting a gain of 4 and 5 yards

# Create layered histogram of yardsToGo, Yards Gained and Predicted Yards Gained 
ggplot(df_predictors_cleaned_3) +
  geom_histogram(aes(x = yardsToGo, fill = "Yards to Go"), alpha = 0.5, binwidth = 1, color = "black") +
  geom_histogram(aes(x = yardsGained, fill = "Yards Gained"), alpha = 0.5, binwidth = 1, color = "black") +
  geom_histogram(aes(x = predicted_yards, fill = "Predicted Yards"), alpha = 0.5, binwidth = 1, color = "black") +
  scale_fill_manual(
    name = "Legend",
    values = c("Yards to Go" = "blue", "Yards Gained" = "green", "Predicted Yards" = "red")
  ) +
  labs(
    title = "Layered Histogram of Yards Metrics",
    x = "Yards",
    y = "Frequency"
  ) +
  scale_x_continuous(limits = c(-15, 20)) +  # Restrict x-axis to -20 to 20
  theme_minimal() +
  theme(legend.position = "top")

# NO PROBLEM; RUN AND DONE!

Second, we wanted at to create a scatter plot that displays the relationship between predicted yards and actual yards gained. The regression line displays a slight positive slope; this indicates that the higher predicted yards value, the more actual yards will be gained. The red regression line seen in the plot aids in visualizing what the actual yardage gained would be based on what our random forest predicted if these two variables were linearly related.

actual <- df_predictors_cleaned_3$yardsGained
predicted <- df_predictors_cleaned_3$predicted_yards

plot(predicted, actual,
     main = "Predicted Yards Gained vs Actual Yards Gained (via RF)",
     xlab = "Predicted Yards Gained",
     ylab = "Actual Yards Gained",
     pch = 19,
     col = "blue")
regression_model <- lm(actual ~ predicted)  # Fit a linear model
abline(regression_model, col = "red", lwd = 2)  # Add the regression line in red

# NO PROBLEM; RUN AND DONE!

Third, we created a heat map. Our heat map looks at all correlations between the five selected variables: motionSinceLineset, inMotionAtBallSnap, predicted_yards, and down (down number). Three of the four variables have a weak to moderate, positive correlation with the predicted yards variable. The one variable that does not have this same similarity is ‘down’, which has quite a weak, negative correlation. This indicates that it does not have much correlation, if any, to the predicted yards.

# Compute the correlation matrix
numerical_vars <- df_predictors_cleaned_3 %>%
  select(motionSinceLineset, inMotionAtBallSnap, yardsToGo, predicted_yards, down)
cor_matrix <- cor(numerical_vars)
cor_matrix[upper.tri(cor_matrix)] <- NA
cor_melted <- melt(cor_matrix, na.rm = TRUE)
# Convert the matrix to a long format for ggplot2
cor_data <- as.data.frame(as.table(cor_matrix))

# Plot the heatmap using ggplot2
# Plot the heatmap with the reversed slope and correlation numbers inside the boxes
ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, name = "Correlation") +
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 4) +  # Add correlation values inside the boxes
  theme_minimal() +
  labs(title = "Correlation Heatmap of Numerical Variables", x = "Variable", y = "Variable") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
  scale_y_discrete(limits = rev(levels(cor_melted$Var2)))  # Reverse the y-axis so the chart slopes down

# NO PROBLEM; RUN AND DONE!

Fourth, we made a two stacked bar graphs representing the successful and non-successful plays that were predicted both correctly and incorrectly. Based on our predicted values and the soft coded approach for defining a successful play, our random forest is correctly predicting non successful plays. Most notably on first and fourth down. Although it is correctly predicting non successful plays, it is not accurate when it comes to predicting successful plays, at just above 50% accuracy for all downs.

# KEEP this one!

facet_labels <- c(
  "Not Successful" = "Predicted Not Successful",
  "Successful" = "Predicted Successful"
)

df_predictors_cleaned_3 <- df_predictors_cleaned_3 %>%
  mutate( predicted_successful_Run_Play = as.factor(predicted_successful_Run_Play))

df_motion_1 <- df_predictors_cleaned_3 %>% filter(motionSinceLineset == 1)

ggplot(df_motion_1) +
  geom_bar(aes(x = factor(down), fill = successful_Run_Play), position = "fill") +
  facet_wrap(~predicted_successful_Run_Play, labeller = labeller(predicted_successful_Run_Play = facet_labels)) +
  labs(
    title = "Successful Motion Plays by Down",
    x = "Down",
    y = "Proportion",
    fill = "Actual Play"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal() +
  theme(legend.position = "top")

# NO PROBLEM; RUN AND DONE!

Fifth and final, we have two tables below that shed insight into how the rush location type and defensive coverage types play into the success of rushing plays. Based on the success rates below, Offensive Coordinates coaches should opt for using rushing attempt (with the usage of motion at their discretion) when they face ‘other’, unique defensive coverages. Conversely, when Defensive Coordinates face rushing plays WITH motion, they should opt for running Zone coverages, as they have the lowest success rate of 40.6%. In regards to the rush type location, running ball to the sidelines and not through the center of the field yield ~ 52% success rate.

success_by_zone_motion <- df_predictors_cleaned_3 %>%
  group_by(pff_manZone, inMotionAtBallSnap) %>%
  summarise(success_rate = mean(as.numeric(successful_Run_Play) == 1, na.rm = TRUE)) %>%
  mutate(success_rate = round(success_rate, 4)) %>%
  ungroup()

datatable(success_by_zone_motion)

success_by_rush_location <- df_predictors_cleaned_3 %>%
  group_by(rushLocationType) %>%
  summarise(success_rate = mean(as.numeric(successful_Run_Play) ==1, na.rm = TRUE)) %>%
  mutate(success_rate = round(success_rate, 4)) %>%
  ungroup

datatable(success_by_rush_location)

Final thoughts and Recommendations to NFL Coaches

In short, we are unfortunately unable to conclude that using motion with rushing plays makes a definitive, positive difference in offensive yards gained. Our randomForest is a good start in predicting the outcome of a play, given if it has motion or not.As we were unable to conclude if motion is statistically significantly and positive impactful in yards gained, we advised that more research, exploration, and testing must be performed in order to do so.

Our biggest regret: not utilizing the Tracking data. This would have significantly helped us, as we could have derived more data from the physical distance from the rushing players to defensive players on the field. Additionally, we could have visualized highly successful rushing plays with and without motion with specific commentary on why they were so successful.

Once again, we would like to thank Prof. Tallon for all of his knowledge and support in his IS470 class this semester.