Introduction

The purpose of this project was to create a tool for offensive coordinators to quickly and easily evaluate the potential outcome of a passing play given a pre-snap state. To do this, a machine learning model was created that generates a decision tree to predict the outcome of the play given the value of 14 pre-snap variables. This model was created using the data provided for the 2025 NFL Big Data Bowl which encompasses player data from the first 9 weeks of the 2022 NFL season.

Description of Project

Using the data from the NFL Big Data Bowl, multiple dataframes were created to gather the metrics for all quarterbacks involved in play during the first 9 weeks. Metrics for average dropback time, time to throw, time in tackle box, and accuracy were generated. The data was then put through a process of testing and filtering to determine what variables were most useful in the creation of a decision tree model using the random forest machine learning algorithm. The end result is a model that given 14 variables from the pre-snap condition can accurately predict the outcome of the pass ~65% of the time.

Quarterback Data Table


The below table illustrates a sample of the metrics generated for evaluation. These averages are created from data that would be available to teams about the quarterback they are facing and can be used in the creation of the model for increased accuracy.

setwd("C:/Users/brygn/Documents/Stuff/RDataSources/NFLBigDataBowl2025")
#setwd("~/RStuff/Data")

suppressMessages(library(kableExtra))


directory <- paste0(getwd())

source("https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R")

load_packages(c("data.table", "dplyr", "ggplot2", "ggalt", "gifski", "gganimate", "lubridate", "randomForest", "caret"))


games <- fread("games.csv")
plays <- fread("plays.csv")
players <- fread("players.csv")
player_play <- fread("player_play.csv")


dfPlayerData <- players %>%
  filter(position == "QB") %>%
  mutate(playCount = 0) %>%
  select(nflId, position, displayName, playCount) %>%
  data.frame()

dfPlayerPlayFiltered <- player_play %>%
  filter(hadRushAttempt == 0) %>%
  data.frame()

# Match counts to nflId in dfPlayerData
occurrences <- table(dfPlayerPlayFiltered$nflId)
dfPlayerData$playCount <- ifelse(dfPlayerData$nflId %in% names(occurrences), occurrences[as.character(dfPlayerData$nflId)], 0)

dfJoined <- dfPlayerPlayFiltered %>%
  left_join(plays, by = c("playId", "gameId")) %>%
  data.frame

# Filter merged data to only include nflIds in dfPlayerData
filteredData <- dfJoined[dfJoined$nflId %in% dfPlayerData$nflId, ] %>% mutate(completedTest = ifelse(passResult == "C", 1, 0))

# Sum the metrics for each nflId
player_sums <- aggregate(
  cbind(timeToThrow, dropbackDistance, timeInTackleBox, completedTest) ~ nflId,
  data = filteredData,
  FUN = sum,
  na.rm = TRUE
)

# Merge the aggregated sums back into dfPlayerData
dfPlayerData <- merge(dfPlayerData, player_sums, by = "nflId", all.x = TRUE)

# Calculates averages for each QB
dfPlayerData <- dfPlayerData %>%
  mutate(
    avgTimeToThrow = timeToThrow / playCount,
    avgDropbackDistance = dropbackDistance / playCount,
    avgTimeInTackleBox = timeInTackleBox / playCount,
    accuracy = completedTest / playCount
  ) %>%
  select(-timeToThrow, -dropbackDistance, -timeInTackleBox) %>%
  data.frame()


dfPlays <- player_play %>%
  select(gameId, playId, nflId) %>%
  left_join(plays, by = c("gameId", "playId")) %>%
  left_join(dfPlayerData, by = "nflId") %>%
  filter(nflId %in% dfPlayerData$nflId) %>%
  data.frame()

dfDataSet <- dfPlays %>%
  select(gameId, playId, nflId, passResult, targetX, targetY, passLength, yardsGained, pff_manZone, pff_passCoverage, yardsToGo, yardlineNumber, gameClock, preSnapHomeScore, preSnapVisitorScore, preSnapHomeTeamWinProbability, preSnapVisitorTeamWinProbability, receiverAlignment, playClockAtSnap, avgTimeToThrow, avgTimeInTackleBox, avgDropbackDistance, accuracy) %>%
  filter(passResult %in% c("C", "I"), pff_manZone == "Man") %>%
  distinct(playId, .keep_all = TRUE) %>%
  data.frame()

knitr::kable(head(dfPlayerData, 10)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
nflId position displayName playCount completedTest avgTimeToThrow avgDropbackDistance avgTimeInTackleBox accuracy
25511 QB Tom Brady 551 243 1.6175935 2.195535 1.6033775 0.4410163
29851 QB Aaron Rodgers 528 197 1.4910265 1.742367 1.4209072 0.3731061
33084 QB Matt Ryan 452 197 1.7290265 1.873606 1.6897788 0.4358407
33099 QB Joe Flacco 206 87 1.9605583 2.520777 1.8808155 0.4223301
33138 QB Chad Henne 8 0 0.4796250 0.533750 0.4796250 0.0000000
34452 QB Matthew Stafford 454 189 1.6292687 2.260419 1.5644273 0.4162996
34754 QB Chase Daniel 1 NA NA NA NA NA
34843 QB Brian Hoyer 15 5 0.8764667 1.627333 0.8764667 0.3333333
37110 QB Andy Dalton 327 123 1.5200306 2.179816 1.4777554 0.3761468
37255 QB Tyrod Taylor 9 1 0.8846667 1.038889 0.8846667 0.1111111

Modeling

After generation of variables and data preparation is completed, the model can be generated. A split of 70/30 training and testing is utilized. Seeing as the play data is being utilized, we avoid the issue of a play being split up unevenly between training and test as would be the case if we were utilizing the tracking data which shows by frames. With the 14 variables used, the model achieves ~65% accuracy of predicting the correct outcome of the play.


Visualization 1: Bar Chart showing the Impact of Variables on Model Accuracy:

The below table visualizes Gini and increase in inaccuracy which illustrates each variables impact on the creation of the random forest. Gini is a measure of how important each variable is to constructing the model where decrease in accuracy lowers the level of accuracy of the model and introduces confusion.

dfDataSet$passResult <- as.factor(dfDataSet$passResult)

set.seed(42)

trainIndex <- sample(seq_len(nrow(dfDataSet)), size = 0.7 * nrow(dfDataSet))
trainData <- dfDataSet[trainIndex, ]
testData <- dfDataSet[-trainIndex, ]


rfModel <- randomForest(
  passResult ~ passLength + yardsToGo + yardlineNumber + gameClock + preSnapHomeScore + preSnapVisitorScore + preSnapHomeTeamWinProbability + preSnapVisitorTeamWinProbability + receiverAlignment + playClockAtSnap + avgTimeToThrow + avgDropbackDistance + avgTimeInTackleBox + pff_passCoverage + accuracy,
  data = trainData,
  nodesize = 1,
  ntree = 5000,
  importance = TRUE
)

predictions <- predict(rfModel, testData)
confusionMatrix <- table(testData$passResult, predictions)

accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)

importance <- importance(rfModel) %>% data.frame() %>% arrange(-MeanDecreaseAccuracy) %>% select(MeanDecreaseGini,MeanDecreaseAccuracy)

importance$variable <- rownames(importance) 
rownames(importance) <- NULL

importance2 <- importance %>% select(variable, MeanDecreaseGini, MeanDecreaseAccuracy) %>% arrange(-MeanDecreaseGini)


library(gridExtra)
library(grid)

giniPlot <- ggplot(importance2, aes(x = MeanDecreaseGini, y = reorder(variable, MeanDecreaseGini))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Mean Decrease in Gini", x = "Value", y = "Variable Modeled") +
  geom_text(aes(label = round(MeanDecreaseGini, 2)), # Display rounded values as labels
            hjust = -0.3, # Position label slightly above the bar
            size = 3) +   # Set the size of the text
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  xlim(0, max(importance2$MeanDecreaseGini) * 1.1)

# Create the bar chart for MeanDecreaseAccuracy
accuracyPlot <- ggplot(importance2, aes(x = MeanDecreaseAccuracy, y=  reorder(variable, MeanDecreaseGini))) +
  geom_bar(stat = "identity", fill = "salmon") +
  labs(title = "Mean Decrease in Accuracy", x = "Value", y = "Variable Modeled") +
  geom_text(aes(label = round(MeanDecreaseAccuracy, 2)), # Display rounded values as labels
            hjust = -0.3, # Position label slightly above the bar
            size = 3) +   # Set the size of the text
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  xlim(min(importance2$MeanDecreaseAccuracy)*1.1, max(importance2$MeanDecreaseAccuracy) * 1.1)

# Arrange the two plots side-by-side
grid.arrange(
  giniPlot, accuracyPlot,
  ncol = 2,
  top = textGrob(
    "Comparison of Variable Importance Metrics", # Title text
    gp = gpar(fontsize = 16, fontface = "bold")  # Customize size and style
  )
)


Visualization 2: Confusion Matrix Table:

The below table shows the confusion matrix produced by our model. This shows that the model is very accurate a predicting completed passes as completed passes, it struggles to properly identify true incomplete passes.

# Pulls the confusion Matrix from above
knitr::kable(head(confusionMatrix, 10)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
C I
C 232 62
I 125 99

Visualizations 3 and 4: Demonstration of Variable Impact:

The below two charts demonstrate an example of how the model goes about predicting based on a pre-snap condition. Receiver alignment would be a known variable to an offensive coordinator as it is part of the play design. The model when fed this information, is able to better inform its prediction as to the outcome of the pass based on the statistics that are illustrated below

library(ggplot2)

ggplot(dfDataSet, aes(x = receiverAlignment, fill = passResult)) + 
  geom_bar(position = "fill") + 
  labs(title = "Pass Result by Receiver Alignment", x = "Receiver Alignment", y = "Proportion")

# Enhanced Bar Plot
ggplot(dfDataSet, aes(x = reorder(receiverAlignment, passResult, FUN = length), fill = passResult)) + 
  geom_bar(position = "fill") + 
  geom_text(
    stat = "count", 
    aes(label = scales::percent(..count.. / tapply(..count.., ..x.., sum)[..x..], accuracy = 0.1)), 
    position = position_fill(vjust = 0.5), 
    size = 3
  ) + 
  scale_fill_manual(
    values = c("C" = "green", "I" = "firebrick"), 
    name = "Pass Result",
    labels = c("Complete", "Incomplete")
  ) + 
  labs(
    title = "Proportion of Pass Results by Receiver Alignment",
    subtitle = "Distribution of pass completions and incompletions across alignments",
    x = "Receiver Alignment",
    y = "Proportion",
    fill = "Pass Result"
  ) + 
  theme_minimal() + 
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, face = "italic"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Conclusion

While the accuracy of this model is not 100%, this project creates value for the NFL by providing coaches and offensive coordinators with a quick and dirty way of analyzing the effectiveness of a given passing play based on data that is known entirely before the ball is snapped. This allows for coordinators to make small changes to an offensive scheme and with an accuracy of slightly less than 70%, evaluate the result of that passing play

Future work on this project includes improvements to the model to further enhance accuracy and creation of an interface that a coordinator or coach could enter in the answer to the 14 pre-snap variables involved in the model and be presented with the model’s prediction of the outcome of the play. This would allow real time analysis of offensive schemes in a control enviornment in preperation for gameday.

Note

If interested in more or have any questions or suggestions, please get in contact with us at or