The purpose of this project was to create a tool for offensive coordinators to quickly and easily evaluate the potential outcome of a passing play given a pre-snap state. To do this, a machine learning model was created that generates a decision tree to predict the outcome of the play given the value of 14 pre-snap variables. This model was created using the data provided for the 2025 NFL Big Data Bowl which encompasses player data from the first 9 weeks of the 2022 NFL season.
Using the data from the NFL Big Data Bowl, multiple dataframes were created to gather the metrics for all quarterbacks involved in play during the first 9 weeks. Metrics for average dropback time, time to throw, time in tackle box, and accuracy were generated. The data was then put through a process of testing and filtering to determine what variables were most useful in the creation of a decision tree model using the random forest machine learning algorithm. The end result is a model that given 14 variables from the pre-snap condition can accurately predict the outcome of the pass ~65% of the time.
The below table illustrates a sample of the metrics generated for evaluation. These averages are created from data that would be available to teams about the quarterback they are facing and can be used in the creation of the model for increased accuracy.
setwd("C:/Users/brygn/Documents/Stuff/RDataSources/NFLBigDataBowl2025")
#setwd("~/RStuff/Data")
suppressMessages(library(kableExtra))
directory <- paste0(getwd())
source("https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R")
load_packages(c("data.table", "dplyr", "ggplot2", "ggalt", "gifski", "gganimate", "lubridate", "randomForest", "caret"))
games <- fread("games.csv")
plays <- fread("plays.csv")
players <- fread("players.csv")
player_play <- fread("player_play.csv")
dfPlayerData <- players %>%
filter(position == "QB") %>%
mutate(playCount = 0) %>%
select(nflId, position, displayName, playCount) %>%
data.frame()
dfPlayerPlayFiltered <- player_play %>%
filter(hadRushAttempt == 0) %>%
data.frame()
# Match counts to nflId in dfPlayerData
occurrences <- table(dfPlayerPlayFiltered$nflId)
dfPlayerData$playCount <- ifelse(dfPlayerData$nflId %in% names(occurrences), occurrences[as.character(dfPlayerData$nflId)], 0)
dfJoined <- dfPlayerPlayFiltered %>%
left_join(plays, by = c("playId", "gameId")) %>%
data.frame
# Filter merged data to only include nflIds in dfPlayerData
filteredData <- dfJoined[dfJoined$nflId %in% dfPlayerData$nflId, ] %>% mutate(completedTest = ifelse(passResult == "C", 1, 0))
# Sum the metrics for each nflId
player_sums <- aggregate(
cbind(timeToThrow, dropbackDistance, timeInTackleBox, completedTest) ~ nflId,
data = filteredData,
FUN = sum,
na.rm = TRUE
)
# Merge the aggregated sums back into dfPlayerData
dfPlayerData <- merge(dfPlayerData, player_sums, by = "nflId", all.x = TRUE)
# Calculates averages for each QB
dfPlayerData <- dfPlayerData %>%
mutate(
avgTimeToThrow = timeToThrow / playCount,
avgDropbackDistance = dropbackDistance / playCount,
avgTimeInTackleBox = timeInTackleBox / playCount,
accuracy = completedTest / playCount
) %>%
select(-timeToThrow, -dropbackDistance, -timeInTackleBox) %>%
data.frame()
dfPlays <- player_play %>%
select(gameId, playId, nflId) %>%
left_join(plays, by = c("gameId", "playId")) %>%
left_join(dfPlayerData, by = "nflId") %>%
filter(nflId %in% dfPlayerData$nflId) %>%
data.frame()
dfDataSet <- dfPlays %>%
select(gameId, playId, nflId, passResult, targetX, targetY, passLength, yardsGained, pff_manZone, pff_passCoverage, yardsToGo, yardlineNumber, gameClock, preSnapHomeScore, preSnapVisitorScore, preSnapHomeTeamWinProbability, preSnapVisitorTeamWinProbability, receiverAlignment, playClockAtSnap, avgTimeToThrow, avgTimeInTackleBox, avgDropbackDistance, accuracy) %>%
filter(passResult %in% c("C", "I"), pff_manZone == "Man") %>%
distinct(playId, .keep_all = TRUE) %>%
data.frame()
knitr::kable(head(dfPlayerData, 10)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| nflId | position | displayName | playCount | completedTest | avgTimeToThrow | avgDropbackDistance | avgTimeInTackleBox | accuracy |
|---|---|---|---|---|---|---|---|---|
| 25511 | QB | Tom Brady | 551 | 243 | 1.6175935 | 2.195535 | 1.6033775 | 0.4410163 |
| 29851 | QB | Aaron Rodgers | 528 | 197 | 1.4910265 | 1.742367 | 1.4209072 | 0.3731061 |
| 33084 | QB | Matt Ryan | 452 | 197 | 1.7290265 | 1.873606 | 1.6897788 | 0.4358407 |
| 33099 | QB | Joe Flacco | 206 | 87 | 1.9605583 | 2.520777 | 1.8808155 | 0.4223301 |
| 33138 | QB | Chad Henne | 8 | 0 | 0.4796250 | 0.533750 | 0.4796250 | 0.0000000 |
| 34452 | QB | Matthew Stafford | 454 | 189 | 1.6292687 | 2.260419 | 1.5644273 | 0.4162996 |
| 34754 | QB | Chase Daniel | 1 | NA | NA | NA | NA | NA |
| 34843 | QB | Brian Hoyer | 15 | 5 | 0.8764667 | 1.627333 | 0.8764667 | 0.3333333 |
| 37110 | QB | Andy Dalton | 327 | 123 | 1.5200306 | 2.179816 | 1.4777554 | 0.3761468 |
| 37255 | QB | Tyrod Taylor | 9 | 1 | 0.8846667 | 1.038889 | 0.8846667 | 0.1111111 |
After generation of variables and data preparation is completed, the model can be generated. A split of 70/30 training and testing is utilized. Seeing as the play data is being utilized, we avoid the issue of a play being split up unevenly between training and test as would be the case if we were utilizing the tracking data which shows by frames. With the 14 variables used, the model achieves ~65% accuracy of predicting the correct outcome of the play.
The below table visualizes Gini and increase in inaccuracy which illustrates each variables impact on the creation of the random forest. Gini is a measure of how important each variable is to constructing the model where decrease in accuracy lowers the level of accuracy of the model and introduces confusion.
dfDataSet$passResult <- as.factor(dfDataSet$passResult)
set.seed(42)
trainIndex <- sample(seq_len(nrow(dfDataSet)), size = 0.7 * nrow(dfDataSet))
trainData <- dfDataSet[trainIndex, ]
testData <- dfDataSet[-trainIndex, ]
rfModel <- randomForest(
passResult ~ passLength + yardsToGo + yardlineNumber + gameClock + preSnapHomeScore + preSnapVisitorScore + preSnapHomeTeamWinProbability + preSnapVisitorTeamWinProbability + receiverAlignment + playClockAtSnap + avgTimeToThrow + avgDropbackDistance + avgTimeInTackleBox + pff_passCoverage + accuracy,
data = trainData,
nodesize = 1,
ntree = 5000,
importance = TRUE
)
predictions <- predict(rfModel, testData)
confusionMatrix <- table(testData$passResult, predictions)
accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
importance <- importance(rfModel) %>% data.frame() %>% arrange(-MeanDecreaseAccuracy) %>% select(MeanDecreaseGini,MeanDecreaseAccuracy)
importance$variable <- rownames(importance)
rownames(importance) <- NULL
importance2 <- importance %>% select(variable, MeanDecreaseGini, MeanDecreaseAccuracy) %>% arrange(-MeanDecreaseGini)
library(gridExtra)
library(grid)
giniPlot <- ggplot(importance2, aes(x = MeanDecreaseGini, y = reorder(variable, MeanDecreaseGini))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Mean Decrease in Gini", x = "Value", y = "Variable Modeled") +
geom_text(aes(label = round(MeanDecreaseGini, 2)), # Display rounded values as labels
hjust = -0.3, # Position label slightly above the bar
size = 3) + # Set the size of the text
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
xlim(0, max(importance2$MeanDecreaseGini) * 1.1)
# Create the bar chart for MeanDecreaseAccuracy
accuracyPlot <- ggplot(importance2, aes(x = MeanDecreaseAccuracy, y= reorder(variable, MeanDecreaseGini))) +
geom_bar(stat = "identity", fill = "salmon") +
labs(title = "Mean Decrease in Accuracy", x = "Value", y = "Variable Modeled") +
geom_text(aes(label = round(MeanDecreaseAccuracy, 2)), # Display rounded values as labels
hjust = -0.3, # Position label slightly above the bar
size = 3) + # Set the size of the text
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
xlim(min(importance2$MeanDecreaseAccuracy)*1.1, max(importance2$MeanDecreaseAccuracy) * 1.1)
# Arrange the two plots side-by-side
grid.arrange(
giniPlot, accuracyPlot,
ncol = 2,
top = textGrob(
"Comparison of Variable Importance Metrics", # Title text
gp = gpar(fontsize = 16, fontface = "bold") # Customize size and style
)
)
The below table shows the confusion matrix produced by our model. This shows that the model is very accurate a predicting completed passes as completed passes, it struggles to properly identify true incomplete passes.
# Pulls the confusion Matrix from above
knitr::kable(head(confusionMatrix, 10)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| C | I | |
|---|---|---|
| C | 232 | 62 |
| I | 125 | 99 |
The below two charts demonstrate an example of how the model goes about predicting based on a pre-snap condition. Receiver alignment would be a known variable to an offensive coordinator as it is part of the play design. The model when fed this information, is able to better inform its prediction as to the outcome of the pass based on the statistics that are illustrated below
library(ggplot2)
ggplot(dfDataSet, aes(x = receiverAlignment, fill = passResult)) +
geom_bar(position = "fill") +
labs(title = "Pass Result by Receiver Alignment", x = "Receiver Alignment", y = "Proportion")
# Enhanced Bar Plot
ggplot(dfDataSet, aes(x = reorder(receiverAlignment, passResult, FUN = length), fill = passResult)) +
geom_bar(position = "fill") +
geom_text(
stat = "count",
aes(label = scales::percent(..count.. / tapply(..count.., ..x.., sum)[..x..], accuracy = 0.1)),
position = position_fill(vjust = 0.5),
size = 3
) +
scale_fill_manual(
values = c("C" = "green", "I" = "firebrick"),
name = "Pass Result",
labels = c("Complete", "Incomplete")
) +
labs(
title = "Proportion of Pass Results by Receiver Alignment",
subtitle = "Distribution of pass completions and incompletions across alignments",
x = "Receiver Alignment",
y = "Proportion",
fill = "Pass Result"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, face = "italic"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
While the accuracy of this model is not 100%, this project creates value for the NFL by providing coaches and offensive coordinators with a quick and dirty way of analyzing the effectiveness of a given passing play based on data that is known entirely before the ball is snapped. This allows for coordinators to make small changes to an offensive scheme and with an accuracy of slightly less than 70%, evaluate the result of that passing play
Future work on this project includes improvements to the model to further enhance accuracy and creation of an interface that a coordinator or coach could enter in the answer to the 14 pre-snap variables involved in the model and be presented with the model’s prediction of the outcome of the play. This would allow real time analysis of offensive schemes in a control enviornment in preperation for gameday.
If interested in more or have any questions or suggestions, please get in contact with us at brygnichols@gmail.com or ssilguero@loyola.edu