Every year, the NFL holds a big competition called the Big Data Bowl in which they release a ton of their data about players, games, and plays with data captures recording countless things about each player 10 times per second. After releasing this data, they prompt the community with a certain focus and ask for the best, most actionable analysis about it. This year, the focus was on offensive and defensive lines in the pass game, asking us to find insights about what goes on in the trenches.
When tasked with this as a final project in our sports analytics class, we decided that in order to find out who is the best pass rusher, we need to figure out how quickly they get to the quarterback, as that is the focus of each pass rusher once the ball is snapped. Not only did we want to find out who gets there quickest, we also wanted to know who is able to maintain pressure by not getting pushed back by the offensive line, as in order to cause effective pressure on the opposing quarterback, a pass rusher must go forward as much as possible and avoid being pushed back. With these goals in mind, we got to work and began trying to create actionable pass rushing stats.
To start off, we had to import the data from the NFL into our R file and then extract the parts of it that we needed to create our statistics. We split it into two data frames to start, getting the quarterback into one and then the positions that would be pass rushing into another. Once we merged these data frames, we then had to calculate the stats we would need, such as the distance from each defender to the quarterback, seconds until the defender reaches the quarterback, and how often a defender is pushed back.
# NFL Big Data Bowl Submission
# Michael McCrory & Nick Harding
# LOAD IN LIBRARIES
suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(gganimate)))
suppressWarnings(suppressMessages(library(gifski)))
suppressWarnings(suppressMessages(library(stats)))
# CREATE WORKING DIRECTORY AND LOAD IN FILE
mypath <- "C:/Users/nrhar/OneDrive/Documents/SA 2023 NFL files"
setwd(mypath)
bigDataFile <- fread("nfl_big_data_bowl_all_data_merged_2023.csv")
# CREATE STATISTICS NEEDED
qbDF <- bigDataFile %>%
filter(frameId > 5) %>%
filter(officialPosition == "QB") %>%
data.frame()
everyoneElseDF <- bigDataFile %>%
filter(frameId > 5, pff_role == "Pass Rush") %>%
filter(officialPosition %in% c("MLB", "DT", "ILB", "OLB", "NT", "LB", "DE")) %>%
data.frame()
totalDF <- left_join(qbDF, everyoneElseDF, by = c("gameId" = "gameId",
"playId" = "playId",
"frameId" = "frameId",
"time" = "time")) %>%
mutate(distNow = sqrt((x.x - x.y)^2 + (y.x - y.y)^2)) %>%
arrange(gameId, playId, frameId) %>%
mutate(yardsGained = lag(distNow) - distNow) %>%
data.frame()
df10 <- totalDF %>%
mutate(secondsToFailure1 = (1 / yardsGained / .1)*distNow) %>%
data.frame()
df12 <- df10 %>%
mutate(binaryPercentage = ifelse(secondsToFailure1 >= 0, 0, 1)) %>%
data.frame()
Now that our data is put together and we have the basic stats that we need, we must then break up each position into different data frames in order to calculate the final statistics. In each of these, we have the different positions that pass rush, DE, DT, NT, and OLB. We initially included the MLBs and ILBs as well however the amount that they played and the stats they had were not on par with the other positions. Once we had all these done it was just merging them back together in order to create data frames that we could use to visualize the data.
dt_df1 <- df12 %>%
filter(officialPosition.y == "DT") %>%
select(displayName.y, nflId.y, officialPosition.y, binaryPercentage) %>%
filter(!is.na(binaryPercentage),
displayName.y != "Malik Jackson") %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(totalNegative = sum(binaryPercentage),
totalPositive = n() - totalNegative,
totalSeconds = n()/10,
percentageNegative = if(totalPositive > 0) round(100*totalNegative/(totalNegative+totalPositive), 2) else 0,
.groups = 'keep') %>%
arrange(percentageNegative) %>%
ungroup() %>%
filter(totalSeconds > 420) %>%
top_n(10, wt = -percentageNegative) %>%
data.frame()
nt_df1 <- df12 %>%
filter(officialPosition.y == "NT") %>%
select(displayName.y, nflId.y, officialPosition.y, binaryPercentage) %>%
filter(!is.na(binaryPercentage)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(totalNegative = sum(binaryPercentage),
totalPositive = n() - totalNegative,
totalSeconds = n()/10,
percentageNegative = if(totalPositive > 0) round(100*totalNegative/(totalNegative+totalPositive), 2) else 0,
.groups = 'keep') %>%
arrange(percentageNegative) %>%
ungroup() %>%
filter(totalSeconds > 420) %>%
top_n(10, wt = -percentageNegative) %>%
data.frame()
olb_df1 <- df12 %>%
filter(officialPosition.y == "OLB") %>%
select(displayName.y, nflId.y, officialPosition.y, binaryPercentage) %>%
filter(!is.na(binaryPercentage)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(totalNegative = sum(binaryPercentage),
totalPositive = n() - totalNegative,
totalSeconds = n()/10,
percentageNegative = if(totalPositive > 0) round(100*totalNegative/(totalNegative+totalPositive), 2) else 0,
.groups = 'keep') %>%
arrange(percentageNegative) %>%
ungroup() %>%
filter(totalSeconds > 420) %>%
top_n(10, wt = -percentageNegative) %>%
data.frame()
de_df1 <- df12 %>%
filter(officialPosition.y == "DE") %>%
select(displayName.y, nflId.y, officialPosition.y, binaryPercentage) %>%
filter(!is.na(binaryPercentage)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(totalNegative = sum(binaryPercentage),
totalPositive = n() - totalNegative,
totalSeconds = n()/10,
percentageNegative = if(totalPositive > 0) round(100*totalNegative/(totalNegative+totalPositive), 2) else 0,
.groups = 'keep') %>%
arrange(percentageNegative) %>%
ungroup() %>%
filter(totalSeconds > 420) %>%
top_n(10, wt = -percentageNegative) %>%
data.frame()
# MERGE DATAFRAMES
position_df1 <- rbind(dt_df1, nt_df1)
position_df2 <- rbind(olb_df1, de_df1)
final_df1 <- rbind(position_df1, position_df2)
visual_df1 <- final_df1 %>%
top_n(8, wt = -percentageNegative) %>%
data.frame()
######################
df11 <- df10 %>%
filter(secondsToFailure1 >= 0) %>%
filter(secondsToFailure1 < 600) %>%
data.frame()
de_df2 <- df11 %>%
filter(officialPosition.y == "DE") %>%
filter(!is.na(secondsToFailure1)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(meanSecondsToQB = mean(secondsToFailure1),
totalSeconds = n()/10) %>%
arrange(meanSecondsToQB) %>%
filter(totalSeconds > 300) %>%
ungroup() %>%
top_n(5, wt = -meanSecondsToQB) %>%
data.frame()
dt_df2 <- df11 %>%
filter(officialPosition.y == "DT") %>%
filter(!is.na(secondsToFailure1)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(meanSecondsToQB = mean(secondsToFailure1),
totalSeconds = n()/10) %>%
arrange(meanSecondsToQB) %>%
filter(totalSeconds > 300) %>%
ungroup() %>%
top_n(5, wt = -meanSecondsToQB) %>%
data.frame()
nt_df2 <- df11 %>%
filter(officialPosition.y == "NT") %>%
filter(!is.na(secondsToFailure1)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(meanSecondsToQB = mean(secondsToFailure1),
totalSeconds = n()/10) %>%
arrange(meanSecondsToQB) %>%
filter(totalSeconds > 200) %>%
ungroup() %>%
top_n(5, wt = -meanSecondsToQB) %>%
data.frame()
olb_df2 <- df11 %>%
filter(officialPosition.y == "OLB") %>%
filter(!is.na(secondsToFailure1)) %>%
group_by(displayName.y, nflId.y, officialPosition.y) %>%
summarise(meanSecondsToQB = mean(secondsToFailure1),
totalSeconds = n()/10) %>%
arrange(meanSecondsToQB) %>%
filter(totalSeconds > 200) %>%
ungroup() %>%
top_n(5, wt = -meanSecondsToQB) %>%
data.frame()
# MERGE DATAFRAMES
position2_df1 <- rbind(dt_df2, nt_df2)
position2_df2 <- rbind(olb_df2, de_df2)
final_df2 <- rbind(position2_df1, position2_df2)
visual_df2 <- final_df2 %>%
top_n(8, wt = -meanSecondsToQB) %>%
data.frame()
With our stats calculated and put into data frames that are organized, it is now possible to use these stats in visualizations in order to see how each defender and position matches up. We created a trellis chart for each stat separating the players by position as well as a bar chart for each stat showing the players who had the best stats regardless of position.
# CREATE VISUALIZATION
ggplot(final_df1, aes(x = percentageNegative, y = reorder(displayName.y, -percentageNegative))) +
geom_bar(stat = "identity", fill = "navyblue") +
geom_text(aes(label = paste0(round(percentageNegative, 2), "%")), hjust = 1.2, size = 3, color = "white") +
labs(y= "% Pushed Back\n", x = "\nName", title = '% of Time Spent Pushed Back\n') +
facet_wrap(ncol = 2, nrow = 2, ~officialPosition.y, scales = 'free') +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())
ggplot(visual_df1, aes(x = reorder(displayName.y, percentageNegative), y = percentageNegative)) +
geom_bar(stat = "identity", fill = "red4") +
geom_text(aes(label = paste0(round(percentageNegative, 1), "%")), vjust = 2, size = 3, color = "white") +
geom_text(aes(label = officialPosition.y), vjust = 4, size = 3, color = "white") +
labs(y= "% Pushed Back\n", x = "\nName", title = '% Pushed Back Overall') +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# CREATE VISUALIZATION
ggplot(final_df2, aes(x = reorder(displayName.y, meanSecondsToQB), y = meanSecondsToQB)) +
geom_bar(stat = "identity", fill = "navyblue") +
geom_text(aes(label = paste0(round(meanSecondsToQB, 1), " secs")), vjust = 2, size = 3, color = "white") +
labs(y= "# Seconds To QB", x = "\nName", title = 'Lowest Average secondsToQB by Position\n') +
facet_wrap(ncol = 2, nrow = 2, ~officialPosition.y, scales = 'free') +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 8),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())
ggplot(visual_df2, aes(x = reorder(displayName.y, meanSecondsToQB), y = meanSecondsToQB)) +
geom_bar(stat = "identity", fill = "tan2") +
geom_text(aes(label = paste0(round(meanSecondsToQB, 1), " secs")), vjust = 2, size = 3, color = "white") +
geom_text(aes(label = officialPosition.y), vjust = 4, size = 3, color = "white") +
labs(y= "# Seconds To QB", x = "\nName", title = 'Lowest Average secondsToQB Overall') +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())