Introduction

In this project, we will analyze the NFL Big Data Bowl 2025 data set to uncover insights into team strategies and player movements. The data includes tracking information for players, game details, play descriptions, and player-specific actions. The project involves joining multiple data sets and creating new features to conduct comprehensive analysis and visualizations. To start, we source necessary shared functions and set up our working directory. We then load several key libraries including httr, data.table, and dplyr, which will aid in data manipulation and analysis. Using all of this information, our goal was to analyze information from before the snap. We wanted to be able to use the offense’s position to predict what they were going to do, which would help the defense know what they needed to do protect the end zone. Along the way, we had to change our end goal, based on the data provided and the time allotted, but we are happy with what we ended up with. We were able to score each of the teams based off of their performances in each offense position (or alternately, the defense’s ability to block against those formations), as well as analyzing the average speed of important players on each team.

Description of Project

Our main goal in this project was to be able to analyze the defense. Using the variable about the offense formations, we were able to analyze both the defense and offense. We assigned each team a score based on the outcome of each play when the offense was in different formations. We also expanded and created another visualization that analyzes the team’s average speed at each frame of the game, allowing us to analyze performance that way.

Data Visualization

The first visualizations that we worked on were our heat maps. We created one for the offensive results and one for the defensive results. The teams are on the vertical axis and the different formations that the offense could be in are on the horizontal axis. For the offensive heat map, it shows which teams are the most effective at running each of the different formations. The darker their square was for that formation, the better their score was. A higher score means that they were able to get past the defense better in that position. On the defensive heat map, the darker the square, the better they were at getting past the offense when they were in the formation shown. The heat maps are helpful in showing which offensive formations are superior for individual teams, but also which ones are the best overall. There aren’t many dark colors on the offensive heat map, which shows that most of these teams didn’t have very high scores. Therefore, it makes sense that the defensive heat map has darker colors, because they performed better against the offense.

The line graph shows the average speed for two teams in a game, by frame. Frame ID is on the x and average speed is on the y. We also specifically selected c(“SS”, “FS”, “RB”, “FB”)) for this chart, as those are the players whose speed matters the most for what we were trying to analyze. The game we are showing here as an example is a game between Denver and Seattle. The teams were well matched at the beginning of the game, but then Denver had more possession at the end of the game. Seattle won this specific game, but that was because of their 17 points in the first half of the game. Denver had 3 points in the 4th quarter but lost 16 to 17. The line graph is helpful in the sense that it works for analyzing any game ID that someone would want to look at, as it uses a function to create the visualization that a game ID is plugged into. The line graph also shows the average speed throughout the entire game so a team can be analyzed at any point.

source('https://raw.githubusercontent.com/ptallon/SportsAnalytics_Fall2024/main/SharedCode.R')
setwd("/Users/rubysullivan/Desktop/Sports Analytics")
mypath <- paste0(getwd(), "/Data/NFLBDB2025")
directory <- paste0(getwd(), "Data/NFLBDB2025")
library(httr)

library(data.table)
library(dplyr)
library(ggrepel)
library(stringr)

week1 <- fread("Data/NFLBDB2025/tracking_week_1.csv")
games_1 <- fread("Data/NFLBDB2025/games.csv")
plays_1 <- fread("Data/NFLBDB2025/plays.csv")
players_1 <- fread("Data/NFLBDB2025/players.csv")
player_play_1 <- fread("Data/NFLBDB2025/player_play.csv")

df <- left_join(week1, games_1, by = c("gameId"))
df <- left_join(df, plays_1, by = c("gameId", "playId"))
df <- left_join(df, players_1, by = c("nflId"))
dfall <- left_join(df, player_play_1, by = c("gameId", "playId", "nflId"))

#create data frame of wanted variables only before the snap and at the snap

wantedvariables <- dfall %>%
  select(gameId, playId, nflId, frameId, frameType, time, jerseyNumber, playDirection, x, y, s, dis, club, 
         week, homeTeamAbbr, visitorTeamAbbr, playDescription, possessionTeam, defensiveTeam, offenseFormation, 
         receiverAlignment, yardsGained, pff_passCoverage, pff_manZone, position, inMotionAtBallSnap, motionSinceLineset, shiftSinceLineset) %>%
  filter(frameType == c("BEFORE_SNAP", "SNAP")) %>%
  mutate(team = ifelse( club == homeTeamAbbr, "home", ifelse(club == "football", "football", "away"))) %>%
  data.frame()

#pick out who has possession of the ball
defense <- dfall %>%
  select(gameId, playId, playDescription, jerseyNumber, possessionTeam, defensiveTeam, offenseFormation, frameType, receiverAlignment, inMotionAtBallSnap) %>%
  filter(frameType == "BEFORE_SNAP") %>%
  distinct() %>%
  data.frame()

#counting each offensive formation for each team

offensive_formations <- dfall %>%
  select(possessionTeam, offenseFormation, frameType, playId, gameId) %>%
  filter(frameType == "BEFORE_SNAP") %>%
  distinct() %>%
  data.frame()

#outcomes

outcomes <- dfall %>%
  select(possessionTeam, playDescription, offenseFormation, playId, gameId, defensiveTeam) %>%
  mutate(outcome = case_when(
    str_detect(playDescription, "TOUCHDOWN") ~ "T",
    str_detect(playDescription, "no gain") ~ "0",
    str_detect(playDescription, "\\-?\\d+ yards") ~ str_extract(playDescription, "\\-?\\d+(?= yards)"),  
    TRUE ~ "0" 
  )) %>%
  distinct() %>%
  data.frame()

offense_scores <- outcomes %>%   
  select(possessionTeam, offenseFormation, outcome, gameId, defensiveTeam) %>%   
  mutate(
    score = ifelse(outcome == "T", 6, 
                   ifelse(grepl("^-?\\d+$", outcome) & as.numeric(outcome) < 0, -1, 
                          ifelse(outcome %in% c("0"), 0, 
                                 ifelse(outcome %in% as.character(1:20), 1,
                                        ifelse(outcome %in% as.character(21:40), 2,
                                               ifelse(outcome %in% as.character(41:60), 3,
                                                      ifelse(outcome %in% as.character(61:80), 4,
                                                             ifelse(outcome %in% as.character(81:99), 5, NA_real_))))))))
  ) %>%   
  data.frame()

summary_df_offense <- offense_scores %>%
  group_by(possessionTeam, offenseFormation) %>%
  summarise(
    numberoftimestheplaywasran = n(),
    averagescore = mean(score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  data.frame()

defense_scores <- outcomes %>%   
  select(defensiveTeam, offenseFormation, outcome, possessionTeam) %>%   
  mutate(
    score = ifelse(outcome == "T", 0, 
                   ifelse(grepl("^-?\\d+$", outcome) & as.numeric(outcome) < 0, 6, 
                          ifelse(outcome %in% c("0"), 6, 
                                 ifelse(outcome %in% as.character(1:20), 5,
                                        ifelse(outcome %in% as.character(21:40), 4,
                                               ifelse(outcome %in% as.character(41:60), 3,
                                                      ifelse(outcome %in% as.character(61:80), 2,
                                                             ifelse(outcome %in% as.character(81:99), 1, NA_real_))))))))
  ) %>%   
  data.frame()

summary_df_defense <- defense_scores %>%
  group_by(defensiveTeam, offenseFormation) %>%
  summarise(
    numberoftimestheplaywasran = n(),
    averagescore = mean(score, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  data.frame()

Visualization 1: Offensive Heatmap

ggplot(summary_df_offense, aes(x = offenseFormation, y = possessionTeam, fill = averagescore)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Offense Formation", y = "Possession Team", fill = "Offensive Average Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Offensive Heatmap")


Visualization 2: Defensive Heatmap

# Create defense heatmap
ggplot(summary_df_defense, aes(x = offenseFormation, y = defensiveTeam, fill = averagescore)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Offense Formation", y = "Possession Team", fill = "Defensive Average Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Defensive Heatmap")


Visualization 3: Line graph for average speed of two teams in any given gameID

plot_average_speed <- function(game_id) {
  filtered_data <- dfall %>%
    select(possessionTeam, dis, position, frameId, gameId, playId) %>%
    filter(position %in% c("SS", "FS", "RB", "FB")) %>%
    filter(gameId == game_id) %>%
    group_by(frameId, possessionTeam) %>%
    summarise(avr = mean(dis), .groups = "drop") %>%
    data.frame()
  
  
  ggplot(filtered_data, aes(x = frameId, y = avr, color = possessionTeam)) + 
    geom_line() + 
    labs(
      title = paste("Average Speed by Team Over Frames (Game", game_id, ")"),
      x = "Frame ID",
      y = "Average Speed (yards)",
      color = "Possession Team"
    ) + 
    theme_minimal() + 
    scale_color_manual(values = c("darkorange", "green"))
}

#example gameID, this works for any gameID
plot_average_speed(2022091200)


Conclusion

Overall, we are very happy with what we did with this project. We started out with a much wider scope, and had to narrow down what we wanted to do based on the data provided and the time we had, but we still were able to create some insightful visualizations. If we were to be able to have more time on this project, we might think about expanding on our line graph idea and using another metric other than average speed for the y axis. This way, we could get even more information out of the data. We also would consider a more advanced scoring system for our heat maps. We just used a scale of 1-6 and assigned those values based on the number of yards gained/lost, but with more time we might have been able to create a wider scoring system that assigned values based on smaller ranges of yards gained/lost. This was a fun project to work on together and was a great way to use what we learned in this class all semester.