Nfl Passing or Rushing

Author

Noah Lazar

For this project, I initially set out to work with NFL data, as there is an abundance of it readily available. My first approach involved pulling datasets from Kaggle, which included information on teams, stadiums, weather conditions, and betting odds. However, this proved to be quite challenging due to the complexity of integrating and making sense of such diverse data sources.

After discussing my struggles with Alex, they introduced me to nflfastr, a comprehensive dataset specifically designed for NFL analytics. This was a game-changer, and I decided to pivot to using nflfastr for my project. With its wealth of information and structure tailored for football analysis, I was able to shift focus to something more manageable and meaningful.

The aspect of NFL gameplay that resonated most with me was predicting whether a play would be a pass or a rush. This is a fundamental distinction in football strategy and provides a great opportunity for predictive modeling. I explored various methods and techniques to approach this classification problem, experimenting with different models and frameworks to achieve reliable predictions. Each step of the process deepened my understanding of the data and the dynamics of NFL gameplay.

This shift not only simplified the technical challenges but also aligned the project with a clear and focused objective, making it both more engaging and insightful.

library(nflfastR)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gsisdecoder)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(ipred)
library(tidytext)

This code processes NFL play-by-play (PBP) data from the 2023 season. It starts by loading the data and slicing the first 7,000 rows for analysis (becuase it was taking way to long to load) The dataset is then filtered to exclude missing EPA values and retain only “pass” and “run” play types. A new binary variable, is_pass, is created to convert the play_type into a numeric format (1 for pass, 0 for run). The code then selects relevant features, such as game context (e.g., score_differential, posteam_timeouts_remaining) and play-specific metrics (e.g., game_seconds_remaining, ydstogo, rushing and passing EPA for both teams), which are likely important for predicting whether a play is a pass or a run. Finally, any rows with missing data are removed using na.omit, preparing the dataset for modeling. This cleaned data, model_data, is now ready for use in a predictive model.

pbp = load_pbp(2023)|>
 slice(1:7000) 

#table(pbp$play_type)  

filtered_data <- pbp |>
  filter(!is.na(epa), play_type %in% c('pass', 'run')) |>
  mutate(is_pass = ifelse(play_type == 'pass', 1, 0))
# needed to turn this into a value rather than a word

model_data = filtered_data |>
  select(is_pass, 
         game_seconds_remaining, down, ydstogo, yardline_100, score_differential, posteam_timeouts_remaining,defteam_timeouts_remaining, quarter_seconds_remaining, half_seconds_remaining, drive, qtr, ydsnet, total_home_rush_epa, total_away_rush_epa, 
total_home_pass_epa,total_away_rush_epa, wp, desc) |>  
  na.omit()

This code splits the model_data into training and test sets, using 50% of the data for training and the rest for testing. It ensures that the is_pass variable is treated as a factor for classification. A random forest model is then trained to predict is_pass using the training data. The model has 500 trees and tries 4 variables at each split. The out-of-bag (OOB) error rate is 33.86%. The confusion matrix shows that the model correctly classifies 518 “run” and 1129 “pass” plays, with a class error rate of 0.495 for “run” and 0.229 for “pass.”

set.seed(1)
train_index = createDataPartition(model_data$is_pass, p = 0.5, list = FALSE)
train_data = model_data[train_index, ]|>
 slice(1:5000)
test_data = model_data[-train_index, ]|>
 slice(1:5000)


train_data$is_pass = as.factor(train_data$is_pass)
test_data$is_pass = as.factor(test_data$is_pass)


rf_model = randomForest(is_pass ~ ., data = train_data)


print(rf_model)

Call:
 randomForest(formula = is_pass ~ ., data = train_data) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 34.5%
Confusion matrix:
    0    1 class.error
0 525  501   0.4883041
1 358 1106   0.2445355

The random forest model achieved an accuracy of 66.21%, with a Kappa of 0.2884, indicating moderate agreement between predicted and actual values. Sensitivity was 50.24%, and specificity was 77.85%. Key variables influencing predictions included wp, yardline_100, and ydstogo, with their importance measured by MeanDecreaseGini.

predictions = predict(rf_model, test_data, type = 'response')

confusion_matrix = confusionMatrix(as.factor(predictions), as.factor(test_data$is_pass))
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  534  311
         1  515 1129
                                          
               Accuracy : 0.6681          
                 95% CI : (0.6493, 0.6866)
    No Information Rate : 0.5785          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.301           
                                          
 Mcnemar's Test P-Value : 1.626e-12       
                                          
            Sensitivity : 0.5091          
            Specificity : 0.7840          
         Pos Pred Value : 0.6320          
         Neg Pred Value : 0.6867          
             Prevalence : 0.4215          
         Detection Rate : 0.2145          
   Detection Prevalence : 0.3395          
      Balanced Accuracy : 0.6465          
                                          
       'Positive' Class : 0               
                                          
importance(rf_model)
                           MeanDecreaseGini
game_seconds_remaining             87.89676
down                               67.28140
ydstogo                            92.97961
yardline_100                       90.34105
score_differential                 57.49231
posteam_timeouts_remaining         15.08450
defteam_timeouts_remaining         13.47222
quarter_seconds_remaining          82.13762
half_seconds_remaining             97.14679
drive                              55.46816
qtr                                13.41556
ydsnet                             70.95013
total_home_rush_epa                80.37910
total_away_rush_epa                80.45762
total_home_pass_epa                88.63848
wp                                123.29779
desc                               88.68270
varImpPlot(rf_model)

This code loads play-by-play (PBP) data for the 2023 NFL season, focusing on “run” and “pass” plays. It preprocesses the data by cleaning column names and filtering for relevant variables, followed by removing missing values. A Random Forest model is fitted to predict play type based on selected features. The model summary is printed, and variable importance is visualized using a plot.

library(dplyr)
library(randomForest)
library(janitor)

pbp_data <- load_pbp(2023)

pbp_clean <- pbp_data |> 
  dplyr::filter(!is.na(play_type), play_type %in% c("run", "pass")) |>
  select(play_type, yardline_100, down, ydstogo, game_seconds_remaining) |> 
  na.omit() |> 
  clean_names()  

pbp_clean$play_type <- factor(pbp_clean$play_type)

rf_pbp <- randomForest(play_type ~ ., 
                       data = pbp_clean, 
                       importance = TRUE)


print(rf_pbp)

Call:
 randomForest(formula = play_type ~ ., data = pbp_clean, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 34.59%
Confusion matrix:
      pass  run class.error
pass 13991 6653   0.3222728
run   5617 9213   0.3787593
varImpPlot(rf_pbp)

This code processes NFL play descriptions from the model_data dataset to analyze the frequency and relationship of words used in pass and non-pass plays. It first filters out missing descriptions and selects the relevant columns. Then, it creates a unique identifier for each play description. The desc column is tokenized into individual words, and stop words are removed to focus on meaningful terms. The code counts the occurrences of each word and groups them by play type (pass or non-pass). It then constructs a document-feature matrix (DFM) and performs hierarchical clustering on the scaled word frequencies to visualize patterns.

nfl_words <- model_data |>
  filter(!is.na(desc)) |>
  select(desc, is_pass)

# Create a unique identifier for 'desc'
nfl_words <- nfl_words |>
  mutate(desc_id = row_number())

# Tokenize 'desc' into individual words and filter out stop words
tokenized_desc <- nfl_words |>
  unnest_tokens(input = desc, output = Word, token = "words") |>
  filter(!(Word %in% stop_words$word))|>
  slice(1:1000)

# Count the occurrences of each word
word_counts <- tokenized_desc |>
  count(Word) |>
  arrange(desc(n))

# Count the occurrences of each word by epa
word_counts_by_play <- tokenized_desc |>
  count(is_pass, Word) |>
  arrange(desc(n))|>
  slice(1:1000)

# Creating the document-feature matrix (DFM)
dfm1 <- tokenized_desc |>
  pivot_wider(id_cols = c("desc_id", "is_pass"),
              names_from = "Word",
              values_from = Word,
              values_fn = length,
              values_fill = 0)


dfm1 |>
  select(-desc_id, -is_pass) |>
  scale() |>
  dist() |>
  hclust() |>
  plot()

This code processes NFL play descriptions to identify patterns and reduce dimensionality using word frequency analysis, hierarchical clustering, and Principal Component Analysis (PCA). First, it calculates the total word count for each word in the document-feature matrix (dfm1), retaining only words that appear more than 25 times (as specified by X). The filtered matrix (dfm2) includes only these frequent words. The code then performs hierarchical clustering on the filtered data, calculating pairwise distances and visualizing the results through a dendrogram. This helps reveal similarities in play descriptions based on word usage. Afterward, PCA is applied to reduce the dataset’s dimensionality, identifying the most significant components. The first and fifth principal components (PC1 and PC5) are explored by examining their loadings and scores, providing insight into which words contribute most to each component. This allows for a deeper understanding of the patterns in play descriptions based on word frequencies.

# Load necessary libraries
library(dplyr)
library(tidyr)

# Calculate word counts and filter words used more than X times
is_pass_count <- dfm1 |>
  select_if(is.numeric) |>
  colSums()

X <- 25

# Only return elements which have a value greater than X
good_pass <- names(is_pass_count[which(is_pass_count > X)])

dfm2 <- dfm1 |>
  select(all_of(good_pass))

# Quick check using hierarchical clustering
dfm2 |>
  dist() |>
  hclust() |>
  plot()

# Perform PCA
pca_plays <- dfm2 |>
  prcomp(center = TRUE, scale. = TRUE)

# Look at our first PC
pca_plays$rotation |>
  data.frame()|>
  select(PC1) |>
  arrange(PC1)
                 PC1
yards    -0.23698446
ari      -0.18675732
left     -0.03574220
desc_id   0.09551176
shotgun   0.11407416
14        0.33406936
s.howell  0.37525703
short     0.41712120
pass      0.47652062
is_pass   0.48173123
# Look at the scores for each play
pca_plays$x[,1]
 [1] -2.30541860  2.27487816 -2.64830719  2.26320900  1.49830107  2.20035924
 [7]  2.20518947  1.48709570 -2.92506068 -2.92023044  0.23739370  0.68510913
[13] -0.10549489 -2.54836042  1.82664227 -2.51300387  1.04086849 -2.22330457
[19] -2.59671948  1.77828321 -2.58705900 -2.20398362  0.60109146  0.60592169
[25]  1.10045128 -1.69496332  1.11011175  0.09990032  2.54244485  1.14593210
[31] -1.92952133  1.06175270 -2.22336125 -2.52426592 -1.22954690 -2.41639912
[37]  1.32143767  2.04919192  0.80838910 -3.03424407 -2.49045426 -2.41311378
[43] -1.39757373  1.58847400 -0.12372055  1.63763501 -0.11406008 -2.45664260
[49]  0.14090389  0.07322389 -2.36964165  0.36292319 -2.35998117  2.52605050
[55]  2.43704086  2.51438134 -0.04442807  1.96183222  1.70042810  1.66575779
[61]  2.53853252  1.35800330  2.54819300  0.42088604 -1.86349958  0.72236684
[67]  0.81362189  1.30815148
# Look at PC5
pca_plays$rotation |>
  data.frame() |>
  select(PC5) |>
  arrange(PC5)
                 PC5
desc_id  -0.49560480
left     -0.45691145
14       -0.12177136
s.howell -0.09943687
shotgun  -0.07849934
pass      0.16324321
is_pass   0.20176361
ari       0.24729734
short     0.28713320
yards     0.55115456
# Look at the scores for each play in PC5
pca_plays$x[,5]
 [1]  0.641431034  1.367076680 -0.142926558  0.364330471  1.670866771
 [6] -0.612733885 -0.637797697  1.807811761  0.111533759  0.086469947
[11]  2.118470950 -0.184559369  1.334113358  0.720444666  1.227521886
[16]  0.458180615  2.721130800  0.215346227 -0.331811126  0.175266093
[21] -0.381938751  0.115090978  1.412862013  1.387798201  0.223841041
[26] -1.124057619  0.173713416  0.002173316 -1.182769565  1.051140564
[31] -0.104051332 -0.878542376 -1.087547687 -0.707768309 -1.330819173
[36] -0.043095629  1.485804398 -1.009932684  0.460558654  0.878471930
[41] -0.883214994  0.018657925  0.416909949 -0.021422209 -0.019068301
[46] -1.258624570 -0.069195925 -1.058661680  1.358752851  0.406752307
[51] -0.206916385  0.163917919 -0.257044009  0.063758446 -1.840860682
[56] -0.938987762  0.632784537 -0.777158770 -1.584454129 -0.422443204
[61] -1.064306823 -1.229120522 -1.114434448 -0.136847827 -1.671021251
[66] -0.414999605  0.310054277 -0.853902883

This code analyzes the probability of a specific row (row 1) appearing in a bootstrap sample of NFL play descriptions. It first creates a unique identifier for each row and sets up the number of bootstrap samples based on the number of rows in the dataset. Then, it generates a bootstrap sample and runs a simulation 1,000 times to estimate the probability that row 1 appears in the sample. The code also calculates the exact probability using a known formula. The estimated probability (approximately 0.639) represents the likelihood that row 1 will appear at least once in a bootstrap sample, while the exact probability (0.632) is derived from the known formula 1−(1/e)1 - (1/e)1−(1/e), where eee is Euler’s number. This comparison shows how well the simulation approximates the exact probability.

nfl_words1 <- model_data|>
  filter(!is.na(desc))|>
  select(desc, is_pass)

# Create a unique identifier for each row
nfl_words1 <- nfl_words1 |>
  mutate(row_id = row_number())

# Set up the number of rows (bootstrap samples)
n_boot <- nrow(nfl_words1)

# Generate bootstrap sample
bootstrap_sample <- sample(1:n_boot, size = n_boot, replace = TRUE)

# Probability that a given row appears at all in the bootstrap sample
simulations <- replicate(1000, {
  bootstrap_sample <- sample(1:n_boot, size = n_boot, replace = TRUE)
  sum(bootstrap_sample == 1) > 0
})

# Proportion of times that row 1 shows up in the bootstrap sample
prob_appears <- mean(simulations)

# Calculate using the known probability
prob_appears_exact <- 1 - (1 / exp(1))

# Print probabilities
print(paste("Estimated Probability:", prob_appears))
[1] "Estimated Probability: 0.622"
print(paste("Exact Probability:", prob_appears_exact))
[1] "Exact Probability: 0.632120558828558"

The graph generated by the code is not informative, as it uses the is_pass column (binary pass/run indicator) for the x-axis, which does not provide meaningful insights when visualizing clusters. The plot is a poor representation of the clustering results, and a better approach would involve using more relevant numeric features, like expected points added (EPA) or other performance metrics, for clearer visualization of cluster separation.

# Filter non-missing 'desc' and select relevant columns
nfl_words2 <- model_data |>
  filter(!is.na(desc)) |>
  select(desc, is_pass)

# Create a unique identifier for 'desc'
nfl_words2 <- nfl_words2 |>
  mutate(desc_id = row_number())

# Tokenize 'desc' into individual words and filter out stop words
tokenized_desc1 <- nfl_words2 |>
  unnest_tokens(input = desc, output = Word, token = "words") |>
  filter(!(Word %in% stop_words$word))

# Creating the document-feature matrix (DFM)
dfm3 <- tokenized_desc1 |>
  pivot_wider(id_cols = c("desc_id", "is_pass"),
              names_from = "Word",
              values_from = Word,
              values_fn = length,
              values_fill = 0)


dfm3_subset <- dfm3 |>
  slice(1:700)


tf_matrix <- dfm3_subset|>
  select(-desc_id, -is_pass)

# Set number of clusters
k <- 25

# Perform k-means clustering
k_means_nfl <- kmeans(tf_matrix, centers = k)

# Add cluster assignment to the original data
nfl_clusters <- dfm3_subset|>
  mutate(cluster = k_means_nfl$cluster)

# Visualize clusters (example with made-up columns, adjust as needed)
ggplot(nfl_clusters) +
  geom_point(aes(x = is_pass, y = cluster, color = factor(cluster)), size = 5) +
  labs(title = "K-Means Clustering of NFL Plays",
       x = "Pass/Run Indicator",
       y = "Cluster",
       color = "Cluster")

The code uses a Support Vector Machine (SVM) to predict whether a play results in a first down based on various play features. The model achieved an accuracy of approximately 0.00016, which is extremely low. This suggests that the model is not performing well and might be overfitting, underfitting, or not capturing important features to predict first downs. Further tuning, feature engineering, or model adjustments may be necessary to improve performance.

# Load necessary libraries
library(nflfastR)
library(dplyr)
library(e1071)

# Load the PBP data for the 2023 season
pbp <- nflfastR::load_pbp(2023)

# Preprocess the data: Filter relevant columns and create a binary target variable
# Convert necessary columns to factors and ensure they have at least two levels
svm_data <- pbp |>
  filter(!is.na(yards_gained) & !is.na(down)) |>
  mutate(first_down = ifelse(yards_gained >= ydstogo, 1, 0)) |>
  select(play_id, game_id, home_team, away_team, season_type, week, posteam, defteam, yardline_100, 
         quarter_seconds_remaining, game_seconds_remaining, game_half, down, ydstogo, yards_gained, 
         shotgun, no_huddle, pass_length, air_yards, yards_after_catch, play_type, first_down) |>
  mutate(across(where(is.character), as.factor)) |>
  mutate(across(where(is.factor), ~ factor(as.character(.), levels = unique(as.character(.))))) |>
  na.omit()

# Split the data into training and testing sets
train_index <- sample(1:nrow(svm_data), 0.5 * nrow(svm_data))
train_data <- svm_data[train_index, ]
test_data <- svm_data[-train_index, ]

# Train the SVM model
svm_model <- svm(first_down ~ ., data = train_data, kernel = "linear", cost = 1)

# Make predictions
predictions <- predict(svm_model, test_data)

# Evaluate the model
confusion_matrix <- table(predictions, test_data$first_down)
#print(confusion_matrix)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.000160926939169617"

The code builds a gradient boosting model using xgboost to predict the binary outcome is_pass (whether a play is a pass) based on features such as game_seconds_remaining, down, ydstogo, and wp. After training the model, SHAP (Shapley Additive Explanations) values are used to interpret the model’s predictions. The SHAP analysis reveals that down, ydstogo, wp (win probability), and game_seconds_remaining (time) are critical factors influencing the model’s decisions. These features appear as the most important in the visualizations, suggesting they play a central role in determining the likelihood of a pass play. The sv_waterfall function shows individual explanations for specific plays (row 25 and row 1111), and the sv_importance plot highlights the overall importance of each feature, reinforcing that these game-specific elements are significant in predicting pass plays.

library(dplyr)
library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice
library(shapviz)


set.seed(3)
pbp_subset <- model_data |>
  dplyr::select(game_seconds_remaining, down, ydstogo, wp)



boost_data <- xgb.DMatrix(data.matrix(pbp_subset),
                          label = model_data$is_pass) #label is what you 
#are trying to predict
boost_model <- xgb.train(data = boost_data,
                         nrounds = 65, #b
                         params = list(learning_rate = 0.1, #lamda
                                       objective = "reg:squarederror"))


shap_values <- shapviz(boost_model,
                       X_pred = data.matrix(pbp_subset),
                       X = pbp_subset)


sv_waterfall(shap_values, row_id = 25)

sv_waterfall(shap_values, row_id = 1111)

sv_importance(shap_values, kind = "beeswarm")


This code creates and visualizes multiple decision trees using the rpart package to predict the is_pass variable based on different features of the dataset. The trees include variables such as half_seconds_remaining, ydstogo, game_seconds_remaining, and additional features like home and away rush EPA. The models vary in complexity, with some having no pruning (cp = 0) and others controlling tree depth. The final tree incorporates multiple features for improved prediction accuracy. I only plot the first tree because it looked like they were pruning to much and looked the best.

library(rpart)
library(rpart.plot)

set.seed(3)

# Basic decision tree model
tree1 <- rpart(factor(is_pass) ~ ., data = model_data)
rpart.plot(tree1)

# Decision tree with 'half_seconds_remaining'
tree2 <- rpart(is_pass ~ factor(half_seconds_remaining), data = model_data)
rpart.plot(tree2)

# Decision tree with 'ydstogo' and no pruning (more complex)
tree3 <- rpart(is_pass ~ factor(ydstogo), data = model_data, cp = 0)
rpart.plot(tree3)

# Decision tree with 'ydstogo' and 'game_seconds_remaining'
tree4 <- rpart(is_pass ~ factor(ydstogo) + game_seconds_remaining, data = model_data, cp = 0)
rpart.plot(tree4)
Warning: labs do not fit even at cex 0.15, there may be some overplotting

# Prediction for a specific case (example values)
predict(tree4, data.frame(ydstogo = 10, game_seconds_remaining = 100))
1 
1 
# Final tree with additional features
tree_final <- rpart(is_pass ~ factor(ydstogo) + game_seconds_remaining + total_home_rush_epa + total_away_rush_epa,
                    data = model_data, 
                    control = rpart.control(minsplit = 5, cp = 0.01))
rpart.plot(tree_final)

This code performs K-means clustering on a subset of numeric columns from the model_data dataset, specifically focusing on factors like ydstogo, game_seconds_remaining, and various EPA metrics. An elbow plot is generated to evaluate the optimal number of clusters. The clustering results show that as the number of clusters (k) increases beyond 3, the boundaries between clusters become very distinct, suggesting that 3 may be the ideal number of clusters for this data.

library(dplyr)
library(ggplot2)

pbp_numeric <- model_data |>
  select(ydstogo, game_seconds_remaining, total_home_rush_epa, total_away_rush_epa, total_home_pass_epa, is_pass)

# Set the number of clusters (k)
k <- 5
k_means_pbp <- kmeans(pbp_numeric |> select(-is_pass), centers = k)

# Create a scatter plot colored by cluster
model_data |>
  mutate(cluster = k_means_pbp$cluster)|>
  ggplot() +
  geom_point(aes(x = total_home_rush_epa, y = game_seconds_remaining,
                 color = factor(cluster)), size = 0.1) +
  labs(title = "K-means Clustering of NFL Plays",
       x = "Total Home Rush EPA",
       y = "Game Seconds Remaining",
       color = "Cluster")

# Calculate within-cluster sum of squares for different values of k
ans <- numeric(20)  # Initialize empty vector to store sum of squares for each k
for(k in 1:20) {
  k_means_pbp <- kmeans(pbp_numeric, centers = k)
  ans[k] <- sum(k_means_pbp$withinss)
}

# Plot the elbow plot to determine the optimal number of clusters
plot(1:20, ans, type = "b", main = "Elbow Plot for K-means Clustering",
     xlab = "Number of Clusters (k)", ylab = "Within-cluster Sum of Squares")

This code generates a visualization that compares the average offensive passing EPA (Expected Points Added) per play at home with the win percentage (wp) for each NFL team during the 2023 regular season. The scatter plot shows how teams with higher passing EPA at home correlate with their win percentage, . The reversed y-axis emphasizes teams with higher win percentages, highlighting any trends between offensive performance and team success.

pbp <- nflfastR::load_pbp(2023:2023)|> 
  dplyr::filter(season_type == "REG") |>
  dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))

offense <- pbp |>
  dplyr::group_by(team = posteam) |>
  dplyr::summarise(total_home_pass_epa = mean(total_home_pass_epa, na.rm = TRUE))

defense <- pbp |>
  dplyr::group_by(team = defteam) |>
  dplyr::summarise(wp = mean(wp, na.rm = TRUE))

combined <- offense |>
  dplyr::inner_join(defense, by = "team")


ggplot2::ggplot(combined, aes(x = total_home_pass_epa, y = wp)) +
  #ggplot2::geom_abline(slope = -1.5, intercept = seq(0.4, -0.3, -0.1), alpha = .2) +
  nflplotR::geom_mean_lines(aes(x0 = total_home_pass_epa , y0 = wp)) +
  nflplotR::geom_nfl_logos(aes(team_abbr = team), width = 0.065, alpha = 0.7) +
  ggplot2::labs(
    x = "Average Offensive Passing EPA per Play at Home",
    y = "Win Percentage",
    title = "2023 NFL Home Passing EPA vs Win Percentage"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(face = "bold"),
    plot.title.position = "plot",
    plot.background = ggplot2::element_rect(fill = "#F0F0F0")
  ) +
  ggplot2::scale_y_reverse()

This code generates a visualization that compares the average offensive rushing EPA (Expected Points Added) per play at home with the win percentage (wp) for each NFL team during the 2023 regular season. The scatter plot shows how teams with higher rushing EPA at home correlate with their win percentage. The reversed y-axis highlights teams with higher win percentages, providing insight into whether a team’s rushing performance contributes to its success.

pbp <- nflfastR::load_pbp(2023:2023)|> 
  dplyr::filter(season_type == "REG") |>
  dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))

offense_rush <- pbp |>
  dplyr::group_by(team = posteam) |>
  dplyr::summarise(total_home_rush_epa = mean(total_home_rush_epa, na.rm = TRUE))

defense_Rush <- pbp |>
  dplyr::group_by(team = posteam) |>
  dplyr::summarise(wp = mean(wp, na.rm = TRUE))

combined_rush <- offense_rush |>
  dplyr::inner_join(defense_Rush, by = "team")


ggplot2::ggplot(combined_rush, aes(x = total_home_rush_epa, y = wp)) +
 # ggplot2::geom_abline(slope = -1.5, intercept = seq(0.4, -0.3, -0.1), alpha = .2) +
  nflplotR::geom_mean_lines(aes(x0 = total_home_rush_epa , y0 = wp)) +
  nflplotR::geom_nfl_logos(aes(team_abbr = team), width = 0.065, alpha = 0.7) +
  ggplot2::labs(
    x = "Average Offensive Rushing EPA per Play at Home",
    y = "Win Percentage",
    title = "2023 NFL Home Rushing vs WP per Play"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(face = "bold"),
    plot.title.position = "plot",
    plot.background = ggplot2::element_rect(fill = "#F0F0F0")
  ) +
  ggplot2::scale_y_reverse()

This plot visualizes the relationship between a team’s average offensive passing EPA per play at home and its average offensive rushing EPA per play at home during the 2023 NFL regular season. Each team is represented by its logo, and the plot compares the efficiency of passing and rushing for each team. The reversed y-axis emphasizes the performance of teams with higher rushing EPA. This graph provides insight into how teams balance their offensive strategies in both the passing and rushing game.

pbp <- nflfastR::load_pbp(2023:2023)|> 
  dplyr::filter(season_type == "REG") |>
  dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))

offense_pas <- pbp |>
  dplyr::group_by(team = posteam) |>
  dplyr::summarise(total_home_pass_epa = mean(total_home_pass_epa, na.rm = TRUE))

defense_rush <- pbp |>
  dplyr::group_by(team = defteam) |>
  dplyr::summarise(total_home_rush_epa = mean(total_home_rush_epa, na.rm = TRUE))

combined <- offense_pas |>
  dplyr::inner_join(defense_rush, by = "team")


ggplot2::ggplot(combined, aes(x = total_home_pass_epa, y = total_home_rush_epa)) +
  #ggplot2::geom_abline(slope = -1.5, intercept = seq(0.4, -0.3, -0.1), alpha = .2) +
  nflplotR::geom_mean_lines(aes(x0 = total_home_pass_epa , y0 = total_home_rush_epa)) +
  nflplotR::geom_nfl_logos(aes(team_abbr = team), width = 0.065, alpha = 0.7) +
  ggplot2::labs(
    x = "Average Offensive Passing EPA per Play at Home",
    y = "Average Offensive Rushing EPA per play at home",
    title = "2023 NFL Home Passing EPA vs  Rushing EPA"
  ) +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(face = "bold"),
    plot.title.position = "plot",
    plot.background = ggplot2::element_rect(fill = "#F0F0F0")
  ) +
  ggplot2::scale_y_reverse()

In this analysis, I began by loading and preprocessing NFL play-by-play (pbp) data from the 2023 season using the nflfastR package, focusing on plays classified as either ‘pass’ or ‘run’. I then created a new binary variable, is_pass, to represent whether a play was a pass. After cleaning the data, I selected relevant features such as down, yards_gained, score_differential, qtr, and game_seconds_remaining, and removed missing values. First, I fitted a logistic regression model (fit_logistic) to predict the likelihood of a pass based on these variables. The model’s significant predictors included down, yards_gained, score_differential, qtr, and game_seconds_remaining, with all variables showing a highly significant effect (p < 0.001). I then explored linear regression models, first with basic continuous predictors (nfl_lm) and subsequently incorporating splines for continuous variables (nfl_lm1) to allow for non-linear relationships. The linear models showed a reasonable fit, with down, yards_gained, and qtr being particularly influential, though score_differential had less of an impact. To improve model flexibility, I applied splines using the bs() function for continuous variables in the logistic regression model (fit_spline). This model demonstrated better performance, with the inclusion of splines capturing more complex relationships, particularly for yards_gained, score_differential, and game_seconds_remaining. The spline model’s AIC (24116) was lower than that of the simpler linear models, indicating a better fit. Additionally, I evaluated the adjusted R-squared values and residual standard errors of the models, which confirmed that the spline model offered superior predictive power. Finally, I visualized the model’s fit using plot(fit_gam, pages = 1, se = TRUE) to assess the effects of splines on the continuous variables, providing a detailed view of how each predictor influenced the likelihood of a pass.

Generalized Additive Models (GAMs) are an extension of generalized linear models that allow for nonlinear relationships between predictors and the response variable. In R, the mgcv package is commonly used to fit GAMs.The gam function in this package enables the specification of smooth functions for predictors, facilitating the modeling of complex, nonlinear patterns For example, to fit a GAM with a smooth term for a predictor x, you can use the formula y ~ s(x), where s(x) denotes a smooth function of x. This approach provides a flexible framework for modeling nonlinear relationships in data #https://www.geeksforgeeks.org/generalized-additive-models-using-r/

In your code, you implemented spline regression in R to model nonlinear relationships between variables.By dividing the dataset into intervals at specific points, known as knots, you applied piecewise polynomial functions to each segment. This approach allowed for a flexible fit that adapts to changes in the data’s pattern.Utilizing the splines package in R, you employed functions like bs() to create B-splines, which are linear combinations of basis functions that ensure smoothness at the knots.This method effectively captured the underlying structure of the data, providing a more accurate representation than simple linear regression. https://www.geeksforgeeks.org/understanding-spline-regression-in-r/

library(nflfastR)
library(dplyr)
library(ggplot2)
library(splines)
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
library(mgcv)
Loading required package: nlme

Attaching package: 'nlme'
The following object is masked from 'package:dplyr':

    collapse
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Load and preprocess data
filtered_data <- load_pbp(2023) |>
  filter(!is.na(epa), play_type %in% c('pass', 'run')) |>
  mutate(is_pass = ifelse(play_type == 'pass', 1, 0))

model_data <- filtered_data |>
  select(is_pass, game_seconds_remaining, down, ydstogo, yardline_100, 
         score_differential, posteam_timeouts_remaining, defteam_timeouts_remaining,
         quarter_seconds_remaining, half_seconds_remaining, drive, qtr, 
         ydsnet, total_home_rush_epa, total_away_rush_epa, total_home_pass_epa, 
         total_away_rush_epa, wp, temp, yards_gained) |>
  na.omit()

# Fit logistic regression model
fit_logistic <- glm(is_pass ~ down + yards_gained + score_differential + 
                    qtr + game_seconds_remaining, data = model_data, family = binomial)
summary(fit_logistic)

Call:
glm(formula = is_pass ~ down + yards_gained + score_differential + 
    qtr + game_seconds_remaining, family = binomial, data = model_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.054e+00  2.244e-01   4.697 2.64e-06 ***
down                    5.141e-01  1.888e-02  27.225  < 2e-16 ***
yards_gained            2.536e-02  1.917e-03  13.230  < 2e-16 ***
score_differential     -3.156e-02  1.518e-03 -20.786  < 2e-16 ***
qtr                    -4.018e-01  5.022e-02  -8.002 1.22e-15 ***
game_seconds_remaining -4.499e-04  5.425e-05  -8.293  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26511  on 19495  degrees of freedom
Residual deviance: 25032  on 19490  degrees of freedom
AIC: 25044

Number of Fisher Scoring iterations: 4
# Fit linear regression models with splines for continuous variables
nfl_lm <- lm(is_pass ~ down + yards_gained + score_differential + game_seconds_remaining + 
             qtr + wp + ydstogo, data = model_data)
summary(nfl_lm)

Call:
lm(formula = is_pass ~ down + yards_gained + score_differential + 
    game_seconds_remaining + qtr + wp + ydstogo, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3883 -0.4704  0.1911  0.4184  0.8776 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.6481578  0.0546784  11.854  < 2e-16 ***
down                    0.1486875  0.0042669  34.847  < 2e-16 ***
yards_gained            0.0048301  0.0003886  12.428  < 2e-16 ***
score_differential      0.0013833  0.0008686   1.593    0.111    
game_seconds_remaining -0.0000977  0.0000120  -8.139 4.22e-16 ***
qtr                    -0.0925139  0.0111171  -8.322  < 2e-16 ***
wp                     -0.3174314  0.0306685 -10.350  < 2e-16 ***
ydstogo                 0.0241519  0.0009223  26.185  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4648 on 19488 degrees of freedom
Multiple R-squared:  0.1127,    Adjusted R-squared:  0.1124 
F-statistic: 353.5 on 7 and 19488 DF,  p-value: < 2.2e-16
nfl_lm1 <- lm(is_pass ~ bs(down, df = 3) + bs(yards_gained, df = 4) + 
              bs(score_differential, df = 4) + qtr + bs(game_seconds_remaining, df = 3), 
              data = model_data)
summary(nfl_lm1)

Call:
lm(formula = is_pass ~ bs(down, df = 3) + bs(yards_gained, df = 4) + 
    bs(score_differential, df = 4) + qtr + bs(game_seconds_remaining, 
    df = 3), data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6136 -0.4783  0.2028  0.4269  1.4492 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          1.98516    0.17680  11.228  < 2e-16 ***
bs(down, df = 3)1                   -0.09475    0.02376  -3.987 6.71e-05 ***
bs(down, df = 3)2                    0.57856    0.03017  19.179  < 2e-16 ***
bs(down, df = 3)3                    0.09488    0.02292   4.139 3.50e-05 ***
bs(yards_gained, df = 4)1           -1.01157    0.13954  -7.249 4.35e-13 ***
bs(yards_gained, df = 4)2           -1.86999    0.11590 -16.135  < 2e-16 ***
bs(yards_gained, df = 4)3            0.65621    0.18378   3.571 0.000357 ***
bs(yards_gained, df = 4)4           -2.17640    0.18189 -11.965  < 2e-16 ***
bs(score_differential, df = 4)1      0.36477    0.14661   2.488 0.012855 *  
bs(score_differential, df = 4)2      0.02156    0.09753   0.221 0.825011    
bs(score_differential, df = 4)3     -0.13780    0.14060  -0.980 0.327061    
bs(score_differential, df = 4)4     -0.59229    0.16014  -3.698 0.000217 ***
qtr                                 -0.08768    0.01202  -7.297 3.06e-13 ***
bs(game_seconds_remaining, df = 3)1 -0.07559    0.03780  -2.000 0.045553 *  
bs(game_seconds_remaining, df = 3)2 -0.22579    0.04555  -4.957 7.24e-07 ***
bs(game_seconds_remaining, df = 3)3 -0.35398    0.04362  -8.114 5.18e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4667 on 19480 degrees of freedom
Multiple R-squared:  0.1059,    Adjusted R-squared:  0.1052 
F-statistic: 153.9 on 15 and 19480 DF,  p-value: < 2.2e-16
fit_spline <- glm(is_pass ~ bs(down, df = 3) + bs(yards_gained, df = 4) + 
                  bs(score_differential, df = 4) + qtr + bs(game_seconds_remaining, df = 3), 
                  data = model_data, family = binomial)
summary(fit_spline)

Call:
glm(formula = is_pass ~ bs(down, df = 3) + bs(yards_gained, df = 4) + 
    bs(score_differential, df = 4) + qtr + bs(game_seconds_remaining, 
    df = 3), family = binomial, data = model_data)

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                          18.95928    1.93379   9.804  < 2e-16 ***
bs(down, df = 3)1                    -0.56141    0.11080  -5.067 4.04e-07 ***
bs(down, df = 3)2                     2.78762    0.14879  18.736  < 2e-16 ***
bs(down, df = 3)3                     0.41778    0.10528   3.968 7.24e-05 ***
bs(yards_gained, df = 4)1           -16.85707    1.91442  -8.805  < 2e-16 ***
bs(yards_gained, df = 4)2           -21.23259    1.68932 -12.569  < 2e-16 ***
bs(yards_gained, df = 4)3            -7.07773    2.20574  -3.209  0.00133 ** 
bs(yards_gained, df = 4)4           -21.92803    1.87671 -11.684  < 2e-16 ***
bs(score_differential, df = 4)1       1.86311    0.72375   2.574  0.01005 *  
bs(score_differential, df = 4)2      -0.04741    0.47692  -0.099  0.92081    
bs(score_differential, df = 4)3      -0.31649    0.70397  -0.450  0.65301    
bs(score_differential, df = 4)4      -3.70267    0.90577  -4.088 4.35e-05 ***
qtr                                  -0.40189    0.05543  -7.251 4.14e-13 ***
bs(game_seconds_remaining, df = 3)1  -0.35026    0.17559  -1.995  0.04607 *  
bs(game_seconds_remaining, df = 3)2  -1.05712    0.20946  -5.047 4.49e-07 ***
bs(game_seconds_remaining, df = 3)3  -1.60401    0.20114  -7.974 1.53e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26511  on 19495  degrees of freedom
Residual deviance: 24084  on 19480  degrees of freedom
AIC: 24116

Number of Fisher Scoring iterations: 6
# Compare model performance using AIC and adjusted R-squared
AIC(nfl_lm, nfl_lm1, fit_spline)
           df      AIC
nfl_lm      9 25466.24
nfl_lm1    17 25629.87
fit_spline 16 24115.91
summary(nfl_lm)$adj.r.squared
[1] 0.112363
summary(nfl_lm1)$adj.r.squared
[1] 0.1052491
summary(fit_spline)$adj.r.squared
NULL
# Predict probabilities
model_data <- model_data |>
  mutate(predicted_prob = predict(fit_spline, type = "response"))

# ROC curve and AUC
roc_curve <- roc(model_data$is_pass, model_data$predicted_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_curve, col = "blue")

auc(roc_curve)
Area under the curve: 0.702
# Predict and plot the effect of yards_gained
yard_grid <- seq(min(model_data$yards_gained), max(model_data$yards_gained), length.out = 100)
yard_effect <- data.frame(yards_gained = yard_grid,
                          predicted = predict(fit_spline, 
                                              newdata = data.frame(down = 1, 
                                                                   yards_gained = yard_grid, 
                                                                   score_differential = 0, 
                                                                   qtr = 1, 
                                                                   game_seconds_remaining = 900), 
                                              type = "response"))

ggplot(yard_effect, aes(x = yards_gained, y = predicted)) +
  geom_line(color = "blue") +
  labs(title = "Effect of Yards Gained on Probability of Pass", 
       x = "Yards Gained", 
       y = "Predicted Probability of Pass")

# Fit a Generalized Additive Model (GAM)
model_data <- model_data|>
  mutate(down = as.factor(down))

fit_gam <- gam(is_pass ~ 
                 yardline_100 + 
                wp +
                 s(ydstogo, k = 5) + 
                s(score_differential, k =5) + 
                game_seconds_remaining +
               factor(down) +
               factor(qtr),
               data = model_data, 
               family = binomial)


summary(fit_gam)

Family: binomial 
Link function: logit 

Formula:
is_pass ~ yardline_100 + wp + s(ydstogo, k = 5) + s(score_differential, 
    k = 5) + game_seconds_remaining + factor(down) + factor(qtr)

Parametric coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.864e+00  2.354e-01  12.166  < 2e-16 ***
yardline_100           -1.755e-03  7.667e-04  -2.289   0.0221 *  
wp                     -3.268e+00  2.336e-01 -13.987  < 2e-16 ***
game_seconds_remaining -4.849e-04  5.863e-05  -8.270  < 2e-16 ***
factor(down)2           7.738e-01  3.986e-02  19.416  < 2e-16 ***
factor(down)3           1.957e+00  5.592e-02  34.994  < 2e-16 ***
factor(down)4           1.515e+00  1.232e-01  12.297  < 2e-16 ***
factor(qtr)2           -1.654e-01  7.130e-02  -2.320   0.0204 *  
factor(qtr)3           -9.289e-01  1.151e-01  -8.070 7.04e-16 ***
factor(qtr)4           -1.194e+00  1.652e-01  -7.226 4.98e-13 ***
factor(qtr)5           -1.423e+00  2.656e-01  -5.357 8.44e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                        edf Ref.df Chi.sq p-value    
s(ydstogo)            3.980  4.000  886.9  <2e-16 ***
s(score_differential) 3.932  3.997  117.8  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.143   Deviance explained = 11.2%
UBRE = 0.20972  Scale est. = 1         n = 19496
plot(fit_gam, pages = 1, se = TRUE)

# Check the number of unique values for each term
sapply(pbp_clean, function(x) length(unique(x)))
             play_type           yardline_100                   down 
                     2                     99                      4 
               ydstogo game_seconds_remaining 
                    36                   3589 
# Plot residual diagnostics
gam.check(fit_gam)


Method: UBRE   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [6.144679e-09,1.678914e-06]
(score 0.2097242 & scale 1).
Hessian positive definite, eigenvalue range [1.922126e-06,9.466006e-06].
Model rank =  19 / 19 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                        k'  edf k-index p-value  
s(ydstogo)            4.00 3.98    0.99   0.285  
s(score_differential) 4.00 3.93    0.97   0.015 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1