For this project, I initially set out to work with NFL data, as there is an abundance of it readily available. My first approach involved pulling datasets from Kaggle, which included information on teams, stadiums, weather conditions, and betting odds. However, this proved to be quite challenging due to the complexity of integrating and making sense of such diverse data sources.
After discussing my struggles with Alex, they introduced me to nflfastr, a comprehensive dataset specifically designed for NFL analytics. This was a game-changer, and I decided to pivot to using nflfastr for my project. With its wealth of information and structure tailored for football analysis, I was able to shift focus to something more manageable and meaningful.
The aspect of NFL gameplay that resonated most with me was predicting whether a play would be a pass or a rush. This is a fundamental distinction in football strategy and provides a great opportunity for predictive modeling. I explored various methods and techniques to approach this classification problem, experimenting with different models and frameworks to achieve reliable predictions. Each step of the process deepened my understanding of the data and the dynamics of NFL gameplay.
This shift not only simplified the technical challenges but also aligned the project with a clear and focused objective, making it both more engaging and insightful.
library(nflfastR)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gsisdecoder)library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
library(ipred)library(tidytext)
This code processes NFL play-by-play (PBP) data from the 2023 season. It starts by loading the data and slicing the first 7,000 rows for analysis (becuase it was taking way to long to load) The dataset is then filtered to exclude missing EPA values and retain only “pass” and “run” play types. A new binary variable, is_pass, is created to convert the play_type into a numeric format (1 for pass, 0 for run). The code then selects relevant features, such as game context (e.g., score_differential, posteam_timeouts_remaining) and play-specific metrics (e.g., game_seconds_remaining, ydstogo, rushing and passing EPA for both teams), which are likely important for predicting whether a play is a pass or a run. Finally, any rows with missing data are removed using na.omit, preparing the dataset for modeling. This cleaned data, model_data, is now ready for use in a predictive model.
pbp =load_pbp(2023)|>slice(1:7000) #table(pbp$play_type) filtered_data <- pbp |>filter(!is.na(epa), play_type %in%c('pass', 'run')) |>mutate(is_pass =ifelse(play_type =='pass', 1, 0))# needed to turn this into a value rather than a wordmodel_data = filtered_data |>select(is_pass, game_seconds_remaining, down, ydstogo, yardline_100, score_differential, posteam_timeouts_remaining,defteam_timeouts_remaining, quarter_seconds_remaining, half_seconds_remaining, drive, qtr, ydsnet, total_home_rush_epa, total_away_rush_epa, total_home_pass_epa,total_away_rush_epa, wp, desc) |>na.omit()
This code splits the model_data into training and test sets, using 50% of the data for training and the rest for testing. It ensures that the is_pass variable is treated as a factor for classification. A random forest model is then trained to predict is_pass using the training data. The model has 500 trees and tries 4 variables at each split. The out-of-bag (OOB) error rate is 33.86%. The confusion matrix shows that the model correctly classifies 518 “run” and 1129 “pass” plays, with a class error rate of 0.495 for “run” and 0.229 for “pass.”
set.seed(1)train_index =createDataPartition(model_data$is_pass, p =0.5, list =FALSE)train_data = model_data[train_index, ]|>slice(1:5000)test_data = model_data[-train_index, ]|>slice(1:5000)train_data$is_pass =as.factor(train_data$is_pass)test_data$is_pass =as.factor(test_data$is_pass)rf_model =randomForest(is_pass ~ ., data = train_data)print(rf_model)
Call:
randomForest(formula = is_pass ~ ., data = train_data)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 34.5%
Confusion matrix:
0 1 class.error
0 525 501 0.4883041
1 358 1106 0.2445355
The random forest model achieved an accuracy of 66.21%, with a Kappa of 0.2884, indicating moderate agreement between predicted and actual values. Sensitivity was 50.24%, and specificity was 77.85%. Key variables influencing predictions included wp, yardline_100, and ydstogo, with their importance measured by MeanDecreaseGini.
predictions =predict(rf_model, test_data, type ='response')confusion_matrix =confusionMatrix(as.factor(predictions), as.factor(test_data$is_pass))print(confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 534 311
1 515 1129
Accuracy : 0.6681
95% CI : (0.6493, 0.6866)
No Information Rate : 0.5785
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.301
Mcnemar's Test P-Value : 1.626e-12
Sensitivity : 0.5091
Specificity : 0.7840
Pos Pred Value : 0.6320
Neg Pred Value : 0.6867
Prevalence : 0.4215
Detection Rate : 0.2145
Detection Prevalence : 0.3395
Balanced Accuracy : 0.6465
'Positive' Class : 0
This code loads play-by-play (PBP) data for the 2023 NFL season, focusing on “run” and “pass” plays. It preprocesses the data by cleaning column names and filtering for relevant variables, followed by removing missing values. A Random Forest model is fitted to predict play type based on selected features. The model summary is printed, and variable importance is visualized using a plot.
Call:
randomForest(formula = play_type ~ ., data = pbp_clean, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 34.59%
Confusion matrix:
pass run class.error
pass 13991 6653 0.3222728
run 5617 9213 0.3787593
varImpPlot(rf_pbp)
This code processes NFL play descriptions from the model_data dataset to analyze the frequency and relationship of words used in pass and non-pass plays. It first filters out missing descriptions and selects the relevant columns. Then, it creates a unique identifier for each play description. The desc column is tokenized into individual words, and stop words are removed to focus on meaningful terms. The code counts the occurrences of each word and groups them by play type (pass or non-pass). It then constructs a document-feature matrix (DFM) and performs hierarchical clustering on the scaled word frequencies to visualize patterns.
nfl_words <- model_data |>filter(!is.na(desc)) |>select(desc, is_pass)# Create a unique identifier for 'desc'nfl_words <- nfl_words |>mutate(desc_id =row_number())# Tokenize 'desc' into individual words and filter out stop wordstokenized_desc <- nfl_words |>unnest_tokens(input = desc, output = Word, token ="words") |>filter(!(Word %in% stop_words$word))|>slice(1:1000)# Count the occurrences of each wordword_counts <- tokenized_desc |>count(Word) |>arrange(desc(n))# Count the occurrences of each word by epaword_counts_by_play <- tokenized_desc |>count(is_pass, Word) |>arrange(desc(n))|>slice(1:1000)# Creating the document-feature matrix (DFM)dfm1 <- tokenized_desc |>pivot_wider(id_cols =c("desc_id", "is_pass"),names_from ="Word",values_from = Word,values_fn = length,values_fill =0)dfm1 |>select(-desc_id, -is_pass) |>scale() |>dist() |>hclust() |>plot()
This code processes NFL play descriptions to identify patterns and reduce dimensionality using word frequency analysis, hierarchical clustering, and Principal Component Analysis (PCA). First, it calculates the total word count for each word in the document-feature matrix (dfm1), retaining only words that appear more than 25 times (as specified by X). The filtered matrix (dfm2) includes only these frequent words. The code then performs hierarchical clustering on the filtered data, calculating pairwise distances and visualizing the results through a dendrogram. This helps reveal similarities in play descriptions based on word usage. Afterward, PCA is applied to reduce the dataset’s dimensionality, identifying the most significant components. The first and fifth principal components (PC1 and PC5) are explored by examining their loadings and scores, providing insight into which words contribute most to each component. This allows for a deeper understanding of the patterns in play descriptions based on word frequencies.
# Load necessary librarieslibrary(dplyr)library(tidyr)# Calculate word counts and filter words used more than X timesis_pass_count <- dfm1 |>select_if(is.numeric) |>colSums()X <-25# Only return elements which have a value greater than Xgood_pass <-names(is_pass_count[which(is_pass_count > X)])dfm2 <- dfm1 |>select(all_of(good_pass))# Quick check using hierarchical clusteringdfm2 |>dist() |>hclust() |>plot()
# Perform PCApca_plays <- dfm2 |>prcomp(center =TRUE, scale. =TRUE)# Look at our first PCpca_plays$rotation |>data.frame()|>select(PC1) |>arrange(PC1)
PC1
yards -0.23698446
ari -0.18675732
left -0.03574220
desc_id 0.09551176
shotgun 0.11407416
14 0.33406936
s.howell 0.37525703
short 0.41712120
pass 0.47652062
is_pass 0.48173123
This code analyzes the probability of a specific row (row 1) appearing in a bootstrap sample of NFL play descriptions. It first creates a unique identifier for each row and sets up the number of bootstrap samples based on the number of rows in the dataset. Then, it generates a bootstrap sample and runs a simulation 1,000 times to estimate the probability that row 1 appears in the sample. The code also calculates the exact probability using a known formula. The estimated probability (approximately 0.639) represents the likelihood that row 1 will appear at least once in a bootstrap sample, while the exact probability (0.632) is derived from the known formula 1−(1/e)1 - (1/e)1−(1/e), where eee is Euler’s number. This comparison shows how well the simulation approximates the exact probability.
nfl_words1 <- model_data|>filter(!is.na(desc))|>select(desc, is_pass)# Create a unique identifier for each rownfl_words1 <- nfl_words1 |>mutate(row_id =row_number())# Set up the number of rows (bootstrap samples)n_boot <-nrow(nfl_words1)# Generate bootstrap samplebootstrap_sample <-sample(1:n_boot, size = n_boot, replace =TRUE)# Probability that a given row appears at all in the bootstrap samplesimulations <-replicate(1000, { bootstrap_sample <-sample(1:n_boot, size = n_boot, replace =TRUE)sum(bootstrap_sample ==1) >0})# Proportion of times that row 1 shows up in the bootstrap sampleprob_appears <-mean(simulations)# Calculate using the known probabilityprob_appears_exact <-1- (1/exp(1))# Print probabilitiesprint(paste("Estimated Probability:", prob_appears))
The graph generated by the code is not informative, as it uses the is_pass column (binary pass/run indicator) for the x-axis, which does not provide meaningful insights when visualizing clusters. The plot is a poor representation of the clustering results, and a better approach would involve using more relevant numeric features, like expected points added (EPA) or other performance metrics, for clearer visualization of cluster separation.
# Filter non-missing 'desc' and select relevant columnsnfl_words2 <- model_data |>filter(!is.na(desc)) |>select(desc, is_pass)# Create a unique identifier for 'desc'nfl_words2 <- nfl_words2 |>mutate(desc_id =row_number())# Tokenize 'desc' into individual words and filter out stop wordstokenized_desc1 <- nfl_words2 |>unnest_tokens(input = desc, output = Word, token ="words") |>filter(!(Word %in% stop_words$word))# Creating the document-feature matrix (DFM)dfm3 <- tokenized_desc1 |>pivot_wider(id_cols =c("desc_id", "is_pass"),names_from ="Word",values_from = Word,values_fn = length,values_fill =0)dfm3_subset <- dfm3 |>slice(1:700)tf_matrix <- dfm3_subset|>select(-desc_id, -is_pass)# Set number of clustersk <-25# Perform k-means clusteringk_means_nfl <-kmeans(tf_matrix, centers = k)# Add cluster assignment to the original datanfl_clusters <- dfm3_subset|>mutate(cluster = k_means_nfl$cluster)# Visualize clusters (example with made-up columns, adjust as needed)ggplot(nfl_clusters) +geom_point(aes(x = is_pass, y = cluster, color =factor(cluster)), size =5) +labs(title ="K-Means Clustering of NFL Plays",x ="Pass/Run Indicator",y ="Cluster",color ="Cluster")
The code uses a Support Vector Machine (SVM) to predict whether a play results in a first down based on various play features. The model achieved an accuracy of approximately 0.00016, which is extremely low. This suggests that the model is not performing well and might be overfitting, underfitting, or not capturing important features to predict first downs. Further tuning, feature engineering, or model adjustments may be necessary to improve performance.
# Load necessary librarieslibrary(nflfastR)library(dplyr)library(e1071)# Load the PBP data for the 2023 seasonpbp <- nflfastR::load_pbp(2023)# Preprocess the data: Filter relevant columns and create a binary target variable# Convert necessary columns to factors and ensure they have at least two levelssvm_data <- pbp |>filter(!is.na(yards_gained) &!is.na(down)) |>mutate(first_down =ifelse(yards_gained >= ydstogo, 1, 0)) |>select(play_id, game_id, home_team, away_team, season_type, week, posteam, defteam, yardline_100, quarter_seconds_remaining, game_seconds_remaining, game_half, down, ydstogo, yards_gained, shotgun, no_huddle, pass_length, air_yards, yards_after_catch, play_type, first_down) |>mutate(across(where(is.character), as.factor)) |>mutate(across(where(is.factor), ~factor(as.character(.), levels =unique(as.character(.))))) |>na.omit()# Split the data into training and testing setstrain_index <-sample(1:nrow(svm_data), 0.5*nrow(svm_data))train_data <- svm_data[train_index, ]test_data <- svm_data[-train_index, ]# Train the SVM modelsvm_model <-svm(first_down ~ ., data = train_data, kernel ="linear", cost =1)# Make predictionspredictions <-predict(svm_model, test_data)# Evaluate the modelconfusion_matrix <-table(predictions, test_data$first_down)#print(confusion_matrix)accuracy <-sum(diag(confusion_matrix)) /sum(confusion_matrix)print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.000160926939169617"
The code builds a gradient boosting model using xgboost to predict the binary outcome is_pass (whether a play is a pass) based on features such as game_seconds_remaining, down, ydstogo, and wp. After training the model, SHAP (Shapley Additive Explanations) values are used to interpret the model’s predictions. The SHAP analysis reveals that down, ydstogo, wp (win probability), and game_seconds_remaining (time) are critical factors influencing the model’s decisions. These features appear as the most important in the visualizations, suggesting they play a central role in determining the likelihood of a pass play. The sv_waterfall function shows individual explanations for specific plays (row 25 and row 1111), and the sv_importance plot highlights the overall importance of each feature, reinforcing that these game-specific elements are significant in predicting pass plays.
library(dplyr)library(xgboost)
Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':
slice
library(shapviz)set.seed(3)pbp_subset <- model_data |> dplyr::select(game_seconds_remaining, down, ydstogo, wp)boost_data <-xgb.DMatrix(data.matrix(pbp_subset),label = model_data$is_pass) #label is what you #are trying to predictboost_model <-xgb.train(data = boost_data,nrounds =65, #bparams =list(learning_rate =0.1, #lamdaobjective ="reg:squarederror"))shap_values <-shapviz(boost_model,X_pred =data.matrix(pbp_subset),X = pbp_subset)sv_waterfall(shap_values, row_id =25)
sv_waterfall(shap_values, row_id =1111)
sv_importance(shap_values, kind ="beeswarm")
This code creates and visualizes multiple decision trees using the rpart package to predict the is_pass variable based on different features of the dataset. The trees include variables such as half_seconds_remaining, ydstogo, game_seconds_remaining, and additional features like home and away rush EPA. The models vary in complexity, with some having no pruning (cp = 0) and others controlling tree depth. The final tree incorporates multiple features for improved prediction accuracy. I only plot the first tree because it looked like they were pruning to much and looked the best.
library(rpart)library(rpart.plot)set.seed(3)# Basic decision tree modeltree1 <-rpart(factor(is_pass) ~ ., data = model_data)rpart.plot(tree1)
# Decision tree with 'half_seconds_remaining'tree2 <-rpart(is_pass ~factor(half_seconds_remaining), data = model_data)rpart.plot(tree2)
# Decision tree with 'ydstogo' and no pruning (more complex)tree3 <-rpart(is_pass ~factor(ydstogo), data = model_data, cp =0)rpart.plot(tree3)
# Decision tree with 'ydstogo' and 'game_seconds_remaining'tree4 <-rpart(is_pass ~factor(ydstogo) + game_seconds_remaining, data = model_data, cp =0)rpart.plot(tree4)
Warning: labs do not fit even at cex 0.15, there may be some overplotting
# Prediction for a specific case (example values)predict(tree4, data.frame(ydstogo =10, game_seconds_remaining =100))
1
1
# Final tree with additional featurestree_final <-rpart(is_pass ~factor(ydstogo) + game_seconds_remaining + total_home_rush_epa + total_away_rush_epa,data = model_data, control =rpart.control(minsplit =5, cp =0.01))rpart.plot(tree_final)
This code performs K-means clustering on a subset of numeric columns from the model_data dataset, specifically focusing on factors like ydstogo, game_seconds_remaining, and various EPA metrics. An elbow plot is generated to evaluate the optimal number of clusters. The clustering results show that as the number of clusters (k) increases beyond 3, the boundaries between clusters become very distinct, suggesting that 3 may be the ideal number of clusters for this data.
library(dplyr)library(ggplot2)pbp_numeric <- model_data |>select(ydstogo, game_seconds_remaining, total_home_rush_epa, total_away_rush_epa, total_home_pass_epa, is_pass)# Set the number of clusters (k)k <-5k_means_pbp <-kmeans(pbp_numeric |>select(-is_pass), centers = k)# Create a scatter plot colored by clustermodel_data |>mutate(cluster = k_means_pbp$cluster)|>ggplot() +geom_point(aes(x = total_home_rush_epa, y = game_seconds_remaining,color =factor(cluster)), size =0.1) +labs(title ="K-means Clustering of NFL Plays",x ="Total Home Rush EPA",y ="Game Seconds Remaining",color ="Cluster")
# Calculate within-cluster sum of squares for different values of kans <-numeric(20) # Initialize empty vector to store sum of squares for each kfor(k in1:20) { k_means_pbp <-kmeans(pbp_numeric, centers = k) ans[k] <-sum(k_means_pbp$withinss)}# Plot the elbow plot to determine the optimal number of clustersplot(1:20, ans, type ="b", main ="Elbow Plot for K-means Clustering",xlab ="Number of Clusters (k)", ylab ="Within-cluster Sum of Squares")
This code generates a visualization that compares the average offensive passing EPA (Expected Points Added) per play at home with the win percentage (wp) for each NFL team during the 2023 regular season. The scatter plot shows how teams with higher passing EPA at home correlate with their win percentage, . The reversed y-axis emphasizes teams with higher win percentages, highlighting any trends between offensive performance and team success.
This code generates a visualization that compares the average offensive rushing EPA (Expected Points Added) per play at home with the win percentage (wp) for each NFL team during the 2023 regular season. The scatter plot shows how teams with higher rushing EPA at home correlate with their win percentage. The reversed y-axis highlights teams with higher win percentages, providing insight into whether a team’s rushing performance contributes to its success.
This plot visualizes the relationship between a team’s average offensive passing EPA per play at home and its average offensive rushing EPA per play at home during the 2023 NFL regular season. Each team is represented by its logo, and the plot compares the efficiency of passing and rushing for each team. The reversed y-axis emphasizes the performance of teams with higher rushing EPA. This graph provides insight into how teams balance their offensive strategies in both the passing and rushing game.
In this analysis, I began by loading and preprocessing NFL play-by-play (pbp) data from the 2023 season using the nflfastR package, focusing on plays classified as either ‘pass’ or ‘run’. I then created a new binary variable, is_pass, to represent whether a play was a pass. After cleaning the data, I selected relevant features such as down, yards_gained, score_differential, qtr, and game_seconds_remaining, and removed missing values. First, I fitted a logistic regression model (fit_logistic) to predict the likelihood of a pass based on these variables. The model’s significant predictors included down, yards_gained, score_differential, qtr, and game_seconds_remaining, with all variables showing a highly significant effect (p < 0.001). I then explored linear regression models, first with basic continuous predictors (nfl_lm) and subsequently incorporating splines for continuous variables (nfl_lm1) to allow for non-linear relationships. The linear models showed a reasonable fit, with down, yards_gained, and qtr being particularly influential, though score_differential had less of an impact. To improve model flexibility, I applied splines using the bs() function for continuous variables in the logistic regression model (fit_spline). This model demonstrated better performance, with the inclusion of splines capturing more complex relationships, particularly for yards_gained, score_differential, and game_seconds_remaining. The spline model’s AIC (24116) was lower than that of the simpler linear models, indicating a better fit. Additionally, I evaluated the adjusted R-squared values and residual standard errors of the models, which confirmed that the spline model offered superior predictive power. Finally, I visualized the model’s fit using plot(fit_gam, pages = 1, se = TRUE) to assess the effects of splines on the continuous variables, providing a detailed view of how each predictor influenced the likelihood of a pass.
Generalized Additive Models (GAMs) are an extension of generalized linear models that allow for nonlinear relationships between predictors and the response variable. In R, the mgcv package is commonly used to fit GAMs.The gam function in this package enables the specification of smooth functions for predictors, facilitating the modeling of complex, nonlinear patterns For example, to fit a GAM with a smooth term for a predictor x, you can use the formula y ~ s(x), where s(x) denotes a smooth function of x. This approach provides a flexible framework for modeling nonlinear relationships in data #https://www.geeksforgeeks.org/generalized-additive-models-using-r/
In your code, you implemented spline regression in R to model nonlinear relationships between variables.By dividing the dataset into intervals at specific points, known as knots, you applied piecewise polynomial functions to each segment. This approach allowed for a flexible fit that adapts to changes in the data’s pattern.Utilizing the splines package in R, you employed functions like bs() to create B-splines, which are linear combinations of basis functions that ensure smoothness at the knots.This method effectively captured the underlying structure of the data, providing a more accurate representation than simple linear regression. https://www.geeksforgeeks.org/understanding-spline-regression-in-r/
Call:
glm(formula = is_pass ~ down + yards_gained + score_differential +
qtr + game_seconds_remaining, family = binomial, data = model_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.054e+00 2.244e-01 4.697 2.64e-06 ***
down 5.141e-01 1.888e-02 27.225 < 2e-16 ***
yards_gained 2.536e-02 1.917e-03 13.230 < 2e-16 ***
score_differential -3.156e-02 1.518e-03 -20.786 < 2e-16 ***
qtr -4.018e-01 5.022e-02 -8.002 1.22e-15 ***
game_seconds_remaining -4.499e-04 5.425e-05 -8.293 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26511 on 19495 degrees of freedom
Residual deviance: 25032 on 19490 degrees of freedom
AIC: 25044
Number of Fisher Scoring iterations: 4
# Fit linear regression models with splines for continuous variablesnfl_lm <-lm(is_pass ~ down + yards_gained + score_differential + game_seconds_remaining + qtr + wp + ydstogo, data = model_data)summary(nfl_lm)
# Predict probabilitiesmodel_data <- model_data |>mutate(predicted_prob =predict(fit_spline, type ="response"))# ROC curve and AUCroc_curve <-roc(model_data$is_pass, model_data$predicted_prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_curve, col ="blue")
auc(roc_curve)
Area under the curve: 0.702
# Predict and plot the effect of yards_gainedyard_grid <-seq(min(model_data$yards_gained), max(model_data$yards_gained), length.out =100)yard_effect <-data.frame(yards_gained = yard_grid,predicted =predict(fit_spline, newdata =data.frame(down =1, yards_gained = yard_grid, score_differential =0, qtr =1, game_seconds_remaining =900), type ="response"))ggplot(yard_effect, aes(x = yards_gained, y = predicted)) +geom_line(color ="blue") +labs(title ="Effect of Yards Gained on Probability of Pass", x ="Yards Gained", y ="Predicted Probability of Pass")
# Fit a Generalized Additive Model (GAM)model_data <- model_data|>mutate(down =as.factor(down))fit_gam <-gam(is_pass ~ yardline_100 + wp +s(ydstogo, k =5) +s(score_differential, k =5) + game_seconds_remaining +factor(down) +factor(qtr),data = model_data, family = binomial)summary(fit_gam)
# Check the number of unique values for each termsapply(pbp_clean, function(x) length(unique(x)))
play_type yardline_100 down
2 99 4
ydstogo game_seconds_remaining
36 3589
# Plot residual diagnosticsgam.check(fit_gam)
Method: UBRE Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [6.144679e-09,1.678914e-06]
(score 0.2097242 & scale 1).
Hessian positive definite, eigenvalue range [1.922126e-06,9.466006e-06].
Model rank = 19 / 19
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(ydstogo) 4.00 3.98 0.99 0.285
s(score_differential) 4.00 3.93 0.97 0.015 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1