Load Training and Testing data

Let’s load in our training data set and testing data sets. Our training data set is play-by-play data from the 2016-17 season through the 2018-19 season. Our testing data set is play-by-play data from the non bubble games in 2019-20. We’re using only non bubble games because the bubble games took place on a neutral court and are too dissimilar from our training dataset.

We’ll also load in the play-by-play data for game 3 of the Eastern Conference Finals so we can produce Win Probability charts for this game.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
train <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/pbpdata_1617_1819.csv")
test <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/pbpdata_1920_nonbubble.csv")
game3 <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/Game 3 ECF Milwaukee at Atlanta 062721.csv")

Let’s view our datasets

head(train,3)
##    Season   gameid       date home_team road_team home_w q TimeRemaining hs vs
## 1 2018-19 21800928 2019-03-01       ATL       CHI      0 1          2880  0  0
## 2 2018-19 21800928 2019-03-01       ATL       CHI      0 1          2870  0  0
## 3 2018-19 21800928 2019-03-01       ATL       CHI      0 1          2864  0  0
##   OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1  0        -2              2            2              2            2
## 2  0        -2              2            2              2            2
## 3  0        -2              2            2              2            2
head(test,3)
##    Season   gameid       date home_team road_team home_w q TimeRemaining hs vs
## 1 2019-20 21900282 2019-11-30       HOU       ATL      1 1          2880  0  0
## 2 2019-20 21900282 2019-11-30       HOU       ATL      1 1          2876  0  0
## 3 2019-20 21900282 2019-11-30       HOU       ATL      1 1          2860  2  0
##   OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1  0       -14              3            3              1            1
## 2  0       -14              3            3              1            1
## 3  0       -14              3            3              1            1
head(game3,3)
##    Season   gameid       date home_team road_team home_w q TimeRemaining hs vs
## 1 2020-21 42000303 2021-06-27       ATL       MIL      0 1          2880  0  0
## 2 2020-21 42000303 2021-06-27       ATL       MIL      0 1          2877  0  0
## 3 2020-21 42000303 2021-06-27       ATL       MIL      0 1          2859  0  0
##   OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1  0       4.5              2            2              2            2
## 2  0       4.5              2            2              2            2
## 3  0       4.5              2            2              2            2

Add Clutch variable for different game states.

We may (or may not) use these later on. We’ll define 3 different clutch states: 1) Clutch as defined by nba.com: last 5 minutes, lead within 5 for either team. 2) Last 3 minutes, lead within 5 for either team. 3) Last minute, lead within 3 for either team.

train %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
  mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
  mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
train %<>% mutate(clutch = as.factor(clutch))
train %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(train$clutch)
## [1] "nonclutch" "clutch1"   "clutch2"   "clutch3"
test %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
  mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
  mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
test %<>% mutate(clutch = as.factor(clutch))
test %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(test$clutch)
## [1] "nonclutch" "clutch1"   "clutch2"   "clutch3"

Model Win Probability using Logistic Regression

We’re removing some of the variables such as quarter, clutch and home/away_days_rest which include NAs. The reason they include NAs is because for every first game of the season, there is no number of “rest days” (unless we used preseason games). To keep things simple, the home/away_days_rest variable was converted to homedaysrest where each first game of the season was given 7 days rest (this is admittedly arbitrary and there is probably a better way to do it).

Let’s run the model.

mod <- glm(home_w ~ ., family = "binomial", 
           data = select(train, -Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch))
summary(mod)
## 
## Call:
## glm(formula = home_w ~ ., family = "binomial", data = select(train, 
##     -Season:-road_team, -q, -home_days_rest, -away_days_rest, 
##     -clutch))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3084  -0.7621   0.2823   0.7527   2.8787  
## 
## Coefficients:
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -5.167e-01  3.238e-02  -15.955  < 2e-16 ***
## TimeRemaining  1.898e-04  1.137e-05   16.694  < 2e-16 ***
## hs             1.430e-01  3.163e-04  452.084  < 2e-16 ***
## vs            -1.389e-01  3.107e-04 -446.971  < 2e-16 ***
## OT            -1.801e-01  1.555e-02  -11.583  < 2e-16 ***
## home_line     -1.268e-01  3.395e-04 -373.489  < 2e-16 ***
## homedaysrest  -1.515e-02  2.196e-03   -6.897  5.3e-12 ***
## awaydaysrest   3.131e-02  2.127e-03   14.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2345348  on 1728065  degrees of freedom
## Residual deviance: 1609980  on 1728058  degrees of freedom
## AIC: 1609996
## 
## Number of Fisher Scoring iterations: 5

Let’s check the Accuracy of the model.

train_predictions <- predict(mod, train, type = "response")
train_predictions1 <- round(train_predictions, 0)
train_accuracy <- Metrics::accuracy(train$home_w, train_predictions1) 

test_predictions <- predict(mod, test, type = "response")
test_predictions1 <- round(test_predictions, 0)
test_accuracy <- Metrics::accuracy(test$home_w, test_predictions1) 

print(paste0("Training Accuracy %: ", round(100 * train_accuracy, 2)))
## [1] "Training Accuracy %: 77.04"
print(paste0("Testing Accuracy %: ", round(100 * test_accuracy, 2)))
## [1] "Testing Accuracy %: 76.92"

Looks pretty good! However, let’s visualize and test it to see how it handles late game scenarios. We’ll look at both game 3 and a late game example which has wild fluctuations.

First, let’s load in our late game scenario and view the data.

library(gt)
lategame <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/lategamescenario.csv")
lategame %>% gt()
TimeRemaining hs vs OT home_line homedaysrest awaydaysrest
60 112 112 0 -5 2 2
50 114 112 0 -5 2 2
40 114 112 0 -5 2 2
30 114 115 0 -5 2 2
20 114 115 0 -5 2 2
10 116 115 0 -5 2 2
5 116 115 0 -5 2 2
2 116 117 0 -5 2 2
0 116 117 0 -5 2 2

As we can see, there’s a lot of lead changes in the last minute. This will be a good test of how good our Win Probability model is.

Visualize Game 3 and our late game scenario

Not so great. We can see multiple issues. For both game 3 and the late game scenario, we see that the win probability doesn’t end at 0. Additionally, in the late game scenario, we can see that there aren’t really wild fluctuations in win probability that you would expect given the constant lead changes. We can also see that the home team’s win probability is always above 50% even though they lost! That is likely due to the fact that they were favored by 5 points heading into the game so at least we can be positive the vegas line is having an impact on the model.

Let’s try a generalized additive models (GAM) to see if we can improve upon our logistic regression model.

Generalized Additive Model (GAM)

library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
stime <- Sys.time()
gam <- gam(home_w ~ OT+s(TimeRemaining)+s(hs)+s(vs)+home_line+homedaysrest+awaydaysrest, 
           family = binomial, data = select(train, -Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch))
gamtime <- Sys.time() - stime
summary(gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## home_w ~ OT + s(TimeRemaining) + s(hs) + s(vs) + home_line + 
##     homedaysrest + awaydaysrest
## 
## Parametric coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   0.2751955  0.0049035   56.122  < 2e-16 ***
## OT           -0.2170275  0.0223902   -9.693  < 2e-16 ***
## home_line    -0.1278376  0.0003454 -370.099  < 2e-16 ***
## homedaysrest -0.0132279  0.0022338   -5.922 3.19e-09 ***
## awaydaysrest  0.0279040  0.0021637   12.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df   Chi.sq p-value    
## s(TimeRemaining) 8.165  8.847    460.9  <2e-16 ***
## s(hs)            8.989  9.000 176501.5  <2e-16 ***
## s(vs)            8.961  8.999 177127.5  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.39   Deviance explained =   34%
## UBRE = -0.10461  Scale est. = 1         n = 1728066
print(paste0("GAM model time: ", gamtime, " mins"))
## [1] "GAM model time: 8.30573508342107 mins"

Let’s check the Accuracy of the GAM model.

train_predictions <- predict(gam, train, type = "response")
train_predictions1 <- round(train_predictions, 0)
train_accuracy1 <- Metrics::accuracy(train$home_w, train_predictions1) 

test_predictions <- predict(gam, test, type = "response")
test_predictions1 <- round(test_predictions, 0)
test_accuracy1 <- Metrics::accuracy(test$home_w, test_predictions1) 

print(paste0("GAM Training Accuracy %: ", round(100 * train_accuracy1, 2)))
## [1] "GAM Training Accuracy %: 77.82"
print(paste0("Logistic Training Accuracy %: ", round(100 * train_accuracy, 2)))
## [1] "Logistic Training Accuracy %: 77.04"
print(paste0("GAM Testing Accuracy %: ", round(100 * test_accuracy1, 2)))
## [1] "GAM Testing Accuracy %: 77.59"
print(paste0("Logistic Testing Accuracy %: ", round(100 * test_accuracy, 2)))
## [1] "Logistic Testing Accuracy %: 76.92"

Our training accuracy has improved slightly! Let’s visualize it to see how it performs in the late game situations we looked at earlier.

Visualize Game 3 and our late game scenario for the GAM model

## Warning: attributes are not identical across measure variables;
## they will be dropped

The GAM model appears to be handling the end of the game better! We can see the win probability for the GAM model approach 0 when the game ends. Let’s check our late game scenario.

## Warning: attributes are not identical across measure variables;
## they will be dropped

Unfortunately, we haven’t quite solved the end of game scenario here. However, the GAM model is better; we can see larger fluctuations in win probability but it still does not approach 0 at the end of the game.

Let’s try fitting a random forest model specifically for end of game situations.

End of Game situations Random Forest Model

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
train2 <- train %>% filter(clutch == "clutch3") %>% 
  select(-Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch)
test2 <- test %>% filter(clutch == "clutch3") %>% 
  select(-Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch)
set.seed(4)
rf <- randomForest(as.factor(home_w) ~ ., data=train2,ntree=500,importance=TRUE,mtry=4)
rf
## 
## Call:
##  randomForest(formula = as.factor(home_w) ~ ., data = train2,      ntree = 500, importance = TRUE, mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.6%
## Confusion matrix:
##      0     1 class.error
## 0 9829    80 0.008073469
## 1   51 11731 0.004328637
varImpPlot(rf)

prediction_for_table <- predict(rf,test2[,-1])
tab = table(observed=test2[,1],predicted=prediction_for_table)
table(observed=test2[,1],predicted=prediction_for_table)
##         predicted
## observed    0    1
##        0 1650 1067
##        1  769 2209
tab <- data.frame(tab)
tab %<>% mutate(correct = ifelse(observed == 0 & predicted == 0,1,0)) %>%
  mutate(correct = ifelse(observed == 1 & predicted == 1,1,correct))
tab2 <- tab %>% group_by(correct) %>% summarise(Freq = sum(Freq)) %>% data.frame()
tab2 %<>% mutate(Tot = sum(Freq))
tab2 %<>% mutate(Pct = Freq/Tot)
print(paste0("Testing Accuracy %: ", round(100 * tab2[2,4], 2)))
## [1] "Testing Accuracy %: 67.76"
train2_predictions <- predict(rf, train2, type = "prob")
train2_predictions1 <- round(train2_predictions, 0)
train2_accuracy1 <- Metrics::accuracy(train2$home_w, train2_predictions1) 

test_predictions <- predict(rf, test2, type = "prob")
test_predictions1 <- round(test_predictions, 0)
test_accuracy3 <- Metrics::accuracy(test2$home_w, test_predictions1) 

print(paste0("RF Training Accuracy %: ", round(100 * train2_accuracy1, 2)))
## [1] "RF Training Accuracy %: 50"
print(paste0("RF Testing Accuracy %: ", round(100 * test_accuracy3, 2)))
## [1] "RF Testing Accuracy %: 50.16"

Unfortunately, the accuracy isn’t that great for these end of game situations. They’re tricky to model. Let’s at least see if this improves upon the late game scenario we looked at earlier.

Visualize Late Game Situation for Random Forest model

lategame %<>% mutate(rfWinProb = predict(rf,lategame,type="prob"))
rflg <- lategame$rfWinProb
lategame$rfWinProb=NULL
lategame %<>% mutate(rfWinProb = rflg[,2])
lg <- lategame %>%
  select(TimeRemaining, WinProb, gWinProb, rfWinProb) %>%
  gather(key = "variable", value = "value", -TimeRemaining)
## Warning: attributes are not identical across measure variables;
## they will be dropped

Unfortunately, the random forest model doesn’t converge on 0 either but we can see it’s an improvement from the other two models based on the fluctuations in Win Probability.

Let’s merge the GAM and situational random forest model together to get our final predictions.

game3 %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
  mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
  mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
game3 %<>% mutate(clutch = as.factor(clutch))
game3 %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(game3$clutch)
## [1] "nonclutch" "clutch1"   "clutch2"   "clutch3"
train %<>% mutate(WinProb = predict(gam,train,type="response")) %>%
  mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, train, type = "prob"),WinProb))

test %<>% mutate(WinProb = predict(gam,test,type="response")) %>%
  mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, test, type = "prob"),WinProb))

game3 %<>% mutate(WinProb = predict(gam,game3,type="response")) %>%
  mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, game3, type = "prob"),WinProb))

Final visualization of Game 3