Let’s load in our training data set and testing data sets. Our training data set is play-by-play data from the 2016-17 season through the 2018-19 season. Our testing data set is play-by-play data from the non bubble games in 2019-20. We’re using only non bubble games because the bubble games took place on a neutral court and are too dissimilar from our training dataset.
We’ll also load in the play-by-play data for game 3 of the Eastern Conference Finals so we can produce Win Probability charts for this game.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
train <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/pbpdata_1617_1819.csv")
test <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/pbpdata_1920_nonbubble.csv")
game3 <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/Game 3 ECF Milwaukee at Atlanta 062721.csv")
head(train,3)
## Season gameid date home_team road_team home_w q TimeRemaining hs vs
## 1 2018-19 21800928 2019-03-01 ATL CHI 0 1 2880 0 0
## 2 2018-19 21800928 2019-03-01 ATL CHI 0 1 2870 0 0
## 3 2018-19 21800928 2019-03-01 ATL CHI 0 1 2864 0 0
## OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1 0 -2 2 2 2 2
## 2 0 -2 2 2 2 2
## 3 0 -2 2 2 2 2
head(test,3)
## Season gameid date home_team road_team home_w q TimeRemaining hs vs
## 1 2019-20 21900282 2019-11-30 HOU ATL 1 1 2880 0 0
## 2 2019-20 21900282 2019-11-30 HOU ATL 1 1 2876 0 0
## 3 2019-20 21900282 2019-11-30 HOU ATL 1 1 2860 2 0
## OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1 0 -14 3 3 1 1
## 2 0 -14 3 3 1 1
## 3 0 -14 3 3 1 1
head(game3,3)
## Season gameid date home_team road_team home_w q TimeRemaining hs vs
## 1 2020-21 42000303 2021-06-27 ATL MIL 0 1 2880 0 0
## 2 2020-21 42000303 2021-06-27 ATL MIL 0 1 2877 0 0
## 3 2020-21 42000303 2021-06-27 ATL MIL 0 1 2859 0 0
## OT home_line home_days_rest homedaysrest away_days_rest awaydaysrest
## 1 0 4.5 2 2 2 2
## 2 0 4.5 2 2 2 2
## 3 0 4.5 2 2 2 2
We may (or may not) use these later on. We’ll define 3 different clutch states: 1) Clutch as defined by nba.com: last 5 minutes, lead within 5 for either team. 2) Last 3 minutes, lead within 5 for either team. 3) Last minute, lead within 3 for either team.
train %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
train %<>% mutate(clutch = as.factor(clutch))
train %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(train$clutch)
## [1] "nonclutch" "clutch1" "clutch2" "clutch3"
test %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
test %<>% mutate(clutch = as.factor(clutch))
test %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(test$clutch)
## [1] "nonclutch" "clutch1" "clutch2" "clutch3"
We’re removing some of the variables such as quarter, clutch and home/away_days_rest which include NAs. The reason they include NAs is because for every first game of the season, there is no number of “rest days” (unless we used preseason games). To keep things simple, the home/away_days_rest variable was converted to homedaysrest where each first game of the season was given 7 days rest (this is admittedly arbitrary and there is probably a better way to do it).
Let’s run the model.
mod <- glm(home_w ~ ., family = "binomial",
data = select(train, -Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch))
summary(mod)
##
## Call:
## glm(formula = home_w ~ ., family = "binomial", data = select(train,
## -Season:-road_team, -q, -home_days_rest, -away_days_rest,
## -clutch))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3084 -0.7621 0.2823 0.7527 2.8787
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.167e-01 3.238e-02 -15.955 < 2e-16 ***
## TimeRemaining 1.898e-04 1.137e-05 16.694 < 2e-16 ***
## hs 1.430e-01 3.163e-04 452.084 < 2e-16 ***
## vs -1.389e-01 3.107e-04 -446.971 < 2e-16 ***
## OT -1.801e-01 1.555e-02 -11.583 < 2e-16 ***
## home_line -1.268e-01 3.395e-04 -373.489 < 2e-16 ***
## homedaysrest -1.515e-02 2.196e-03 -6.897 5.3e-12 ***
## awaydaysrest 3.131e-02 2.127e-03 14.716 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2345348 on 1728065 degrees of freedom
## Residual deviance: 1609980 on 1728058 degrees of freedom
## AIC: 1609996
##
## Number of Fisher Scoring iterations: 5
train_predictions <- predict(mod, train, type = "response")
train_predictions1 <- round(train_predictions, 0)
train_accuracy <- Metrics::accuracy(train$home_w, train_predictions1)
test_predictions <- predict(mod, test, type = "response")
test_predictions1 <- round(test_predictions, 0)
test_accuracy <- Metrics::accuracy(test$home_w, test_predictions1)
print(paste0("Training Accuracy %: ", round(100 * train_accuracy, 2)))
## [1] "Training Accuracy %: 77.04"
print(paste0("Testing Accuracy %: ", round(100 * test_accuracy, 2)))
## [1] "Testing Accuracy %: 76.92"
Looks pretty good! However, let’s visualize and test it to see how it handles late game scenarios. We’ll look at both game 3 and a late game example which has wild fluctuations.
First, let’s load in our late game scenario and view the data.
library(gt)
lategame <- read.csv("/Users/knarsu3/Google Drive/Sports Info Solutions/lategamescenario.csv")
lategame %>% gt()
| TimeRemaining | hs | vs | OT | home_line | homedaysrest | awaydaysrest |
|---|---|---|---|---|---|---|
| 60 | 112 | 112 | 0 | -5 | 2 | 2 |
| 50 | 114 | 112 | 0 | -5 | 2 | 2 |
| 40 | 114 | 112 | 0 | -5 | 2 | 2 |
| 30 | 114 | 115 | 0 | -5 | 2 | 2 |
| 20 | 114 | 115 | 0 | -5 | 2 | 2 |
| 10 | 116 | 115 | 0 | -5 | 2 | 2 |
| 5 | 116 | 115 | 0 | -5 | 2 | 2 |
| 2 | 116 | 117 | 0 | -5 | 2 | 2 |
| 0 | 116 | 117 | 0 | -5 | 2 | 2 |
As we can see, there’s a lot of lead changes in the last minute. This will be a good test of how good our Win Probability model is.
Not so great. We can see multiple issues. For both game 3 and the late game scenario, we see that the win probability doesn’t end at 0. Additionally, in the late game scenario, we can see that there aren’t really wild fluctuations in win probability that you would expect given the constant lead changes. We can also see that the home team’s win probability is always above 50% even though they lost! That is likely due to the fact that they were favored by 5 points heading into the game so at least we can be positive the vegas line is having an impact on the model.
Let’s try a generalized additive models (GAM) to see if we can improve upon our logistic regression model.
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
stime <- Sys.time()
gam <- gam(home_w ~ OT+s(TimeRemaining)+s(hs)+s(vs)+home_line+homedaysrest+awaydaysrest,
family = binomial, data = select(train, -Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch))
gamtime <- Sys.time() - stime
summary(gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## home_w ~ OT + s(TimeRemaining) + s(hs) + s(vs) + home_line +
## homedaysrest + awaydaysrest
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2751955 0.0049035 56.122 < 2e-16 ***
## OT -0.2170275 0.0223902 -9.693 < 2e-16 ***
## home_line -0.1278376 0.0003454 -370.099 < 2e-16 ***
## homedaysrest -0.0132279 0.0022338 -5.922 3.19e-09 ***
## awaydaysrest 0.0279040 0.0021637 12.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(TimeRemaining) 8.165 8.847 460.9 <2e-16 ***
## s(hs) 8.989 9.000 176501.5 <2e-16 ***
## s(vs) 8.961 8.999 177127.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.39 Deviance explained = 34%
## UBRE = -0.10461 Scale est. = 1 n = 1728066
print(paste0("GAM model time: ", gamtime, " mins"))
## [1] "GAM model time: 8.30573508342107 mins"
train_predictions <- predict(gam, train, type = "response")
train_predictions1 <- round(train_predictions, 0)
train_accuracy1 <- Metrics::accuracy(train$home_w, train_predictions1)
test_predictions <- predict(gam, test, type = "response")
test_predictions1 <- round(test_predictions, 0)
test_accuracy1 <- Metrics::accuracy(test$home_w, test_predictions1)
print(paste0("GAM Training Accuracy %: ", round(100 * train_accuracy1, 2)))
## [1] "GAM Training Accuracy %: 77.82"
print(paste0("Logistic Training Accuracy %: ", round(100 * train_accuracy, 2)))
## [1] "Logistic Training Accuracy %: 77.04"
print(paste0("GAM Testing Accuracy %: ", round(100 * test_accuracy1, 2)))
## [1] "GAM Testing Accuracy %: 77.59"
print(paste0("Logistic Testing Accuracy %: ", round(100 * test_accuracy, 2)))
## [1] "Logistic Testing Accuracy %: 76.92"
Our training accuracy has improved slightly! Let’s visualize it to see how it performs in the late game situations we looked at earlier.
## Warning: attributes are not identical across measure variables;
## they will be dropped
The GAM model appears to be handling the end of the game better! We can see the win probability for the GAM model approach 0 when the game ends. Let’s check our late game scenario.
## Warning: attributes are not identical across measure variables;
## they will be dropped
Unfortunately, we haven’t quite solved the end of game scenario here. However, the GAM model is better; we can see larger fluctuations in win probability but it still does not approach 0 at the end of the game.
Let’s try fitting a random forest model specifically for end of game situations.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
train2 <- train %>% filter(clutch == "clutch3") %>%
select(-Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch)
test2 <- test %>% filter(clutch == "clutch3") %>%
select(-Season:-road_team,-q,-home_days_rest,-away_days_rest,-clutch)
set.seed(4)
rf <- randomForest(as.factor(home_w) ~ ., data=train2,ntree=500,importance=TRUE,mtry=4)
rf
##
## Call:
## randomForest(formula = as.factor(home_w) ~ ., data = train2, ntree = 500, importance = TRUE, mtry = 4)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.6%
## Confusion matrix:
## 0 1 class.error
## 0 9829 80 0.008073469
## 1 51 11731 0.004328637
varImpPlot(rf)
prediction_for_table <- predict(rf,test2[,-1])
tab = table(observed=test2[,1],predicted=prediction_for_table)
table(observed=test2[,1],predicted=prediction_for_table)
## predicted
## observed 0 1
## 0 1650 1067
## 1 769 2209
tab <- data.frame(tab)
tab %<>% mutate(correct = ifelse(observed == 0 & predicted == 0,1,0)) %>%
mutate(correct = ifelse(observed == 1 & predicted == 1,1,correct))
tab2 <- tab %>% group_by(correct) %>% summarise(Freq = sum(Freq)) %>% data.frame()
tab2 %<>% mutate(Tot = sum(Freq))
tab2 %<>% mutate(Pct = Freq/Tot)
print(paste0("Testing Accuracy %: ", round(100 * tab2[2,4], 2)))
## [1] "Testing Accuracy %: 67.76"
train2_predictions <- predict(rf, train2, type = "prob")
train2_predictions1 <- round(train2_predictions, 0)
train2_accuracy1 <- Metrics::accuracy(train2$home_w, train2_predictions1)
test_predictions <- predict(rf, test2, type = "prob")
test_predictions1 <- round(test_predictions, 0)
test_accuracy3 <- Metrics::accuracy(test2$home_w, test_predictions1)
print(paste0("RF Training Accuracy %: ", round(100 * train2_accuracy1, 2)))
## [1] "RF Training Accuracy %: 50"
print(paste0("RF Testing Accuracy %: ", round(100 * test_accuracy3, 2)))
## [1] "RF Testing Accuracy %: 50.16"
Unfortunately, the accuracy isn’t that great for these end of game situations. They’re tricky to model. Let’s at least see if this improves upon the late game scenario we looked at earlier.
lategame %<>% mutate(rfWinProb = predict(rf,lategame,type="prob"))
rflg <- lategame$rfWinProb
lategame$rfWinProb=NULL
lategame %<>% mutate(rfWinProb = rflg[,2])
lg <- lategame %>%
select(TimeRemaining, WinProb, gWinProb, rfWinProb) %>%
gather(key = "variable", value = "value", -TimeRemaining)
## Warning: attributes are not identical across measure variables;
## they will be dropped
Unfortunately, the random forest model doesn’t converge on 0 either but we can see it’s an improvement from the other two models based on the fluctuations in Win Probability.
Let’s merge the GAM and situational random forest model together to get our final predictions.
game3 %<>% mutate(clutch = ifelse(TimeRemaining <= 300 & abs(hs-vs) <= 5,"clutch1","nonclutch")) %>%
mutate(clutch = ifelse(TimeRemaining <= 180 & abs(hs-vs) <= 5,"clutch2",clutch)) %>%
mutate(clutch = ifelse(TimeRemaining <= 60 & abs(hs-vs) <= 3,"clutch3",clutch))
game3 %<>% mutate(clutch = as.factor(clutch))
game3 %<>% mutate(clutch = ordered(clutch, levels = c("nonclutch", "clutch1", "clutch2","clutch3")))
levels(game3$clutch)
## [1] "nonclutch" "clutch1" "clutch2" "clutch3"
train %<>% mutate(WinProb = predict(gam,train,type="response")) %>%
mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, train, type = "prob"),WinProb))
test %<>% mutate(WinProb = predict(gam,test,type="response")) %>%
mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, test, type = "prob"),WinProb))
game3 %<>% mutate(WinProb = predict(gam,game3,type="response")) %>%
mutate(WinProb = ifelse(clutch == "clutch3",predict(rf, game3, type = "prob"),WinProb))