This report is a continuation of the All-NBA Team Capstone project, which utilizes historical NBA statistics from 1937 to 2012 to predict All-NBA Teams. It will cover the ensemble modeling of the complete player data using the randomForest and gbm packages. See this report for web scraping the 2018-2019 stats.
The data cleaning, exploratory data analysis, and logistic regression model reports are also on RPubs, as well as my capstone project repository.
For this report, we’ll use the complete set of NBA stats to build this model, reserving the 2018 season as the test set. Then, we’ll use the current 2019 stats to make predictions for this year’s All-NBA team rosters.
p1 <- as_tibble(read.csv("players_EDA.csv")) %>% select(-c(allDefFirstTeam, allDefSecondTeam, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, MVP, defPOTY))
p2 <- as_tibble(read.csv("players_2019.csv"))
players <- rbind(p1, p2)
train.data <- players %>%
filter(year < 2018) %>%
select(-c(playerID, year, tmID, center, forward, guard))
test.data <- players %>%
filter(year == 2018) %>%
select(-c(playerID, year, tmID, center, forward, guard))
Decision trees model data by building “trees” of heirarchical branches. Branches are made until they reach “leaves” that represent predictions. The main advantage of decision trees over regression models is that they can model non-linear relationships. However, they are prone to overfitting; so much so that if they are completely unconstrained they can completely “memorize” the training data by creating more and more branches until each observation has it’s own leaf!
To take advantage of their flexibility while preventing overfitting to the training data, we can use ensembles.
Ensembles are machine learning methods of combining multiple models into a single, better model. There are 2 ensemble methods:
When the base models for ensembling are decision trees, they have special names:
Here we’ll use the randomForest package to implement our first random forest model. We’ll specify the importance = TRUE argument to inspect variable importance.
# Import library
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
# Make the outcome variable a factor
train.data$allNBA <- as.factor(train.data$allNBA)
test.data$allNBA <- as.factor(test.data$allNBA)
# default rf model
set.seed(42)
rf <- randomForest(allNBA ~ ., data = train.data, importance = TRUE)
rf
##
## Call:
## randomForest(formula = allNBA ~ ., data = train.data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 1.52%
## Confusion matrix:
## 0 1 class.error
## 0 16969 88 0.005159172
## 1 179 349 0.339015152
varImpPlot(rf)
test <- test.data %>% select(-allNBA)
rf.pred <- predict(rf, test, type = "prob")
rf.results <- players %>%
filter(year == 2018) %>%
mutate(probability = as.vector(rf.pred[,2])) %>%
select(playerID, year, allNBA, probability, center, forward, guard)
rf.centers <- rf.results %>%
filter(center == 1) %>%
select(-c(center, forward, guard)) %>%
arrange(desc(probability)) %>%
head(10)
rf.forwards <- rf.results %>%
filter(forward == 1) %>%
select(-c(center, forward, guard)) %>%
arrange(desc(probability)) %>%
head(10)
rf.guards <- rf.results %>%
filter(guard == 1) %>%
select(-c(center, forward, guard)) %>%
arrange(desc(probability)) %>%
head(10)
By default, the resulting randomForest object provides a confusion matrix, from which we can see the predicted class error for All-NBA team members is \(\frac{179}{179+349} = 0.339015152\).
The varImpPlot() function returns two different measures of variable importance: mean decrease in accuracy and mean Gini decrease. The mean accuracy decrease measures the difference in prediction error when each predictor variable is permuted. Gini importance is a measure of node purity. The lower the value the better, so a predictor that results in a large Gini decrease does a better job of separating members from non-members at a given node.
Looks like All-Star status and game score are at the top of both measures, which suggests that they are relatively strong predictors of All-NBA team membership.
# Plots
gridExtra::grid.arrange(auroc.rf, auprc.rf, ncol=2)
# AUROC
auc.rf
## [1] 0.998151
# AUPRC
pr.rf$auc.integral
## [1] 0.9250388
# Cross-Entropy
mxe.rf
## [1] 0.02196089
rf.guards
rf.forwards
rf.centers
Even without tuning, the random forest model performs exceptionally well, with an AUROC of \(0.9981510\), AUPRC of \(0.9250388\), and cross-entropy loss of \(0.2196089\). Adjusting for the fact that Jimmy Butler and LaMarcus Aldridge made the All-NBA team as forwards, and Anthony Davis as a center, the model correctly predicted 13 out of 15 players!
caretWe can attempt to improve on the previous model by tuning parameters with the caret package.
Here, we use 10-fold cross-validation and a simple grid search to find the optimal value of mtry, which determines the number of variables randomly sampled as candidates for each split. The default value for classification is \(mtry = \sqrt{p}\), where \(p\) is the number of variables. In our case, the default is \(mtry = \sqrt{55} = 7.416198\) (which can be seen in the results from the previous model), so we’ll define our grid to cover the values 1 through 15.
# Create tuning df
train.data.tune <- train.data
levels(train.data.tune$allNBA) <- c("allNBA", "not") # Rename levels for train()
fitControl <- trainControl(method = "cv",
number = 10,
search = "grid",
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Define grid of parameters to try
tunegrid <- expand.grid(.mtry=c(1:15))
set.seed(42)
rf.tune <- train(allNBA ~ ., data = train.data.tune,
method = "rf",
metric = "ROC",
tuneGrid = tunegrid,
trControl = fitControl)
print(rf.tune)
## Random Forest
##
## 17585 samples
## 54 predictor
## 2 classes: 'allNBA', 'not'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 15828, 15826, 15826, 15826, 15827, 15826, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9918000 0.9965408 0.5376996
## 2 0.9922197 0.9959545 0.6061321
## 3 0.9926631 0.9954855 0.6211538
## 4 0.9928568 0.9953096 0.6306967
## 5 0.9926948 0.9951923 0.6288099
## 6 0.9928858 0.9950164 0.6534470
## 7 0.9928009 0.9947232 0.6514877
## 8 0.9918921 0.9944887 0.6477866
## 9 0.9920960 0.9945473 0.6667271
## 10 0.9929486 0.9944301 0.6591074
## 11 0.9930554 0.9941956 0.6686502
## 12 0.9930928 0.9944301 0.6686502
## 13 0.9929302 0.9941369 0.6743469
## 14 0.9929726 0.9941370 0.6705007
## 15 0.9911000 0.9936680 0.6799347
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
plot(rf.tune)
The results show an optimal setting of mtry = 12, so we’ll refit the model using this parameter and compare the results.
set.seed(42)
rf.tune <- randomForest(allNBA ~ ., data = train.data, mtry = 12, importance = TRUE)
rf.tune
##
## Call:
## randomForest(formula = allNBA ~ ., data = train.data, mtry = 12, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 1.53%
## Confusion matrix:
## 0 1 class.error
## 0 16961 96 0.005628188
## 1 173 355 0.327651515
varImpPlot(rf.tune)
# Plots
gridExtra::grid.arrange(auroc.rf.tune, auprc.rf.tune, ncol=2)
# AUROC
auc.rf.tune
## [1] 0.9979456
# AUPRC
pr.rf.tune$auc.integral
## [1] 0.9195144
# Cross-Entropy
mxe.rf.tune
## [1] 0.02162433
The tuned model actually has a slightly lower AUROC (\(0.9979456\)) and AUPRC (\(0.9195144\)). However, it has a lower cross-entropy loss. It also correctly predicts 13 out of 15 players.
As mentioned above, boosted trees use a different approach to ensembling than random forests. Random forests build an ensemble of deep, independent trees, while boosted trees sequentially build upon shallow and weak trees. We’ll build a boosted tree model using the gbm package. Implementation is done following the UC Business Analytics guide.
We’ll try a basic implementation of gbm using mostly default settings. However, we’ll again use 10-fold cv, as well as increase n.trees to 5000 and decrease shrinkage to 0.01.
# Load package
library(gbm)
# Reload train/test data
train.data <- players %>%
filter(year < 2018) %>%
select(-c(playerID, year, tmID, center, forward, guard))
test.data <- players %>%
filter(year == 2018) %>%
select(-c(playerID, year, tmID, center, forward, guard))
# Build model
set.seed(42)
gb <- gbm(
formula = allNBA ~ .,
distribution = "bernoulli",
data = train.data,
n.trees = 5000,
shrinkage = 0.01,
cv.folds = 10)
# Variable importance plot for top 10 variables
vip::vip(gb)
Just as with the random forest models, the boosted tree model places high importance in All-Star status and game score.
# Plots
gridExtra::grid.arrange(auroc.gb, auprc.gb, ncol=2)
# AUROC
auc.gb
## [1] 0.9973292
# AUPRC
pr.gb$auc.integral
## [1] 0.867211
# Cross-Entropy
mxe.gb
## [1] 0.02347881
Without any tuning, the boosted tree model still has excellent AUROC, AUPRC, and cross-entropy loss. It predicts the same number of correct players as the random forest models, though the rosters are shuffled around a bit. Next, we’ll try tuning some parameters using caret.
caret provides the following 4 options for tuning parameters. From the author’s bookdown document:
n.trees (# Boosting Iterations)interaction.depth (Max Tree Depth) - Tree complexityshrinkage (Shrinkage) - Learning raten.minobsinnode (Min. Terminal Node Size) - Smallest allowable leafUnlike with the random forest model, which only needs to adjust mtry, boosted tree models are much more involved when it comes to tuning. Our parameter grid below contains \(3^4 = 81\) total parameter combinations. To help speed up computation, the allowParallel argument in the trainControl function is set to TRUE.
# Parameter grid
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 5),
n.trees = c(2500, 5000, 10000),
shrinkage = c(0.01, 0.05, 0.1),
n.minobsinnode = c(5, 10, 15))
# Control settings
gbmControl <- trainControl(method = "cv",
number = 10,
search = "grid",
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
# Build models
set.seed(42)
gb.opt <- train(allNBA ~ ., data = train.data,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
trControl = gbmControl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2580 nan 0.0100 0.0057
## 2 0.2486 nan 0.0100 0.0047
## 3 0.2404 nan 0.0100 0.0038
## 4 0.2340 nan 0.0100 0.0032
## 5 0.2279 nan 0.0100 0.0029
## 6 0.2224 nan 0.0100 0.0025
## 7 0.2173 nan 0.0100 0.0024
## 8 0.2128 nan 0.0100 0.0023
## 9 0.2085 nan 0.0100 0.0020
## 10 0.2047 nan 0.0100 0.0020
## 20 0.1764 nan 0.0100 0.0011
## 40 0.1445 nan 0.0100 0.0005
## 60 0.1250 nan 0.0100 0.0003
## 80 0.1121 nan 0.0100 0.0003
## 100 0.1027 nan 0.0100 0.0002
## 120 0.0953 nan 0.0100 0.0001
## 140 0.0895 nan 0.0100 0.0001
## 160 0.0849 nan 0.0100 0.0001
## 180 0.0812 nan 0.0100 0.0001
## 200 0.0782 nan 0.0100 0.0001
## 220 0.0754 nan 0.0100 0.0001
## 240 0.0729 nan 0.0100 0.0001
## 260 0.0710 nan 0.0100 0.0000
## 280 0.0686 nan 0.0100 0.0000
## 300 0.0671 nan 0.0100 0.0000
## 320 0.0655 nan 0.0100 0.0000
## 340 0.0642 nan 0.0100 0.0000
## 360 0.0630 nan 0.0100 0.0000
## 380 0.0619 nan 0.0100 0.0000
## 400 0.0608 nan 0.0100 -0.0000
## 420 0.0598 nan 0.0100 0.0000
## 440 0.0591 nan 0.0100 0.0000
## 460 0.0582 nan 0.0100 0.0000
## 480 0.0574 nan 0.0100 -0.0000
## 500 0.0566 nan 0.0100 -0.0000
## 520 0.0558 nan 0.0100 -0.0000
## 540 0.0550 nan 0.0100 0.0000
## 560 0.0544 nan 0.0100 -0.0000
## 580 0.0537 nan 0.0100 0.0000
## 600 0.0530 nan 0.0100 -0.0000
## 620 0.0524 nan 0.0100 -0.0000
## 640 0.0517 nan 0.0100 -0.0000
## 660 0.0511 nan 0.0100 0.0000
## 680 0.0506 nan 0.0100 0.0000
## 700 0.0500 nan 0.0100 -0.0000
## 720 0.0494 nan 0.0100 -0.0000
## 740 0.0488 nan 0.0100 -0.0000
## 760 0.0484 nan 0.0100 -0.0000
## 780 0.0479 nan 0.0100 -0.0000
## 800 0.0475 nan 0.0100 -0.0000
## 820 0.0470 nan 0.0100 -0.0000
## 840 0.0465 nan 0.0100 -0.0000
## 860 0.0460 nan 0.0100 -0.0000
## 880 0.0456 nan 0.0100 -0.0000
## 900 0.0452 nan 0.0100 -0.0000
## 920 0.0447 nan 0.0100 -0.0000
## 940 0.0443 nan 0.0100 -0.0000
## 960 0.0439 nan 0.0100 -0.0000
## 980 0.0435 nan 0.0100 -0.0000
## 1000 0.0432 nan 0.0100 -0.0000
## 1020 0.0428 nan 0.0100 -0.0000
## 1040 0.0424 nan 0.0100 0.0000
## 1060 0.0421 nan 0.0100 -0.0000
## 1080 0.0417 nan 0.0100 -0.0000
## 1100 0.0414 nan 0.0100 -0.0000
## 1120 0.0410 nan 0.0100 -0.0000
## 1140 0.0407 nan 0.0100 -0.0000
## 1160 0.0403 nan 0.0100 -0.0000
## 1180 0.0399 nan 0.0100 -0.0000
## 1200 0.0396 nan 0.0100 -0.0000
## 1220 0.0393 nan 0.0100 -0.0000
## 1240 0.0390 nan 0.0100 -0.0000
## 1260 0.0386 nan 0.0100 -0.0000
## 1280 0.0383 nan 0.0100 -0.0000
## 1300 0.0380 nan 0.0100 -0.0000
## 1320 0.0377 nan 0.0100 -0.0000
## 1340 0.0374 nan 0.0100 -0.0000
## 1360 0.0370 nan 0.0100 -0.0000
## 1380 0.0367 nan 0.0100 -0.0000
## 1400 0.0364 nan 0.0100 -0.0000
## 1420 0.0362 nan 0.0100 -0.0000
## 1440 0.0359 nan 0.0100 0.0000
## 1460 0.0356 nan 0.0100 -0.0000
## 1480 0.0353 nan 0.0100 -0.0000
## 1500 0.0350 nan 0.0100 -0.0000
## 1520 0.0347 nan 0.0100 -0.0000
## 1540 0.0344 nan 0.0100 -0.0000
## 1560 0.0342 nan 0.0100 -0.0000
## 1580 0.0339 nan 0.0100 -0.0000
## 1600 0.0337 nan 0.0100 -0.0000
## 1620 0.0334 nan 0.0100 -0.0000
## 1640 0.0331 nan 0.0100 -0.0000
## 1660 0.0329 nan 0.0100 -0.0000
## 1680 0.0326 nan 0.0100 -0.0000
## 1700 0.0324 nan 0.0100 -0.0000
## 1720 0.0321 nan 0.0100 -0.0000
## 1740 0.0318 nan 0.0100 -0.0000
## 1760 0.0315 nan 0.0100 -0.0000
## 1780 0.0313 nan 0.0100 -0.0000
## 1800 0.0310 nan 0.0100 -0.0000
## 1820 0.0308 nan 0.0100 -0.0000
## 1840 0.0306 nan 0.0100 -0.0000
## 1860 0.0303 nan 0.0100 -0.0000
## 1880 0.0301 nan 0.0100 -0.0000
## 1900 0.0299 nan 0.0100 -0.0000
## 1920 0.0297 nan 0.0100 -0.0000
## 1940 0.0295 nan 0.0100 -0.0000
## 1960 0.0292 nan 0.0100 -0.0000
## 1980 0.0290 nan 0.0100 -0.0000
## 2000 0.0288 nan 0.0100 -0.0000
## 2020 0.0286 nan 0.0100 -0.0000
## 2040 0.0284 nan 0.0100 -0.0000
## 2060 0.0282 nan 0.0100 -0.0000
## 2080 0.0280 nan 0.0100 -0.0000
## 2100 0.0277 nan 0.0100 -0.0000
## 2120 0.0275 nan 0.0100 -0.0000
## 2140 0.0273 nan 0.0100 -0.0000
## 2160 0.0271 nan 0.0100 -0.0000
## 2180 0.0269 nan 0.0100 -0.0000
## 2200 0.0267 nan 0.0100 -0.0000
## 2220 0.0265 nan 0.0100 -0.0000
## 2240 0.0263 nan 0.0100 -0.0000
## 2260 0.0261 nan 0.0100 -0.0000
## 2280 0.0259 nan 0.0100 -0.0000
## 2300 0.0257 nan 0.0100 -0.0000
## 2320 0.0255 nan 0.0100 -0.0000
## 2340 0.0254 nan 0.0100 -0.0000
## 2360 0.0252 nan 0.0100 -0.0000
## 2380 0.0250 nan 0.0100 -0.0000
## 2400 0.0248 nan 0.0100 -0.0000
## 2420 0.0246 nan 0.0100 -0.0000
## 2440 0.0244 nan 0.0100 -0.0000
## 2460 0.0242 nan 0.0100 -0.0000
## 2480 0.0240 nan 0.0100 -0.0000
## 2500 0.0238 nan 0.0100 -0.0000
print(gb.opt)
## Stochastic Gradient Boosting
##
## 17585 samples
## 54 predictor
## 2 classes: 'allNBA', 'not'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 15828, 15826, 15826, 15826, 15827, 15826, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees ROC
## 0.01 1 5 2500 0.9925547
## 0.01 1 5 5000 0.9926221
## 0.01 1 5 10000 0.9925256
## 0.01 1 10 2500 0.9925304
## 0.01 1 10 5000 0.9925298
## 0.01 1 10 10000 0.9924287
## 0.01 1 15 2500 0.9926210
## 0.01 1 15 5000 0.9926334
## 0.01 1 15 10000 0.9925477
## 0.01 3 5 2500 0.9932405
## 0.01 3 5 5000 0.9931004
## 0.01 3 5 10000 0.9928965
## 0.01 3 10 2500 0.9931590
## 0.01 3 10 5000 0.9930040
## 0.01 3 10 10000 0.9928857
## 0.01 3 15 2500 0.9932580
## 0.01 3 15 5000 0.9931065
## 0.01 3 15 10000 0.9929454
## 0.01 5 5 2500 0.9932267
## 0.01 5 5 5000 0.9932036
## 0.01 5 5 10000 0.9931293
## 0.01 5 10 2500 0.9932362
## 0.01 5 10 5000 0.9931932
## 0.01 5 10 10000 0.9931291
## 0.01 5 15 2500 0.9933334
## 0.01 5 15 5000 0.9932470
## 0.01 5 15 10000 0.9931189
## 0.05 1 5 2500 0.9924139
## 0.05 1 5 5000 0.9920860
## 0.05 1 5 10000 0.9913018
## 0.05 1 10 2500 0.9922249
## 0.05 1 10 5000 0.9917101
## 0.05 1 10 10000 0.9911202
## 0.05 1 15 2500 0.9922416
## 0.05 1 15 5000 0.9918316
## 0.05 1 15 10000 0.9909350
## 0.05 3 5 2500 0.9925773
## 0.05 3 5 5000 0.9924258
## 0.05 3 5 10000 0.9922565
## 0.05 3 10 2500 0.9927159
## 0.05 3 10 5000 0.9926647
## 0.05 3 10 10000 0.9925780
## 0.05 3 15 2500 0.9926644
## 0.05 3 15 5000 0.9925763
## 0.05 3 15 10000 0.9922057
## 0.05 5 5 2500 0.9930087
## 0.05 5 5 5000 0.9927923
## 0.05 5 5 10000 0.9911668
## 0.05 5 10 2500 0.9929238
## 0.05 5 10 5000 0.9927555
## 0.05 5 10 10000 0.9925631
## 0.05 5 15 2500 0.9914765
## 0.05 5 15 5000 0.9928544
## 0.05 5 15 10000 0.9924921
## 0.10 1 5 2500 0.9922804
## 0.10 1 5 5000 0.9913557
## 0.10 1 5 10000 0.9900686
## 0.10 1 10 2500 0.9918524
## 0.10 1 10 5000 0.9910617
## 0.10 1 10 10000 0.9900409
## 0.10 1 15 2500 0.9916587
## 0.10 1 15 5000 0.9908605
## 0.10 1 15 10000 0.9899106
## 0.10 3 5 2500 0.9926214
## 0.10 3 5 5000 0.9924569
## 0.10 3 5 10000 0.9922713
## 0.10 3 10 2500 0.9926546
## 0.10 3 10 5000 0.9925491
## 0.10 3 10 10000 0.9925334
## 0.10 3 15 2500 0.9924481
## 0.10 3 15 5000 0.9922049
## 0.10 3 15 10000 0.9922325
## 0.10 5 5 2500 0.9927748
## 0.10 5 5 5000 0.9926944
## 0.10 5 5 10000 0.9892830
## 0.10 5 10 2500 0.9925895
## 0.10 5 10 5000 0.9910017
## 0.10 5 10 10000 0.9859020
## 0.10 5 15 2500 0.9928859
## 0.10 5 15 5000 0.9925326
## 0.10 5 15 10000 0.9851374
## Sens Spec
## 0.9929057 0.6855951
## 0.9930231 0.6874456
## 0.9927885 0.6837083
## 0.9930816 0.6836720
## 0.9933161 0.6723149
## 0.9929644 0.6742380
## 0.9932575 0.6893687
## 0.9932575 0.6855588
## 0.9928471 0.6818578
## 0.9930230 0.7083091
## 0.9927297 0.7064224
## 0.9933161 0.7064586
## 0.9927299 0.7064586
## 0.9927884 0.7064224
## 0.9930229 0.7083817
## 0.9929058 0.7121190
## 0.9930230 0.7121916
## 0.9933162 0.6969521
## 0.9931403 0.7216618
## 0.9931988 0.7141147
## 0.9933161 0.7046081
## 0.9936093 0.7178157
## 0.9935507 0.7292090
## 0.9934921 0.7140421
## 0.9936680 0.7158563
## 0.9933748 0.7139332
## 0.9937266 0.7120827
## 0.9926127 0.7046081
## 0.9922023 0.6932511
## 0.9917332 0.6836720
## 0.9928472 0.6742743
## 0.9920851 0.6780842
## 0.9920265 0.6799347
## 0.9927885 0.6856676
## 0.9920264 0.6706459
## 0.9916748 0.6724601
## 0.9928471 0.7158926
## 0.9929056 0.6988752
## 0.9933747 0.6951016
## 0.9930816 0.6969884
## 0.9935506 0.7045356
## 0.9936093 0.7046081
## 0.9933162 0.7159289
## 0.9936680 0.7045718
## 0.9939025 0.7045718
## 0.9931989 0.7083817
## 0.9935506 0.7121916
## 0.9936092 0.7007983
## 0.9937853 0.7026851
## 0.9940197 0.6914006
## 0.9937852 0.6951379
## 0.9933747 0.7235123
## 0.9936092 0.7158926
## 0.9937265 0.7121190
## 0.9919092 0.6931785
## 0.9912643 0.6817489
## 0.9912643 0.6837446
## 0.9923782 0.6742743
## 0.9919679 0.6703919
## 0.9919093 0.6723512
## 0.9922609 0.6818940
## 0.9922610 0.6628810
## 0.9916747 0.6533382
## 0.9931989 0.7101959
## 0.9931402 0.7102685
## 0.9934920 0.7007983
## 0.9937265 0.7027213
## 0.9934335 0.7065312
## 0.9934333 0.7084180
## 0.9934921 0.6988389
## 0.9938438 0.6969521
## 0.9936679 0.7045718
## 0.9934920 0.7178157
## 0.9937265 0.7121916
## 0.9941956 0.7084180
## 0.9936679 0.7101959
## 0.9937265 0.6989478
## 0.9938437 0.7007983
## 0.9938438 0.7178157
## 0.9935506 0.7027213
## 0.9934921 0.7007620
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 2500,
## interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 15.
The train function specifies the final values used for the optimal model are n.trees = 2500, interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 15, so let’s rebuild a boosted tree model with those parameters.
# Build model
set.seed(42)
gb.tune <- gbm(
formula = allNBA ~ .,
distribution = "bernoulli",
data = train.data,
n.trees = 2500,
interaction.depth = 5,
shrinkage = 0.01,
n.minobsinnode = 15,
cv.folds = 10)
# Variable importance plot for top 10 variables
vip::vip(gb.tune)
# Plots
gridExtra::grid.arrange(auroc.gb.tune, auprc.gb.tune, ncol=2)
# AUROC
auc.gb.tune
## [1] 0.9974319
# AUPRC
pr.gb.tune$auc.integral
## [1] 0.8734863
# Cross-Entropy
mxe.gb.tune
## [1] 0.02299367
The tuned boosted tree model provides a very slight improvement over the untuned model. As with the random forest models, it correctly predicts 13 out of 15 players, with the order of players slightly different.
We can now compare our evaluation metrics for our tree ensembles and our previous regularized regression models.
| Model | AUROC | AUPRC | Entropy |
|---|---|---|---|
| Lasso | 0.993008748683597 | 0.836911512512884 | 0.0411973893703773 |
| Ridge | 0.991516743545584 | 0.821885589241977 | 0.0637057619270399 |
| Elastic-Net | 0.993159883405898 | 0.841167220442868 | 0.0417216641802348 |
| Random Forest | 0.997945557267591 | 0.919514424299821 | 0.0216243273033408 |
| Boosted Tree | 0.997431946584489 | 0.873486347127061 | 0.0229936663401111 |
The random forest model performed the best based on all evaluation criteria, so let’s use that to predict the 2019 All-NBA team!
# Read in 2018-2019 regular season data and save as matrix
players.2019 <- players %>%
filter(year == 2019) %>%
select(-c(playerID, year, tmID, center, forward, guard, allNBA))
# Make predictions
set.seed(42)
rf.tune.pred.2019 <- predict(rf.tune, players.2019, type = "prob")
| First Team | Second Team | Third Team | |
|---|---|---|---|
| Guard | Bradley Beal | Damian Lillard | Kemba Walker |
| Guard | Russell Westbrook | James Harden | Stephen Curry |
| Forward | Giannis Antetokounmpo | Paul George | LeBron James |
| Forward | Kevin Durant | Blake Griffin | Kawhi Leonard |
| Center | Karl-Anthony Towns | Nikola Jokic | Joel Embiid |
The random forest model provides a very different roster than what our regularized regression models came up with. There are admittedly some questionable choices in there; however, this only highlights the fact that the results of the model’s predictions should be used only as a guide. There are still plenty of improvements that can be made, including but not limited to:
Regardless, it’ll be interesting to see how each predicted roster compares with the actual roster. As of today (May 15th, 2019), we’ll just have to keep waiting!