Decision Trees

Introduction

This report is a continuation of the All-NBA Team Capstone project, which utilizes historical NBA statistics from 1937 to 2012 to predict All-NBA Teams. It will cover the ensemble modeling of the complete player data using the randomForest and gbm packages. See this report for web scraping the 2018-2019 stats.

The data cleaning, exploratory data analysis, and logistic regression model reports are also on RPubs, as well as my capstone project repository.

Loading and Splitting the Data

For this report, we’ll use the complete set of NBA stats to build this model, reserving the 2018 season as the test set. Then, we’ll use the current 2019 stats to make predictions for this year’s All-NBA team rosters.

p1 <- as_tibble(read.csv("players_EDA.csv")) %>% select(-c(allDefFirstTeam, allDefSecondTeam, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, MVP, defPOTY))
p2 <- as_tibble(read.csv("players_2019.csv"))

players <- rbind(p1, p2)

train.data <- players %>% 
  filter(year < 2018) %>% 
  select(-c(playerID, year, tmID, center, forward, guard))

test.data <- players %>% 
  filter(year == 2018) %>% 
  select(-c(playerID, year, tmID, center, forward, guard))

Theory

Decision Trees

Decision trees model data by building “trees” of heirarchical branches. Branches are made until they reach “leaves” that represent predictions. The main advantage of decision trees over regression models is that they can model non-linear relationships. However, they are prone to overfitting; so much so that if they are completely unconstrained they can completely “memorize” the training data by creating more and more branches until each observation has it’s own leaf!

To take advantage of their flexibility while preventing overfitting to the training data, we can use ensembles.

Tree Ensembles

Ensembles are machine learning methods of combining multiple models into a single, better model. There are 2 ensemble methods:

Bagging - training multiple “strong (high complexity) learners” and combining them to “smooth out” their predictions.
Boosting - training multiple “weak (low complexity) learners” that improve upon their predecessors, “boosting” model complexity.

When the base models for ensembling are decision trees, they have special names:

Random Forests (bagging) - trees are limited to a random subset of features and observations (hence the name).
Gradient Boosted Trees (boosting obviously) - trees are constrained to a maximum depth, and subsequent trees try to correct previous prediction errors.

Basic Random Forests

Here we’ll use the randomForest package to implement our first random forest model. We’ll specify the importance = TRUE argument to inspect variable importance.

# Import library
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

# Make the outcome variable a factor
train.data$allNBA <- as.factor(train.data$allNBA)
test.data$allNBA <- as.factor(test.data$allNBA)

# default rf model
set.seed(42)
rf <- randomForest(allNBA ~ ., data = train.data, importance = TRUE)
rf

## 
## Call:
##  randomForest(formula = allNBA ~ ., data = train.data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 1.52%
## Confusion matrix:
##       0   1 class.error
## 0 16969  88 0.005159172
## 1   179 349 0.339015152

varImpPlot(rf)

test <- test.data %>% select(-allNBA)

rf.pred <- predict(rf, test, type = "prob")

rf.results <- players %>%
  filter(year == 2018) %>% 
  mutate(probability = as.vector(rf.pred[,2])) %>% 
  select(playerID, year, allNBA, probability, center, forward, guard)

rf.centers <- rf.results %>% 
  filter(center == 1) %>%
  select(-c(center, forward, guard)) %>% 
  arrange(desc(probability)) %>% 
  head(10)

rf.forwards <- rf.results %>% 
  filter(forward == 1) %>%
  select(-c(center, forward, guard)) %>% 
  arrange(desc(probability)) %>% 
  head(10)

rf.guards <- rf.results %>% 
  filter(guard == 1) %>%
  select(-c(center, forward, guard)) %>% 
  arrange(desc(probability)) %>% 
  head(10)

By default, the resulting randomForest object provides a confusion matrix, from which we can see the predicted class error for All-NBA team members is \(\frac{179}{179+349} = 0.339015152\).

The varImpPlot() function returns two different measures of variable importance: mean decrease in accuracy and mean Gini decrease. The mean accuracy decrease measures the difference in prediction error when each predictor variable is permuted. Gini importance is a measure of node purity. The lower the value the better, so a predictor that results in a large Gini decrease does a better job of separating members from non-members at a given node.

Looks like All-Star status and game score are at the top of both measures, which suggests that they are relatively strong predictors of All-NBA team membership.

Metrics

# Plots
gridExtra::grid.arrange(auroc.rf, auprc.rf, ncol=2)

# AUROC
auc.rf

## [1] 0.998151

# AUPRC
pr.rf$auc.integral

## [1] 0.9250388

# Cross-Entropy
mxe.rf

## [1] 0.02196089

Predictions

Guards

rf.guards

Forwards

rf.forwards

Centers

rf.centers

Summary

Even without tuning, the random forest model performs exceptionally well, with an AUROC of \(0.9981510\), AUPRC of \(0.9250388\), and cross-entropy loss of \(0.2196089\). Adjusting for the fact that Jimmy Butler and LaMarcus Aldridge made the All-NBA team as forwards, and Anthony Davis as a center, the model correctly predicted 13 out of 15 players!

Tuning with `caret`

We can attempt to improve on the previous model by tuning parameters with the caret package.

Here, we use 10-fold cross-validation and a simple grid search to find the optimal value of mtry, which determines the number of variables randomly sampled as candidates for each split. The default value for classification is \(mtry = \sqrt{p}\), where \(p\) is the number of variables. In our case, the default is \(mtry = \sqrt{55} = 7.416198\) (which can be seen in the results from the previous model), so we’ll define our grid to cover the values 1 through 15.

# Create tuning df
train.data.tune <- train.data
levels(train.data.tune$allNBA) <- c("allNBA", "not") # Rename levels for train()

fitControl <- trainControl(method = "cv", 
                           number = 10,
                           search = "grid",
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

# Define grid of parameters to try
tunegrid <- expand.grid(.mtry=c(1:15))

set.seed(42)
rf.tune <-  train(allNBA ~ ., data = train.data.tune,
                    method = "rf",
                    metric = "ROC",
                    tuneGrid = tunegrid,
                    trControl = fitControl)

print(rf.tune)

## Random Forest 
## 
## 17585 samples
##    54 predictor
##     2 classes: 'allNBA', 'not' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 15828, 15826, 15826, 15826, 15827, 15826, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    1    0.9918000  0.9965408  0.5376996
##    2    0.9922197  0.9959545  0.6061321
##    3    0.9926631  0.9954855  0.6211538
##    4    0.9928568  0.9953096  0.6306967
##    5    0.9926948  0.9951923  0.6288099
##    6    0.9928858  0.9950164  0.6534470
##    7    0.9928009  0.9947232  0.6514877
##    8    0.9918921  0.9944887  0.6477866
##    9    0.9920960  0.9945473  0.6667271
##   10    0.9929486  0.9944301  0.6591074
##   11    0.9930554  0.9941956  0.6686502
##   12    0.9930928  0.9944301  0.6686502
##   13    0.9929302  0.9941369  0.6743469
##   14    0.9929726  0.9941370  0.6705007
##   15    0.9911000  0.9936680  0.6799347
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.

plot(rf.tune)

The results show an optimal setting of mtry = 12, so we’ll refit the model using this parameter and compare the results.

set.seed(42)
rf.tune <- randomForest(allNBA ~ ., data = train.data, mtry = 12, importance = TRUE)
rf.tune

## 
## Call:
##  randomForest(formula = allNBA ~ ., data = train.data, mtry = 12,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 1.53%
## Confusion matrix:
##       0   1 class.error
## 0 16961  96 0.005628188
## 1   173 355 0.327651515

varImpPlot(rf.tune)

Metrics

# Plots
gridExtra::grid.arrange(auroc.rf.tune, auprc.rf.tune, ncol=2)

# AUROC
auc.rf.tune

## [1] 0.9979456

# AUPRC
pr.rf.tune$auc.integral

## [1] 0.9195144

# Cross-Entropy
mxe.rf.tune

## [1] 0.02162433

Predictions

Guards

Forwards

Centers

Summary

The tuned model actually has a slightly lower AUROC (\(0.9979456\)) and AUPRC (\(0.9195144\)). However, it has a lower cross-entropy loss. It also correctly predicts 13 out of 15 players.

Gradient Boosted Trees

As mentioned above, boosted trees use a different approach to ensembling than random forests. Random forests build an ensemble of deep, independent trees, while boosted trees sequentially build upon shallow and weak trees. We’ll build a boosted tree model using the gbm package. Implementation is done following the UC Business Analytics guide.

Basic Implementation

We’ll try a basic implementation of gbm using mostly default settings. However, we’ll again use 10-fold cv, as well as increase n.trees to 5000 and decrease shrinkage to 0.01.

# Load package
library(gbm)

# Reload train/test data
train.data <- players %>% 
  filter(year < 2018) %>% 
  select(-c(playerID, year, tmID, center, forward, guard))

test.data <- players %>% 
  filter(year == 2018) %>% 
  select(-c(playerID, year, tmID, center, forward, guard))

# Build model
set.seed(42)
gb <- gbm(
  formula = allNBA ~ .,
  distribution = "bernoulli",
  data = train.data,
  n.trees = 5000,
  shrinkage = 0.01,
  cv.folds = 10)

# Variable importance plot for top 10 variables
vip::vip(gb)

Just as with the random forest models, the boosted tree model places high importance in All-Star status and game score.

Metrics

# Plots
gridExtra::grid.arrange(auroc.gb, auprc.gb, ncol=2)

# AUROC
auc.gb

## [1] 0.9973292

# AUPRC
pr.gb$auc.integral

## [1] 0.867211

# Cross-Entropy
mxe.gb

## [1] 0.02347881

Predictions

Guards

Forwards

Centers

Summary

Without any tuning, the boosted tree model still has excellent AUROC, AUPRC, and cross-entropy loss. It predicts the same number of correct players as the random forest models, though the rosters are shuffled around a bit. Next, we’ll try tuning some parameters using caret.

Tuning the Boosted Tree Model

caret provides the following 4 options for tuning parameters. From the author’s bookdown document:

n.trees (# Boosting Iterations)
interaction.depth (Max Tree Depth) - Tree complexity
shrinkage (Shrinkage) - Learning rate
n.minobsinnode (Min. Terminal Node Size) - Smallest allowable leaf

Unlike with the random forest model, which only needs to adjust mtry, boosted tree models are much more involved when it comes to tuning. Our parameter grid below contains \(3^4 = 81\) total parameter combinations. To help speed up computation, the allowParallel argument in the trainControl function is set to TRUE.

# Parameter grid
gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 5),
                        n.trees = c(2500, 5000, 10000), 
                        shrinkage = c(0.01, 0.05, 0.1),
                        n.minobsinnode = c(5, 10, 15))

# Control settings
gbmControl <- trainControl(method = "cv", 
                           number = 10,
                           search = "grid",
                           classProbs = TRUE,
                           allowParallel = TRUE,
                           summaryFunction = twoClassSummary)

# Build models
set.seed(42)
gb.opt <-  train(allNBA ~ ., data = train.data,
                  method = "gbm",
                  metric = "ROC",
                  tuneGrid = gbmGrid,
                  trControl = gbmControl)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.2580             nan     0.0100    0.0057
##      2        0.2486             nan     0.0100    0.0047
##      3        0.2404             nan     0.0100    0.0038
##      4        0.2340             nan     0.0100    0.0032
##      5        0.2279             nan     0.0100    0.0029
##      6        0.2224             nan     0.0100    0.0025
##      7        0.2173             nan     0.0100    0.0024
##      8        0.2128             nan     0.0100    0.0023
##      9        0.2085             nan     0.0100    0.0020
##     10        0.2047             nan     0.0100    0.0020
##     20        0.1764             nan     0.0100    0.0011
##     40        0.1445             nan     0.0100    0.0005
##     60        0.1250             nan     0.0100    0.0003
##     80        0.1121             nan     0.0100    0.0003
##    100        0.1027             nan     0.0100    0.0002
##    120        0.0953             nan     0.0100    0.0001
##    140        0.0895             nan     0.0100    0.0001
##    160        0.0849             nan     0.0100    0.0001
##    180        0.0812             nan     0.0100    0.0001
##    200        0.0782             nan     0.0100    0.0001
##    220        0.0754             nan     0.0100    0.0001
##    240        0.0729             nan     0.0100    0.0001
##    260        0.0710             nan     0.0100    0.0000
##    280        0.0686             nan     0.0100    0.0000
##    300        0.0671             nan     0.0100    0.0000
##    320        0.0655             nan     0.0100    0.0000
##    340        0.0642             nan     0.0100    0.0000
##    360        0.0630             nan     0.0100    0.0000
##    380        0.0619             nan     0.0100    0.0000
##    400        0.0608             nan     0.0100   -0.0000
##    420        0.0598             nan     0.0100    0.0000
##    440        0.0591             nan     0.0100    0.0000
##    460        0.0582             nan     0.0100    0.0000
##    480        0.0574             nan     0.0100   -0.0000
##    500        0.0566             nan     0.0100   -0.0000
##    520        0.0558             nan     0.0100   -0.0000
##    540        0.0550             nan     0.0100    0.0000
##    560        0.0544             nan     0.0100   -0.0000
##    580        0.0537             nan     0.0100    0.0000
##    600        0.0530             nan     0.0100   -0.0000
##    620        0.0524             nan     0.0100   -0.0000
##    640        0.0517             nan     0.0100   -0.0000
##    660        0.0511             nan     0.0100    0.0000
##    680        0.0506             nan     0.0100    0.0000
##    700        0.0500             nan     0.0100   -0.0000
##    720        0.0494             nan     0.0100   -0.0000
##    740        0.0488             nan     0.0100   -0.0000
##    760        0.0484             nan     0.0100   -0.0000
##    780        0.0479             nan     0.0100   -0.0000
##    800        0.0475             nan     0.0100   -0.0000
##    820        0.0470             nan     0.0100   -0.0000
##    840        0.0465             nan     0.0100   -0.0000
##    860        0.0460             nan     0.0100   -0.0000
##    880        0.0456             nan     0.0100   -0.0000
##    900        0.0452             nan     0.0100   -0.0000
##    920        0.0447             nan     0.0100   -0.0000
##    940        0.0443             nan     0.0100   -0.0000
##    960        0.0439             nan     0.0100   -0.0000
##    980        0.0435             nan     0.0100   -0.0000
##   1000        0.0432             nan     0.0100   -0.0000
##   1020        0.0428             nan     0.0100   -0.0000
##   1040        0.0424             nan     0.0100    0.0000
##   1060        0.0421             nan     0.0100   -0.0000
##   1080        0.0417             nan     0.0100   -0.0000
##   1100        0.0414             nan     0.0100   -0.0000
##   1120        0.0410             nan     0.0100   -0.0000
##   1140        0.0407             nan     0.0100   -0.0000
##   1160        0.0403             nan     0.0100   -0.0000
##   1180        0.0399             nan     0.0100   -0.0000
##   1200        0.0396             nan     0.0100   -0.0000
##   1220        0.0393             nan     0.0100   -0.0000
##   1240        0.0390             nan     0.0100   -0.0000
##   1260        0.0386             nan     0.0100   -0.0000
##   1280        0.0383             nan     0.0100   -0.0000
##   1300        0.0380             nan     0.0100   -0.0000
##   1320        0.0377             nan     0.0100   -0.0000
##   1340        0.0374             nan     0.0100   -0.0000
##   1360        0.0370             nan     0.0100   -0.0000
##   1380        0.0367             nan     0.0100   -0.0000
##   1400        0.0364             nan     0.0100   -0.0000
##   1420        0.0362             nan     0.0100   -0.0000
##   1440        0.0359             nan     0.0100    0.0000
##   1460        0.0356             nan     0.0100   -0.0000
##   1480        0.0353             nan     0.0100   -0.0000
##   1500        0.0350             nan     0.0100   -0.0000
##   1520        0.0347             nan     0.0100   -0.0000
##   1540        0.0344             nan     0.0100   -0.0000
##   1560        0.0342             nan     0.0100   -0.0000
##   1580        0.0339             nan     0.0100   -0.0000
##   1600        0.0337             nan     0.0100   -0.0000
##   1620        0.0334             nan     0.0100   -0.0000
##   1640        0.0331             nan     0.0100   -0.0000
##   1660        0.0329             nan     0.0100   -0.0000
##   1680        0.0326             nan     0.0100   -0.0000
##   1700        0.0324             nan     0.0100   -0.0000
##   1720        0.0321             nan     0.0100   -0.0000
##   1740        0.0318             nan     0.0100   -0.0000
##   1760        0.0315             nan     0.0100   -0.0000
##   1780        0.0313             nan     0.0100   -0.0000
##   1800        0.0310             nan     0.0100   -0.0000
##   1820        0.0308             nan     0.0100   -0.0000
##   1840        0.0306             nan     0.0100   -0.0000
##   1860        0.0303             nan     0.0100   -0.0000
##   1880        0.0301             nan     0.0100   -0.0000
##   1900        0.0299             nan     0.0100   -0.0000
##   1920        0.0297             nan     0.0100   -0.0000
##   1940        0.0295             nan     0.0100   -0.0000
##   1960        0.0292             nan     0.0100   -0.0000
##   1980        0.0290             nan     0.0100   -0.0000
##   2000        0.0288             nan     0.0100   -0.0000
##   2020        0.0286             nan     0.0100   -0.0000
##   2040        0.0284             nan     0.0100   -0.0000
##   2060        0.0282             nan     0.0100   -0.0000
##   2080        0.0280             nan     0.0100   -0.0000
##   2100        0.0277             nan     0.0100   -0.0000
##   2120        0.0275             nan     0.0100   -0.0000
##   2140        0.0273             nan     0.0100   -0.0000
##   2160        0.0271             nan     0.0100   -0.0000
##   2180        0.0269             nan     0.0100   -0.0000
##   2200        0.0267             nan     0.0100   -0.0000
##   2220        0.0265             nan     0.0100   -0.0000
##   2240        0.0263             nan     0.0100   -0.0000
##   2260        0.0261             nan     0.0100   -0.0000
##   2280        0.0259             nan     0.0100   -0.0000
##   2300        0.0257             nan     0.0100   -0.0000
##   2320        0.0255             nan     0.0100   -0.0000
##   2340        0.0254             nan     0.0100   -0.0000
##   2360        0.0252             nan     0.0100   -0.0000
##   2380        0.0250             nan     0.0100   -0.0000
##   2400        0.0248             nan     0.0100   -0.0000
##   2420        0.0246             nan     0.0100   -0.0000
##   2440        0.0244             nan     0.0100   -0.0000
##   2460        0.0242             nan     0.0100   -0.0000
##   2480        0.0240             nan     0.0100   -0.0000
##   2500        0.0238             nan     0.0100   -0.0000

print(gb.opt)

## Stochastic Gradient Boosting 
## 
## 17585 samples
##    54 predictor
##     2 classes: 'allNBA', 'not' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 15828, 15826, 15826, 15826, 15827, 15826, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  ROC      
##   0.01       1                   5               2500    0.9925547
##   0.01       1                   5               5000    0.9926221
##   0.01       1                   5              10000    0.9925256
##   0.01       1                  10               2500    0.9925304
##   0.01       1                  10               5000    0.9925298
##   0.01       1                  10              10000    0.9924287
##   0.01       1                  15               2500    0.9926210
##   0.01       1                  15               5000    0.9926334
##   0.01       1                  15              10000    0.9925477
##   0.01       3                   5               2500    0.9932405
##   0.01       3                   5               5000    0.9931004
##   0.01       3                   5              10000    0.9928965
##   0.01       3                  10               2500    0.9931590
##   0.01       3                  10               5000    0.9930040
##   0.01       3                  10              10000    0.9928857
##   0.01       3                  15               2500    0.9932580
##   0.01       3                  15               5000    0.9931065
##   0.01       3                  15              10000    0.9929454
##   0.01       5                   5               2500    0.9932267
##   0.01       5                   5               5000    0.9932036
##   0.01       5                   5              10000    0.9931293
##   0.01       5                  10               2500    0.9932362
##   0.01       5                  10               5000    0.9931932
##   0.01       5                  10              10000    0.9931291
##   0.01       5                  15               2500    0.9933334
##   0.01       5                  15               5000    0.9932470
##   0.01       5                  15              10000    0.9931189
##   0.05       1                   5               2500    0.9924139
##   0.05       1                   5               5000    0.9920860
##   0.05       1                   5              10000    0.9913018
##   0.05       1                  10               2500    0.9922249
##   0.05       1                  10               5000    0.9917101
##   0.05       1                  10              10000    0.9911202
##   0.05       1                  15               2500    0.9922416
##   0.05       1                  15               5000    0.9918316
##   0.05       1                  15              10000    0.9909350
##   0.05       3                   5               2500    0.9925773
##   0.05       3                   5               5000    0.9924258
##   0.05       3                   5              10000    0.9922565
##   0.05       3                  10               2500    0.9927159
##   0.05       3                  10               5000    0.9926647
##   0.05       3                  10              10000    0.9925780
##   0.05       3                  15               2500    0.9926644
##   0.05       3                  15               5000    0.9925763
##   0.05       3                  15              10000    0.9922057
##   0.05       5                   5               2500    0.9930087
##   0.05       5                   5               5000    0.9927923
##   0.05       5                   5              10000    0.9911668
##   0.05       5                  10               2500    0.9929238
##   0.05       5                  10               5000    0.9927555
##   0.05       5                  10              10000    0.9925631
##   0.05       5                  15               2500    0.9914765
##   0.05       5                  15               5000    0.9928544
##   0.05       5                  15              10000    0.9924921
##   0.10       1                   5               2500    0.9922804
##   0.10       1                   5               5000    0.9913557
##   0.10       1                   5              10000    0.9900686
##   0.10       1                  10               2500    0.9918524
##   0.10       1                  10               5000    0.9910617
##   0.10       1                  10              10000    0.9900409
##   0.10       1                  15               2500    0.9916587
##   0.10       1                  15               5000    0.9908605
##   0.10       1                  15              10000    0.9899106
##   0.10       3                   5               2500    0.9926214
##   0.10       3                   5               5000    0.9924569
##   0.10       3                   5              10000    0.9922713
##   0.10       3                  10               2500    0.9926546
##   0.10       3                  10               5000    0.9925491
##   0.10       3                  10              10000    0.9925334
##   0.10       3                  15               2500    0.9924481
##   0.10       3                  15               5000    0.9922049
##   0.10       3                  15              10000    0.9922325
##   0.10       5                   5               2500    0.9927748
##   0.10       5                   5               5000    0.9926944
##   0.10       5                   5              10000    0.9892830
##   0.10       5                  10               2500    0.9925895
##   0.10       5                  10               5000    0.9910017
##   0.10       5                  10              10000    0.9859020
##   0.10       5                  15               2500    0.9928859
##   0.10       5                  15               5000    0.9925326
##   0.10       5                  15              10000    0.9851374
##   Sens       Spec     
##   0.9929057  0.6855951
##   0.9930231  0.6874456
##   0.9927885  0.6837083
##   0.9930816  0.6836720
##   0.9933161  0.6723149
##   0.9929644  0.6742380
##   0.9932575  0.6893687
##   0.9932575  0.6855588
##   0.9928471  0.6818578
##   0.9930230  0.7083091
##   0.9927297  0.7064224
##   0.9933161  0.7064586
##   0.9927299  0.7064586
##   0.9927884  0.7064224
##   0.9930229  0.7083817
##   0.9929058  0.7121190
##   0.9930230  0.7121916
##   0.9933162  0.6969521
##   0.9931403  0.7216618
##   0.9931988  0.7141147
##   0.9933161  0.7046081
##   0.9936093  0.7178157
##   0.9935507  0.7292090
##   0.9934921  0.7140421
##   0.9936680  0.7158563
##   0.9933748  0.7139332
##   0.9937266  0.7120827
##   0.9926127  0.7046081
##   0.9922023  0.6932511
##   0.9917332  0.6836720
##   0.9928472  0.6742743
##   0.9920851  0.6780842
##   0.9920265  0.6799347
##   0.9927885  0.6856676
##   0.9920264  0.6706459
##   0.9916748  0.6724601
##   0.9928471  0.7158926
##   0.9929056  0.6988752
##   0.9933747  0.6951016
##   0.9930816  0.6969884
##   0.9935506  0.7045356
##   0.9936093  0.7046081
##   0.9933162  0.7159289
##   0.9936680  0.7045718
##   0.9939025  0.7045718
##   0.9931989  0.7083817
##   0.9935506  0.7121916
##   0.9936092  0.7007983
##   0.9937853  0.7026851
##   0.9940197  0.6914006
##   0.9937852  0.6951379
##   0.9933747  0.7235123
##   0.9936092  0.7158926
##   0.9937265  0.7121190
##   0.9919092  0.6931785
##   0.9912643  0.6817489
##   0.9912643  0.6837446
##   0.9923782  0.6742743
##   0.9919679  0.6703919
##   0.9919093  0.6723512
##   0.9922609  0.6818940
##   0.9922610  0.6628810
##   0.9916747  0.6533382
##   0.9931989  0.7101959
##   0.9931402  0.7102685
##   0.9934920  0.7007983
##   0.9937265  0.7027213
##   0.9934335  0.7065312
##   0.9934333  0.7084180
##   0.9934921  0.6988389
##   0.9938438  0.6969521
##   0.9936679  0.7045718
##   0.9934920  0.7178157
##   0.9937265  0.7121916
##   0.9941956  0.7084180
##   0.9936679  0.7101959
##   0.9937265  0.6989478
##   0.9938437  0.7007983
##   0.9938438  0.7178157
##   0.9935506  0.7027213
##   0.9934921  0.7007620
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 2500,
##  interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 15.

The train function specifies the final values used for the optimal model are n.trees = 2500, interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 15, so let’s rebuild a boosted tree model with those parameters.

# Build model
set.seed(42)
gb.tune <- gbm(
  formula = allNBA ~ .,
  distribution = "bernoulli",
  data = train.data,
  n.trees = 2500,
  interaction.depth = 5,
  shrinkage = 0.01,
  n.minobsinnode = 15,
  cv.folds = 10)

# Variable importance plot for top 10 variables
vip::vip(gb.tune)

Metrics

# Plots
gridExtra::grid.arrange(auroc.gb.tune, auprc.gb.tune, ncol=2)

# AUROC
auc.gb.tune

## [1] 0.9974319

# AUPRC
pr.gb.tune$auc.integral

## [1] 0.8734863

# Cross-Entropy
mxe.gb.tune

## [1] 0.02299367

Predictions

Guards

Forwards

Centers

Summary

The tuned boosted tree model provides a very slight improvement over the untuned model. As with the random forest models, it correctly predicts 13 out of 15 players, with the order of players slightly different.

Model Comparison

We can now compare our evaluation metrics for our tree ensembles and our previous regularized regression models.

Model	AUROC	AUPRC	Entropy
Lasso	0.993008748683597	0.836911512512884	0.0411973893703773
Ridge	0.991516743545584	0.821885589241977	0.0637057619270399
Elastic-Net	0.993159883405898	0.841167220442868	0.0417216641802348
Random Forest	0.997945557267591	0.919514424299821	0.0216243273033408
Boosted Tree	0.997431946584489	0.873486347127061	0.0229936663401111

The random forest model performed the best based on all evaluation criteria, so let’s use that to predict the 2019 All-NBA team!

2019 All-NBA Team Predictions

# Read in 2018-2019 regular season data and save as matrix
players.2019 <- players %>% 
  filter(year == 2019) %>% 
  select(-c(playerID, year, tmID, center, forward, guard, allNBA))

# Make predictions
set.seed(42)
rf.tune.pred.2019 <- predict(rf.tune, players.2019, type = "prob")

Probability Rankings

Guard Predictions

Forward Predictions

Center Predictions

Predicted Roster

	First Team	Second Team	Third Team
Guard	Bradley Beal	Damian Lillard	Kemba Walker
Guard	Russell Westbrook	James Harden	Stephen Curry
Forward	Giannis Antetokounmpo	Paul George	LeBron James
Forward	Kevin Durant	Blake Griffin	Kawhi Leonard
Center	Karl-Anthony Towns	Nikola Jokic	Joel Embiid

Conclusion

The random forest model provides a very different roster than what our regularized regression models came up with. There are admittedly some questionable choices in there; however, this only highlights the fact that the results of the model’s predictions should be used only as a guide. There are still plenty of improvements that can be made, including but not limited to:

Resampling to account for class imbalance
Data preprocessing (centering, scaling)
Further tuning

Regardless, it’ll be interesting to see how each predicted roster compares with the actual roster. As of today (May 15th, 2019), we’ll just have to keep waiting!

Decision Trees

James Martinez

May 15, 2019

Introduction

Loading and Splitting the Data

Theory

Decision Trees

Tree Ensembles

Basic Random Forests

Metrics

Predictions

Guards

Forwards

Centers

Summary

Tuning with caret

Metrics

Predictions

Guards

Forwards

Centers

Summary

Gradient Boosted Trees

Basic Implementation

Metrics

Predictions

Guards

Forwards

Centers

Summary

Tuning the Boosted Tree Model

Metrics

Predictions

Guards

Forwards

Centers

Summary

Model Comparison

2019 All-NBA Team Predictions

Probability Rankings

Guard Predictions

Forward Predictions

Center Predictions

Predicted Roster

Conclusion

Tuning with `caret`