Machine Learning

Introduction

This report is a continuation of the All-NBA Team Capstone project, which utilizes historical NBA statistics from 1937 to 2012 to predict All-NBA Teams. It will cover the logistic regression modeling of the cleaned players data using base R and the glmnet package.

See RPubs for the data cleaning and exploratory data analysis reports, or check my capstone project repository. Also, see this report for web scraping the 2018-2019 stats.

Importing the Data

# Store clean data as "players"
players <- as_tibble(read.csv("players_EDA.csv"))
players <- players %>%
  select(-c(allDefFirstTeam, allDefSecondTeam, MVP, defPOTY))
players

Note: NBA awards are removed as they are typically announced after the All-NBA teams.

Splitting the Data

Data is typically split in an 80/20 ratio for the train and test sets, respectively. With that in mind, we’ll first build our model reserving the last 6 seasons as our test set.

# Only include variables we're actually regressing on
train.data <- players %>% 
  filter(year < 2006) %>%
  select(-c(playerID, year, tmID, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, center, forward, guard))

test.data <- players %>% 
  filter(year >= 2006) %>%
  select(-c(playerID, year, tmID, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, center, forward, guard))

Building the Models

Regularization

Regularization is a technique used to prevent model overfitting by imposing a penalty on model coefficients. There are 3 commonly used penalized regression models that we’ll use:

LASSO: only the most significant features are kept (automatic feature selection)
Ridge: all features are kept, but less contributive ones are set really low (feature shrinkage)
Elastic-Net: a combination of the above

We can implement these using the glmnet package which, per the Glmnet Vignette, solves the following problem:

\[\min\limits_{\beta_0, \beta} \frac{1}{N} \sum\limits^N_{i = 1} w_i l (y_i, \beta_0 + \beta^T x_i) + \lambda[(1 - \alpha) ||\beta||^2_2 /2 + \alpha ||\beta||_1]\]

The “strength” of the penalty for all 3 models is dependent on \(\lambda\), which is one of 2 tunable parameters in regularized regression. The other parameter is \(\alpha\), which determines the ratio of the two penalty types. Also, notice that ridge and lasso are technically special cases of elastic-net (\(\alpha = 0\) and \(\alpha = 1\), respectively).

Threshold Value

With a logistic regression model, we would normally pick threshold value for the logit (i.e. probability) that determines membership of the positive (allNBA = 1) or negative class (allNBA = 0). For instance, if the threshold value is 0.50, our model would label allNBA = 1 for probabilities greater than 0.50.

In practice, the selection of a threshold value is a business decision that depends upon the willingness to accept false positives (or false negatives). In our case, there is obviously no consequence for setting our threshold higher or lower. More importantly, however, there are a couple constraints we have to consider regarding the selection of the All-NBA team that we didn’t build into our model:

Each All-NBA team roster must have 2 forwards, 2 guards, and 1 center.
Players who play “hybrid” positions (e.g. C-F, F-G) can be awarded honors in one position or another, depending on how votes shape up.

So, rather than set a threshold value, we’ll simply group by player position and sort by descending probability and determine based on rank. In the ideal case of correctly identifying all 15 unique members: the top 6 players in the guard and forward lists and the top 3 players in the center list make up all 3 rosters. Of course, because of how player positions are encoded in the data, there may be some overlap that we’ll have to look out for.

Model Evaluation Criteria

Area Under the ROC Curve (AUROC)

All things considered, we’ll simply want our model to have the best sorting ability (as opposed to evaluating model performance by accuracy, specificity, recall, etc). To characterize this in a way that is insensitive to unbalanced classes (as in our case), we used Area under the ROC curve (AUROC) as the criterion for cross-validation. The receiver operating characteristic (ROC) curve is a plot of true positive rate (TPR) against false positive rate at various threshold values, so the AUC describes the probability that, given a random player, our model can distinguish between an All-NBA team member and non-member.

Using AUROC as our performance metric is easily done by adding the argument type.measure = "auc" to the cv.glmnet call.

Area Under the Precision-Recall Curve (AUPRC)

Precision is the ratio of the number of true positives (TP) divided by the sum of true positives and false positives (FP):

\[Precision = \frac{TP}{TP + FP}\]

It basically describes the proportion of positive class predictions (i.e. allNBA = 1) that were actually correct.

Recall is the ratio of the number of true positives divided by the sum of true positives and false negatives (FN):

\[Recall = \frac{TP}{TP + FN}\]

It describes the proportion of correctly identified All-NBA members.

Considering both precision and recall is particularly useful when a problem involves imbalanced classes, which is the case with our problem (i.e. there are many more non-members than there are members). Note that, in the equations above, we aren’t concerned with the number of true negatives - because of the heavy class imbalance, our model’s ability to predict the negative class isn’t as important as its ability to correctly predict the minority (positive) class.

Though it isn’t as easily interpretable as the area under the ROC curve, the area under the precision-recall curve also provides a measure of model performance. A perfect classifier will have \(AUC = 1.0\). So, as with the ROC curve, an AUC value closer to \(1.0\) (its graph as close to the upper-right corner as possible) suggests better performance.

We can calculate the area under the precision-recall curve using the pr.curve function from the PRROC package.

Cross-Entropy

Cross-Entropy is a typical cost function used with classifiers. The value of the cross-entropy of a given observation is based on two things:

The actual class
The model’s probability of the actual class

For binary classification in particular, cross-entropy can be calculated as

\[-[y \ ln (p) + (1-y) \ ln(1-p)]\]

where \(y\) is the indicator (\(0\) or \(1\)) and \(p\) is the predicted probability. Cross-entropy loss increases rapidly for non-member predictions when the player actually made any roster, and vice versa. A perfect model will have a loss of 0, so the lower the cross-entropy loss, the better a model performed.

From the ROCR package, we can use performance() with the argument measure = "mxe" to calculate mean cross-entropy.

Lasso

\(L_1\), or Lasso (Least Absolute Shrinkage and Selection Operator), regularization relies on the \(L_1\) norm (absolute size) to impose penalties on coefficients \(\beta_i\).

Practically speaking, this type of penalty results in less-significant variables being “turned off” (i.e. \(\beta_i = 0\)).

The degree of the penalty is dependent upon tuning parameter \(\lambda\), whose optimal value we can determine using cross-validation on our training set.

To perform a lasso regression in glmnet, we simply set the argument alpha = 1 (see formula above).

# Create inputs for glmnet
x <- train.data %>% select(-allNBA) %>% as.matrix # glmnet requires a coefficient matrix
y <- train.data$allNBA

# Find optimal lambda via cross-validation
set.seed(42)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "auc") # Perform cross-validation based on AUROC

# Make predictions on test data
x.test <- test.data %>% select(-allNBA) %>% as.matrix
p <- cv.lasso %>% predict(newx = x.test, s = "lambda.min", type = "response") %>% as.numeric # store probabilities

To explore the predictions for each year, we can use the interactive tables below, which were created using the DT package.

Guards

Forwards

Centers

Positive Coefficients

##             lgAssists              tmPoints               GPRatio 
##          3.467938e+02          2.398640e+01          3.464876e+00 
##               allstar         blocksPerGame         stealsPerGame 
##          2.619913e+00          4.306973e-01          4.201650e-01 
##          avgGameScore       reboundsPerGame      threeMadePerGame 
##          3.994291e-01          3.028850e-01          1.245655e-01 
##      oReboundsPerGame      dReboundsPerGame threeAttemptedPerGame 
##          1.030358e-01          7.590572e-02          5.420529e-02 
##        assistsPerGame           ftAttempted                ftMade 
##          8.705046e-03          5.857263e-04          4.407272e-04

Zero Coefficients

##                 GP             points          oRebounds 
##                  0                  0                  0 
##          dRebounds           rebounds            assists 
##                  0                  0                  0 
##             steals             blocks             fgMade 
##                  0                  0                  0 
##          threeMade     minutesPerGame      pointsPerGame 
##                  0                  0                  0 
##      fgMadePerGame ftAttemptedPerGame      ftMadePerGame 
##                  0                  0                  0 
##            healthy           lgPoints         lgRebounds 
##                  0                  0                  0 
##        lgORebounds         tmRebounds        tmDRebounds 
##                  0                  0                  0 
##              fgPct             efgPct        dReboundPct 
##                  0                  0                  0 
##     totalGameScore 
##                  0

Negative Coefficients

##     threeAttempted            minutes        fgAttempted 
##      -6.250630e-04      -6.908618e-04      -8.186588e-04 
##          turnovers                 PF fgAttemptedPerGame 
##      -5.133172e-03      -6.280982e-03      -7.110358e-02 
##   turnoversPerGame           threePct        astTovRatio 
##      -1.411060e-01      -1.720310e-01      -2.701563e-01 
##        tmORebounds        oReboundPct              ftPct 
##      -1.924349e+00      -2.037514e+00      -2.880400e+00 
##          tmAssists        (Intercept)        lgDRebounds 
##      -4.150487e+00      -1.120680e+01      -2.139338e+02

Lasso Metrics

# Plots
gridExtra::grid.arrange(auroc.lasso, auprc.lasso, ncol=2)

# AUROC
max(cv.lasso$cvm)

## [1] 0.9934185

# AUPRC
pr.lasso$auc.integral

## [1] 0.7796794

# Cross-Entropy
mxe.lasso

## [1] 0.04313924

From the looks of it, our model is doing a pretty decent job! The AUROC and AUPRC of this model are \(0.9934185\) and \(0.7796794\) (respectively), and it looks like the correct players are within the top 15 or so most probable players per position. But, let’s see if we can make an adjustment in our train/test split to try and improve our model.

Maximizing training data

Since we’d presumably use this model to predict All-NBA teams every year, we should try using only the last season (2011-2012) as our test set. Hopefully, maximizing our training data will improve the model. We’ll also make corrections to the positions of Rajon Rondo, Dwight Howard, and Dirk Nowitzki.

# Overwrite train/test sets
train.data <- players %>% 
  filter(year != 2011) %>%
  select(-c(playerID, year, tmID, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, center, forward, guard))

test.data <- players %>% 
  filter(year == 2011) %>%
  select(-c(playerID, year, tmID, allNBAFirstTeam, allNBASecondTeam, allNBAThirdTeam, center, forward, guard))

# glmnet inputs
x <- train.data %>% select(-allNBA) %>% as.matrix 
y <- train.data$allNBA

# Find optimal lambda via cross-validation
set.seed(42)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", type.measure = "auc") # Perform cross-validation based on AUC

# Make predictions on test data
x.test <- test.data %>% select(-allNBA) %>% as.matrix
p <- cv.lasso %>% predict(newx = x.test, s = "lambda.min", type = "response") %>% as.numeric

Guards

lasso.guard.predictions.2011

Forwards

lasso.forward.predictions.2011

Centers

lasso.center.predictions.2011

Positive Coefficients

##             lgAssists              tmPoints               GPRatio 
##          2.644141e+02          2.836507e+01          4.883256e+00 
##               allstar         blocksPerGame       reboundsPerGame 
##          2.221494e+00          5.534877e-01          4.629383e-01 
##          avgGameScore               healthy threeAttemptedPerGame 
##          4.184888e-01          3.793851e-01          2.829082e-01 
##         stealsPerGame      oReboundsPerGame        assistsPerGame 
##          2.630618e-01          1.164550e-01          9.175199e-02 
##           lgORebounds         ftMadePerGame             dRebounds 
##          2.065856e-02          6.797864e-03          3.400448e-03 
##        totalGameScore                ftMade 
##          8.442028e-04          7.547251e-05

Zero Coefficients

##                 GP             points          oRebounds 
##                  0                  0                  0 
##           rebounds            assists             steals 
##                  0                  0                  0 
##          turnovers             fgMade        ftAttempted 
##                  0                  0                  0 
##          threeMade      pointsPerGame   dReboundsPerGame 
##                  0                  0                  0 
## ftAttemptedPerGame   threeMadePerGame           lgPoints 
##                  0                  0                  0 
##         lgRebounds              fgPct             efgPct 
##                  0                  0                  0 
##        astTovRatio        dReboundPct 
##                  0                  0

Negative Coefficients

##        fgAttempted             blocks            minutes 
##      -5.286172e-04      -1.428885e-03      -1.510106e-03 
##     threeAttempted     minutesPerGame                 PF 
##      -3.684711e-03      -4.657490e-03      -6.184822e-03 
##      fgMadePerGame fgAttemptedPerGame   turnoversPerGame 
##      -1.271416e-02      -1.054196e-01      -2.008124e-01 
##           threePct        tmORebounds         tmRebounds 
##      -2.606585e-01      -6.922066e-01      -1.883970e+00 
##              ftPct        oReboundPct        tmDRebounds 
##      -2.981195e+00      -4.645660e+00      -4.879549e+00 
##          tmAssists        (Intercept)        lgDRebounds 
##      -5.488186e+00      -1.136295e+01      -4.279627e+02

Lasso Metrics

# Plots
gridExtra::grid.arrange(auroc.lasso, auprc.lasso, ncol=2)

# AUROC
max(cv.lasso$cvm)

## [1] 0.9930087

# AUPRC
pr.lasso$auc.integral

## [1] 0.8369115

# Cross-Entropy
mxe.lasso

## [1] 0.04119739

For the most part the predicted line-ups didn’t change (Brandon Jennings is no longer in the top 6, moving Tony Parker closer to the appropriate rank). Additionally, the top contenders for each position have higher probabilities relative to the previous model. Despite a slightly lower AUROC of \(0.9930087\), the AUPRC is much-improved at \(0.8369115\). It also looks like taking advantage of as many seasons as possible gave this model a slight edge over the previous one, given the slight improvement in cross-entropy losses \((0.04119739 < 0.04313924)\). Considering all of these facts, we’ll use this method when building the other models.

Interpreting the Regression Coefficients

The logit mentioned at the beginning of the report represents the log odds, not probabilities. This means that, for any given variable and holding other variables constant, the actual ratio of probabilities is \(e^{\beta}\). For example, the regression coefficient for All-Star team membership is \(\beta = 2.221494\). Then, fixing every other variable at a fixed value, the odds of making an All-NBA team roster for an All-Star (allstar = 1) compared to the odds of making the roster for a non-All-Star (allstar = 0) is \(e^{2.221494} = 9.221097\). In other words, the odds are 9.22 times greater for All-Stars to make an All-NBA roster (if everything else is equal).

Lasso Discussion

It’s interesting to see that the lasso model “turned off” many of the offensive variables like points, assists, and field goals, and even more interesting that it penalizes others like field goal attempts. Some of these penalties make sense, for instance, personal fouls and turnovers per game. For others, like blocks, the penalties are probably a result of the algorithm adjusting for bias due to other features over-representing the raw statistic. This can be seen by comparing the block coefficient to the blocksPerGame coefficient. It looks like our features GPRatio and healthy, which were inspired by our exploratory data analysis, turned out to be good indicators of All-NBA team membership. Sweet.

Overall, it looks like this lasso model did a fairly good job predicting All-NBA membership in general. Except for Rajon Rondo and Carmelo Anthony, the correct players are placed in the top 10 for each position. A cursory glance of the results, particularly the AUC of \(0.9930087\), suggest that this is a pretty good model.

Ridge

Similar to lasso regression, \(L_2\), or ridge, regression penalizes large coefficients. However, it instead relies on the \(L_2\) penalty (squared).

Practically speaking, this leads to coefficient shrinkage (it doesn’t force them to 0).

Again, the “strength” of the penalty is tuned by the parameter \(\lambda\).

Using glmnet, ridge regression can be done by setting alpha = 0.

# Find optimal lambda via cross-validation
set.seed(42)
cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial", type.measure = "auc") # No need to re-initialize x and y

# Make predictions on test data
p <- cv.ridge %>% predict(newx = x.test, s = "lambda.min", type = "response") %>% as.numeric # store probabilities

Guard Predictions

ridge.guard.predictions.2011

Forward Predictions

ridge.forward.predictions.2011

Center Predictions

ridge.center.predictions.2011

Positive Coefficients

##             lgAssists              lgPoints           lgORebounds 
##          5.763250e+01          4.856552e+01          4.257824e+00 
##              tmPoints            lgRebounds               allstar 
##          2.735093e+00          2.465918e+00          1.795669e+00 
##             tmAssists           tmORebounds            tmRebounds 
##          1.739097e+00          1.412404e+00          1.154470e+00 
##                efgPct                 fgPct           tmDRebounds 
##          1.010174e+00          9.945160e-01          9.354206e-01 
##           dReboundPct         blocksPerGame         stealsPerGame 
##          3.590569e-01          3.252119e-01          1.954983e-01 
##         ftMadePerGame           astTovRatio      threeMadePerGame 
##          1.111507e-01          1.047392e-01          1.024201e-01 
##    ftAttemptedPerGame        assistsPerGame      dReboundsPerGame 
##          9.486095e-02          8.798613e-02          8.109276e-02 
##      oReboundsPerGame         fgMadePerGame          avgGameScore 
##          7.122856e-02          6.500880e-02          5.260153e-02 
##       reboundsPerGame               healthy threeAttemptedPerGame 
##          4.786521e-02          3.829229e-02          2.931582e-02 
##         pointsPerGame              threePct    fgAttemptedPerGame 
##          2.621667e-02          2.215310e-02          1.799737e-02 
##               GPRatio        minutesPerGame                blocks 
##          4.165392e-03          2.614274e-03          2.378526e-03 
##             threeMade                ftMade               assists 
##          1.151087e-03          7.770179e-04          7.755045e-04 
##                steals           ftAttempted        totalGameScore 
##          6.515315e-04          6.238926e-04          4.123563e-04 
##                fgMade             dRebounds        threeAttempted 
##          3.571578e-04          2.227062e-04          1.370472e-04 
##                points              rebounds           fgAttempted 
##          1.353643e-04          5.686495e-05          4.973701e-06

Negative Coefficients

##          minutes        oRebounds        turnovers               PF 
##    -6.467301e-05    -5.346227e-04    -1.223218e-03    -2.111940e-03 
##               GP turnoversPerGame            ftPct      oReboundPct 
##    -2.601061e-03    -9.323615e-03    -4.331699e-02    -4.633257e-01 
##      lgDRebounds      (Intercept) 
##    -5.127471e-01    -1.178001e+01

Ridge Metrics

# Plots
gridExtra::grid.arrange(auroc.ridge, auprc.ridge, ncol=2)

# AUROC
max(cv.ridge$cvm)

## [1] 0.9915167

# AUPRC
pr.ridge$auc.integral

## [1] 0.8218856

# Cross-Entropy
mxe.ridge

## [1] 0.06370576

Ridge Discussion

Like the Lasso model, the Ridge model correctly identified the First Team, as well as a good chunk of the Second Team. Deron Williams again ranked pretty high despite not making a roster at all; John wall also scooted another guard off the top 6. Rajon Rondo was only given a probability of 0.038, which would just place him in the top 10 for guards. It had Carmelo Anthony in 7th place for forwards, but it also gave poor Tyson Chandler a measly 0.015 probability of making the roster.

Note that, as mentioned at the top of the section, all of the variables contribute, though some coefficients are quite small and thus bear little effect. the GPRatio and healthy variables are much less significant compared to the Lasso model, but many of the offensive stats are much more important to the Ridge model.

Though the Ridge model pulled off a high AUROC of \(0.9915167\), it doesn’t perform as well as the lasso model based on the AUPRC and cross-entropy.

Elastic-net

Elastic-net is a compromise between lasso and ridge regression. the ratio of \(L_1\) and \(L_2\) penalties is determined by alpha.

To find the optimal alpha, we’ll have to perform several iterations of cross-validation and store the alpha that corresponds to the maximum AUC. To speed up computation, we can use %dopar% from the foreach package to run in parallel (in conjunction with the doParallel package).

# Set alphas to try
a <- seq(0.05, 0.95, 0.05)

# Loop with foreach
loop <- foreach(i = a, .combine = rbind, .packages = "glmnet") %dopar% {
  set.seed(42)
  cv <- cv.glmnet(x, y, family = "binomial", type.measure = "auc", parallel = TRUE, alpha = i)
  data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.min], s = "lambda.min", alpha = i)
}

# Optimal alpha
alpha <- loop[loop$cvm == max(loop$cvm),]$alpha
alpha

## [1] 0.15

# Remake the optimal fitted cv.glmnet object
cv.elastic <- cv.glmnet(x, y, alpha = alpha, family = "binomial", type.measure = "auc")

# Make predictions with the elastic net model
set.seed(42)
p <- cv.elastic %>% predict(newx = x.test, s = "lambda.min", type = "response") %>% as.numeric

Guards

elastic.guard.predictions.2011

Forwards

elastic.forward.predictions.2011

Centers

elastic.center.predictions.2011

Positive Coefficients

##             lgAssists              lgPoints           lgORebounds 
##          1.968793e+02          1.655255e+02          4.185373e+01 
##              tmPoints                 fgPct                efgPct 
##          2.156011e+01          3.145837e+00          2.531974e+00 
##               GPRatio               allstar           dReboundPct 
##          2.401594e+00          2.184386e+00          2.069288e+00 
##         blocksPerGame               healthy           tmORebounds 
##          7.647388e-01          5.311330e-01          4.731112e-01 
##         stealsPerGame      oReboundsPerGame      dReboundsPerGame 
##          4.515525e-01          3.708144e-01          2.747295e-01 
## threeAttemptedPerGame        assistsPerGame       reboundsPerGame 
##          2.451623e-01          2.129503e-01          2.024594e-01 
##          avgGameScore         ftMadePerGame      threeMadePerGame 
##          1.732118e-01          1.464864e-01          1.455032e-01 
##         pointsPerGame        totalGameScore             dRebounds 
##          1.018255e-02          1.300575e-03          1.170581e-03 
##                ftMade              rebounds 
##          5.097407e-04          3.443722e-04

Zero Coefficients

##             points            assists             steals 
##                  0                  0                  0 
##             fgMade        ftAttempted          threeMade 
##                  0                  0                  0 
##      fgMadePerGame ftAttemptedPerGame        astTovRatio 
##                  0                  0                  0

Negative Coefficients

##          oRebounds            minutes        fgAttempted 
##      -1.314830e-04      -7.415770e-04      -8.699819e-04 
##             blocks          turnovers     threeAttempted 
##      -2.149325e-03      -2.949120e-03      -3.103796e-03 
##                 GP fgAttemptedPerGame                 PF 
##      -4.431213e-03      -4.819964e-03      -7.007939e-03 
##     minutesPerGame   turnoversPerGame           threePct 
##      -4.328564e-02      -2.155763e-01      -2.456865e-01 
##         tmRebounds        tmDRebounds              ftPct 
##      -4.367087e-01      -9.687259e-01      -2.003881e+00 
##          tmAssists        oReboundPct        (Intercept) 
##      -2.297697e+00      -2.775426e+00      -1.479367e+01 
##         lgRebounds        lgDRebounds 
##      -1.019791e+02      -3.263354e+02

Elastic-Net Metrics

# Plots
gridExtra::grid.arrange(auroc.elastic, auprc.elastic, ncol=2)

# AUROC
max(loop$cvm)

## [1] 0.9931599

# AUPRC
pr.elastic$auc.integral

## [1] 0.8411672

# Cross-Entropy
mxe.elastic

## [1] 0.04172166

We can see that the elastic-net model did a pretty good job predicting all three teams. Like the other two models, the First Team and most of the Second Team are correctly predicted and the true roster are all within the top 15 most probable players. Overall, the model correctly predicts the most players, and has the highest AUROC of \(0.9931599\), as well as the highest AUPRC of \(0.8411672\). It has an excellent cross-entropy loss of only \(0.04172166\), though its not as low as the lasso model.

Model Comparison

Prediction Summary

There are a few things to note:

Dirk Nowitzki actually made the team as a forward, though he’s coded as a center.
Dwight Howard made the team as a center, though he’s coded as a forward.
Again, Rajon Rondo was incorrectly coded as a forward, so he should be listed as a guard.

With these adjustments, the results are as follows:

2012 All-NBA First Team

Position	Lasso	Ridge	Elastic-Net	Actual
Guard	Chris Paul	Chris Paul	Chris Paul	Chris Paul
Guard	Kobe Bryant	Kobe Bryant	Kobe Bryant	Kobe Bryant
Forward	LeBron James	LeBron James	LeBron James	LeBron James
Forward	Kevin Durant	Kevin Durant	Kevin Durant	Kevin Durant
Center	Dwight Howard	Dwight Howard	Dwight Howard	Dwight Howard

2012 All-NBA Second Team

Position	Lasso	Ridge	Elastic-Net	Actual
Guard	~~Deron Williams~~	~~Deron Williams~~	~~Deron Williams~~	Tony Parker
Guard	Russell Westbrook	Russell Westbrook	Russell Westbrook	Russell Westbrook
Forward	Kevin Love	Kevin Love	Kevin Love	Kevin Love
Forward	Blake Griffin	Blake Griffin	Blake Griffin	Blake Griffin
Center	Andrew Bynum	Andrew Bynum	Andrew Bynum	Andrew Bynum

2012 All-NBA Third Team

Position	Lasso	Ridge	Elastic-Net	Actual
Guard	~~Brandon Jennings~~	Dwyane Wade	Dwyane Wade	Dwyane Wade
Guard	Tony Parker	~~John Wall~~	Tony Parker	Rajon Rondo
Forward	~~Pau Gasol~~	~~Pau Gasol~~	~~Pau Gasol~~	Dirk Nowitzki
Forward	~~Josh Smith~~	~~Josh Smith~~	~~Josh Smith~~	Carmelo Anthony
Center	~~Marcin Gortat~~	~~Marc Gasol~~	~~Marcin Gortat~~	Tyson Chandler

Roster Discussion

Note that, in the tables above, players that are struck through did not make any roster (i.e. ~~Marcin Gortat~~, ~~Josh Smith~~, etc.). Tony Parker is italicized since he was predicted to make Third Team, but actually made Second Team.

All 3 models correctly predicted the First Team roster, as well as 4 of 5 Second Team members. Interestingly, they all agreed that Deron Williams should have been on the Second Team, though he didn’t actually make any roster (sorry Deron). The Third Team roster is where it gets hairy - the Lasso and Ridge models only correctly predicted 1 of 5 players, and Elastic-Net only got 2. Also note that, if Deron Williams wasn’t ranked 4th on the Guards list, Tony Parker would have correctly made Second Team for the Elastic-Net model. Unfortunately, poor Rondo was only given a 0.066 probability of making a roster, so he still wouldn’t have cracked Third Team on this model.

There are a lot of intangible factors that go into the All-NBA team selection, particularly since it’s determined by a point-based voting system - and not by players and coaches, but by a panel of sportswriters and broadcasters. From Wikipedia:

Players receive five points for a first team vote, three points for a second team vote, and one point for a third team vote.

This provides a bit of an explanation for the high accuracy of the First and Second teams as well as the low accuracy of the Third Team. Voters mostly seem to agree on the obvious First- and Second-Teamers, so those players rack up 5x or 3x the points than do Third-Team votes. Given the weight of the points, it makes sense that the Third Team roster has such variance. Not even the models seem to agree on Third Team, for that matter.

At any rate, given that all 3 models correctly predicted 9 of 10 First and Second Team rosters, it seems that there’s some semblance of agreement upon whatever criteria the voters judge worthiness. But whatever those criteria are, the First Team apparently seem to have it in spades, given their relative probabilities in each model.

Evaluation Metrics

A direct comparison of the ROC and Precision-Recall curves, as well as a summary of the evaluation metrics, are provided below.

Note: The best value is highlighted in green and the worst in red.

Model	AUROC	AUPRC	Entropy
Lasso	0.993008748683597	0.836911512512884	0.0411973893703773
Ridge	0.991516743545584	0.821885589241977	0.0637057619270399
Elastic-Net	0.993159883405898	0.841167220442868	0.0417216641802348

The ROC Curves plot demonstrates the similarity in AUROC scores for each model, though we can barely see the edge that the elastic-net model has on the other two (it’s above the other two curves). The Precision-Recall curves are a bit more scattered; it’s difficult to tell which model wins from the plot alone.

The elastic-net model outperformed the lasso model based on AUROC and AUPRC, although the lasso model has a narrow edge in terms of cross-entropy. The ridge model consistently performed the worst across all metrics. Despite this, however, the ridge and lasso models correctly predicted the same number of All-NBA team members.

The elastic-net model identified the most correct players out of all models, although it only beat the other models by 1 player (11 out of 15 players correctly identified). Based on this and the fact that the elastic-net model won in 2 out of 3 evaluation metrics, we can say that elastic-net provides the winning model.

2019 All-NBA Predictions

Of course, our model probably (no pun intended) won’t be effective skipping 7 seasons of data. Not to mention how the changes to the All-Star team selection, one of the most significant predictors, will affect the model.

Let’s use our elastic-net model to predict the All-NBA teams for the 2018-2019 season anyway!

# Read in 2018-2019 regular season data and save as matrix
players.2019 <- read.csv("players_2019.csv")

x.2019 <- players.2019 %>% 
  select(-c(playerID, tmID, center, forward, guard)) %>% 
  as.matrix

set.seed(42)
p <- cv.elastic %>% predict(newx = x.2019, s = "lambda.min", type = "response") %>% as.numeric

Probability Rankings

Guard Predictions

Forward Predictions

Center Predictions

Predicted Roster

	First Team	Second Team	Third Team
Guard	James Harden	Stephen Curry	Kemba Walker
Guard	Russell Westbrook	Damian Lillard	Kyrie Irving
Forward	Giannis Antetokounmpo	Paul George	Kawhi Leonard
Forward	LeBron James	Kevin Durant	Blake Griffin
Center	Anthony Davis	Joel Embiid	Karl-Anthony Towns

Seems like a pretty good list. We’ll find out how good in a few weeks!

Conclusion

We created 3 logistic regression models using Lasso, ridge, and elastic-net. The elastic-net model performed the best, in terms of AUC and players correctly predicted (11 of 15). There is a lot of room for improvement, but it’s not terrible! It’s also worth noting that all 3 models agreed on both the First and Second Teams.

It’s interesting to note that all 3 models have lgAssists as the highest coefficient, followed either by lgPoints or tmPoints. Of course, being proportions, league- and team-wide statistics vary from player to player on the order of tenths, hundreths, or even thousandths of a point. On such a scale their importance can be a bit difficult to interpret; this is one of the disadvantages of logistic regression. However, it makes enough sense that the highest scorers and assist-ers have much greater probabilities of making a roster.

All-Star status may be the best predictor of All-NBA team membership - the elastic-net and lasso models gave All-Stars about 9 times greater odds of making a roster (ridge gave about 6 times). Player health (and/or GPRatio) is also a significant contributor - all models had one or the other within the top 10 highest coefficients. Another point of interest is the apparent importance of lgORebounds, which can, in some opinions, be an indirect indicator of effort/activity.

Though not exactly unexpected, all 3 models penalize turnovers, turnover per game, and personal fouls. What’s more interesting is that all 3 models also penalize some combination of minutes, minutes per game, and games played. It’s not much of a penalty in any model - for instance, the winning elastic-net model docks ~0.4% chance for every game played. Perhaps this has to do with teams resting their key players every once in a while (or before the playoffs).

We used the model to predict the All-NBA team rosters for the current (2018-2019) season. As of April 18, 2019, the teams have not yet been announced.

Machine Learning

James Martinez

April 17, 2019

Introduction

Importing the Data

Splitting the Data

Building the Models

Regularization

Threshold Value

Model Evaluation Criteria

Area Under the ROC Curve (AUROC)

Area Under the Precision-Recall Curve (AUPRC)

Cross-Entropy

Lasso

Guards

Forwards

Centers

Positive Coefficients

Zero Coefficients

Negative Coefficients

Lasso Metrics

Maximizing training data

Guards

Forwards

Centers

Positive Coefficients

Zero Coefficients

Negative Coefficients

Lasso Metrics

Interpreting the Regression Coefficients

Lasso Discussion

Ridge

Guard Predictions

Forward Predictions

Center Predictions

Positive Coefficients

Negative Coefficients

Ridge Metrics

Ridge Discussion

Elastic-net

Guards

Forwards

Centers

Positive Coefficients

Zero Coefficients

Negative Coefficients

Elastic-Net Metrics

Model Comparison

Prediction Summary

2012 All-NBA First Team

2012 All-NBA Second Team

2012 All-NBA Third Team

Roster Discussion

Evaluation Metrics

2019 All-NBA Predictions

Probability Rankings

Guard Predictions

Forward Predictions

Center Predictions

Predicted Roster

Conclusion