Question

Our goal with this project is to see if we could predict what pick NBA Draft prospects will be taken based on their performance in the NBA Draft Combine. The question we are trying to answer is:

  • Is it possible to predict if a player will be selected in the NBA Draft Lottery (first 14 picks)?

Background Information

Each year, some of the top young basketball players around the world get to hear their name called at the NBA Draft in June. However, the preceding months are filled with a thorough evaluation process, so each team can have the utmost confidence in the picks they make. One significant part of this process is the NBA Draft Combine, which occurs roughly one month before the draft.

While it does not get the same media attention, the NBA Draft Combine functions much like the NFL Combine, where players have their heights and weights measured, as well as their performances in a number of tests such as vertical jump, bench press, and a three-quarter court sprint. Although on-court performance in college or international leagues is often most influential, a strong performance at the NBA Draft Combine can lead to teams having a more positive view on a player, leading to the prospect getting drafted earlier.

Ultimately, we are trying to analyze that correlation, between NBA Draft Combine performance and NBA Draft position. More specifically, we are hoping to identify what combine tests have the most significance to NBA teams, then build a model to predict which players may be selected in the NBA Draft Lottery, or the first 14 picks, accordingly. For instance, if a prospect has a large wingspan and a fast sprint time, would that mean they are more likely to be selected earlier? Our model seeks to answer questions like these.

Exploratory Data Analysis

The data we have chosen for this project is in the file "nba_draft_combine_all_years.csv". It includes the following information about each player:
    • Player Name
    • Year
    • Draft Pick
    • Height (No Shoes)
    • Height (With Shoes)
    • Wingspan
    • Standing reach
    • Vertical (Max)
    • Vertical (Max Reach)
    • Vertical (No Step)
    • Vertical (No Step Reach)
    • Weight
    • Body Fat
    • Hand (Length)
    • Hand (Width)
    • Bench
    • Agility
    • Sprint

Some of these variables will not be useful for our intended application. For instance, the name of a player and the year they were drafted would not be something that would be worth keeping in our model, as they are specific to each player. Similarly, we chose to drop both height variables, standing reach, weight, and body fat, so that our model would not be biased toward Power Forwards and Centers, the positions that are typically filled with larger athletes. We also removed some of the vertical measures, since they could be redundant, and bench, since many prospects chose opt out of that test at the NBA Draft Combine.

There were also a significant amount of NA's in the data. This is due to various players opting out of some combine activities. There are also discrepancies between how the combine was completed from year to year. For example, hand size was only measured after 2010. We decided to remove these NA's, as the dataset was large enough to still draw meaningful conclusions.

Our correlation matrix revealed that none of our variables had an excepionally strong correlation to Draft Pick. Each variable, along with its correlation to Draft Pick, is shown below.

##                       Test Correlation
## 1               Draft pick   1.0000000
## 2                 Wingspan  -0.1588609
## 3     Vertical (Max Reach)  -0.2242434
## 4 Vertical (No Step Reach)  -0.1936272
## 5            Hand (Length)  -0.1671339
## 6             Hand (Width)  -0.1495533
## 7                  Agility   0.1345048
## 8                   Sprint   0.1795948

It is important to note that a negative correlation for each of wingspan, both vertical variables, and both hand size variables, truly represents a positive correlation. For instance, the wingspan correlation is given as negative because a large wingspan (higher value) is correlated with a low draft pick value. The agility and sprint correlations are positive because faster times (lower values) are correlated with low draft pick values. While the correlations are not especially strong, they are all in the direction we expected. For example, we hypothesized that teams would view prospects that are agile and fast with a large wingspan, high vertical, and large hands favorably, and the correlations reflect that.

Methods and Evaluation

Once the exploratory data analysis was complete, we had to split our data into the training and test sets. We elected to have 90% of the data in the training set, to ensure the model was trained on as much data as possible.

Next we were interested in determining how accurate our model would be, based on the number of randomly selected predictors. The plot of Accuracy vs. Number of Predictors is shown below.

While the accuracy generally decreases with more predictors, this does not occur at a significant level. With anywhere from one to ten randomly selected predictors, the accuracy remains between 77.5% and 73.5%.

When continuing to build our model, it would be helpful to know the prevalence of lottery picks in our dataset. After our data cleaning, 61 lottery picks and 224 non-lottery picks remained in the dataset, for a prevalence of 21.4%.

Next, we needed to identify what threshold would work best for our model. To do this, we created a plot that gave the true positive and false positive rates at varying thresholds. The graph is shown below, with the colors and their corresponding thresholds on the right side.

The corner in the yellow region, with a true positive rate of roughly 75% and a false positive rate just under 20%, represented the best balance of a high true positive rate and a low false positive rate. Thus, we decided to move forward with the threshold corresponding to the yellow region, 76%. It is important to note that this threshold needed to be flipped to its opposite, 24%, for our model's purposes.

Using a threshold of 24% resulted in a sensitivity of 83.33%, while still maintaining a specificity of 72.73%. The confusion matrix and accompanying measures of performance can be seen below.

## Confusion Matrix and Statistics
## 
##                   Actual
## Prediction         Lottery pick Non-lottery pick
##   Lottery pick                5                6
##   Non-lottery pick            1               16
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.5513, 0.8931)
##     No Information Rate : 0.7857          
##     P-Value [Acc > NIR] : 0.7623          
##                                           
##                   Kappa : 0.4302          
##                                           
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.7273          
##          Pos Pred Value : 0.4545          
##          Neg Pred Value : 0.9412          
##               Precision : 0.4545          
##                  Recall : 0.8333          
##                      F1 : 0.5882          
##              Prevalence : 0.2143          
##          Detection Rate : 0.1786          
##    Detection Prevalence : 0.3929          
##       Balanced Accuracy : 0.7803          
##                                           
##        'Positive' Class : Lottery pick    
## 

To see how our model performed at all thresholds, we created an ROC (receiving operator characteristic) curve, which can be seen below.

## Area under the curve: 0.7803

The area under an ROC curve can be used to evaluate the quality of a model. An area of 1.0 would mean that the model performs perfectly, with all of its predictions being correct at every threshold. Our model had an area under the curve of 0.7803, which means our model qualifies as being along the border of good and very good.

One of our goals was to determine which combine tests have the most significance to NBA teams. Based on the variable importance table below, it is evident that agility and sprint speed are most important to NBA teams during the draft process, while hand size is not that significant.

## rf variable importance
## 
##                        Overall
## agility                 100.00
## sprint                   97.42
## wingspan                 66.55
## vertical_max_reach       54.33
## vertical_no_step_reach   52.38
## hand_width               15.30
## hand_length               0.00

Conclusions

  • The overall accuracy of our model is 0.75.

    • Our model correctly identifies whether a player will be drafted in the lottery 75% of the time.

    • Compare this to the baseline prevalence of 0.21. If we were to simply decide that all players will not be drafted in the lottery, our model would technically have an accuracy of 0.79.

    • Therefore, it's important to look at metrics other than accuracy.

  • Because we're interested in classifying the positive cases (NBA players who are drafted in the lottery), our question is best answered with recall and precision.

  • From evaluation section, the model has a high recall/TPR.

    • Out of all players drafted in the lottery, the model accurately predicted 83.33% as such.

    • Our model is pretty good at labeling NBA players who will be drafted in the lottery as such from their combine data.

  • However, optimizing the recall of the model can lead to a decreased precision.

    • Precision of 0.4545, so out of all players predicted to be drafted in the lottery, only 45.45% actually were.

    • Sub-optimal precision, perhaps further evaluation could help us balance the trade-off between the two.

  • F1 score is also a good metric for our purposes because our data is imbalanced (prevalence of lottery pick is 21.43%). It also acts to balance the precision and recall.

    • The F1 score of our model is 0.5882, which indicates that the model's overall accuracy is moderate. This means that the model is able to make predictions that are better than random chance, but there is still a lot of room for improvement.
  • The LogLoss value for our model is 0.4561243. On the other hand, the dummy LogLoss given our prevalence of 21% is about 0.5.

    • Our LogLoss value is less than the dummy LogLoss, which suggests that the model is at least better than a dummy classifier at estimating the probabilities of both classes.
  • So, we can conclude that our model does a good job of determining which players have the greatest chance of being drafted in the lottery, but it does need refining overall to be able to pick these players without as many false-positives.

Future Work

  • Limitations of our analysis on this project:

    • Unbalanced data. Because of the nature of the draft, all data is unbalanced. That is, there are significantly more players not being drafted in the lottery than there are being drafted, so there isn't as much data available about players who are drafted in the lottery (in comparison to those who are).
  • Future Improvements:

    • Grouping by Position. Because different basketball positions have different requirements for the various metrics in the Combine data set, we could improve the model by building separate models for each player position. In doing so, we could sparsify the data based on what is best at predicting a draft pick for a specific position.

    • Adding Clusters. In order to further improve the model, we could also cluster players on various combinations of metrics. For example, we could have a clustering based on shooting metrics and another based on general speed and strength.