Classifying NBA Players: Predicting ESPN’s Top 50

Sports Analytics & Insight — Classification & Clustering Assignment, Part A

Author

Luka Kelentric

Published

April 28, 2026

1 Introduction

In September 2018, ESPN.com published a list of the top 50 NBA players for the 2018-19 season. It is assumed that performance data from the 2017-18 season was a major input into ESPN’s rankings. The aim of Part A is to use classification techniques on 2017-18 player statistics to predict whether or not a player ended up on ESPN’s Top 50 list.

Two classification methods are applied:

A Classification Tree built using the rpart package.
A Binary Logistic Regression model built using the glm function.

The variable pts (average points scored per game) is excluded from both models per the assignment brief, and the variable player is excluded as it is purely an identifier. The two models are then compared on accuracy and the predictors they each consider important.

Code

library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tidyr)
library(gt)
library(scales)
library(janitor)

2 Question 1: Import the Data

The training and testing datasets are imported using read_csv() from the tidyverse. A quick class-balance check on the target variable (top_50) is included to give context for the accuracy results later on.

Code

nba_training <- read_csv("nba_training.csv")    #Importing all data sets into R
nba_testing <- read_csv("nba_testing.csv")

nba_training %>% count(top_50)      #assessing the balance/imbalance of the top50

# A tibble: 2 × 2
  top_50     n
  <chr>  <int>
1 N         99
2 Y         31

The training set contains 130 players and the testing set contains 66 players. The target top_50 is stored as “Y” for Top 50 players and “N” otherwise. As expected from the brief (only 45 of ESPN’s Top 50 met the 1500-minutes-played requirement), the data is imbalanced, with far more non-Top-50 players than Top-50 players in both sets.

3 Question 2: Classification Tree

3.1 Q2a — Build and visualise the tree

Position (pos) is a categorical variable, so it is converted to a factor before modelling. The classification tree is then built using all available predictors except player and pts.

Code

nba_training$pos <- as.factor(nba_training$pos)     #changing position into a categorical variable
nba_testing$pos  <- as.factor(nba_testing$pos)


nba_model_tree <- rpart(top_50 ~ pos + fg + fgp + thr + thrp + efg +    #Building the tree using all the predictors
                          trb + ast + stl + blk + tov + pf,
                        data = nba_training, method = 'class')     #classification not regression

Code

fancyRpartPlot(nba_model_tree)    #Visualization of the tree

Classification tree predicting whether an NBA player made ESPN’s Top 50.

3.2 Q2b — Interpret the tree

3.2.1 Q2b.i — A rule for predicting Top 50

The rule for the tree model predicting if the player will be in the Top 50 based on his regular season statistics is if the player simply has ‘fg’ ≥ 7.1, this leaf has a purity of 81% which is the highest of any leaf that predicts a player being “Top 50”.

3.2.2 Q2b.ii — A rule for predicting Not Top 50

Now on the other hand the rule for finding if a player is “not top 50” material, we follow the following rule: fg < 7.1, thr < 2.3 and trb < 6.4, this path tells us with 99% purity that the player will not be within the top 50 if applied over different set of data.

3.2.3 Q2b.iii — Variable importance

The variable.importance element of the fitted rpart object gives a numeric score for each variable, reflecting how much it contributed to the splits in the tree.

Code

nba_model_tree$variable.importance    #assess importance of each variable

        fg        thr        tov         pf        blk        trb        pos 
16.3355244  6.1996468  5.4451748  3.5572075  2.8196389  2.7911430  1.9380366 
       ast       thrp        stl        fgp 
 1.8660357  1.3760600  1.0881536  0.8043276

Code

summary(nba_model_tree)      #show the importance differences in % form

Call:
rpart(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg + 
    trb + ast + stl + blk + tov + pf, data = nba_training, method = "class")
  n= 130 

          CP nsplit rel error    xerror      xstd
1 0.41935484      0 1.0000000 1.0000000 0.1567347
2 0.01075269      1 0.5806452 0.7096774 0.1379081
3 0.01000000      4 0.5483871 0.7741935 0.1427004

Variable importance
  fg  thr  tov   pf  blk  trb  pos  ast thrp  stl  fgp 
  37   14   12    8    6    6    4    4    3    2    2 

Node number 1: 130 observations,    complexity param=0.4193548
  predicted class=N  expected loss=0.2384615  P(node) =1
    class counts:    99    31
   probabilities: 0.762 0.238 
  left son=2 (109 obs) right son=3 (21 obs)
  Primary splits:
      fg  < 7.1    to the left,  improve=16.335520, (0 missing)
      thr < 2.25   to the left,  improve=10.069930, (0 missing)
      tov < 1.45   to the left,  improve= 8.330769, (0 missing)
      stl < 1.35   to the left,  improve= 6.179021, (0 missing)
      trb < 7.3    to the left,  improve= 5.553524, (0 missing)
  Surrogate splits:
      tov < 2.75   to the left,  agree=0.892, adj=0.333, (0 split)
      pf  < 3.05   to the left,  agree=0.869, adj=0.190, (0 split)
      thr < 2.6    to the left,  agree=0.854, adj=0.095, (0 split)
      ast < 8.5    to the left,  agree=0.854, adj=0.095, (0 split)
      stl < 1.75   to the left,  agree=0.846, adj=0.048, (0 split)

Node number 2: 109 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.1284404  P(node) =0.8384615
    class counts:    95    14
   probabilities: 0.872 0.128 
  left son=4 (97 obs) right son=5 (12 obs)
  Primary splits:
      thr < 2.25   to the left,  improve=3.723257, (0 missing)
      fg  < 4.65   to the left,  improve=3.478123, (0 missing)
      trb < 7.3    to the left,  improve=2.893895, (0 missing)
      efg < 0.5335 to the left,  improve=2.570815, (0 missing)
      tov < 1.45   to the left,  improve=2.270472, (0 missing)
  Surrogate splits:
      ast < 6.75   to the left,  agree=0.899, adj=0.083, (0 split)
      stl < 1.65   to the left,  agree=0.899, adj=0.083, (0 split)

Node number 3: 21 observations
  predicted class=Y  expected loss=0.1904762  P(node) =0.1615385
    class counts:     4    17
   probabilities: 0.190 0.810 

Node number 4: 97 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.08247423  P(node) =0.7461538
    class counts:    89     8
   probabilities: 0.918 0.082 
  left son=8 (73 obs) right son=9 (24 obs)
  Primary splits:
      trb < 6.35   to the left,  improve=2.791143, (0 missing)
      blk < 0.95   to the left,  improve=2.233258, (0 missing)
      fg  < 4.65   to the left,  improve=1.802364, (0 missing)
      pos splits as  RLLLL,      improve=1.542761, (0 missing)
      fgp < 0.464  to the left,  improve=1.435421, (0 missing)
  Surrogate splits:
      pos  splits as  RLLLL,      agree=0.845, adj=0.375, (0 split)
      thrp < 0.2565 to the right, agree=0.835, adj=0.333, (0 split)
      blk  < 0.75   to the left,  agree=0.825, adj=0.292, (0 split)
      thr  < 0.25   to the right, agree=0.814, adj=0.250, (0 split)
      fgp  < 0.549  to the left,  agree=0.804, adj=0.208, (0 split)

Node number 5: 12 observations
  predicted class=N  expected loss=0.5  P(node) =0.09230769
    class counts:     6     6
   probabilities: 0.500 0.500 

Node number 8: 73 observations
  predicted class=N  expected loss=0.01369863  P(node) =0.5615385
    class counts:    72     1
   probabilities: 0.986 0.014 

Node number 9: 24 observations,    complexity param=0.01075269
  predicted class=N  expected loss=0.2916667  P(node) =0.1846154
    class counts:    17     7
   probabilities: 0.708 0.292 
  left son=18 (15 obs) right son=19 (9 obs)
  Primary splits:
      blk < 0.95   to the left,  improve=2.0055560, (0 missing)
      stl < 0.9    to the left,  improve=1.0416670, (0 missing)
      fgp < 0.474  to the left,  improve=0.9388889, (0 missing)
      efg < 0.51   to the left,  improve=0.9388889, (0 missing)
      thr < 0.85   to the left,  improve=0.7500000, (0 missing)
  Surrogate splits:
      pos  splits as  RLLL-,      agree=0.792, adj=0.444, (0 split)
      thrp < 0.1715 to the right, agree=0.708, adj=0.222, (0 split)
      pf   < 2.65   to the left,  agree=0.708, adj=0.222, (0 split)
      fgp  < 0.474  to the left,  agree=0.667, adj=0.111, (0 split)
      thr  < 0.05   to the right, agree=0.667, adj=0.111, (0 split)

Node number 18: 15 observations
  predicted class=N  expected loss=0.1333333  P(node) =0.1153846
    class counts:    13     2
   probabilities: 0.867 0.133 

Node number 19: 9 observations
  predicted class=Y  expected loss=0.4444444  P(node) =0.06923077
    class counts:     4     5
   probabilities: 0.444 0.556

The most important variables for predicting whether a player is in ESPN’s Top 50 are the ones at the top of this list. The top three are:

[1] "fg"  "thr" "tov"

In other words, fg, thr, and tov are the predictors the tree relied on most heavily when separating Top 50 from non-Top-50 players.

3.3 Q2c — Accuracy of the tree

3.3.1 Training accuracy

Code

# --- TRAINING DATA ---

nba_training_tree_prediction <- predict(nba_model_tree, newdata = nba_training, type = 'class')

nba_training_tree_tab <- table(nba_training$top_50,
                               nba_training_tree_prediction,
                               dnn = c('Actual', 'Predicted'))
nba_training_tree_tab

      Predicted
Actual  N  Y
     N 91  8
     Y  9 22

Code

nba_training_tree_acc <- sum(diag(nba_training_tree_tab)) / sum(nba_training_tree_tab)
nba_training_tree_acc

[1] 0.8692308

3.3.2 Testing accuracy

Code

# --- TESTING ---
nba_testing_tree_prediction <- predict(nba_model_tree, newdata = nba_testing, type = 'class')

nba_testing_tree_tab <- table(nba_testing$top_50,
                              nba_testing_tree_prediction,
                              dnn = c('Actual', 'Predicted'))
nba_testing_tree_tab

      Predicted
Actual  N  Y
     N 48  4
     Y  3 11

Code

nba_testing_tree_acc <- sum(diag(nba_testing_tree_tab)) / sum(nba_testing_tree_tab)
nba_testing_tree_acc

[1] 0.8939394

The classification tree is 86.9% accurate on the training data and 89.4% accurate on the testing data.

3.3.3 Q2c.i — Overfitting check

A model is said to be overfitting when it performs much better on the training data than on the testing data — it has memorised the training set rather than learnt rules that generalise to unseen players.

In this case, the gap between training accuracy (86.9%) and testing accuracy (89.4%) is -2.5 percentage points. The gap is not significant enough in order to label the classification tree as overfitting on the training data set, rather the tree generalises well considering the insiginificant degree of the gap between the training and testing tree accuracy.

4 Question 3: Binary Logistic Regression

4.1 Q3a.i — Preparing the response variable

The target top_50 is stored as text “N” / “Y”. For logistic regression, it must be converted to a factor with the reference level first (“N” — i.e. not in the Top 50) and the event level second (“Y” — in the Top 50). This ordering ensures the model estimates the log-odds of being in the Top 50. Readable labels are also applied so all the confusion matrixes are easier to interpret.

Code

nba_training$top_50 <- factor(nba_training$top_50,       #converting to a factor N=reference Y=event
                              levels = c("N", "Y"),     #Top 50 stored as Y N in the csv file
                              labels = c("Not Top 50", "Top 50"))

nba_testing$top_50  <- factor(nba_testing$top_50,
                              levels = c("N", "Y"),
                              labels = c("Not Top 50", "Top 50"))     #labels allow the matrix to be more digestible

levels(nba_training$top_50)    #should print: "Not Top 50" "Top 50"

[1] "Not Top 50" "Top 50"

Code

levels(nba_testing$top_50)

[1] "Not Top 50" "Top 50"

4.2 Q3a — Building the logistic regression model

Code

#regression model

nba_model_lr <- glm(top_50 ~ pos + fg + fgp + thr + thrp + efg +
                      trb + ast + stl + blk + tov + pf,
                    data = nba_training,
                    family = binomial(link = 'logit'))

summary(nba_model_lr)


Call:
glm(formula = top_50 ~ pos + fg + fgp + thr + thrp + efg + trb + 
    ast + stl + blk + tov + pf, family = binomial(link = "logit"), 
    data = nba_training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -18.6844     7.3431  -2.544   0.0109 *
posPF        -2.1358     1.3850  -1.542   0.1230  
posPG        -1.8758     2.0854  -0.900   0.3684  
posSF        -0.6826     1.6757  -0.407   0.6837  
posSG        -0.3048     1.7937  -0.170   0.8651  
fg            1.1205     0.4981   2.250   0.0245 *
fgp          16.4680    41.3104   0.399   0.6902  
thr           2.3870     1.5595   1.531   0.1259  
thrp         -4.5043     5.9837  -0.753   0.4516  
efg          -0.5173    41.2340  -0.013   0.9900  
trb           0.1575     0.2460   0.640   0.5219  
ast           0.5552     0.4049   1.371   0.1703  
stl           1.3078     1.1259   1.162   0.2454  
blk           1.3646     0.9495   1.437   0.1507  
tov          -1.7518     1.1043  -1.586   0.1127  
pf            0.2102     0.9898   0.212   0.8318  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 142.818  on 129  degrees of freedom
Residual deviance:  61.859  on 114  degrees of freedom
AIC: 93.859

Number of Fisher Scoring iterations: 7

4.2.1 Q3a.ii — The regression equation

Using the estimated coefficients in the summary() output above, the fitted regression equation is of the form:

\[ \log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_1(\text{pos}) + \beta_2(\text{fg}) + \beta_3(\text{fgp}) + \beta_4(\text{thr}) + \beta_5(\text{thrp}) + \beta_6(\text{efg}) + \beta_7(\text{trb}) + \beta_8(\text{ast}) + \beta_9(\text{stl}) + \beta_{10}(\text{blk}) + \beta_{11}(\text{tov}) + \beta_{12}(\text{pf}) \]

where π is the probability that a player is in the Top 50. To produce the actual numerical equation, substitute each β with the corresponding Estimate from the summary() table above.

\[ \begin{aligned} \log\!\left(\frac{\pi}{1 - \pi}\right) =\ & -18.6844 - 2.1358\,(\text{posPF}) - 1.8758\,(\text{posPG}) - 0.6826\,(\text{posSF}) \\ & - 0.3048\,(\text{posSG}) + 1.1205\,(\text{fg}) + 16.4680\,(\text{fgp}) + 2.3870\,(\text{thr}) \\ & - 4.5043\,(\text{thrp}) - 0.5173\,(\text{efg}) + 0.1575\,(\text{trb}) + 0.5552\,(\text{ast}) \\ & + 1.3078\,(\text{stl}) + 1.3646\,(\text{blk}) - 1.7518\,(\text{tov}) + 0.2102\,(\text{pf}) \end{aligned} \]

where π is the probability that a player is in the Top 50.

4.2.2 Q3a.iii — Significant predictors

The predictors that are statistically significant at conventional levels are those with p-values (the Pr(>|z|) column) below 0.05.

Code

# Extract the coefficient table
lr_summary <- summary(nba_model_lr)$coefficients
round(lr_summary, 4)

            Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.6844     7.3431 -2.5445   0.0109
posPF        -2.1358     1.3850 -1.5421   0.1230
posPG        -1.8758     2.0854 -0.8995   0.3684
posSF        -0.6826     1.6757 -0.4074   0.6837
posSG        -0.3048     1.7937 -0.1699   0.8651
fg            1.1205     0.4981  2.2498   0.0245
fgp          16.4680    41.3104  0.3986   0.6902
thr           2.3870     1.5595  1.5306   0.1259
thrp         -4.5043     5.9837 -0.7528   0.4516
efg          -0.5173    41.2340 -0.0125   0.9900
trb           0.1575     0.2460  0.6404   0.5219
ast           0.5552     0.4049  1.3712   0.1703
stl           1.3078     1.1259  1.1616   0.2454
blk           1.3646     0.9495  1.4372   0.1507
tov          -1.7518     1.1043 -1.5864   0.1127
pf            0.2102     0.9898  0.2124   0.8318

Code

# Identify variables significant at the 5% level (excluding the intercept)
sig_vars <- rownames(lr_summary)[lr_summary[, 4] < 0.05 & rownames(lr_summary) != "(Intercept)"]
sig_vars

[1] "fg"

The variables identified as significant at the 5% level are: fg.

The only variable showing any level of significance is “fg” with a coefficient of 0.0245. Following the standard significance level of p < 0.05

4.2.3 Q3a.iv — Impact on the odds

Exponentiating each coefficient converts it from the log-odds scale to an odds ratio — the multiplicative change in the odds of being Top 50 for a one-unit increase in the predictor.

Code

exp(coef(nba_model_lr))

 (Intercept)        posPF        posPG        posSF        posSG           fg 
7.681944e-09 1.181444e-01 1.532270e-01 5.052774e-01 7.372898e-01 3.066453e+00 
         fgp          thr         thrp          efg          trb          ast 
1.419001e+07 1.088058e+01 1.106101e-02 5.961075e-01 1.170595e+00 1.742356e+00 
         stl          blk          tov           pf 
3.697919e+00 3.914198e+00 1.734596e-01 1.233917e+00

For each significant predictor, the odds ratio can be read as follows:

An odds ratio above 1 means a one-unit increase in that variable increases the odds of being in the Top 50. Specifically, the odds are multiplied by the odds ratio. To express it as a percentage change, compute (OR − 1) × 100%.
An odds ratio below 1 means the variable decreases the odds. The percentage change is (OR − 1) × 100% (negative).

(Odds_Ratio−1) × 100% = (3.07−1) × 100% = 207%

The odds ratio for fg is 3.07, meaning each additional field goal made per game multiplies the odds of being in the Top 50 by 3.07. This is equivalent to a 207% increase in the odds.

4.3 Q3b — Accuracy of the logistic regression model

4.3.1 Training accuracy

Code

#training

nba_training_lr_pi <- predict(nba_model_lr, newdata = nba_training, type = 'response')

nba_training_lr_final <- nba_training %>%
  mutate(pi = nba_training_lr_pi) %>%
  mutate(nba_training_lr_prediction = case_when(pi > 0.5  ~ 'Top 50',
                                                pi <= 0.5 ~ 'Not Top 50'))

nba_training_lr_final$nba_training_lr_prediction <- factor(
  nba_training_lr_final$nba_training_lr_prediction,
  levels = c("Not Top 50", "Top 50")
)

nba_training_lr_tab <- table(nba_training_lr_final$top_50,
                             nba_training_lr_final$nba_training_lr_prediction,
                             dnn = c('Actual', 'Predicted'))
nba_training_lr_tab

            Predicted
Actual       Not Top 50 Top 50
  Not Top 50         94      5
  Top 50              6     25

Code

nba_training_lr_acc <- sum(diag(nba_training_lr_tab)) / sum(nba_training_lr_tab)
nba_training_lr_acc

[1] 0.9153846

4.3.2 Testing accuracy

Code

#testing

nba_testing_lr_pi <- predict(nba_model_lr, newdata = nba_testing, type = 'response')

nba_testing_lr_final <- nba_testing %>%
  mutate(pi = nba_testing_lr_pi) %>%
  mutate(nba_testing_lr_prediction = case_when(pi > 0.5  ~ 'Top 50',
                                               pi <= 0.5 ~ 'Not Top 50'))

nba_testing_lr_final$nba_testing_lr_prediction <- factor(
  nba_testing_lr_final$nba_testing_lr_prediction,
  levels = c("Not Top 50", "Top 50")
)

nba_testing_lr_tab <- table(nba_testing_lr_final$top_50,
                            nba_testing_lr_final$nba_testing_lr_prediction,
                            dnn = c('Actual', 'Predicted'))
nba_testing_lr_tab

            Predicted
Actual       Not Top 50 Top 50
  Not Top 50         50      2
  Top 50              4     10

Code

nba_testing_lr_acc <- sum(diag(nba_testing_lr_tab)) / sum(nba_testing_lr_tab)
nba_testing_lr_acc

[1] 0.9090909

The logistic regression is 91.5% accurate on the training data and 90.9% accurate on the testing data. Similarly to the tree, a small distance between the training and testing data suggests the model generalises well, whereas a larger gap would suggest that the model is overfitting.

5 Question 4: Comparing the Two Models

Code

nba_training_tree_acc #Tree

[1] 0.8692308

Code

nba_testing_tree_acc

[1] 0.8939394

Code

nba_training_lr_acc #logistic regression

[1] 0.9153846

Code

nba_testing_lr_acc

[1] 0.9090909

# A tibble: 2 × 3
  Model               TrainingAccuracy TestingAccuracy
  <chr>                          <dbl>           <dbl>
1 Classification Tree            0.869           0.894
2 Logistic Regression            0.915           0.909

5.1 Q4a — Which model is more accurate?

On the testing data, the Logistic Regression model achieved the higher accuracy (90.9% vs 89.4%).

5.2 Q4b — Variable importance comparison

The classification tree relied most heavily on the predictors at the top of nba_model_tree$variable.importance, while the logistic regression flagged the variables in sig_vars as statistically significant.

Tree’s top predictors: fg, thr, tov, pf, blk.
Logistic regression’s significant predictors (p < 0.05): fg.

The two lists both independently agreed on significance of a single variable which was ‘fg’. The reason this variable is high in popularity in terms of significance is likely due to being highly correlated to a variable (pts) specifically not used in this experiment due to it being the most primary stat on which player value is assessed and would hence make prediction for top 50 quite straight forward. ‘fg’ is correlated to ‘pts’ as they refer to how many shots a player makes on average per game. The reason they aren’t directly correlated with ‘pts’ is due to the fact that players are able to score from methods different to field goals (free throws). Which usually correspond for a smaller portion of a player’s total points but still plays a part nevertheless.