NBA Player Value

# Imports
library(tidyverse)
library(MASS)
library(Rfast)
library(kableExtra)

Goals of This Analysis

Import NBA player stats and salaries (scraped via this Python script)
Optimize and linear model to accurately predict player salaries relative to their statistics
Assess which teams are over- and under-paying for their players, broadly speaking

# Read in CSV from .py script
nba <- read_csv("Stats-and-Salaries.csv")                               

# Custom histogram function
fastHistogram <- function(FEAT, NAME, BINS=30) {
        nba %>% ggplot(aes(x = scale(FEAT))) +
                geom_histogram(color="white", bins = BINS, fill="royalblue4", alpha=0.92) +
                labs(x = paste("Scaled Feature: ", NAME), y = "") +
                theme_minimal()
}

Clean + Transform

First, we’ll isolate the numeric variables in this dataset in order to run a base model with the kitchen sink thrown in. We’ll take a look at the skewness (or kurtosis of each variable) to determine if any transformations are necessary to maximize the predictive power of a given feature.

nba$Salary <- as.numeric(gsub("[\\$,]", "", nba$Salary))                # Convert salary to int value

nba <- nba %>% 
        mutate("PPM" = PTS / MP, "APM" = AST/MP)                        # Calculate Points and Assists / Minute
        
# Numeric variables only
nba.reduced <- nba %>% 
        dplyr::select(!c("Rk", "Player", "Tm", "Pos")) %>% 
        na.omit()

sapply(nba.reduced, function(x) skew(x))                                # Calculate column-wise kurtosis

##        Age          G         GS         MP         FG        FGA        FG% 
##  0.4651902 -0.2181294  1.1229062  0.5015262  1.3012428  1.2444990 -0.1581002 
##         3P        3PA        3P%         2P        2PA        2P%       eFG% 
##  1.5418959  1.4344909 -0.3863949  1.6422724  1.5373924 -0.6220154 -1.4119478 
##         FT        FTA        FT%        ORB        DRB        TRB        AST 
##  2.6963236  2.4969941 -1.6425254  2.3163317  1.5360696  1.6405875  2.1501044 
##        STL        BLK        TOV         PF        PTS     Salary        PPM 
##  1.1502839  3.0530029  1.7197048  0.5549311  1.4245024  1.6709484  0.7747620 
##        APM 
##  1.1954246

We’ll lean on the > | 1.0 | rule of thumb - in other words, if the skewness of any one features is greater than 1.0 or less than -1.0, we’ll consider it eligible to transformed in our model.

Next, let’s plot a few of these distributions to get a better idea of what we’re looking at (i.e., how these observations are distributed).

# Plot Age distribution
fastHistogram(nba$Age, "Age%", 25)

After scaling Age, it’s evident that this distribution is slightly right-skewed (in other words, the majority of the data falls below the mean). We’ll attempt to log-transform this variable to approximate normality, though if the transformation yields any infinite values, we won’t be able to effectively perform the log-transform …. but more on that in a minute.

# Plot 3P%
fastHistogram(nba$`3P%`, "3-Point %", 40)

3-Point% is slightly left-skewed, and seems to have several severe outliers. This most likely represents low-volume shooters (for example, players that take and make one 3-pointer would be an extreme outlier). We likely will not need to transform this variable, as a log-transformation will not impact the extreme positive outliers observed here.

# Plot salary distribution
fastHistogram(nba$Salary, "Salary", 65)

Salary is observed to be massively right-skewed, such that the majority of observations are less than the mean average of the total distribution. There are noticeable outliers in this distribution, with several observations 3 or more standard deviations from the mean. This is a good indicator that, to obtain a robust result, we’ll need to log-transform player salaries in our linear model.

Modeling

To start, let’s include numeric variables in the observed data. That will give us a base.model that we can then optimize.

base.model <- lm(log(Salary) ~ ., data = nba.reduced)                   # Build model with all features
summary(base.model)

## 
## Call:
## lm(formula = log(Salary) ~ ., data = nba.reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3991 -0.6826  0.1208  0.7190  3.2659 
## 
## Coefficients: (4 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.9363668  0.6085755  17.970  < 2e-16 ***
## Age          0.0775823  0.0125862   6.164 1.32e-09 ***
## G           -0.0048601  0.0064574  -0.753 0.451973    
## GS           0.0055850  0.0046099   1.212 0.226183    
## MP           0.0009650  0.0004813   2.005 0.045406 *  
## FG          -0.0060011  0.0046280  -1.297 0.195241    
## FGA          0.0025386  0.0024982   1.016 0.309974    
## `FG%`        7.3683658  2.2946001   3.211 0.001394 ** 
## `3P`        -0.0061923  0.0099819  -0.620 0.535261    
## `3PA`        0.0033786  0.0044025   0.767 0.443142    
## `3P%`        0.2690149  0.6308078   0.426 0.669928    
## `2P`                NA         NA      NA       NA    
## `2PA`               NA         NA      NA       NA    
## `2P%`       -0.5453082  0.7905613  -0.690 0.490609    
## `eFG%`      -5.4557201  2.2096746  -2.469 0.013832 *  
## FT          -0.0003526  0.0056329  -0.063 0.950114    
## FTA         -0.0002607  0.0048619  -0.054 0.957261    
## `FT%`       -0.2374785  0.3929910  -0.604 0.545888    
## ORB         -0.0012013  0.0031488  -0.381 0.702973    
## DRB          0.0021340  0.0012426   1.717 0.086444 .  
## TRB                 NA         NA      NA       NA    
## AST         -0.0041606  0.0017121  -2.430 0.015392 *  
## STL          0.0040577  0.0041256   0.984 0.325746    
## BLK          0.0050401  0.0038253   1.318 0.188162    
## TOV          0.0018591  0.0041573   0.447 0.654906    
## PF          -0.0031449  0.0025984  -1.210 0.226639    
## PTS                 NA         NA      NA       NA    
## PPM          1.1820742  0.7023505   1.683 0.092902 .  
## APM          6.9676172  1.8146754   3.840 0.000137 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.185 on 587 degrees of freedom
## Multiple R-squared:  0.3997, Adjusted R-squared:  0.3751 
## F-statistic: 16.28 on 24 and 587 DF,  p-value: < 2.2e-16

The coefficients in the base model appear to be weakly predictive, at best. This is due in part to the combination of variables. To yield a better, more predictive model, we’re going to employ AIC Stepwise Variable Selection to incorporate different combinations of variables, with our final model containing the most robust set of predictors.

# Log-transform points and 2-point attempts
base.model <- stats::update(base.model, . ~ . -PTS +log(PTS) -`2PA` +log(`2PA`))

# Select optimal variables via AIC stepwise selection
step.model <- MASS::stepAIC(base.model, 
                            direction = "both",
                            k = 2,
                            trace = F, 
                            steps = 1000)

summary(step.model)

## 
## Call:
## lm(formula = log(Salary) ~ Age + G + GS + `3PA` + `FT%` + DRB + 
##     BLK + PPM + APM + log(`2PA`), data = nba.reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4762 -0.6296  0.1195  0.7398  3.0722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.7607978  0.4325826  24.876  < 2e-16 ***
## Age          0.0700988  0.0113216   6.192 1.10e-09 ***
## G           -0.0278480  0.0050946  -5.466 6.75e-08 ***
## GS           0.0048627  0.0034604   1.405  0.16047    
## `3PA`        0.0019174  0.0005897   3.252  0.00121 ** 
## `FT%`       -0.4846788  0.3279999  -1.478  0.14002    
## DRB          0.0014435  0.0008635   1.672  0.09510 .  
## BLK          0.0048360  0.0032439   1.491  0.13654    
## PPM         -0.9128554  0.4905971  -1.861  0.06327 .  
## APM          2.0798548  1.0122942   2.055  0.04035 *  
## log(`2PA`)   0.7309538  0.0779231   9.380  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 601 degrees of freedom
## Multiple R-squared:  0.4506, Adjusted R-squared:  0.4414 
## F-statistic: 49.29 on 10 and 601 DF,  p-value: < 2.2e-16

Our AIC selection algorithm yields an interesting set of predictors. We observe an R-squared value of 0.4414; put simply, our model accounts for 44.14% of the variance in player salary. Considering the swath of variance that we can’t account for given the present data (injury, market value, salary cap, etc.) this is actually pretty decent! Let’s apply this finished model to our full dataset to see how teams and players fare.

Predictions

Now that we have an optimized model, let’s apply it to our data. This will give us a PREDICTIONS column, which represents an estimation of what each player should be paid based on their individual stat lines.

nba["PREDICTIONS"] <- exp(predict(step.model, nba))                     # Predict salaries with our optimal model

# Define custom X + Y labels
x_labels <- c("$0", "$10M", "$20M", "$30M", "$40M")
y_labels <- c("$0", "$20M", "$40M", "$60M")

# Plot actual vs. predicted salaries
salary.plot <- nba %>%
                        ggplot(aes(x = Salary, y = PREDICTIONS, color=scale(PTS))) +
                        geom_point(alpha=0.65, size=(nba$MP / 400)) +
                        geom_smooth(color="white", alpha=0.65) +
                        theme_minimal() +
                        labs(x = "Actual Salary", 
                             y = "Predicted Salary",
                             title = "Salary Prediction Model",
                             color = "Scaled Points") +
                        theme(plot.title = element_text(hjust = 0.5, face="bold")) +
        scale_x_continuous(labels = x_labels) + scale_y_continuous(labels = y_labels) +
        theme(axis.text.x = element_text(hjust = 1),
              axis.text.y = element_text(vjust = -1))

salary.plot

We observe a moderately close fit between the the observed data points and the line of best fit in this plot. While there appears to be a general trend of points scored correlating with salary, this model seems to inflate player salaries slightly, such that players’ predicted salaries tend to be higher than their actual salaries. This is most likely due to the lack of control parameters built in to this model (for example, salary cap is not enforced in this environment, which provides no ceiling for the model to work under).

Determining Player Value

How will this work?

Using the predicted values from our model, we’ll calculate the differential from actual salaries to create a Salary.Differential variable. If this variable is positive - i.e., the predicted salary is higher than the actual salary - we may assert that this player is underpaid. Conversely, if the predicted salary is lower than the actual salary, it would suggest that the player is overpaid.

nba <- nba %>%
        mutate("Salary.Differntial" = PREDICTIONS - Salary,
               "Overpaid" = ifelse(Salary.Differntial < 0, "Overpaid", "Underpaid"))

nba %>% 
        dplyr::filter(Overpaid != "NA") %>% 
        ggplot(aes(x = Overpaid, fill=Overpaid)) +
        geom_bar(color="white", alpha=0.85) +
        labs(x = "Salary Differential", 
             y ="",
             title = "Salary Differential") +
        theme_minimal() +
        theme(legend.position = "none",
              axis.title.x = element_blank(),
              axis.text.x = element_text(face="bold"),
              plot.title = element_text(face = "bold", hjust = 0.5)) +
        scale_fill_manual(values = c("royalblue4", "dodgerblue3"))

Underpaid Players

First, let’s look at players whose projected salaries are higher than what they are actually paid. We’ll consider these players underpaid:

nba %>%
        dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>% 
        arrange(desc(Salary.Differntial)) %>% 
        head(10) %>% 
        kable() %>% kable_styling(kable.format)

Player	Age	Salary	PREDICTIONS	Salary.Differntial
LeBron James	35	37436858	58766387	21329529
Carmelo Anthony	35	2159029	18770254	16611225
Pascal Siakam	25	2351839	17660671	15308832
Luka Dončić	20	7683360	19347889	11664529
Devonte’ Graham	24	1416852	11960571	10543719
Domantas Sabonis	23	3529555	13170514	9640959
Bam Adebayo	22	3454080	12539230	9085150
James Harden	30	37800000	46682005	8882005
Buddy Hield	27	4861208	12982360	8121152
Donovan Mitchell	23	3635760	11663591	8027831

Obvious points: You can’t overpay LeBron James. Whatever the league will allow you to pay him, pay him 150% of that. Similarly, the other players on this list include Luka, Siakam, and Bam Adebayo - in other words, young players who either haven’t hit the bank yet or are playing above their value.

Less obvious: Carmelo’s inclusion on this list. Our model indicates that his stat line should have put him around the $16M / year range. His actual salary is relatively modest in comparison, which makes him a bit of a bargain.

Overpaid Players

Here we’ll take the opposite approach, and explore players whose predicted salaries are less than their actual salaries. We’ll filter out Steph Curry since he was injured all year and makes lots of money. An outlier if I’ve ever seen one!

nba %>%
        dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>% 
        dplyr::filter(Player != "Stephen Curry") %>% 
        arrange((Salary.Differntial)) %>% 
        head(10) %>% 
        kable() %>% kable_styling(kable.format)

Player	Age	Salary	PREDICTIONS	Salary.Differntial
Blake Griffin	30	34234964	7035132	-27199832
Otto Porter	26	27250576	3190033	-24060543
Kyrie Irving	27	31742000	8299659	-23442341
D’Angelo Russell	23	27285000	3861536	-23423464
Andre Drummond	26	27093019	3926889	-23166130
Andrew Wiggins	24	27504630	4578262	-22926368
Nicolas Batum	31	25565217	2683952	-22881265
Mike Conley	32	32511623	11294960	-21216663
Paul Millsap	34	30500000	10148431	-20351569
Paul George	29	33005556	13783402	-19222154

Obvious points: Relative to the underpaid player list, the majority of these players are in their late 20’s and early 30’s. This point is twofold - older players are more likely to have higher salaries (i.e., non-rookie deals), and older players are generally less likely to have robust stat lines relative to younger players (see: Blake Griffin on both accounts).

Less obvious: Westbrook (57 games played) and Kyrie (20 games played). The Westbrook outcome is especially interesting - it’d be interesting to approach this question slightly differently, to see how Westbrook stacks up in years that he plays the full season. He’s a former MVP, but he can easily drop to a below-average player at times. Surely this has been quantified by other researchers, but still worth a shot!

Team Cap Management

Lastly, let’s track how teams look overall - how many of their players are overpaid and how many of their teams are underpaid. We’ll calculate the percentage of each roster that is made up of overpaid players, then we’ll observe the “bottom 10” - i.e., the 10 teams that have the highest percentage of overpaid players.

nba %>% 
        group_by(Tm) %>% 
        filter(Tm != "TOT") %>% 
        summarise(Players = n(),
                  Underpaid.Players = sum(Overpaid == "Underpaid", na.rm = T),
                  Overpaid.Players = sum(Overpaid == "Overpaid", na.rm = T),
                  Pct.Overpaid = (Overpaid.Players / Players)) %>% 
        head(10) %>% 
        arrange(desc(Pct.Overpaid)) %>% kable() %>% kable_styling(kable.format)

Tm	Players	Underpaid.Players	Overpaid.Players	Pct.Overpaid
CHI	15	4	11	0.7333333
DAL	18	6	12	0.6666667
BOS	15	6	9	0.6000000
DEN	21	7	11	0.5238095
CLE	21	9	10	0.4761905
ATL	19	8	9	0.4736842
DET	24	12	10	0.4166667
CHO	18	11	7	0.3888889
GSW	24	17	7	0.2916667
BRK	25	19	5	0.2000000

Seems to be a bit of a mixed bag, just by the eye test. Of the bottom-10 teams with a high-percentage of overpaid players, some of them are competitive - Dallas, Boston, Denver - while others are competing for the lottery every year (such as Cleveland and Detroit). Intuitively, it seems like it’s more important to pay the right players, even if those players are large cap expenditures.

Dallas is a good example of this, as they’re able to get amazing output from Luka while only paying him ~$7M annually.

nba %>% 
        dplyr::filter(Tm == "DAL") %>% 
        dplyr::select(Player, Age, Pos, Salary) %>% 
        arrange(desc(Salary)) %>% head(5) %>% kable() %>% kable_styling(kable.format)

Player	Age	Pos	Salary
Kristaps Porziņģis	24	PF	27285000
Michael Kidd-Gilchrist	26	SF	13000000
Courtney Lee	34	SG	12759670
Dwight Powell	28	C	10259375
Delon Wright	27	SG	9473684

Future Directions

Calculate the correlation between “overpaid player percentage” and win total in the regular season
Fix the dataset to minimize NA values in the model (and, in turn, yield a higher R-squared value)

Thanks for reading!
//IRF