# Imports
library(tidyverse)
library(MASS)
library(Rfast)
library(kableExtra)
# Read in CSV from .py script
nba <- read_csv("Stats-and-Salaries.csv")
# Custom histogram function
fastHistogram <- function(FEAT, NAME, BINS=30) {
nba %>% ggplot(aes(x = scale(FEAT))) +
geom_histogram(color="white", bins = BINS, fill="royalblue4", alpha=0.92) +
labs(x = paste("Scaled Feature: ", NAME), y = "") +
theme_minimal()
}
First, we’ll isolate the numeric variables in this dataset in order to run a base model with the kitchen sink thrown in. We’ll take a look at the skewness (or kurtosis of each variable) to determine if any transformations are necessary to maximize the predictive power of a given feature.
nba$Salary <- as.numeric(gsub("[\\$,]", "", nba$Salary)) # Convert salary to int value
nba <- nba %>%
mutate("PPM" = PTS / MP, "APM" = AST/MP) # Calculate Points and Assists / Minute
# Numeric variables only
nba.reduced <- nba %>%
dplyr::select(!c("Rk", "Player", "Tm", "Pos")) %>%
na.omit()
sapply(nba.reduced, function(x) skew(x)) # Calculate column-wise kurtosis
## Age G GS MP FG FGA FG%
## 0.4651902 -0.2181294 1.1229062 0.5015262 1.3012428 1.2444990 -0.1581002
## 3P 3PA 3P% 2P 2PA 2P% eFG%
## 1.5418959 1.4344909 -0.3863949 1.6422724 1.5373924 -0.6220154 -1.4119478
## FT FTA FT% ORB DRB TRB AST
## 2.6963236 2.4969941 -1.6425254 2.3163317 1.5360696 1.6405875 2.1501044
## STL BLK TOV PF PTS Salary PPM
## 1.1502839 3.0530029 1.7197048 0.5549311 1.4245024 1.6709484 0.7747620
## APM
## 1.1954246
We’ll lean on the > | 1.0 | rule of thumb - in other words, if the skewness of any one features is greater than 1.0 or less than -1.0, we’ll consider it eligible to transformed in our model.
Next, let’s plot a few of these distributions to get a better idea of what we’re looking at (i.e., how these observations are distributed).
# Plot Age distribution
fastHistogram(nba$Age, "Age%", 25)
After scaling Age, it’s evident that this distribution is slightly right-skewed (in other words, the majority of the data falls below the mean). We’ll attempt to log-transform this variable to approximate normality, though if the transformation yields any infinite values, we won’t be able to effectively perform the log-transform …. but more on that in a minute.
# Plot 3P%
fastHistogram(nba$`3P%`, "3-Point %", 40)
3-Point% is slightly left-skewed, and seems to have several severe outliers. This most likely represents low-volume shooters (for example, players that take and make one 3-pointer would be an extreme outlier). We likely will not need to transform this variable, as a log-transformation will not impact the extreme positive outliers observed here.
# Plot salary distribution
fastHistogram(nba$Salary, "Salary", 65)
Salary is observed to be massively right-skewed, such that the majority of observations are less than the mean average of the total distribution. There are noticeable outliers in this distribution, with several observations 3 or more standard deviations from the mean. This is a good indicator that, to obtain a robust result, we’ll need to log-transform player salaries in our linear model.
To start, let’s include numeric variables in the observed data. That will give us a base.model that we can then optimize.
base.model <- lm(log(Salary) ~ ., data = nba.reduced) # Build model with all features
summary(base.model)
##
## Call:
## lm(formula = log(Salary) ~ ., data = nba.reduced)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3991 -0.6826 0.1208 0.7190 3.2659
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9363668 0.6085755 17.970 < 2e-16 ***
## Age 0.0775823 0.0125862 6.164 1.32e-09 ***
## G -0.0048601 0.0064574 -0.753 0.451973
## GS 0.0055850 0.0046099 1.212 0.226183
## MP 0.0009650 0.0004813 2.005 0.045406 *
## FG -0.0060011 0.0046280 -1.297 0.195241
## FGA 0.0025386 0.0024982 1.016 0.309974
## `FG%` 7.3683658 2.2946001 3.211 0.001394 **
## `3P` -0.0061923 0.0099819 -0.620 0.535261
## `3PA` 0.0033786 0.0044025 0.767 0.443142
## `3P%` 0.2690149 0.6308078 0.426 0.669928
## `2P` NA NA NA NA
## `2PA` NA NA NA NA
## `2P%` -0.5453082 0.7905613 -0.690 0.490609
## `eFG%` -5.4557201 2.2096746 -2.469 0.013832 *
## FT -0.0003526 0.0056329 -0.063 0.950114
## FTA -0.0002607 0.0048619 -0.054 0.957261
## `FT%` -0.2374785 0.3929910 -0.604 0.545888
## ORB -0.0012013 0.0031488 -0.381 0.702973
## DRB 0.0021340 0.0012426 1.717 0.086444 .
## TRB NA NA NA NA
## AST -0.0041606 0.0017121 -2.430 0.015392 *
## STL 0.0040577 0.0041256 0.984 0.325746
## BLK 0.0050401 0.0038253 1.318 0.188162
## TOV 0.0018591 0.0041573 0.447 0.654906
## PF -0.0031449 0.0025984 -1.210 0.226639
## PTS NA NA NA NA
## PPM 1.1820742 0.7023505 1.683 0.092902 .
## APM 6.9676172 1.8146754 3.840 0.000137 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.185 on 587 degrees of freedom
## Multiple R-squared: 0.3997, Adjusted R-squared: 0.3751
## F-statistic: 16.28 on 24 and 587 DF, p-value: < 2.2e-16
The coefficients in the base model appear to be weakly predictive, at best. This is due in part to the combination of variables. To yield a better, more predictive model, we’re going to employ AIC Stepwise Variable Selection to incorporate different combinations of variables, with our final model containing the most robust set of predictors.
# Log-transform points and 2-point attempts
base.model <- stats::update(base.model, . ~ . -PTS +log(PTS) -`2PA` +log(`2PA`))
# Select optimal variables via AIC stepwise selection
step.model <- MASS::stepAIC(base.model,
direction = "both",
k = 2,
trace = F,
steps = 1000)
summary(step.model)
##
## Call:
## lm(formula = log(Salary) ~ Age + G + GS + `3PA` + `FT%` + DRB +
## BLK + PPM + APM + log(`2PA`), data = nba.reduced)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4762 -0.6296 0.1195 0.7398 3.0722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7607978 0.4325826 24.876 < 2e-16 ***
## Age 0.0700988 0.0113216 6.192 1.10e-09 ***
## G -0.0278480 0.0050946 -5.466 6.75e-08 ***
## GS 0.0048627 0.0034604 1.405 0.16047
## `3PA` 0.0019174 0.0005897 3.252 0.00121 **
## `FT%` -0.4846788 0.3279999 -1.478 0.14002
## DRB 0.0014435 0.0008635 1.672 0.09510 .
## BLK 0.0048360 0.0032439 1.491 0.13654
## PPM -0.9128554 0.4905971 -1.861 0.06327 .
## APM 2.0798548 1.0122942 2.055 0.04035 *
## log(`2PA`) 0.7309538 0.0779231 9.380 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.12 on 601 degrees of freedom
## Multiple R-squared: 0.4506, Adjusted R-squared: 0.4414
## F-statistic: 49.29 on 10 and 601 DF, p-value: < 2.2e-16
Our AIC selection algorithm yields an interesting set of predictors. We observe an R-squared value of 0.4414; put simply, our model accounts for 44.14% of the variance in player salary. Considering the swath of variance that we can’t account for given the present data (injury, market value, salary cap, etc.) this is actually pretty decent! Let’s apply this finished model to our full dataset to see how teams and players fare.
Now that we have an optimized model, let’s apply it to our data. This will give us a PREDICTIONS column, which represents an estimation of what each player should be paid based on their individual stat lines.
nba["PREDICTIONS"] <- exp(predict(step.model, nba)) # Predict salaries with our optimal model
# Define custom X + Y labels
x_labels <- c("$0", "$10M", "$20M", "$30M", "$40M")
y_labels <- c("$0", "$20M", "$40M", "$60M")
# Plot actual vs. predicted salaries
salary.plot <- nba %>%
ggplot(aes(x = Salary, y = PREDICTIONS, color=scale(PTS))) +
geom_point(alpha=0.65, size=(nba$MP / 400)) +
geom_smooth(color="white", alpha=0.65) +
theme_minimal() +
labs(x = "Actual Salary",
y = "Predicted Salary",
title = "Salary Prediction Model",
color = "Scaled Points") +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_x_continuous(labels = x_labels) + scale_y_continuous(labels = y_labels) +
theme(axis.text.x = element_text(hjust = 1),
axis.text.y = element_text(vjust = -1))
salary.plot
We observe a moderately close fit between the the observed data points and the line of best fit in this plot. While there appears to be a general trend of points scored correlating with salary, this model seems to inflate player salaries slightly, such that players’ predicted salaries tend to be higher than their actual salaries. This is most likely due to the lack of control parameters built in to this model (for example, salary cap is not enforced in this environment, which provides no ceiling for the model to work under).
How will this work?
Using the predicted values from our model, we’ll calculate the differential from actual salaries to create a Salary.Differential variable. If this variable is positive - i.e., the predicted salary is higher than the actual salary - we may assert that this player is underpaid. Conversely, if the predicted salary is lower than the actual salary, it would suggest that the player is overpaid.
nba <- nba %>%
mutate("Salary.Differntial" = PREDICTIONS - Salary,
"Overpaid" = ifelse(Salary.Differntial < 0, "Overpaid", "Underpaid"))
nba %>%
dplyr::filter(Overpaid != "NA") %>%
ggplot(aes(x = Overpaid, fill=Overpaid)) +
geom_bar(color="white", alpha=0.85) +
labs(x = "Salary Differential",
y ="",
title = "Salary Differential") +
theme_minimal() +
theme(legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_text(face="bold"),
plot.title = element_text(face = "bold", hjust = 0.5)) +
scale_fill_manual(values = c("royalblue4", "dodgerblue3"))
First, let’s look at players whose projected salaries are higher than what they are actually paid. We’ll consider these players underpaid:
nba %>%
dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>%
arrange(desc(Salary.Differntial)) %>%
head(10) %>%
kable() %>% kable_styling(kable.format)
| Player | Age | Salary | PREDICTIONS | Salary.Differntial |
|---|---|---|---|---|
| LeBron James | 35 | 37436858 | 58766387 | 21329529 |
| Carmelo Anthony | 35 | 2159029 | 18770254 | 16611225 |
| Pascal Siakam | 25 | 2351839 | 17660671 | 15308832 |
| Luka Dončić | 20 | 7683360 | 19347889 | 11664529 |
| Devonte’ Graham | 24 | 1416852 | 11960571 | 10543719 |
| Domantas Sabonis | 23 | 3529555 | 13170514 | 9640959 |
| Bam Adebayo | 22 | 3454080 | 12539230 | 9085150 |
| James Harden | 30 | 37800000 | 46682005 | 8882005 |
| Buddy Hield | 27 | 4861208 | 12982360 | 8121152 |
| Donovan Mitchell | 23 | 3635760 | 11663591 | 8027831 |
Obvious points: You can’t overpay LeBron James. Whatever the league will allow you to pay him, pay him 150% of that. Similarly, the other players on this list include Luka, Siakam, and Bam Adebayo - in other words, young players who either haven’t hit the bank yet or are playing above their value.
Less obvious: Carmelo’s inclusion on this list. Our model indicates that his stat line should have put him around the $16M / year range. His actual salary is relatively modest in comparison, which makes him a bit of a bargain.
Here we’ll take the opposite approach, and explore players whose predicted salaries are less than their actual salaries. We’ll filter out Steph Curry since he was injured all year and makes lots of money. An outlier if I’ve ever seen one!
nba %>%
dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>%
dplyr::filter(Player != "Stephen Curry") %>%
arrange((Salary.Differntial)) %>%
head(10) %>%
kable() %>% kable_styling(kable.format)
| Player | Age | Salary | PREDICTIONS | Salary.Differntial |
|---|---|---|---|---|
| Blake Griffin | 30 | 34234964 | 7035132 | -27199832 |
| Otto Porter | 26 | 27250576 | 3190033 | -24060543 |
| Kyrie Irving | 27 | 31742000 | 8299659 | -23442341 |
| D’Angelo Russell | 23 | 27285000 | 3861536 | -23423464 |
| Andre Drummond | 26 | 27093019 | 3926889 | -23166130 |
| Andrew Wiggins | 24 | 27504630 | 4578262 | -22926368 |
| Nicolas Batum | 31 | 25565217 | 2683952 | -22881265 |
| Mike Conley | 32 | 32511623 | 11294960 | -21216663 |
| Paul Millsap | 34 | 30500000 | 10148431 | -20351569 |
| Paul George | 29 | 33005556 | 13783402 | -19222154 |
Obvious points: Relative to the underpaid player list, the majority of these players are in their late 20’s and early 30’s. This point is twofold - older players are more likely to have higher salaries (i.e., non-rookie deals), and older players are generally less likely to have robust stat lines relative to younger players (see: Blake Griffin on both accounts).
Less obvious: Westbrook (57 games played) and Kyrie (20 games played). The Westbrook outcome is especially interesting - it’d be interesting to approach this question slightly differently, to see how Westbrook stacks up in years that he plays the full season. He’s a former MVP, but he can easily drop to a below-average player at times. Surely this has been quantified by other researchers, but still worth a shot!
Lastly, let’s track how teams look overall - how many of their players are overpaid and how many of their teams are underpaid. We’ll calculate the percentage of each roster that is made up of overpaid players, then we’ll observe the “bottom 10” - i.e., the 10 teams that have the highest percentage of overpaid players.
nba %>%
group_by(Tm) %>%
filter(Tm != "TOT") %>%
summarise(Players = n(),
Underpaid.Players = sum(Overpaid == "Underpaid", na.rm = T),
Overpaid.Players = sum(Overpaid == "Overpaid", na.rm = T),
Pct.Overpaid = (Overpaid.Players / Players)) %>%
head(10) %>%
arrange(desc(Pct.Overpaid)) %>% kable() %>% kable_styling(kable.format)
| Tm | Players | Underpaid.Players | Overpaid.Players | Pct.Overpaid |
|---|---|---|---|---|
| CHI | 15 | 4 | 11 | 0.7333333 |
| DAL | 18 | 6 | 12 | 0.6666667 |
| BOS | 15 | 6 | 9 | 0.6000000 |
| DEN | 21 | 7 | 11 | 0.5238095 |
| CLE | 21 | 9 | 10 | 0.4761905 |
| ATL | 19 | 8 | 9 | 0.4736842 |
| DET | 24 | 12 | 10 | 0.4166667 |
| CHO | 18 | 11 | 7 | 0.3888889 |
| GSW | 24 | 17 | 7 | 0.2916667 |
| BRK | 25 | 19 | 5 | 0.2000000 |
Seems to be a bit of a mixed bag, just by the eye test. Of the bottom-10 teams with a high-percentage of overpaid players, some of them are competitive - Dallas, Boston, Denver - while others are competing for the lottery every year (such as Cleveland and Detroit). Intuitively, it seems like it’s more important to pay the right players, even if those players are large cap expenditures.
Dallas is a good example of this, as they’re able to get amazing output from Luka while only paying him ~$7M annually.
nba %>%
dplyr::filter(Tm == "DAL") %>%
dplyr::select(Player, Age, Pos, Salary) %>%
arrange(desc(Salary)) %>% head(5) %>% kable() %>% kable_styling(kable.format)
| Player | Age | Pos | Salary |
|---|---|---|---|
| Kristaps Porziņģis | 24 | PF | 27285000 |
| Michael Kidd-Gilchrist | 26 | SF | 13000000 |
| Courtney Lee | 34 | SG | 12759670 |
| Dwight Powell | 28 | C | 10259375 |
| Delon Wright | 27 | SG | 9473684 |
Calculate the correlation between “overpaid player percentage” and win total in the regular season
Fix the dataset to minimize NA values in the model (and, in turn, yield a higher R-squared value)
Thanks for reading!
//IRF