library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
I’m interested in how possible it is to predict NBA players’ salaries based on their in-game performance. Fortunately, I found a dataset that includes salary information in addition to many such performance metrics on kaggle:
https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season
df <- read.csv('nba_2022-23_all_stats_with_salary.csv')
First, I’m going to filter the data down to players that played at least 50 games. This gives us an appropriate sample size to improve the likelihood that a player’s performance “matches” their salary (for example, that we are not basing our model on players who earn a lot but were injured the majority of the season, or on players who scored many points in a few outlier games).
df <- df %>%
filter(GP >= 50)
Because we need a quadratic variable for this exercise, I suspect points-per-game could have such a relationship. Since the top-tier players make so much money, I could see salaries going up by many millions of dollors for what seem like relatively small increases in points-per-game at the highest levels. I can gut check this visually:
ggplot(data=df, aes(x=PTS, y=Salary)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "blue")
And numerically:
quad_model <- lm(Salary~I(PTS^2), data=df)
summary(quad_model)
##
## Call:
## lm(formula = Salary ~ I(PTS^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19841386 -4130915 -1841596 3301247 33276790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3943706 662002 5.957 8.33e-09 ***
## I(PTS^2) 39000 2267 17.205 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7985000 on 259 degrees of freedom
## Multiple R-squared: 0.5333, Adjusted R-squared: 0.5315
## F-statistic: 296 on 1 and 259 DF, p-value: < 2.2e-16
It’s far from perfect, but it will do!
Next, we need a binary categorical variable, of which there are none in the data natively. However, I believe age is a good candidate here, since it makes sense that older players (defined relatively) earn more, through a combination of improved performance over time, and rules around paying veterans negotiated by the players’ union.
median(df$Age)
## [1] 25
mean(df$Age)
## [1] 26.02682
The median age of 25 seems like a good cutoff, even though it is lower than the average, because it’s actually close to the age at which many players will be starting their second contracts. After having a few years on a rookie deal (the smallest deals in the league), good players will often “get paid” around 25 after having proven themselves on the smaller deal. I’ll make a new categorical variable with that cutoff:
df$below_25_yo <- ifelse(df$Age<25, 1, 0)
And see if there are categorical differences:
#Ensure below_25_yo is treated as categorical for graphing
df$below_25_yo <- factor(df$below_25_yo)
ggplot(df, aes(x=below_25_yo, y=Salary)) +
geom_boxplot()
We can see that while some below-25 players manage to make very high salaries, the distribution is such that older players are clearly more likely to be earning higher salaries.
Now, we can gut-check which variables we may want to include or not based on correlations among them:
numeric_data <- df[sapply(df, is.numeric)]
cor_matrix <- cor(numeric_data)
cor_data <- melt(cor_matrix)
ggplot(cor_data, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1)) +
theme_minimal() +
labs(x = "Variables", y = "Variables", fill = "Correlation") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There is a LOT of colinearity in this data, which does make sense:
players who score a lot are likely the ones who play a lot of minutes,
which allows them to rack up other stats as well. There are some
variables that don’t correlate with points (like turnovers and rebounds)
but those, interestingly, also don’t seem to correlate with salary.
While some individual players may pay their way in those areas, in
total, those stats are apparently just not really drivers of
earning.
So: I’ll include PTS as my quadratic variable, my new “below 25 years old” variable as my categorical, and the interaction of below_25_yo and PTS since I know they may be related.
model <- lm(Salary~I(PTS^2) + below_25_yo + below_25_yo*PTS, data=df)
summary(model)
##
## Call:
## lm(formula = Salary ~ I(PTS^2) + below_25_yo + below_25_yo *
## PTS, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17862849 -3675334 -754673 2320547 27367133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2330393 1878566 -1.241 0.216
## I(PTS^2) 4741 7915 0.599 0.550
## below_25_yo1 1509697 1720027 0.878 0.381
## PTS 1310992 263958 4.967 1.25e-06 ***
## below_25_yo1:PTS -774295 128372 -6.032 5.65e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6540000 on 256 degrees of freedom
## Multiple R-squared: 0.6906, Adjusted R-squared: 0.6857
## F-statistic: 142.8 on 4 and 256 DF, p-value: < 2.2e-16
Only PTS and the interaction of PTS and below_25_yo had a significant effect. Basically, a 1-point increase in scoring per game was associated with a $1.3 million higher salary. Meanwhile, such an increase in conjunction with being older than 25 was associated with $774k. Not bad! The adjusted \(R^2\) of 0.69 shows a decently strong predictive value for the model.
Now we can run a residuals analysis to check some assumptions.
residuals <- residuals(model)
plot(fitted(model), residuals,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
Unfortunately I’m interpreting mixed results here. Across the right half of the graph the residuals appear randomoly distributed, but they also appear to be “fanning out” as the graph progresses, suggesting some degree of heteroscedasticity. This would violate a key assumption for regression.
qqnorm(residuals)
qqline(residuals, col = "red")
Similarly, our Q-Q plot shows a mostly linear relationship, but also suggest a somewhat non-normal distribution. I think I would need to make some updates to trust this model more fully.