Business Focus

In the National Basketball Association, projecting the appropriate amount of money to invest in players is important when building a team. Knowing the average salary of players at each position, as well as understanding a player’s value based on their versatility, can give us clarity about how much resources we should invest. We will look at the variables that have high correlation with Salary, including Win Shares (WS), Points (PTS), and Total Rebounds (TRB), to name a few. I will create Machine Learning models that predict a player’s salary, and evaluate which model performed the best. The models used will be Lasso Regression, Neural Networks, and XG Boost. Finally I will compare the predicted salary of each model with the actual salaries of each player and determine which performed the best.

Link to dataset: https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season

library(tidyverse)
library(tseries)
library(forecast)
library(kableExtra)
library(reactable)
library(seasonal)
library(tsibble)
library(openxlsx)
library(readxl)
library(mice)
library(caret)
library(zoo)
library(vtable)
library(lubridate)
library(imputeTS)
library(naniar)
library(timeplyr)
library(rstatix)
library(timetk)
library(glmnet)
library(corrr)
library(corrplot)
library(ggcorrplot)
library(plotly)
library(GGally)
library(car)
library(e1071)
library(neuralnet)
library(kableExtra)
library(reshape2)
df <- read_csv("/Users/mohamedhassan/Downloads/nba_2022-23_all_stats_with_salary.csv")
reactable(df)
glimpse(df)
## Rows: 467
## Columns: 52
## $ ...1            <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ `Player Name`   <chr> "Stephen Curry", "John Wall", "Russell Westbrook", "Le…
## $ Salary          <dbl> 48070014, 47345760, 47080179, 44474988, 44119845, 4327…
## $ Position        <chr> "PG", "PG", "PG", "PF", "PF", "SG", "SF", "SF", "PF", …
## $ Age             <dbl> 34, 32, 34, 38, 34, 29, 31, 32, 28, 32, 32, 30, 30, 31…
## $ Team            <chr> "GSW", "LAC", "LAL/LAC", "LAL", "BRK/PHO", "WAS", "LAC…
## $ GP              <dbl> 56, 34, 73, 55, 47, 50, 52, 56, 63, 58, 69, 60, 70, 33…
## $ GS              <dbl> 56, 3, 24, 54, 47, 50, 50, 56, 63, 58, 69, 60, 70, 19,…
## $ MP              <dbl> 34.7, 22.2, 29.1, 35.5, 35.6, 33.5, 33.6, 34.6, 32.1, …
## $ FG              <dbl> 10.0, 4.1, 5.9, 11.1, 10.3, 8.9, 8.6, 8.2, 11.2, 9.6, …
## $ FGA             <dbl> 20.2, 9.9, 13.6, 22.2, 18.3, 17.6, 16.8, 17.9, 20.3, 2…
## $ `FG%`           <dbl> 0.493, 0.408, 0.436, 0.500, 0.560, 0.506, 0.512, 0.457…
## $ `3P`            <dbl> 4.9, 1.0, 1.2, 2.2, 2.0, 1.6, 2.0, 2.8, 0.7, 4.2, 4.4,…
## $ `3PA`           <dbl> 11.4, 3.2, 3.9, 6.9, 4.9, 4.4, 4.8, 7.6, 2.7, 11.3, 10…
## $ `3P%`           <dbl> 0.427, 0.303, 0.311, 0.321, 0.404, 0.365, 0.416, 0.371…
## $ `2P`            <dbl> 5.1, 3.1, 4.7, 8.9, 8.3, 7.3, 6.6, 5.4, 10.5, 5.4, 3.6…
## $ `2PA`           <dbl> 8.8, 6.7, 9.7, 15.3, 13.4, 13.2, 11.9, 10.3, 17.6, 9.4…
## $ `2P%`           <dbl> 0.579, 0.459, 0.487, 0.580, 0.617, 0.552, 0.551, 0.521…
## $ `eFG%`          <dbl> 0.614, 0.457, 0.481, 0.549, 0.614, 0.551, 0.572, 0.536…
## $ FT              <dbl> 4.6, 2.3, 2.8, 4.6, 6.5, 3.8, 4.7, 4.6, 7.9, 8.8, 1.7,…
## $ FTA             <dbl> 5.0, 3.3, 4.3, 5.9, 7.1, 4.6, 5.4, 5.3, 12.3, 9.6, 1.9…
## $ `FT%`           <dbl> 0.915, 0.681, 0.656, 0.768, 0.919, 0.842, 0.871, 0.871…
## $ ORB             <dbl> 0.7, 0.4, 1.2, 1.2, 0.4, 0.8, 1.1, 0.8, 2.2, 0.8, 0.6,…
## $ DRB             <dbl> 5.4, 2.3, 4.6, 7.1, 6.3, 3.1, 5.4, 5.3, 9.6, 4.0, 3.6,…
## $ TRB             <dbl> 6.1, 2.7, 5.8, 8.3, 6.7, 3.9, 6.5, 6.1, 11.8, 4.8, 4.1…
## $ AST             <dbl> 6.3, 5.2, 7.5, 6.8, 5.0, 5.4, 3.9, 5.1, 5.7, 7.3, 2.4,…
## $ STL             <dbl> 0.9, 0.8, 1.0, 0.9, 0.7, 0.9, 1.4, 1.5, 0.8, 0.9, 0.7,…
## $ BLK             <dbl> 0.4, 0.4, 0.5, 0.6, 1.4, 0.7, 0.5, 0.4, 0.8, 0.3, 0.4,…
## $ TOV             <dbl> 3.2, 2.4, 3.5, 3.2, 3.3, 2.9, 1.7, 3.1, 3.9, 3.3, 1.8,…
## $ PF              <dbl> 2.1, 1.7, 2.2, 1.6, 2.1, 2.1, 1.6, 2.8, 3.1, 1.9, 1.9,…
## $ PTS             <dbl> 29.4, 11.4, 15.9, 28.9, 29.1, 23.2, 23.8, 23.8, 31.1, …
## $ `Total Minutes` <dbl> 1941, 755, 2126, 1954, 1672, 1673, 1748, 1939, 2024, 2…
## $ PER             <dbl> 24.1, 13.6, 16.1, 23.9, 25.9, 19.7, 23.9, 19.6, 29.0, …
## $ `TS%`           <dbl> 0.656, 0.498, 0.513, 0.583, 0.677, 0.593, 0.623, 0.588…
## $ `3PAr`          <dbl> 0.564, 0.322, 0.289, 0.309, 0.267, 0.249, 0.287, 0.424…
## $ FTr             <dbl> 0.248, 0.334, 0.317, 0.268, 0.387, 0.260, 0.320, 0.293…
## $ `ORB%`          <dbl> 2.3, 2.1, 4.7, 3.7, 1.2, 2.8, 3.7, 2.6, 7.3, 2.4, 1.9,…
## $ `DRB%`          <dbl> 16.8, 11.4, 16.5, 20.8, 19.5, 9.9, 17.8, 17.0, 30.0, 1…
## $ `TRB%`          <dbl> 9.7, 6.8, 10.8, 12.5, 10.5, 6.5, 10.8, 9.9, 19.1, 7.6,…
## $ `AST%`          <dbl> 30.0, 35.3, 38.6, 33.5, 24.5, 26.6, 19.6, 24.2, 33.2, …
## $ `STL%`          <dbl> 1.3, 1.8, 1.7, 1.2, 1.0, 1.3, 2.0, 2.1, 1.2, 1.2, 1.0,…
## $ `BLK%`          <dbl> 0.9, 1.4, 1.3, 1.4, 3.4, 1.7, 1.4, 0.9, 2.1, 0.8, 1.1,…
## $ `TOV%`          <dbl> 12.5, 17.1, 18.4, 11.6, 13.4, 12.9, 8.1, 13.4, 13.2, 1…
## $ `USG%`          <dbl> 31.0, 27.0, 27.7, 33.3, 30.7, 29.2, 27.0, 29.5, 38.8, …
## $ OWS             <dbl> 5.8, -0.4, -0.6, 3.2, 4.7, 2.2, 4.9, 2.3, 4.9, 8.2, 1.…
## $ DWS             <dbl> 2.0, 0.7, 2.6, 2.4, 2.1, 1.2, 2.2, 2.3, 3.7, 0.8, 1.9,…
## $ WS              <dbl> 7.8, 0.3, 1.9, 5.6, 6.8, 3.4, 7.1, 4.6, 8.6, 9.0, 3.1,…
## $ `WS/48`         <dbl> 0.192, 0.020, 0.044, 0.138, 0.194, 0.099, 0.194, 0.114…
## $ OBPM            <dbl> 7.5, -0.8, 0.3, 5.5, 6.0, 2.9, 5.1, 2.4, 5.8, 8.3, 1.5…
## $ DBPM            <dbl> 0.1, -0.4, -0.1, 0.6, 1.2, -1.2, 0.9, 0.3, 2.7, -1.2, …
## $ BPM             <dbl> 7.5, -1.2, 0.2, 6.1, 7.1, 1.8, 6.1, 2.8, 8.5, 7.1, -0.…
## $ VORP            <dbl> 4.7, 0.1, 1.2, 4.0, 3.9, 1.6, 3.5, 2.3, 5.4, 4.9, 1.0,…
st(df)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
…1 467 233 135 0 116 350 466
Salary 467 8416599 10708118 5849 1782621 10633544 48070014
Age 467 26 4.3 19 23 29 42
GP 467 48 25 1 31 68 83
GS 467 23 27 0 1 46 83
MP 467 20 9.5 1.8 12 28 41
FG 467 3.4 2.5 0 1.6 4.3 11
FGA 467 7.1 5 0 3.3 9.4 22
FG% 466 0.47 0.11 0 0.42 0.51 1
3P 467 1 0.88 0 0.3 1.5 4.9
3PA 467 2.8 2.3 0 1 4.2 11
3P% 454 0.33 0.13 0 0.29 0.39 1
2P 467 2.4 2 0 0.9 3.3 10
2PA 467 4.3 3.6 0 1.8 5.8 18
2P% 463 0.53 0.14 0 0.49 0.59 1
eFG% 466 0.53 0.1 0 0.49 0.58 1
FT 467 1.4 1.6 0 0.5 1.9 10
FTA 467 1.8 1.9 0 0.6 2.3 12
FT% 444 0.75 0.15 0 0.69 0.84 1
ORB 467 0.87 0.75 0 0.4 1.1 5.1
DRB 467 2.7 1.7 0 1.4 3.5 9.6
TRB 467 3.5 2.3 0 1.9 4.5 12
AST 467 2.1 2 0 0.8 2.9 11
STL 467 0.61 0.4 0 0.3 0.8 3
BLK 467 0.38 0.36 0 0.1 0.5 2.5
TOV 467 1.1 0.83 0 0.5 1.5 4.1
PF 467 1.7 0.78 0 1.2 2.2 5
PTS 467 9.1 6.9 0 4.1 12 33
Total Minutes 467 1110 827 2 341 1842 2963
PER 467 13 6.2 -21 10 16 66
TS% 466 0.56 0.1 0 0.52 0.61 1.1
3PAr 466 0.4 0.22 0 0.27 0.55 1
FTr 466 0.25 0.19 0 0.14 0.32 2
ORB% 467 5.2 4.3 0 2.1 7.1 29
DRB% 467 15 6.5 0 11 18 55
TRB% 467 10 4.7 0 6.8 12 30
AST% 467 14 8.6 0 7.5 18 48
STL% 467 1.5 1.4 0 1 1.7 24
BLK% 467 1.8 2.5 0 0.7 2.3 44
TOV% 467 13 7.2 0 9.5 15 100
USG% 467 18 5.9 5.1 14 21 52
OWS 467 1.2 1.8 -1.9 0 1.8 9.6
DWS 467 1.1 0.98 0 0.3 1.8 4.8
WS 467 2.3 2.5 -1.6 0.3 3.5 13
WS/48 467 0.082 0.088 -0.52 0.047 0.13 0.63
OBPM 467 -1.3 3.6 -22 -2.9 0.45 17
DBPM 467 -0.087 2.2 -10 -0.9 0.8 33
BPM 467 -1.4 4.9 -26 -3.3 0.6 49
VORP 467 0.54 1.2 -1.3 -0.1 0.8 6.4
df <- df %>%
  select(-1)
sum(is.na(df))
## [1] 45
miss_var_summary(df)
## # A tibble: 51 × 3
##    variable    n_miss pct_miss
##    <chr>        <int>    <num>
##  1 FT%             23    4.93 
##  2 3P%             13    2.78 
##  3 2P%              4    0.857
##  4 FG%              1    0.214
##  5 eFG%             1    0.214
##  6 TS%              1    0.214
##  7 3PAr             1    0.214
##  8 FTr              1    0.214
##  9 Player Name      0    0    
## 10 Salary           0    0    
## # ℹ 41 more rows
df[!complete.cases(df), ]
## # A tibble: 34 × 51
##    `Player Name` Salary Position   Age Team     GP    GS    MP    FG   FGA `FG%`
##    <chr>          <dbl> <chr>    <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mitchell Rob… 1.70e7 C           24 NYK      59    58  27     3.2   4.7 0.671
##  2 Mason Plumlee 9.08e6 C           32 LAC/…    79    60  26     4.2   6.1 0.68 
##  3 Dylan Windler 4.04e6 SF          26 CLE       3     0   3.3   0.7   1   0.667
##  4 Mark Williams 3.72e6 C           21 CHO      43    17  19.3   3.7   5.8 0.637
##  5 Sterling Bro… 3.12e6 SF          27 LAL       4     0   6     0     1   0    
##  6 Joe Wieskamp  2.91e6 SF          23 TOR       9     0   5.6   0.3   1.6 0.214
##  7 Bismack Biyo… 2.91e6 C           30 PHO      61    14  14.3   2     3.4 0.578
##  8 Udoka Azubui… 2.17e6 C           23 UTA      36     4  10     1.6   2   0.819
##  9 Daniel Gaffo… 1.93e6 C           24 WAS      78    47  20.6   3.7   5.1 0.732
## 10 Ryan Arcidia… 1.84e6 PG          28 NYK/…    20     4   8.6   0.5   1.9 0.243
## # ℹ 24 more rows
## # ℹ 40 more variables: `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, `2P` <dbl>,
## #   `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
## #   ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
## #   TOV <dbl>, PF <dbl>, PTS <dbl>, `Total Minutes` <dbl>, PER <dbl>,
## #   `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, `ORB%` <dbl>, `DRB%` <dbl>,
## #   `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, …

Upon inspection, the missing values are from players who did not have a shot attempt and therefore did not have a percentage to show for it. Therefore, I filled in the missing values with 0:

df[is.na(df)] <- 0
sum(is.na(df))
## [1] 0

Exploratory Data Analysis

Distribution of Player Salaries

Below is a plot of the distribution of salaries among all NBA players. There is a distinct skewness to the right, which is not suprising considering that only a select amount of players will earn a significant amount of money. Almost 100 players earned about $3 million for this season, and a majority appear to have earned less than $10 million.

salary_histogram <- ggplot(df, aes(x = Salary)) +
  geom_histogram(binwidth = 1000000, color = "white", fill = "#69b3a2", size = 0.2) + 
  labs(title = "Distribution of NBA Player Salaries", x = "Salary", y = "Number of Players") + 
  scale_x_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 1e7)) + 
  theme(plot.title = element_text(hjust = 0.5))
salary_histogram

Average Salary by Player Position

Players in this dataset are classified as playing either one position or multiple positions. In the first bar plot, you can see players who are Point Guard-Shooting Guard (PG-SG) have the highest average salary, followed by players classified as SG-PG and Small Forward-Shooting Guard (SF-SG). This suggests that players who can play multiple positions are highly valued and should be allocated a sizable amount of money when seeking to sign players at these positions. Conversely, the 3 lowest average salary positions are Small Forward-Power Forward (SF-PF), Shooting Guard (SG), and Center (C). This suggests that players who play the forward position and Center position are not as highly valued as the players who play the guard position. Additionally, playing only the Shooting Guard role does not carry the same value as being able to play multiple guard positions.

Overall, players who have the position designation of Point Guard-Shooting Guard (PG-SG) average the most in salary with just under $22 million, while players who play the Small Forward-Power Forward (SF-PF) position average the least at around $3 million. There is an importance placed on players who can play multiple positions, particularly at PG and SG. The boxplot below shows a more detailed breakdown of player salary distribution.

df %>%
  group_by(Position) %>%
  mutate(`Salary` = mean(`Salary`)) %>%
  ggplot(aes(x=reorder(Position,-Salary), y=Salary, fill=Position)) +
  geom_bar(stat="identity", position="dodge") +
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 2e6)) +
  labs(x="Position",
       y="Average Salary",
       title="Average Salary by Player Position \n for the 2022-2023 NBA Regular Season") +
  theme(plot.title = element_text(hjust = 0.5))

# Salary by position boxplot
position_boxplot <- ggplot(df, aes(x = Position, y = Salary)) +
  geom_boxplot(fill = "skyblue", color = "steelblue", alpha = 0.7) +
  labs(title = "NBA Player Salaries by Position", x = "Position", y = "Salary") +
  scale_y_continuous(labels = scales::comma_format(scale = 1, big.mark = ",", decimal.mark = ".", prefix = "$"))
position_boxplot

Average Salary by Player Primary Position

I wanted to look at the primary position of each player that was listed, and get an idea of what the average salary is for a singular position:

# Some players are listed under multiple positions. We will only consider their first position listed.
df$PrimaryPosition <- sapply(strsplit(as.character(df$Position), "-"), function(x) x[1])
df$PrimaryPosition <- factor(df$PrimaryPosition, levels = c("PG", "SG", "SF", "PF", "C"))
df %>%
  group_by(PrimaryPosition) %>%
  mutate(`Salary` = mean(`Salary`)) %>%
  ggplot(aes(x=reorder(PrimaryPosition,-Salary), y=Salary, fill=PrimaryPosition)) +
  geom_bar(stat="identity", position="dodge") +
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 2e6)) +
  labs(x="Position",
       y="Average Salary",
       title="Average Salary by Primary Player Position \n for the 2022-2023 NBA Regular Season") +
  theme(plot.title = element_text(hjust = 0.5))

Not surprisingly, Point Guard (PG) had the highest average salary at just under $12 million. However, the second highest average salary was Power Forward (PF) at just under $9 million. A possible explanation may be that there aren’t a lot of players classified primarily as Power Forwards and therefore the average salary could be inflated by the small amount of players. The boxplot below provides more detail about the distribution of salaries by position:

# Salary by position boxplot
position_boxplot <- ggplot(df, aes(x = PrimaryPosition, y = Salary)) +
  geom_boxplot(fill = "skyblue", color = "steelblue", alpha = 0.7) +
  labs(title = "NBA Player Salaries by Primary Position", x = "Primary Position", y = "Salary") +
  scale_y_continuous(labels = scales::comma_format(scale = 1, big.mark = ",", decimal.mark = ".", prefix = "$"))
position_boxplot

Age and Salary

I wanted to see if the age of the player impacted their salary. The highest salary amounts were predominantly from players who are in their late 20s to mid-30s. Only two players under the age of 30 made over $40 million. This makes sense, as the more years of experience a player has, the more potential they have to make more money. If players are playing into their 30s, it is more probable they have performed very well in their NBA careers.

age_vs_salary <- ggplot(df, aes(x = Age, y = Salary)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(formula = y ~ x, method = "loess", color = "#69b3a2") +
  labs(title = "Age vs Salary") +
  scale_y_continuous(labels = scales::dollar_format(prefix = "$"), limits = c(0, max(df$Salary))) +
  theme(plot.title = element_text(hjust = 0.5, size = 16))
age_vs_salary

Feature Analysis and Selection

# removed categorical variables except Position
df2 <- df %>%
  select(-`Player Name`, -Team, -PrimaryPosition)
# created encoding for Position feature
df2$Position <-  as.integer(factor(df2$Position))
set.seed(123)
df2 %>% 
  correlate() %>% 
  focus(Salary) %>%
  arrange(desc(Salary)) %>%
  reactable()

For the purposes of this analysis, we will use features that have a correlation coefficient of at least 50% with the target variable Salary. PTS, FG, and FGA have the highest correlations with Salary, each having a correlation coefficient of at least 70%. Just based on the characteristics of each independent variable, it appears there may be multicollinearity between each variable, which I will examine further using a correlation plot.

nba_df_feat <- df2 %>%
  select(Salary, PTS, FG, FGA, `2PA`, VORP, `2P`, FT, FTA, TOV, MP, WS, GS, AST, OWS, DRB, DWS, `Total Minutes`, TRB) %>%
  as.data.frame()
corr_mat <- cor(nba_df_feat)
ggcorrplot::ggcorrplot(corr_mat, type = "lower",
          lab = TRUE, lab_size = 2.1, tl.cex = 8)

df_cor <- Hmisc::rcorr(as.matrix(nba_df_feat))
data.frame(df_cor$r) %>% kable() %>% kable_styling()
Salary PTS FG FGA X2PA VORP X2P FT FTA TOV MP WS GS AST OWS DRB DWS Total.Minutes TRB
Salary 1.0000000 0.7275967 0.7202600 0.7063930 0.6819264 0.6803390 0.6764062 0.6737723 0.6686381 0.6468075 0.6420985 0.6246638 0.6022673 0.5939712 0.5824164 0.5765958 0.5669897 0.5658969 0.5036332
PTS 0.7275967 1.0000000 0.9918424 0.9821855 0.9286500 0.7569086 0.9147482 0.8979597 0.8897570 0.8600304 0.8737560 0.7370737 0.7470972 0.7329330 0.6901382 0.7048537 0.6596061 0.7632907 0.6182641
FG 0.7202600 0.9918424 1.0000000 0.9780933 0.9496727 0.7436993 0.9430055 0.8598332 0.8609104 0.8576005 0.8786385 0.7351147 0.7539334 0.7222491 0.6777565 0.7345357 0.6769033 0.7636258 0.6584965
FGA 0.7063930 0.9821855 0.9780933 1.0000000 0.9161943 0.6820033 0.8790428 0.8391898 0.8285125 0.8596452 0.8829525 0.6594889 0.7449237 0.7566952 0.5915472 0.6666348 0.6366949 0.7679677 0.5694408
2PA 0.6819264 0.9286500 0.9496727 0.9161943 1.0000000 0.7020484 0.9852865 0.8710493 0.8866262 0.8416437 0.7988539 0.6934428 0.7088033 0.6936573 0.6326276 0.7316343 0.6509837 0.6961546 0.6781086
VORP 0.6803390 0.7569086 0.7436993 0.6820033 0.7020484 1.0000000 0.7309881 0.7618190 0.7532235 0.6098389 0.6165867 0.8927801 0.6294509 0.6231408 0.8908721 0.6511420 0.7029440 0.5960315 0.6014074
2P 0.6764062 0.9147482 0.9430055 0.8790428 0.9852865 0.7309881 1.0000000 0.8594309 0.8827964 0.8135829 0.7812706 0.7347046 0.7009049 0.6426601 0.6829717 0.7682642 0.6670889 0.6811883 0.7301540
FT 0.6737723 0.8979597 0.8598332 0.8391898 0.8710493 0.7618190 0.8594309 1.0000000 0.9875281 0.7845790 0.7056326 0.7192200 0.6303017 0.6464973 0.7179870 0.6231723 0.5638565 0.6304507 0.5509019
FTA 0.6686381 0.8897570 0.8609104 0.8285125 0.8866262 0.7532235 0.8827964 0.9875281 1.0000000 0.7915174 0.7053006 0.7199464 0.6356131 0.6296448 0.7079629 0.6672606 0.5836073 0.6301941 0.6095663
TOV 0.6468075 0.8600304 0.8576005 0.8596452 0.8416437 0.6098389 0.8135829 0.7845790 0.7915174 1.0000000 0.7960867 0.5563825 0.6471489 0.8265584 0.4785576 0.6694122 0.5718471 0.6488436 0.5875191
MP 0.6420985 0.8737560 0.8786385 0.8829525 0.7988539 0.6165867 0.7812706 0.7056326 0.7053006 0.7960867 1.0000000 0.7280854 0.8275999 0.7499174 0.6278770 0.7496944 0.7489025 0.8746072 0.6790314
WS 0.6246638 0.7370737 0.7351147 0.6594889 0.6934428 0.8927801 0.7347046 0.7192200 0.7199464 0.5563825 0.7280854 1.0000000 0.7605724 0.5400001 0.9585744 0.7258428 0.8585728 0.7962848 0.7089825
GS 0.6022673 0.7470972 0.7539334 0.7449237 0.7088033 0.6294509 0.7009049 0.6303017 0.6356131 0.6471489 0.8275999 0.7605724 1.0000000 0.5907430 0.6626077 0.6587868 0.7711772 0.8775652 0.6138609
AST 0.5939712 0.7329330 0.7222491 0.7566952 0.6936573 0.6231408 0.6426601 0.6464973 0.6296448 0.8265584 0.7499174 0.5400001 0.5907430 1.0000000 0.5039736 0.4907528 0.4857097 0.5832622 0.3903797
OWS 0.5824164 0.6901382 0.6777565 0.5915472 0.6326276 0.8908721 0.6829717 0.7179870 0.7079629 0.4785576 0.6278770 0.9585744 0.6626077 0.5039736 1.0000000 0.6219740 0.6775770 0.6649729 0.6153671
DRB 0.5765958 0.7048537 0.7345357 0.6666348 0.7316343 0.6511420 0.7682642 0.6231723 0.6672606 0.6694122 0.7496944 0.7258428 0.6587868 0.4907528 0.6219740 1.0000000 0.7545584 0.6621512 0.9701003
DWS 0.5669897 0.6596061 0.6769033 0.6366949 0.6509837 0.7029440 0.6670889 0.5638565 0.5836073 0.5718471 0.7489025 0.8585728 0.7711772 0.4857097 0.6775770 0.7545584 1.0000000 0.8594498 0.7233293
Total Minutes 0.5658969 0.7632907 0.7636258 0.7679677 0.6961546 0.5960315 0.6811883 0.6304507 0.6301941 0.6488436 0.8746072 0.7962848 0.8775652 0.5832622 0.6649729 0.6621512 0.8594498 1.0000000 0.6042844
TRB 0.5036332 0.6182641 0.6584965 0.5694408 0.6781086 0.6014074 0.7301540 0.5509019 0.6095663 0.5875191 0.6790314 0.7089825 0.6138609 0.3903797 0.6153671 0.9701003 0.7233293 0.6042844 1.0000000

Most of the independent variables have high correlation with one another. This isn’t a total surprise, since a lot of the features are similar to one another. For instance, Points (PTS) are produced from 2-pointers (2P), 3-pointers (3P), and Free Throws (FT) and are represented as a whole by Field Goals (FG). Win Shares (WS) is a combination of Offensive Win Shares (OWS) and Defensive Win Shares (DWS). Conversely, Total Rebounds (TRB) and Assists (AST) had the lowest correlation coefficient with 0.3903797. Because of the high multicollinearity and similar attributes among the independent variables, I decided to reduce the number of features further, removing variables that have exceptionally high multicollinearity with the independent variable PTS, specifically 2PA, FG. FGA, 2P, FT, and FTA, which each have over 88% correlation coefficient with PTS. Likewise, I removed DWS since it has similiarity with WS, Total Minutes since it is similar to Minutes Played (MP), and removed Defensive Rebounds (DRB), since it is similar to Total Rebounds (TRB). Value over Replacement Player (VORP) was kept as an independent variable, since it is a different statistical measurement than the other variables. It is a box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement. You can read more about the statistic here. Even after removing those variables and taking into account their redundancy, the presence of multicollinearity with the remaining independent variables influenced by decision to choose Lasso Regression, Neural Networks, and XG Boost, since these two Machine Learning algorithms handle the presence of multicollinearity very well.

Modified Data Set and Correlation Plot

nba_df_feat2 <- nba_df_feat %>%
  select(Salary, PTS, VORP, WS, TRB, MP, TOV, GS, AST) %>%
  as.data.frame()
p <- ggpairs(nba_df_feat2[,c(1:9)], lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p

Points and Salary

Since PTS have high correlation with Salary, I wanted to explore the relationship between each variable:

ggplotly(ggplot(nba_df_feat2 %>%
                  drop_na(PTS, Salary), aes(x = PTS, y = Salary)) + 
           geom_point(col = "blue") +
           geom_smooth(formula = y ~ x, method = "loess") +
           scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 1e7)) +
           labs(title = "2022-23 Average Points Per Game and Salary",
                x = "Average Points Per Game", y = "2022-23 Salary")) 

There appears to be a positive correlation with Points and Salary, shown by its upward diagonal trajectory. This isn’t totally suprising, since players who score a lot of points are highly coveted by NBA teams and therefore will invest a lot of money based on their points production.

Lasso Regression

Train-Test Split

set.seed(123)  
trainIndex <- createDataPartition(nba_df_feat2$Salary, p = 0.8, list = FALSE)
trainData <- nba_df_feat2[trainIndex, ]
testData <- nba_df_feat2[-trainIndex, ]

# Scaling data
preProcValues <- preProcess(trainData, method = c("center", "scale"))
train_data_scaled <- predict(preProcValues, trainData)
test_data_scaled <- predict(preProcValues, testData)

# Predictor and response variables for training data
X_train_scaled <- as.matrix(train_data_scaled[, -1]) # scaled
#X_train2 <- as.matrix(trainData[, -1])
y_train_scaled <- train_data_scaled$Salary # scaled
#y_train2 <- trainData$Salary

# Predictor and response variables for test data
X_test_scaled <- as.matrix(test_data_scaled[, -1]) # scaled
#X_test2 <- as.matrix(testData[, -1])
y_test_scaled <- test_data_scaled$Salary # scaled
#y_test2 <- testData$Salary

Lasso Regression Model

# Set seed for reproducible random selection and assignment operations
set.seed(1985)

# Specify 10-fold cross-validation as training method
ctrlspecs <- trainControl(method="cv", 
                          number=10,
                          savePredictions="all")

# Create vector of potential lambda values
lambda_vector <- 10^seq(5, -5, length=500)

# Specify lasso regression model to be estimated using training data
# and k-fold cross-validation process
lr_model_scaled <- train(Salary ~ ., 
                data=train_data_scaled,
                preProcess=c("center","scale"),
                method="glmnet", 
                tuneGrid=expand.grid(alpha=1, lambda=lambda_vector),
                trControl=ctrlspecs,
                na.action=na.omit)
summary(lr_model_scaled)
##             Length Class      Mode     
## a0           79    -none-     numeric  
## beta        632    dgCMatrix  S4       
## df           79    -none-     numeric  
## dim           2    -none-     numeric  
## lambda       79    -none-     numeric  
## dev.ratio    79    -none-     numeric  
## nulldev       1    -none-     numeric  
## npasses       1    -none-     numeric  
## jerr          1    -none-     numeric  
## offset        1    -none-     logical  
## call          5    -none-     call     
## nobs          1    -none-     numeric  
## lambdaOpt     1    -none-     numeric  
## xNames        8    -none-     character
## problemType   1    -none-     character
## tuneValue     2    data.frame list     
## obsLevels     1    -none-     logical  
## param         0    -none-     list
set.seed(123)
lr_model_scaled$results$MAE[1]
## [1] 0.4419835
lr_model_scaled$results$RMSE[1]
## [1] 0.6659549
lr_model_scaled$results$Rsquared[1]
## [1] 0.55629
set.seed(123)
lr_predictions <- predict(lr_model_scaled, X_test_scaled) #as.vector(predict(model1, X_test))
lr_model_resample <- postResample(pred = lr_predictions, obs = y_test_scaled)
lr_results <- data.frame(Model = "Lasso Regression",
                         RMSE = caret::RMSE(lr_predictions, y_test_scaled),
                         Rsquared = caret::R2(lr_predictions, y_test_scaled),
                         MAE = caret::MAE(lr_predictions, y_test_scaled))
set.seed(123)
lr_results |>
  kbl() |>
  kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
Model RMSE Rsquared MAE
Lasso Regression 0.6326769 0.6226381 0.4168337

Source: https://rforhr.com/lassoregression.html

Variable Importance - Lasso Regression

varImp(lr_model_scaled)
## glmnet variable importance
## 
##      Overall
## PTS  100.000
## VORP  92.951
## GS    39.363
## TOV   32.503
## AST    2.664
## WS     0.000
## MP     0.000
## TRB    0.000
set.seed(123)
lasso_model_importance <- varImp(lr_model_scaled)$importance |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  #filter(Overall >= 50) |>
  arrange(desc(Overall)) |>
  mutate(importance = row_number())

set.seed(123)
varImp(lr_model_scaled) %>%
  plot(., top = max(lasso_model_importance$importance), main = "Important Variables In Predicting NBA Player Salary Using \n Lasso Regression")

Neural Networks

# tooHigh <- findCorrelation(cor(train_x), cutoff  = .75)
# 
# train_x2 <- train_x[, -tooHigh]
# test_x2 <- test_x[, -tooHigh]

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))
set.seed(669)
nnetModel <- train(X_train_scaled, y_train_scaled,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "repeatedcv",
                                            repeats = 5),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(X_train_scaled) + 1) + 10 + 1,
                  maxit = 500)

nnetModel
## Neural Network 
## 
## 375 samples
##   8 predictor
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 338, 339, 336, 337, 336, 338, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.6826304  0.5387993  0.4439362
##   0.00    2    0.7546671  0.4624549  0.4912958
##   0.00    3    0.7841705  0.4462586  0.4972833
##   0.00    4    0.8142730  0.4344568  0.5249582
##   0.00    5    1.0251343  0.3791678  0.6076205
##   0.00    6    0.9131322  0.3364305  0.5872640
##   0.00    7    0.9377538  0.3416392  0.6057707
##   0.00    8    0.9495851  0.3669709  0.6154047
##   0.00    9    0.9647247  0.3659755  0.6214975
##   0.00   10    1.4082450  0.2735293  0.7340524
##   0.01    1    0.6818064  0.5379839  0.4456567
##   0.01    2    0.7261134  0.4845550  0.4676771
##   0.01    3    0.7636211  0.4601992  0.4868423
##   0.01    4    0.7962528  0.4287952  0.5159405
##   0.01    5    0.8114874  0.4263842  0.5191773
##   0.01    6    0.8294005  0.4322094  0.5283609
##   0.01    7    0.8560919  0.4102477  0.5537744
##   0.01    8    0.8953805  0.3973334  0.5751889
##   0.01    9    0.9329654  0.3891464  0.5979713
##   0.01   10    1.0178994  0.3271282  0.6461981
##   0.10    1    0.6754181  0.5474605  0.4378668
##   0.10    2    0.7013695  0.5193044  0.4587552
##   0.10    3    0.7277943  0.4889851  0.4740563
##   0.10    4    0.7250926  0.4980875  0.4686974
##   0.10    5    0.7292474  0.4953542  0.4726979
##   0.10    6    0.7601260  0.4796456  0.4897747
##   0.10    7    0.7506504  0.4864477  0.4829723
##   0.10    8    0.8035754  0.4359179  0.5187116
##   0.10    9    0.8013368  0.4549355  0.5237018
##   0.10   10    0.8135000  0.4312123  0.5382530
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
set.seed(123)
nnetModel$results$MAE[1]
## [1] 0.4439362
nnetModel$results$RMSE[1]
## [1] 0.6826304
nnetModel$results$Rsquared[1]
## [1] 0.5387993
set.seed(669)
nnetPred <- predict(nnetModel, newdata = X_test_scaled) #as.vector(predict(nnetModel2, newdata = X_test))
NNET_Model <- postResample(pred = nnetPred, obs = y_test_scaled)
nnet_results <- data.frame(Model = "Neural Networks",
                         RMSE = caret::RMSE(nnetPred, y_test_scaled),
                         Rsquared = caret::R2(nnetPred, y_test_scaled),
                         MAE = caret::MAE(nnetPred, y_test_scaled))
#NNET_Model2


set.seed(123)
nnet_results |>
  kbl() |>
  kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
Model RMSE Rsquared MAE
Neural Networks 0.6362838 0.6130482 0.4125134

Variable Importance - Neural Networks

varImp(nnetModel)
## nnet variable importance
## 
##      Overall
## VORP  100.00
## MP     61.14
## PTS    44.75
## TOV    39.61
## GS     27.73
## WS     12.85
## AST    12.62
## TRB     0.00
set.seed(123)
nnet_model_importance <- varImp(nnetModel)$importance |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  #filter(Overall >= 50) |>
  arrange(desc(Overall)) |>
  mutate(importance = row_number())

set.seed(123)
varImp(nnetModel) %>%
  plot(., top = max(nnet_model_importance$importance), main = "Important Variables for Predicting NBA Player Salary Using \n Neural Networks")

XG Boost

set.seed(123)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 2, search = "random")
# train a xgbTree model using caret::train
xg_model <- train(Salary ~., data = train_data_scaled, method = "xgbTree", trControl = fitControl)

# Instead of tree for our boosters, you can also fit a linear regression or logistic regression model using xgbLinear
# model <- train(factor(Improved)~., data = df, method = "xgbLinear", trControl = fitControl)

# See model results
print(xg_model)
## eXtreme Gradient Boosting 
## 
## 375 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 338, 337, 338, 339, 337, 338, ... 
## Resampling results across tuning parameters:
## 
##   eta        max_depth  gamma     colsample_bytree  min_child_weight  subsample
##   0.2745122   2         6.775706  0.6599300         18                0.7813979
##   0.3313096  10         4.533342  0.3411699          4                0.7417793
##   0.5741432   6         5.726334  0.3984351          8                0.6580495
##   nrounds  RMSE       Rsquared   MAE      
##   526      0.6703663  0.5452747  0.4399166
##   179      0.7059238  0.5011445  0.4583089
##   195      0.7327009  0.4878432  0.4772313
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 526, max_depth = 2, eta
##  = 0.2745122, gamma = 6.775706, colsample_bytree = 0.65993, min_child_weight
##  = 18 and subsample = 0.7813979.
set.seed(123)
xg_model$results$MAE[1]
## [1] 0.4583089
xg_model$results$RMSE[1]
## [1] 0.7059238
xg_model$results$Rsquared[1]
## [1] 0.5011445
set.seed(669)
xgPred <- predict(xg_model, newdata = X_test_scaled) #as.vector(predict(nnetModel2, newdata = X_test))
xg_mod <- postResample(pred = xgPred, obs = y_test_scaled)
xg_results <- data.frame(Model = "XG Boost",
                         RMSE = caret::RMSE(xgPred, y_test_scaled),
                         Rsquared = caret::R2(xgPred, y_test_scaled),
                         MAE = caret::MAE(xgPred, y_test_scaled))
#NNET_Model2


set.seed(123)
xg_results |>
  kbl() |>
  kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
Model RMSE Rsquared MAE
XG Boost 0.6445609 0.5982462 0.4191515

Variable Importance - XG Boost

varImp(xg_model)
## xgbTree variable importance
## 
##      Overall
## MP   100.000
## TOV   70.145
## VORP  67.051
## PTS   31.934
## AST   26.189
## WS    16.511
## GS     8.129
## TRB    0.000
set.seed(123)
nnet_model_importance <- varImp(xg_model)$importance |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  #filter(Overall >= 50) |>
  arrange(desc(Overall)) |>
  mutate(importance = row_number())

set.seed(123)
varImp(xg_model) %>%
  plot(., top = max(nnet_model_importance$importance), main = "Important Variables for Predicting NBA Player Salary Using \n XG Boost")

Combined Results

set.seed(123)
combine_results <- rbind(lr_results, nnet_results, xg_results)
combine_results |>
  kbl() |>
  kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
Model RMSE Rsquared MAE
Lasso Regression 0.6326769 0.6226381 0.4168337
Neural Networks 0.6362838 0.6130482 0.4125134
XG Boost 0.6445609 0.5982462 0.4191515

Findings

The business impact from this analysis is to identify what variables we should consider when assessing the amount of money to invest in players, as well as the best model to use to predict a player’s salary. The three models used for this analysis were Lasso Regression, Neural Networks, and XG Boost. The reason for selecting these specific models is that they can handle the multicollinearity that is present among the independent variables. As discussed earlier, the distribution of NBA player salaries is skewed to the right, with many outliers in the dataset. When identifying independent variables highly correlated with Salary, there were 18 variables that had at least a 50% correlation coefficient, and selected those variables. Most of the selected variables had high multicollinearity with each other. When further examining each independent variable, I removed variables that not only had multicollinearity, but also were redundant. As mentioned earlier, 2PA, FG. FGA, 2P, FT, FTA, OWS, DWS, Total Minutes, and Defensive Rebounds (DRB) were removed due to its high correlation and redundancy with other variables. With the subsetted dataset, I split the data into train and test sets. Because of the significant outliers in the data, I scaled the train and test dependent and independent variables.

The three models performed well. Lasso Regression had the highest Rsquared score of 0.6226381, followed by Neural Networks at 0.6130482 and XGBoost at 0.5982462. The RMSE and MAE values for each model were also very close, with Lasso Regression having slightly the lowest Root Mean Squared Error (RMSE), and Neural Networks having slightly the lowest Mean Absolute Error (MAE). For Lasso Regression, Points had the highest variable importance at 100%, slightly higher than Value Over Replacement Player (VORP) at 92.95%, and followed by significantly lower scores in Games Started (GS), Turnovers (TOV), and Assists (AST). Win Shares (WS), Minutes Played (MP), and Total Rebounds (TRB) did not register a variable importance score. Neural Networks had a different variable importance ranking, with VORP having a 100% score, followed by Minutes Played and Points. The variable importance rankings were different for XG Boost as well, with Minutes Played having the highest score at 100% followed by Turnovers, VORP, and Points. The rankings were in sharp contrast with Lasso Regression, where Minutes Played was given a score of zero. For all three models, Total Rebounds were given an importance score of zero, which indicates that it doesn’t have an impact of the salary of NBA players. Taken altogether, it appears the Points and VORP are the two independent variables that have an impact on NBA player salaries among the three models used for this analysis.

While there are small margins in their performance metrics that separate all three models, the model that I would choose is the Lasso Regression. Its ability to handle multicollinearity combined with its high Rsquared and low RMSE values make it a viable model for this data.

Next Steps

While the statistical variables provided in this dataset are metrics that are used to measure player performance, there are other advanced metrics that could be included. Effective Field Goal Percentage (eFG%), Box Plus Minus (BPM), and True Shooting Percentage (TS%) are just a few features that are used to evaluate players which are in the dataset, but did not have a high correlation values with Salary. With respect to player salary, there are other stats, particularly defensive stats, that could be included and would be interesting to see if they have any impact on a player’s salary. While scaling the data is important to mitigate the outliers that exist, building models without scaling the data could be done just to compare their model performances with the three models that used scaled data. With more hyperparameter tuning, the models used for this analysis could be improved and provide a more accurate model to predict NBA player salaries.