In the National Basketball Association, projecting the appropriate amount of money to invest in players is important when building a team. Knowing the average salary of players at each position, as well as understanding a player’s value based on their versatility, can give us clarity about how much resources we should invest. We will look at the variables that have high correlation with Salary, including Win Shares (WS), Points (PTS), and Total Rebounds (TRB), to name a few. I will create Machine Learning models that predict a player’s salary, and evaluate which model performed the best. The models used will be Lasso Regression, Neural Networks, and XG Boost. Finally I will compare the predicted salary of each model with the actual salaries of each player and determine which performed the best.
Link to dataset: https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season
library(tidyverse)
library(tseries)
library(forecast)
library(kableExtra)
library(reactable)
library(seasonal)
library(tsibble)
library(openxlsx)
library(readxl)
library(mice)
library(caret)
library(zoo)
library(vtable)
library(lubridate)
library(imputeTS)
library(naniar)
library(timeplyr)
library(rstatix)
library(timetk)
library(glmnet)
library(corrr)
library(corrplot)
library(ggcorrplot)
library(plotly)
library(GGally)
library(car)
library(e1071)
library(neuralnet)
library(kableExtra)
library(reshape2)
df <- read_csv("/Users/mohamedhassan/Downloads/nba_2022-23_all_stats_with_salary.csv")
reactable(df)
glimpse(df)
## Rows: 467
## Columns: 52
## $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ `Player Name` <chr> "Stephen Curry", "John Wall", "Russell Westbrook", "Le…
## $ Salary <dbl> 48070014, 47345760, 47080179, 44474988, 44119845, 4327…
## $ Position <chr> "PG", "PG", "PG", "PF", "PF", "SG", "SF", "SF", "PF", …
## $ Age <dbl> 34, 32, 34, 38, 34, 29, 31, 32, 28, 32, 32, 30, 30, 31…
## $ Team <chr> "GSW", "LAC", "LAL/LAC", "LAL", "BRK/PHO", "WAS", "LAC…
## $ GP <dbl> 56, 34, 73, 55, 47, 50, 52, 56, 63, 58, 69, 60, 70, 33…
## $ GS <dbl> 56, 3, 24, 54, 47, 50, 50, 56, 63, 58, 69, 60, 70, 19,…
## $ MP <dbl> 34.7, 22.2, 29.1, 35.5, 35.6, 33.5, 33.6, 34.6, 32.1, …
## $ FG <dbl> 10.0, 4.1, 5.9, 11.1, 10.3, 8.9, 8.6, 8.2, 11.2, 9.6, …
## $ FGA <dbl> 20.2, 9.9, 13.6, 22.2, 18.3, 17.6, 16.8, 17.9, 20.3, 2…
## $ `FG%` <dbl> 0.493, 0.408, 0.436, 0.500, 0.560, 0.506, 0.512, 0.457…
## $ `3P` <dbl> 4.9, 1.0, 1.2, 2.2, 2.0, 1.6, 2.0, 2.8, 0.7, 4.2, 4.4,…
## $ `3PA` <dbl> 11.4, 3.2, 3.9, 6.9, 4.9, 4.4, 4.8, 7.6, 2.7, 11.3, 10…
## $ `3P%` <dbl> 0.427, 0.303, 0.311, 0.321, 0.404, 0.365, 0.416, 0.371…
## $ `2P` <dbl> 5.1, 3.1, 4.7, 8.9, 8.3, 7.3, 6.6, 5.4, 10.5, 5.4, 3.6…
## $ `2PA` <dbl> 8.8, 6.7, 9.7, 15.3, 13.4, 13.2, 11.9, 10.3, 17.6, 9.4…
## $ `2P%` <dbl> 0.579, 0.459, 0.487, 0.580, 0.617, 0.552, 0.551, 0.521…
## $ `eFG%` <dbl> 0.614, 0.457, 0.481, 0.549, 0.614, 0.551, 0.572, 0.536…
## $ FT <dbl> 4.6, 2.3, 2.8, 4.6, 6.5, 3.8, 4.7, 4.6, 7.9, 8.8, 1.7,…
## $ FTA <dbl> 5.0, 3.3, 4.3, 5.9, 7.1, 4.6, 5.4, 5.3, 12.3, 9.6, 1.9…
## $ `FT%` <dbl> 0.915, 0.681, 0.656, 0.768, 0.919, 0.842, 0.871, 0.871…
## $ ORB <dbl> 0.7, 0.4, 1.2, 1.2, 0.4, 0.8, 1.1, 0.8, 2.2, 0.8, 0.6,…
## $ DRB <dbl> 5.4, 2.3, 4.6, 7.1, 6.3, 3.1, 5.4, 5.3, 9.6, 4.0, 3.6,…
## $ TRB <dbl> 6.1, 2.7, 5.8, 8.3, 6.7, 3.9, 6.5, 6.1, 11.8, 4.8, 4.1…
## $ AST <dbl> 6.3, 5.2, 7.5, 6.8, 5.0, 5.4, 3.9, 5.1, 5.7, 7.3, 2.4,…
## $ STL <dbl> 0.9, 0.8, 1.0, 0.9, 0.7, 0.9, 1.4, 1.5, 0.8, 0.9, 0.7,…
## $ BLK <dbl> 0.4, 0.4, 0.5, 0.6, 1.4, 0.7, 0.5, 0.4, 0.8, 0.3, 0.4,…
## $ TOV <dbl> 3.2, 2.4, 3.5, 3.2, 3.3, 2.9, 1.7, 3.1, 3.9, 3.3, 1.8,…
## $ PF <dbl> 2.1, 1.7, 2.2, 1.6, 2.1, 2.1, 1.6, 2.8, 3.1, 1.9, 1.9,…
## $ PTS <dbl> 29.4, 11.4, 15.9, 28.9, 29.1, 23.2, 23.8, 23.8, 31.1, …
## $ `Total Minutes` <dbl> 1941, 755, 2126, 1954, 1672, 1673, 1748, 1939, 2024, 2…
## $ PER <dbl> 24.1, 13.6, 16.1, 23.9, 25.9, 19.7, 23.9, 19.6, 29.0, …
## $ `TS%` <dbl> 0.656, 0.498, 0.513, 0.583, 0.677, 0.593, 0.623, 0.588…
## $ `3PAr` <dbl> 0.564, 0.322, 0.289, 0.309, 0.267, 0.249, 0.287, 0.424…
## $ FTr <dbl> 0.248, 0.334, 0.317, 0.268, 0.387, 0.260, 0.320, 0.293…
## $ `ORB%` <dbl> 2.3, 2.1, 4.7, 3.7, 1.2, 2.8, 3.7, 2.6, 7.3, 2.4, 1.9,…
## $ `DRB%` <dbl> 16.8, 11.4, 16.5, 20.8, 19.5, 9.9, 17.8, 17.0, 30.0, 1…
## $ `TRB%` <dbl> 9.7, 6.8, 10.8, 12.5, 10.5, 6.5, 10.8, 9.9, 19.1, 7.6,…
## $ `AST%` <dbl> 30.0, 35.3, 38.6, 33.5, 24.5, 26.6, 19.6, 24.2, 33.2, …
## $ `STL%` <dbl> 1.3, 1.8, 1.7, 1.2, 1.0, 1.3, 2.0, 2.1, 1.2, 1.2, 1.0,…
## $ `BLK%` <dbl> 0.9, 1.4, 1.3, 1.4, 3.4, 1.7, 1.4, 0.9, 2.1, 0.8, 1.1,…
## $ `TOV%` <dbl> 12.5, 17.1, 18.4, 11.6, 13.4, 12.9, 8.1, 13.4, 13.2, 1…
## $ `USG%` <dbl> 31.0, 27.0, 27.7, 33.3, 30.7, 29.2, 27.0, 29.5, 38.8, …
## $ OWS <dbl> 5.8, -0.4, -0.6, 3.2, 4.7, 2.2, 4.9, 2.3, 4.9, 8.2, 1.…
## $ DWS <dbl> 2.0, 0.7, 2.6, 2.4, 2.1, 1.2, 2.2, 2.3, 3.7, 0.8, 1.9,…
## $ WS <dbl> 7.8, 0.3, 1.9, 5.6, 6.8, 3.4, 7.1, 4.6, 8.6, 9.0, 3.1,…
## $ `WS/48` <dbl> 0.192, 0.020, 0.044, 0.138, 0.194, 0.099, 0.194, 0.114…
## $ OBPM <dbl> 7.5, -0.8, 0.3, 5.5, 6.0, 2.9, 5.1, 2.4, 5.8, 8.3, 1.5…
## $ DBPM <dbl> 0.1, -0.4, -0.1, 0.6, 1.2, -1.2, 0.9, 0.3, 2.7, -1.2, …
## $ BPM <dbl> 7.5, -1.2, 0.2, 6.1, 7.1, 1.8, 6.1, 2.8, 8.5, 7.1, -0.…
## $ VORP <dbl> 4.7, 0.1, 1.2, 4.0, 3.9, 1.6, 3.5, 2.3, 5.4, 4.9, 1.0,…
st(df)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| …1 | 467 | 233 | 135 | 0 | 116 | 350 | 466 |
| Salary | 467 | 8416599 | 10708118 | 5849 | 1782621 | 10633544 | 48070014 |
| Age | 467 | 26 | 4.3 | 19 | 23 | 29 | 42 |
| GP | 467 | 48 | 25 | 1 | 31 | 68 | 83 |
| GS | 467 | 23 | 27 | 0 | 1 | 46 | 83 |
| MP | 467 | 20 | 9.5 | 1.8 | 12 | 28 | 41 |
| FG | 467 | 3.4 | 2.5 | 0 | 1.6 | 4.3 | 11 |
| FGA | 467 | 7.1 | 5 | 0 | 3.3 | 9.4 | 22 |
| FG% | 466 | 0.47 | 0.11 | 0 | 0.42 | 0.51 | 1 |
| 3P | 467 | 1 | 0.88 | 0 | 0.3 | 1.5 | 4.9 |
| 3PA | 467 | 2.8 | 2.3 | 0 | 1 | 4.2 | 11 |
| 3P% | 454 | 0.33 | 0.13 | 0 | 0.29 | 0.39 | 1 |
| 2P | 467 | 2.4 | 2 | 0 | 0.9 | 3.3 | 10 |
| 2PA | 467 | 4.3 | 3.6 | 0 | 1.8 | 5.8 | 18 |
| 2P% | 463 | 0.53 | 0.14 | 0 | 0.49 | 0.59 | 1 |
| eFG% | 466 | 0.53 | 0.1 | 0 | 0.49 | 0.58 | 1 |
| FT | 467 | 1.4 | 1.6 | 0 | 0.5 | 1.9 | 10 |
| FTA | 467 | 1.8 | 1.9 | 0 | 0.6 | 2.3 | 12 |
| FT% | 444 | 0.75 | 0.15 | 0 | 0.69 | 0.84 | 1 |
| ORB | 467 | 0.87 | 0.75 | 0 | 0.4 | 1.1 | 5.1 |
| DRB | 467 | 2.7 | 1.7 | 0 | 1.4 | 3.5 | 9.6 |
| TRB | 467 | 3.5 | 2.3 | 0 | 1.9 | 4.5 | 12 |
| AST | 467 | 2.1 | 2 | 0 | 0.8 | 2.9 | 11 |
| STL | 467 | 0.61 | 0.4 | 0 | 0.3 | 0.8 | 3 |
| BLK | 467 | 0.38 | 0.36 | 0 | 0.1 | 0.5 | 2.5 |
| TOV | 467 | 1.1 | 0.83 | 0 | 0.5 | 1.5 | 4.1 |
| PF | 467 | 1.7 | 0.78 | 0 | 1.2 | 2.2 | 5 |
| PTS | 467 | 9.1 | 6.9 | 0 | 4.1 | 12 | 33 |
| Total Minutes | 467 | 1110 | 827 | 2 | 341 | 1842 | 2963 |
| PER | 467 | 13 | 6.2 | -21 | 10 | 16 | 66 |
| TS% | 466 | 0.56 | 0.1 | 0 | 0.52 | 0.61 | 1.1 |
| 3PAr | 466 | 0.4 | 0.22 | 0 | 0.27 | 0.55 | 1 |
| FTr | 466 | 0.25 | 0.19 | 0 | 0.14 | 0.32 | 2 |
| ORB% | 467 | 5.2 | 4.3 | 0 | 2.1 | 7.1 | 29 |
| DRB% | 467 | 15 | 6.5 | 0 | 11 | 18 | 55 |
| TRB% | 467 | 10 | 4.7 | 0 | 6.8 | 12 | 30 |
| AST% | 467 | 14 | 8.6 | 0 | 7.5 | 18 | 48 |
| STL% | 467 | 1.5 | 1.4 | 0 | 1 | 1.7 | 24 |
| BLK% | 467 | 1.8 | 2.5 | 0 | 0.7 | 2.3 | 44 |
| TOV% | 467 | 13 | 7.2 | 0 | 9.5 | 15 | 100 |
| USG% | 467 | 18 | 5.9 | 5.1 | 14 | 21 | 52 |
| OWS | 467 | 1.2 | 1.8 | -1.9 | 0 | 1.8 | 9.6 |
| DWS | 467 | 1.1 | 0.98 | 0 | 0.3 | 1.8 | 4.8 |
| WS | 467 | 2.3 | 2.5 | -1.6 | 0.3 | 3.5 | 13 |
| WS/48 | 467 | 0.082 | 0.088 | -0.52 | 0.047 | 0.13 | 0.63 |
| OBPM | 467 | -1.3 | 3.6 | -22 | -2.9 | 0.45 | 17 |
| DBPM | 467 | -0.087 | 2.2 | -10 | -0.9 | 0.8 | 33 |
| BPM | 467 | -1.4 | 4.9 | -26 | -3.3 | 0.6 | 49 |
| VORP | 467 | 0.54 | 1.2 | -1.3 | -0.1 | 0.8 | 6.4 |
df <- df %>%
select(-1)
sum(is.na(df))
## [1] 45
miss_var_summary(df)
## # A tibble: 51 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 FT% 23 4.93
## 2 3P% 13 2.78
## 3 2P% 4 0.857
## 4 FG% 1 0.214
## 5 eFG% 1 0.214
## 6 TS% 1 0.214
## 7 3PAr 1 0.214
## 8 FTr 1 0.214
## 9 Player Name 0 0
## 10 Salary 0 0
## # ℹ 41 more rows
df[!complete.cases(df), ]
## # A tibble: 34 × 51
## `Player Name` Salary Position Age Team GP GS MP FG FGA `FG%`
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mitchell Rob… 1.70e7 C 24 NYK 59 58 27 3.2 4.7 0.671
## 2 Mason Plumlee 9.08e6 C 32 LAC/… 79 60 26 4.2 6.1 0.68
## 3 Dylan Windler 4.04e6 SF 26 CLE 3 0 3.3 0.7 1 0.667
## 4 Mark Williams 3.72e6 C 21 CHO 43 17 19.3 3.7 5.8 0.637
## 5 Sterling Bro… 3.12e6 SF 27 LAL 4 0 6 0 1 0
## 6 Joe Wieskamp 2.91e6 SF 23 TOR 9 0 5.6 0.3 1.6 0.214
## 7 Bismack Biyo… 2.91e6 C 30 PHO 61 14 14.3 2 3.4 0.578
## 8 Udoka Azubui… 2.17e6 C 23 UTA 36 4 10 1.6 2 0.819
## 9 Daniel Gaffo… 1.93e6 C 24 WAS 78 47 20.6 3.7 5.1 0.732
## 10 Ryan Arcidia… 1.84e6 PG 28 NYK/… 20 4 8.6 0.5 1.9 0.243
## # ℹ 24 more rows
## # ℹ 40 more variables: `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>, `2P` <dbl>,
## # `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>,
## # ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>,
## # TOV <dbl>, PF <dbl>, PTS <dbl>, `Total Minutes` <dbl>, PER <dbl>,
## # `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, `ORB%` <dbl>, `DRB%` <dbl>,
## # `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, `TOV%` <dbl>, …
Upon inspection, the missing values are from players who did not have a shot attempt and therefore did not have a percentage to show for it. Therefore, I filled in the missing values with 0:
df[is.na(df)] <- 0
sum(is.na(df))
## [1] 0
Below is a plot of the distribution of salaries among all NBA players. There is a distinct skewness to the right, which is not suprising considering that only a select amount of players will earn a significant amount of money. Almost 100 players earned about $3 million for this season, and a majority appear to have earned less than $10 million.
salary_histogram <- ggplot(df, aes(x = Salary)) +
geom_histogram(binwidth = 1000000, color = "white", fill = "#69b3a2", size = 0.2) +
labs(title = "Distribution of NBA Player Salaries", x = "Salary", y = "Number of Players") +
scale_x_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 1e7)) +
theme(plot.title = element_text(hjust = 0.5))
salary_histogram
Players in this dataset are classified as playing either one position or multiple positions. In the first bar plot, you can see players who are Point Guard-Shooting Guard (PG-SG) have the highest average salary, followed by players classified as SG-PG and Small Forward-Shooting Guard (SF-SG). This suggests that players who can play multiple positions are highly valued and should be allocated a sizable amount of money when seeking to sign players at these positions. Conversely, the 3 lowest average salary positions are Small Forward-Power Forward (SF-PF), Shooting Guard (SG), and Center (C). This suggests that players who play the forward position and Center position are not as highly valued as the players who play the guard position. Additionally, playing only the Shooting Guard role does not carry the same value as being able to play multiple guard positions.
Overall, players who have the position designation of Point
Guard-Shooting Guard (PG-SG) average the most in salary
with just under $22 million, while players who play the Small
Forward-Power Forward (SF-PF) position average the least at
around $3 million. There is an importance placed on players who can play
multiple positions, particularly at PG and SG. The boxplot below shows a
more detailed breakdown of player salary distribution.
df %>%
group_by(Position) %>%
mutate(`Salary` = mean(`Salary`)) %>%
ggplot(aes(x=reorder(Position,-Salary), y=Salary, fill=Position)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 2e6)) +
labs(x="Position",
y="Average Salary",
title="Average Salary by Player Position \n for the 2022-2023 NBA Regular Season") +
theme(plot.title = element_text(hjust = 0.5))
# Salary by position boxplot
position_boxplot <- ggplot(df, aes(x = Position, y = Salary)) +
geom_boxplot(fill = "skyblue", color = "steelblue", alpha = 0.7) +
labs(title = "NBA Player Salaries by Position", x = "Position", y = "Salary") +
scale_y_continuous(labels = scales::comma_format(scale = 1, big.mark = ",", decimal.mark = ".", prefix = "$"))
position_boxplot
I wanted to look at the primary position of each player that was listed, and get an idea of what the average salary is for a singular position:
# Some players are listed under multiple positions. We will only consider their first position listed.
df$PrimaryPosition <- sapply(strsplit(as.character(df$Position), "-"), function(x) x[1])
df$PrimaryPosition <- factor(df$PrimaryPosition, levels = c("PG", "SG", "SF", "PF", "C"))
df %>%
group_by(PrimaryPosition) %>%
mutate(`Salary` = mean(`Salary`)) %>%
ggplot(aes(x=reorder(PrimaryPosition,-Salary), y=Salary, fill=PrimaryPosition)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 2e6)) +
labs(x="Position",
y="Average Salary",
title="Average Salary by Primary Player Position \n for the 2022-2023 NBA Regular Season") +
theme(plot.title = element_text(hjust = 0.5))
Not surprisingly, Point Guard (PG) had the highest average salary at just under $12 million. However, the second highest average salary was Power Forward (PF) at just under $9 million. A possible explanation may be that there aren’t a lot of players classified primarily as Power Forwards and therefore the average salary could be inflated by the small amount of players. The boxplot below provides more detail about the distribution of salaries by position:
# Salary by position boxplot
position_boxplot <- ggplot(df, aes(x = PrimaryPosition, y = Salary)) +
geom_boxplot(fill = "skyblue", color = "steelblue", alpha = 0.7) +
labs(title = "NBA Player Salaries by Primary Position", x = "Primary Position", y = "Salary") +
scale_y_continuous(labels = scales::comma_format(scale = 1, big.mark = ",", decimal.mark = ".", prefix = "$"))
position_boxplot
I wanted to see if the age of the player impacted their salary. The highest salary amounts were predominantly from players who are in their late 20s to mid-30s. Only two players under the age of 30 made over $40 million. This makes sense, as the more years of experience a player has, the more potential they have to make more money. If players are playing into their 30s, it is more probable they have performed very well in their NBA careers.
age_vs_salary <- ggplot(df, aes(x = Age, y = Salary)) +
geom_point(alpha = 0.7) +
geom_smooth(formula = y ~ x, method = "loess", color = "#69b3a2") +
labs(title = "Age vs Salary") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"), limits = c(0, max(df$Salary))) +
theme(plot.title = element_text(hjust = 0.5, size = 16))
age_vs_salary
# removed categorical variables except Position
df2 <- df %>%
select(-`Player Name`, -Team, -PrimaryPosition)
# created encoding for Position feature
df2$Position <- as.integer(factor(df2$Position))
set.seed(123)
df2 %>%
correlate() %>%
focus(Salary) %>%
arrange(desc(Salary)) %>%
reactable()
For the purposes of this analysis, we will use features that have a
correlation coefficient of at least 50% with the target variable
Salary. PTS, FG, and
FGA have the highest correlations with Salary,
each having a correlation coefficient of at least 70%. Just based on the
characteristics of each independent variable, it appears there may be
multicollinearity between each variable, which I will examine further
using a correlation plot.
nba_df_feat <- df2 %>%
select(Salary, PTS, FG, FGA, `2PA`, VORP, `2P`, FT, FTA, TOV, MP, WS, GS, AST, OWS, DRB, DWS, `Total Minutes`, TRB) %>%
as.data.frame()
corr_mat <- cor(nba_df_feat)
ggcorrplot::ggcorrplot(corr_mat, type = "lower",
lab = TRUE, lab_size = 2.1, tl.cex = 8)
df_cor <- Hmisc::rcorr(as.matrix(nba_df_feat))
data.frame(df_cor$r) %>% kable() %>% kable_styling()
| Salary | PTS | FG | FGA | X2PA | VORP | X2P | FT | FTA | TOV | MP | WS | GS | AST | OWS | DRB | DWS | Total.Minutes | TRB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Salary | 1.0000000 | 0.7275967 | 0.7202600 | 0.7063930 | 0.6819264 | 0.6803390 | 0.6764062 | 0.6737723 | 0.6686381 | 0.6468075 | 0.6420985 | 0.6246638 | 0.6022673 | 0.5939712 | 0.5824164 | 0.5765958 | 0.5669897 | 0.5658969 | 0.5036332 |
| PTS | 0.7275967 | 1.0000000 | 0.9918424 | 0.9821855 | 0.9286500 | 0.7569086 | 0.9147482 | 0.8979597 | 0.8897570 | 0.8600304 | 0.8737560 | 0.7370737 | 0.7470972 | 0.7329330 | 0.6901382 | 0.7048537 | 0.6596061 | 0.7632907 | 0.6182641 |
| FG | 0.7202600 | 0.9918424 | 1.0000000 | 0.9780933 | 0.9496727 | 0.7436993 | 0.9430055 | 0.8598332 | 0.8609104 | 0.8576005 | 0.8786385 | 0.7351147 | 0.7539334 | 0.7222491 | 0.6777565 | 0.7345357 | 0.6769033 | 0.7636258 | 0.6584965 |
| FGA | 0.7063930 | 0.9821855 | 0.9780933 | 1.0000000 | 0.9161943 | 0.6820033 | 0.8790428 | 0.8391898 | 0.8285125 | 0.8596452 | 0.8829525 | 0.6594889 | 0.7449237 | 0.7566952 | 0.5915472 | 0.6666348 | 0.6366949 | 0.7679677 | 0.5694408 |
| 2PA | 0.6819264 | 0.9286500 | 0.9496727 | 0.9161943 | 1.0000000 | 0.7020484 | 0.9852865 | 0.8710493 | 0.8866262 | 0.8416437 | 0.7988539 | 0.6934428 | 0.7088033 | 0.6936573 | 0.6326276 | 0.7316343 | 0.6509837 | 0.6961546 | 0.6781086 |
| VORP | 0.6803390 | 0.7569086 | 0.7436993 | 0.6820033 | 0.7020484 | 1.0000000 | 0.7309881 | 0.7618190 | 0.7532235 | 0.6098389 | 0.6165867 | 0.8927801 | 0.6294509 | 0.6231408 | 0.8908721 | 0.6511420 | 0.7029440 | 0.5960315 | 0.6014074 |
| 2P | 0.6764062 | 0.9147482 | 0.9430055 | 0.8790428 | 0.9852865 | 0.7309881 | 1.0000000 | 0.8594309 | 0.8827964 | 0.8135829 | 0.7812706 | 0.7347046 | 0.7009049 | 0.6426601 | 0.6829717 | 0.7682642 | 0.6670889 | 0.6811883 | 0.7301540 |
| FT | 0.6737723 | 0.8979597 | 0.8598332 | 0.8391898 | 0.8710493 | 0.7618190 | 0.8594309 | 1.0000000 | 0.9875281 | 0.7845790 | 0.7056326 | 0.7192200 | 0.6303017 | 0.6464973 | 0.7179870 | 0.6231723 | 0.5638565 | 0.6304507 | 0.5509019 |
| FTA | 0.6686381 | 0.8897570 | 0.8609104 | 0.8285125 | 0.8866262 | 0.7532235 | 0.8827964 | 0.9875281 | 1.0000000 | 0.7915174 | 0.7053006 | 0.7199464 | 0.6356131 | 0.6296448 | 0.7079629 | 0.6672606 | 0.5836073 | 0.6301941 | 0.6095663 |
| TOV | 0.6468075 | 0.8600304 | 0.8576005 | 0.8596452 | 0.8416437 | 0.6098389 | 0.8135829 | 0.7845790 | 0.7915174 | 1.0000000 | 0.7960867 | 0.5563825 | 0.6471489 | 0.8265584 | 0.4785576 | 0.6694122 | 0.5718471 | 0.6488436 | 0.5875191 |
| MP | 0.6420985 | 0.8737560 | 0.8786385 | 0.8829525 | 0.7988539 | 0.6165867 | 0.7812706 | 0.7056326 | 0.7053006 | 0.7960867 | 1.0000000 | 0.7280854 | 0.8275999 | 0.7499174 | 0.6278770 | 0.7496944 | 0.7489025 | 0.8746072 | 0.6790314 |
| WS | 0.6246638 | 0.7370737 | 0.7351147 | 0.6594889 | 0.6934428 | 0.8927801 | 0.7347046 | 0.7192200 | 0.7199464 | 0.5563825 | 0.7280854 | 1.0000000 | 0.7605724 | 0.5400001 | 0.9585744 | 0.7258428 | 0.8585728 | 0.7962848 | 0.7089825 |
| GS | 0.6022673 | 0.7470972 | 0.7539334 | 0.7449237 | 0.7088033 | 0.6294509 | 0.7009049 | 0.6303017 | 0.6356131 | 0.6471489 | 0.8275999 | 0.7605724 | 1.0000000 | 0.5907430 | 0.6626077 | 0.6587868 | 0.7711772 | 0.8775652 | 0.6138609 |
| AST | 0.5939712 | 0.7329330 | 0.7222491 | 0.7566952 | 0.6936573 | 0.6231408 | 0.6426601 | 0.6464973 | 0.6296448 | 0.8265584 | 0.7499174 | 0.5400001 | 0.5907430 | 1.0000000 | 0.5039736 | 0.4907528 | 0.4857097 | 0.5832622 | 0.3903797 |
| OWS | 0.5824164 | 0.6901382 | 0.6777565 | 0.5915472 | 0.6326276 | 0.8908721 | 0.6829717 | 0.7179870 | 0.7079629 | 0.4785576 | 0.6278770 | 0.9585744 | 0.6626077 | 0.5039736 | 1.0000000 | 0.6219740 | 0.6775770 | 0.6649729 | 0.6153671 |
| DRB | 0.5765958 | 0.7048537 | 0.7345357 | 0.6666348 | 0.7316343 | 0.6511420 | 0.7682642 | 0.6231723 | 0.6672606 | 0.6694122 | 0.7496944 | 0.7258428 | 0.6587868 | 0.4907528 | 0.6219740 | 1.0000000 | 0.7545584 | 0.6621512 | 0.9701003 |
| DWS | 0.5669897 | 0.6596061 | 0.6769033 | 0.6366949 | 0.6509837 | 0.7029440 | 0.6670889 | 0.5638565 | 0.5836073 | 0.5718471 | 0.7489025 | 0.8585728 | 0.7711772 | 0.4857097 | 0.6775770 | 0.7545584 | 1.0000000 | 0.8594498 | 0.7233293 |
| Total Minutes | 0.5658969 | 0.7632907 | 0.7636258 | 0.7679677 | 0.6961546 | 0.5960315 | 0.6811883 | 0.6304507 | 0.6301941 | 0.6488436 | 0.8746072 | 0.7962848 | 0.8775652 | 0.5832622 | 0.6649729 | 0.6621512 | 0.8594498 | 1.0000000 | 0.6042844 |
| TRB | 0.5036332 | 0.6182641 | 0.6584965 | 0.5694408 | 0.6781086 | 0.6014074 | 0.7301540 | 0.5509019 | 0.6095663 | 0.5875191 | 0.6790314 | 0.7089825 | 0.6138609 | 0.3903797 | 0.6153671 | 0.9701003 | 0.7233293 | 0.6042844 | 1.0000000 |
Most of the independent variables have high correlation with one
another. This isn’t a total surprise, since a lot of the features are
similar to one another. For instance, Points (PTS) are produced from
2-pointers (2P), 3-pointers (3P), and Free Throws (FT) and are
represented as a whole by Field Goals (FG). Win Shares (WS) is a
combination of Offensive Win Shares (OWS) and Defensive Win Shares
(DWS). Conversely, Total Rebounds (TRB) and Assists (AST) had the lowest
correlation coefficient with 0.3903797. Because of the high
multicollinearity and similar attributes among the independent
variables, I decided to reduce the number of features further, removing
variables that have exceptionally high multicollinearity with the
independent variable PTS, specifically 2PA,
FG. FGA, 2P, FT, and
FTA, which each have over 88% correlation coefficient with
PTS. Likewise, I removed DWS since it has
similiarity with WS, Total Minutes since it is
similar to Minutes Played (MP), and removed
Defensive Rebounds (DRB), since it is similar to
Total Rebounds (TRB).
Value over Replacement Player (VORP) was kept as an
independent variable, since it is a different statistical measurement
than the other variables. It is a box score estimate of the points per
100 TEAM possessions that a player contributed above a replacement-level
(-2.0) player, translated to an average team and prorated to an 82-game
season. Multiply by 2.70 to convert to wins over replacement. You can
read more about the statistic here.
Even after removing those variables and taking into account their
redundancy, the presence of multicollinearity with the remaining
independent variables influenced by decision to choose Lasso Regression,
Neural Networks, and XG Boost, since these two Machine Learning
algorithms handle the presence of multicollinearity very well.
nba_df_feat2 <- nba_df_feat %>%
select(Salary, PTS, VORP, WS, TRB, MP, TOV, GS, AST) %>%
as.data.frame()
p <- ggpairs(nba_df_feat2[,c(1:9)], lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p
Since PTS have high correlation with
Salary, I wanted to explore the relationship between each
variable:
ggplotly(ggplot(nba_df_feat2 %>%
drop_na(PTS, Salary), aes(x = PTS, y = Salary)) +
geom_point(col = "blue") +
geom_smooth(formula = y ~ x, method = "loess") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"), breaks = seq(0, max(df$Salary), by = 1e7)) +
labs(title = "2022-23 Average Points Per Game and Salary",
x = "Average Points Per Game", y = "2022-23 Salary"))
There appears to be a positive correlation with Points and Salary, shown by its upward diagonal trajectory. This isn’t totally suprising, since players who score a lot of points are highly coveted by NBA teams and therefore will invest a lot of money based on their points production.
set.seed(123)
trainIndex <- createDataPartition(nba_df_feat2$Salary, p = 0.8, list = FALSE)
trainData <- nba_df_feat2[trainIndex, ]
testData <- nba_df_feat2[-trainIndex, ]
# Scaling data
preProcValues <- preProcess(trainData, method = c("center", "scale"))
train_data_scaled <- predict(preProcValues, trainData)
test_data_scaled <- predict(preProcValues, testData)
# Predictor and response variables for training data
X_train_scaled <- as.matrix(train_data_scaled[, -1]) # scaled
#X_train2 <- as.matrix(trainData[, -1])
y_train_scaled <- train_data_scaled$Salary # scaled
#y_train2 <- trainData$Salary
# Predictor and response variables for test data
X_test_scaled <- as.matrix(test_data_scaled[, -1]) # scaled
#X_test2 <- as.matrix(testData[, -1])
y_test_scaled <- test_data_scaled$Salary # scaled
#y_test2 <- testData$Salary
# Set seed for reproducible random selection and assignment operations
set.seed(1985)
# Specify 10-fold cross-validation as training method
ctrlspecs <- trainControl(method="cv",
number=10,
savePredictions="all")
# Create vector of potential lambda values
lambda_vector <- 10^seq(5, -5, length=500)
# Specify lasso regression model to be estimated using training data
# and k-fold cross-validation process
lr_model_scaled <- train(Salary ~ .,
data=train_data_scaled,
preProcess=c("center","scale"),
method="glmnet",
tuneGrid=expand.grid(alpha=1, lambda=lambda_vector),
trControl=ctrlspecs,
na.action=na.omit)
summary(lr_model_scaled)
## Length Class Mode
## a0 79 -none- numeric
## beta 632 dgCMatrix S4
## df 79 -none- numeric
## dim 2 -none- numeric
## lambda 79 -none- numeric
## dev.ratio 79 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 8 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
set.seed(123)
lr_model_scaled$results$MAE[1]
## [1] 0.4419835
lr_model_scaled$results$RMSE[1]
## [1] 0.6659549
lr_model_scaled$results$Rsquared[1]
## [1] 0.55629
set.seed(123)
lr_predictions <- predict(lr_model_scaled, X_test_scaled) #as.vector(predict(model1, X_test))
lr_model_resample <- postResample(pred = lr_predictions, obs = y_test_scaled)
lr_results <- data.frame(Model = "Lasso Regression",
RMSE = caret::RMSE(lr_predictions, y_test_scaled),
Rsquared = caret::R2(lr_predictions, y_test_scaled),
MAE = caret::MAE(lr_predictions, y_test_scaled))
set.seed(123)
lr_results |>
kbl() |>
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Lasso Regression | 0.6326769 | 0.6226381 | 0.4168337 |
varImp(lr_model_scaled)
## glmnet variable importance
##
## Overall
## PTS 100.000
## VORP 92.951
## GS 39.363
## TOV 32.503
## AST 2.664
## WS 0.000
## MP 0.000
## TRB 0.000
set.seed(123)
lasso_model_importance <- varImp(lr_model_scaled)$importance |>
as.data.frame() |>
rownames_to_column("Variable") |>
#filter(Overall >= 50) |>
arrange(desc(Overall)) |>
mutate(importance = row_number())
set.seed(123)
varImp(lr_model_scaled) %>%
plot(., top = max(lasso_model_importance$importance), main = "Important Variables In Predicting NBA Player Salary Using \n Lasso Regression")
# tooHigh <- findCorrelation(cor(train_x), cutoff = .75)
#
# train_x2 <- train_x[, -tooHigh]
# test_x2 <- test_x[, -tooHigh]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
set.seed(669)
nnetModel <- train(X_train_scaled, y_train_scaled,
method = "nnet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "repeatedcv",
repeats = 5),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(X_train_scaled) + 1) + 10 + 1,
maxit = 500)
nnetModel
## Neural Network
##
## 375 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 338, 339, 336, 337, 336, 338, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.6826304 0.5387993 0.4439362
## 0.00 2 0.7546671 0.4624549 0.4912958
## 0.00 3 0.7841705 0.4462586 0.4972833
## 0.00 4 0.8142730 0.4344568 0.5249582
## 0.00 5 1.0251343 0.3791678 0.6076205
## 0.00 6 0.9131322 0.3364305 0.5872640
## 0.00 7 0.9377538 0.3416392 0.6057707
## 0.00 8 0.9495851 0.3669709 0.6154047
## 0.00 9 0.9647247 0.3659755 0.6214975
## 0.00 10 1.4082450 0.2735293 0.7340524
## 0.01 1 0.6818064 0.5379839 0.4456567
## 0.01 2 0.7261134 0.4845550 0.4676771
## 0.01 3 0.7636211 0.4601992 0.4868423
## 0.01 4 0.7962528 0.4287952 0.5159405
## 0.01 5 0.8114874 0.4263842 0.5191773
## 0.01 6 0.8294005 0.4322094 0.5283609
## 0.01 7 0.8560919 0.4102477 0.5537744
## 0.01 8 0.8953805 0.3973334 0.5751889
## 0.01 9 0.9329654 0.3891464 0.5979713
## 0.01 10 1.0178994 0.3271282 0.6461981
## 0.10 1 0.6754181 0.5474605 0.4378668
## 0.10 2 0.7013695 0.5193044 0.4587552
## 0.10 3 0.7277943 0.4889851 0.4740563
## 0.10 4 0.7250926 0.4980875 0.4686974
## 0.10 5 0.7292474 0.4953542 0.4726979
## 0.10 6 0.7601260 0.4796456 0.4897747
## 0.10 7 0.7506504 0.4864477 0.4829723
## 0.10 8 0.8035754 0.4359179 0.5187116
## 0.10 9 0.8013368 0.4549355 0.5237018
## 0.10 10 0.8135000 0.4312123 0.5382530
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
set.seed(123)
nnetModel$results$MAE[1]
## [1] 0.4439362
nnetModel$results$RMSE[1]
## [1] 0.6826304
nnetModel$results$Rsquared[1]
## [1] 0.5387993
set.seed(669)
nnetPred <- predict(nnetModel, newdata = X_test_scaled) #as.vector(predict(nnetModel2, newdata = X_test))
NNET_Model <- postResample(pred = nnetPred, obs = y_test_scaled)
nnet_results <- data.frame(Model = "Neural Networks",
RMSE = caret::RMSE(nnetPred, y_test_scaled),
Rsquared = caret::R2(nnetPred, y_test_scaled),
MAE = caret::MAE(nnetPred, y_test_scaled))
#NNET_Model2
set.seed(123)
nnet_results |>
kbl() |>
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Neural Networks | 0.6362838 | 0.6130482 | 0.4125134 |
varImp(nnetModel)
## nnet variable importance
##
## Overall
## VORP 100.00
## MP 61.14
## PTS 44.75
## TOV 39.61
## GS 27.73
## WS 12.85
## AST 12.62
## TRB 0.00
set.seed(123)
nnet_model_importance <- varImp(nnetModel)$importance |>
as.data.frame() |>
rownames_to_column("Variable") |>
#filter(Overall >= 50) |>
arrange(desc(Overall)) |>
mutate(importance = row_number())
set.seed(123)
varImp(nnetModel) %>%
plot(., top = max(nnet_model_importance$importance), main = "Important Variables for Predicting NBA Player Salary Using \n Neural Networks")
set.seed(123)
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 2, search = "random")
# train a xgbTree model using caret::train
xg_model <- train(Salary ~., data = train_data_scaled, method = "xgbTree", trControl = fitControl)
# Instead of tree for our boosters, you can also fit a linear regression or logistic regression model using xgbLinear
# model <- train(factor(Improved)~., data = df, method = "xgbLinear", trControl = fitControl)
# See model results
print(xg_model)
## eXtreme Gradient Boosting
##
## 375 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times)
## Summary of sample sizes: 338, 337, 338, 339, 337, 338, ...
## Resampling results across tuning parameters:
##
## eta max_depth gamma colsample_bytree min_child_weight subsample
## 0.2745122 2 6.775706 0.6599300 18 0.7813979
## 0.3313096 10 4.533342 0.3411699 4 0.7417793
## 0.5741432 6 5.726334 0.3984351 8 0.6580495
## nrounds RMSE Rsquared MAE
## 526 0.6703663 0.5452747 0.4399166
## 179 0.7059238 0.5011445 0.4583089
## 195 0.7327009 0.4878432 0.4772313
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 526, max_depth = 2, eta
## = 0.2745122, gamma = 6.775706, colsample_bytree = 0.65993, min_child_weight
## = 18 and subsample = 0.7813979.
set.seed(123)
xg_model$results$MAE[1]
## [1] 0.4583089
xg_model$results$RMSE[1]
## [1] 0.7059238
xg_model$results$Rsquared[1]
## [1] 0.5011445
set.seed(669)
xgPred <- predict(xg_model, newdata = X_test_scaled) #as.vector(predict(nnetModel2, newdata = X_test))
xg_mod <- postResample(pred = xgPred, obs = y_test_scaled)
xg_results <- data.frame(Model = "XG Boost",
RMSE = caret::RMSE(xgPred, y_test_scaled),
Rsquared = caret::R2(xgPred, y_test_scaled),
MAE = caret::MAE(xgPred, y_test_scaled))
#NNET_Model2
set.seed(123)
xg_results |>
kbl() |>
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| XG Boost | 0.6445609 | 0.5982462 | 0.4191515 |
varImp(xg_model)
## xgbTree variable importance
##
## Overall
## MP 100.000
## TOV 70.145
## VORP 67.051
## PTS 31.934
## AST 26.189
## WS 16.511
## GS 8.129
## TRB 0.000
set.seed(123)
nnet_model_importance <- varImp(xg_model)$importance |>
as.data.frame() |>
rownames_to_column("Variable") |>
#filter(Overall >= 50) |>
arrange(desc(Overall)) |>
mutate(importance = row_number())
set.seed(123)
varImp(xg_model) %>%
plot(., top = max(nnet_model_importance$importance), main = "Important Variables for Predicting NBA Player Salary Using \n XG Boost")
set.seed(123)
combine_results <- rbind(lr_results, nnet_results, xg_results)
combine_results |>
kbl() |>
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Lasso Regression | 0.6326769 | 0.6226381 | 0.4168337 |
| Neural Networks | 0.6362838 | 0.6130482 | 0.4125134 |
| XG Boost | 0.6445609 | 0.5982462 | 0.4191515 |
The business impact from this analysis is to identify what variables
we should consider when assessing the amount of money to invest in
players, as well as the best model to use to predict a player’s salary.
The three models used for this analysis were Lasso Regression, Neural
Networks, and XG Boost. The reason for selecting these specific models
is that they can handle the multicollinearity that is present among the
independent variables. As discussed earlier, the distribution of NBA
player salaries is skewed to the right, with many outliers in the
dataset. When identifying independent variables highly correlated with
Salary, there were 18 variables that had at least a 50%
correlation coefficient, and selected those variables. Most of the
selected variables had high multicollinearity with each other. When
further examining each independent variable, I removed variables that
not only had multicollinearity, but also were redundant. As mentioned
earlier, 2PA, FG. FGA,
2P, FT, FTA, OWS,
DWS, Total Minutes, and
Defensive Rebounds (DRB) were removed due to its high
correlation and redundancy with other variables. With the subsetted
dataset, I split the data into train and test sets. Because of the
significant outliers in the data, I scaled the train and test dependent
and independent variables.
The three models performed well. Lasso Regression had the highest Rsquared score of 0.6226381, followed by Neural Networks at 0.6130482 and XGBoost at 0.5982462. The RMSE and MAE values for each model were also very close, with Lasso Regression having slightly the lowest Root Mean Squared Error (RMSE), and Neural Networks having slightly the lowest Mean Absolute Error (MAE). For Lasso Regression, Points had the highest variable importance at 100%, slightly higher than Value Over Replacement Player (VORP) at 92.95%, and followed by significantly lower scores in Games Started (GS), Turnovers (TOV), and Assists (AST). Win Shares (WS), Minutes Played (MP), and Total Rebounds (TRB) did not register a variable importance score. Neural Networks had a different variable importance ranking, with VORP having a 100% score, followed by Minutes Played and Points. The variable importance rankings were different for XG Boost as well, with Minutes Played having the highest score at 100% followed by Turnovers, VORP, and Points. The rankings were in sharp contrast with Lasso Regression, where Minutes Played was given a score of zero. For all three models, Total Rebounds were given an importance score of zero, which indicates that it doesn’t have an impact of the salary of NBA players. Taken altogether, it appears the Points and VORP are the two independent variables that have an impact on NBA player salaries among the three models used for this analysis.
While there are small margins in their performance metrics that separate all three models, the model that I would choose is the Lasso Regression. Its ability to handle multicollinearity combined with its high Rsquared and low RMSE values make it a viable model for this data.
While the statistical variables provided in this dataset are metrics that are used to measure player performance, there are other advanced metrics that could be included. Effective Field Goal Percentage (eFG%), Box Plus Minus (BPM), and True Shooting Percentage (TS%) are just a few features that are used to evaluate players which are in the dataset, but did not have a high correlation values with Salary. With respect to player salary, there are other stats, particularly defensive stats, that could be included and would be interesting to see if they have any impact on a player’s salary. While scaling the data is important to mitigate the outliers that exist, building models without scaling the data could be done just to compare their model performances with the three models that used scaled data. With more hyperparameter tuning, the models used for this analysis could be improved and provide a more accurate model to predict NBA player salaries.