As a NBA fan for years, I am curious about the polarised NBA salary. Players who seem to play similar games receive a very different salary. Hence, we want to know the most critical elements in NBA players that make them better than others, while salary is a good standard representing the players’ value in the NBA.
There are several approaches to the problem, including interviewing NBA players and reading different articles. However, we want to take a more mathematics approach and use secondary source data online and do data analysis ourselves to find our own results. It allows us to give our independent interpretation of the game.
This report aims to find the relationships between the players’ statistics and the salary in the following year.
We found player statistics for the 2021-2022 season and their respective salary in the 2022-2023 season. We have collected the player statistics and salaries of 384 NBA players who played in 2021-2022 and will receive a salary in the 2022-23 season. The player statistics are collected from Basketball Reference and NBA official website, and the detailed links are listed below in the Source of files section.
We started with merging the data from different sources and removing repeated variables.
In the exploratory data analysis, we start with a univariate analysis of the response variable salary and then explore the correlation of individual variables with salary.
In the modelling section, we use the RMSE score on the train set to
determine the final model. We have 10-fold cross-validation on each
model (the 10-fold all use RMSE to hyperparameter tuning). The root mean
squared error (RMSE) measures the prediction error, while R-squared
measures the proportion of variation explained by the model.
We have included the following modelling methods:
As the Random Forest Model has the lowest RMSE, we chose it as the final model for our prediction. It has 0.4835 for the estimated out-of-bag RMSE.
In the interpretation part, we will mainly use linear regression and principal component analysis to analyse. We will not use the random forest model as it is hard to give an interpretation from a random forest model.
In the analysis, imperial units are used as it is the standard unit systems in the NBA. The salaries are recorded in USD.
We want to use player statistics to explore what kind of players with
which attributes will result in higher salaries in the modern NBA.
There are 59 explanatory predictive variables after removing repetitive
variables in the raw dataset with salary as the response variable.
In the current NBA, offensive players are valued more than defensive players as they are usually more eye-catching and thus generate more revenue for the organization. Hence, we predict that the points per game, assist per game and the age of the player will be the most dominant factors in the players’ salaries.
player_PGstats_2021.csv – NBA players’ statistics per game in the
2021-2022 season
source: https://www.basketball-reference.com/leagues/NBA_2022_per_game.html
player_Adstats_2021.csv – NBA players advance statistics in 2021-2022
season source: https://www.basketball-reference.com/leagues/NBA_2022_advanced.html
salary2022.csv – NBA players’ contracts in 2022-2023 season onward
source: https://www.basketball-reference.com/contracts/players.html
bio.csv – NBA player bio (height and weight)
source: https://www.nba.com/stats/players/bio/?Season=2021-22&SeasonType=Regular%20Season&sort=PLAYER_NAME&dir=1
The source Basketball
Reference is a online database specified for NBA that has long-term
reliable sources.
The source NBA official website is
the official website of NBA which represent the standard of all NBA
related informations.
Load libraries required for the analysis.
library(knitr)
library(plyr)
library(dplyr)
library(tidyr)
library(caret)
library(ggplot2)
library(corrplot)
library(stringr)
library(scales)
library(randomForest)
library(glmnet)
library(rpart)
library(lubridate)
library(plotly)
library(forcats)
library(psych)
library(ggExtra)
library(reshape2)
library(tree)
library(MASS)
library(Metrics)
library(rattle)
library(neuralnet)
library(sigmoid)
opts_chunk$set(echo = TRUE, cache = TRUE)
opts_chunk$set(tidy.opts = list(width.cutoff = 80), tidy = TRUE, fig.height = 6, fig.width = 9)
Read in the data that were downloaded from the sources.
pgstats <- read.csv("files/2022/player_PGstats_2021.csv")
adstats <- read.csv("files/2022/player_Adstats_2021.csv")
salary <- read.csv("files/2022/salary2022.csv")
bio <- read.csv("files/2022/bio.csv")
NBA players’ statistics per game in the 2021-2022 season
source: https://www.basketball-reference.com/leagues/NBA_2022_per_game.html
The file contains NBA player per-game statistics in 2021-22 season,
including points, assists, rebounds, blocks, and steal et al..
dim(pgstats)
## [1] 812 31
str(pgstats)
## 'data.frame': 812 obs. of 31 variables:
## $ Rk : int 1 2 3 4 5 6 6 6 7 8 ...
## $ Player : chr "Precious Achiuwa" "Steven Adams" "Bam Adebayo" "Santi Aldama" ...
## $ Pos : chr "C" "C" "C" "PF" ...
## $ Age : int 22 28 24 21 36 23 23 23 26 23 ...
## $ Tm : chr "TOR" "MEM" "MIA" "MEM" ...
## $ G : int 73 76 56 32 47 65 50 15 66 56 ...
## $ GS : int 28 75 56 0 12 21 19 2 61 56 ...
## $ MP : num 23.6 26.3 32.6 11.3 22.3 22.6 26.3 9.9 27.3 32.3 ...
## $ FG : num 3.6 2.8 7.3 1.7 5.4 3.9 4.7 1.1 3.9 6.6 ...
## $ FGA : num 8.3 5.1 13 4.1 9.7 10.5 12.6 3.2 8.6 9.7 ...
## $ FG. : num 0.439 0.547 0.557 0.402 0.55 0.372 0.375 0.333 0.448 0.677 ...
## $ X3P : num 0.8 0 0 0.2 0.3 1.6 1.9 0.7 2.4 0 ...
## $ X3PA : num 2.1 0 0.1 1.5 1 5.2 6.1 2.2 5.9 0.2 ...
## $ X3P. : num 0.359 0 0 0.125 0.304 0.311 0.311 0.303 0.409 0.1 ...
## $ X2P : num 2.9 2.8 7.3 1.5 5.1 2.3 2.8 0.4 1.5 6.6 ...
## $ X2PA : num 6.1 5 12.9 2.6 8.8 5.3 6.5 1 2.7 9.6 ...
## $ X2P. : num 0.468 0.548 0.562 0.56 0.578 0.433 0.434 0.4 0.533 0.688 ...
## $ eFG. : num 0.486 0.547 0.557 0.424 0.566 0.449 0.45 0.438 0.588 0.678 ...
## $ FT : num 1.1 1.4 4.6 0.6 1.9 1.2 1.4 0.7 1 2.9 ...
## $ FTA : num 1.8 2.6 6.1 1 2.2 1.7 1.9 0.8 1.1 4.2 ...
## $ FT. : num 0.595 0.543 0.753 0.625 0.873 0.743 0.722 0.917 0.865 0.708 ...
## $ ORB : num 2 4.6 2.4 1 1.6 0.6 0.7 0.1 0.5 3.4 ...
## $ DRB : num 4.5 5.4 7.6 1.7 3.9 2.3 2.6 1.5 2.9 7.3 ...
## $ TRB : num 6.5 10 10.1 2.7 5.5 2.9 3.3 1.5 3.4 10.8 ...
## $ AST : num 1.1 3.4 3.4 0.7 0.9 2.4 2.8 1.1 1.5 1.6 ...
## $ STL : num 0.5 0.9 1.4 0.2 0.3 0.7 0.8 0.3 0.7 0.8 ...
## $ BLK : num 0.6 0.8 0.8 0.3 1 0.4 0.4 0.3 0.3 1.3 ...
## $ TOV : num 1.2 1.5 2.6 0.5 0.9 1.4 1.7 0.5 0.7 1.7 ...
## $ PF : num 2.1 2 3.1 1.1 1.7 1.6 1.8 1 1.5 1.7 ...
## $ PTS : num 9.1 6.9 19.1 4.1 12.9 10.6 12.8 3.5 11.1 16.1 ...
## $ player_id: chr "achiupr01" "adamsst01" "adebaba01" "aldamsa01" ...
player_Adstats_2021.csv – NBA players advance statistics in 2021-2022
season source: https://www.basketball-reference.com/leagues/NBA_2022_advanced.html
This file contains NBA player advance statistics in 2021-22 season,
including PER (player efficiency rating), win shares, box score, and
VORP (value over replacement player) et al..
dim(adstats)
## [1] 812 30
str(adstats)
## 'data.frame': 812 obs. of 30 variables:
## $ Rk : int 1 2 3 4 5 6 6 6 7 8 ...
## $ Player : chr "Precious Achiuwa" "Steven Adams" "Bam Adebayo" "Santi Aldama" ...
## $ Pos : chr "C" "C" "C" "PF" ...
## $ Age : int 22 28 24 21 36 23 23 23 26 23 ...
## $ Tm : chr "TOR" "MEM" "MIA" "MEM" ...
## $ G : int 73 76 56 32 47 65 50 15 66 56 ...
## $ MP : int 1725 1999 1825 360 1050 1466 1317 149 1805 1809 ...
## $ PER : num 12.7 17.6 21.8 10.2 19.6 10.5 10.5 10.2 12.7 23 ...
## $ TS. : num 0.503 0.56 0.608 0.452 0.604 0.475 0.474 0.497 0.609 0.698 ...
## $ X3PAr : num 0.259 0.003 0.008 0.364 0.1 0.497 0.483 0.688 0.684 0.018 ...
## $ FTr : num 0.217 0.518 0.466 0.242 0.223 0.16 0.153 0.25 0.13 0.428 ...
## $ ORB. : num 8.7 17.9 8.7 9.4 7.8 2.7 3 0.8 1.9 12 ...
## $ DRB. : num 21.7 22 26.1 16.1 18.7 11.5 11 15.6 10.9 24.5 ...
## $ TRB. : num 14.9 19.9 17.5 12.6 13.4 7.1 6.9 8.5 6.5 18.4 ...
## $ AST. : num 6.9 16.1 17.5 7.7 6.3 16.1 16.1 15.5 7.6 8.2 ...
## $ STL. : num 1.1 1.6 2.2 0.8 0.6 1.5 1.5 1.7 1.2 1.2 ...
## $ BLK. : num 2.3 2.7 2.6 2.5 4 1.5 1.4 2.4 1 3.7 ...
## $ TOV. : num 11.3 19.6 14.4 9.9 8 11.3 11.2 13.1 6.7 12.7 ...
## $ USG. : num 18.5 12 25 18.4 22.4 24.1 24.8 17.9 15.2 18.1 ...
## $ X : logi NA NA NA NA NA NA ...
## $ OWS : num 0.4 3.8 3.6 -0.1 2.1 -1.1 -1.1 0 2.8 5.4 ...
## $ DWS : num 2.1 3 3.5 0.4 1 1.1 0.9 0.2 1.4 3 ...
## $ WS : num 2.5 6.8 7.2 0.3 3.1 0.1 -0.1 0.2 4.2 8.5 ...
## $ WS.48 : num 0.07 0.163 0.188 0.044 0.141 0.003 -0.005 0.07 0.11 0.225 ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ OBPM : num -2 1 1.7 -4.2 1.3 -1.8 -1.7 -2.9 0.6 2.7 ...
## $ DBPM : num -0.6 1 2.1 -1.5 -0.6 -1.1 -1.3 1.2 -0.2 1.2 ...
## $ BPM : num -2.6 2 3.8 -5.7 0.7 -2.9 -3 -1.7 0.4 3.9 ...
## $ VORP : num -0.2 2 2.7 -0.3 0.7 -0.3 -0.3 0 1.1 2.7 ...
## $ player_id: chr "achiupr01" "adamsst01" "adebaba01" "aldamsa01" ...
salary2022.csv – NBA players contract in 2022-2023 season
onward
source: https://www.basketball-reference.com/contracts/players.html
This file contains NBA players’ salaries, types of contracts and the
guaranteed amount of money from the contract.
dim(salary)
## [1] 448 12
str(salary)
## 'data.frame': 448 obs. of 12 variables:
## $ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Player : chr "Stephen Curry" "Russell Westbrook" "LeBron James" "Kevin Durant" ...
## $ Tm : chr "GSW" "LAL" "LAL" "BRK" ...
## $ X2022.23 : chr "$48070014" "$47063478" "$44474988" "$44119845" ...
## $ X2023.24 : chr "$51915615" "" "" "$46407433" ...
## $ X2024.25 : chr "$55761216" "" "" "$49856021" ...
## $ X2025.26 : chr "$59606817" "" "" "$53282609" ...
## $ X2026.27 : chr "" "" "" "" ...
## $ X2027.28 : chr "" "" "" "" ...
## $ Signed.Using: chr "Bird" "Bird Rights" "Bird" "Bird" ...
## $ Guaranteed : chr "$215353662" "$47063478" "$44474988" "$193665908" ...
## $ player_id : chr "curryst01" "westbru01" "jamesle01" "duranke01" ...
bio.csv – NBA player bio (height and weight)
source: https://www.nba.com/stats/players/bio/?Season=2021-22&SeasonType=Regular%20Season&sort=PLAYER_NAME&dir=1
This file contains the height and weight of NBA players in the 2021-22
season.
dim(bio)
## [1] 605 4
str(bio)
## 'data.frame': 605 obs. of 4 variables:
## $ Player: chr "Aaron Gordon" "Aaron Henry" "Aaron Holiday" "Aaron Nesmith" ...
## $ Team : chr "DEN" "PHI" "PHX" "BOS" ...
## $ Weight: int 235 210 185 215 190 225 200 241 174 240 ...
## $ Height: chr "6-8" "6-5" "6-0" "6-5" ...
We merge the tables by their primary key (pgstats.player_id) and foreign key (salary.player_id) by inner join (only take the entries which exist in both tables). We treat the players who received a salary but have not played any games as outliers.
Merge player per-game statistics and advance statistics.
merged <- merge(pgstats, adstats, by = c("player_id", "Tm"))
Since some players have changed teams in the middle of the season, We use the total stats (indicated by Tm = “TOT”).
repeated <- names(which(table(merged$player_id) > 1))
for (i in 1:length(repeated)) {
id <- repeated[i]
use <- merged[which(merged$player_id == id & merged$Tm == "TOT"), ]
temp_tm <- merged$Tm[max(which(merged$player_id == id))]
merged <- merged[merged$player_id != id, ]
merged <- rbind(merged, use)
merged$Tm[merged$player_id == id] <- temp_tm
}
Remove and rename some variables to increase readability.
merged <- merged %>%
dplyr::select(!c(Rk.y, Player.y, Pos.y, Age.y, MP.y, G.y, X, X.1)) %>%
rename(Rk = Rk.x, Player = Player.x, Position = Pos.x, Age = Age.x, Game_played = G.x,
FGpct = FG., X3Ppct = X3P., X2Ppct = X2P., eFGpct = eFG., FTpct = FT., TSpct = TS.,
ORBpct = ORB., DRBpct = DRB., TRBpct = TRB., ASTpct = AST., STLpct = STL.,
BLKpct = BLK., TOVpct = TOV., USGpct = USG., WSper48 = WS.48, Game_started = GS,
MP = MP.x)
The salary dataset contains multiple entries for players who had changed their teams in the middle of the season. The salary for the repeated entries are the same.
salary <- salary %>%
group_by(Player, player_id, X2022.23) %>%
summarise(Tm = Tm[1], Signed.Using = Signed.Using[1], Guaranteed = sum(as.numeric(grep(pattern = "[0-9]+",
Guaranteed))), Rk = Rk[1])
Merge the remain datasets.
merged <- merge(merged, salary, by = c("player_id"))
bio <- bio %>%
dplyr::select(!Team) %>%
rename(Player.x = Player)
merged <- merge(merged, bio, by = c("Player.x"), all.x = TRUE)
write.csv(merged, "dataset/all2022.csv")
all <- as.data.frame(read.csv("dataset/all2022.csv"))
dim(all)
## [1] 384 60
There are 384 entries.
str(all)
## 'data.frame': 384 obs. of 60 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Player.x : chr "Aaron Gordon" "Aaron Holiday" "Aaron Nesmith" "Aaron Wiggins" ...
## $ player_id : chr "gordoaa01" "holidaa01" "nesmiaa01" "wiggiaa01" ...
## $ Tm.x : chr "DEN" "WAS" "BOS" "OKC" ...
## $ Rk.x : int 198 244 406 581 250 85 448 97 328 497 ...
## $ Position : chr "PF" "PG" "SF" "SG" ...
## $ Age : int 26 25 22 23 35 30 20 27 28 19 ...
## $ Game_played : int 75 63 52 50 69 81 61 41 39 72 ...
## $ Game_started: int 75 15 3 35 69 44 12 18 10 13 ...
## $ MP : num 31.7 16.2 11 24.2 29.1 28.6 20.2 28 15.9 20.7 ...
## $ FG : num 5.8 2.4 1.4 3.1 3.9 3.5 3 2.5 2.4 3.5 ...
## $ FGA : num 11.1 5.4 3.5 6.7 8.2 9 7.5 6.2 4.5 7.3 ...
## $ FGpct : num 0.52 0.447 0.396 0.463 0.467 0.391 0.408 0.398 0.534 0.474 ...
## $ X3P : num 1.2 0.6 0.6 0.8 1.3 1.9 0.9 1 0.2 0.4 ...
## $ X3PA : num 3.5 1.6 2.2 2.8 3.8 4.8 3.2 3.1 0.5 1.6 ...
## $ X3Ppct : num 0.335 0.379 0.27 0.304 0.336 0.404 0.289 0.333 0.286 0.248 ...
## $ X2P : num 4.6 1.8 0.8 2.3 2.6 1.6 2.1 1.5 2.3 3.1 ...
## $ X2PA : num 7.7 3.7 1.3 4 4.4 4.2 4.2 3.2 4 5.7 ...
## $ X2Ppct : num 0.605 0.477 0.612 0.573 0.582 0.378 0.498 0.462 0.568 0.539 ...
## $ eFGpct : num 0.573 0.504 0.481 0.525 0.546 0.499 0.47 0.48 0.551 0.502 ...
## $ FT : num 2.3 0.9 0.4 1.2 1.2 2.7 0.6 1.4 1.1 2.3 ...
## $ FTA : num 3.1 1.1 0.5 1.7 1.4 3.3 0.8 1.8 1.6 3.2 ...
## $ FTpct : num 0.743 0.868 0.808 0.729 0.842 0.822 0.7 0.795 0.651 0.711 ...
## $ ORB : num 1.7 0.4 0.3 1 1.6 0.6 1.2 0.8 1.3 1.9 ...
## $ DRB : num 4.2 1.6 1.4 2.5 6.1 4.3 4 2.8 2.8 3.5 ...
## $ TRB : num 5.9 1.9 1.7 3.6 7.7 4.9 5.2 3.6 4.1 5.5 ...
## $ AST : num 2.5 2.4 0.4 1.4 3.4 3 2.1 4 1.2 2.6 ...
## $ STL : num 0.6 0.7 0.4 0.6 0.7 1 0.6 1.7 0.3 0.8 ...
## $ BLK : num 0.6 0.1 0.1 0.2 1.3 0.3 0.6 0.4 0.6 0.9 ...
## $ TOV : num 1.8 1.1 0.6 1.1 0.9 1.1 1.5 1.4 1.1 2 ...
## $ PF : num 2 1.5 1.3 1.9 1.9 2.7 1.4 2.6 2.6 3 ...
## $ PTS : num 15 6.3 3.8 8.3 10.2 11.7 7.6 7.4 6 9.6 ...
## $ PER : num 15.3 12.6 7.3 10.3 16.7 13.7 12 11.7 13.1 16 ...
## $ TSpct : num 0.602 0.544 0.507 0.556 0.574 0.559 0.485 0.528 0.577 0.552 ...
## $ X3PAr : num 0.312 0.305 0.632 0.409 0.466 0.534 0.432 0.492 0.119 0.223 ...
## $ FTr : num 0.276 0.201 0.143 0.252 0.167 0.363 0.11 0.285 0.358 0.442 ...
## $ ORBpct : num 6.1 2.6 2.9 4.3 6 2.1 6 3.3 9 10.1 ...
## $ DRBpct : num 14.3 10.3 13.6 11 22.2 16.6 20.8 11.2 19.3 18.9 ...
## $ TRBpct : num 10.3 6.5 8.4 7.6 14.3 9.2 13.3 7.3 14.1 14.5 ...
## $ ASTpct : num 11.6 20.7 5.4 8.5 16.4 15.7 16.5 18.5 10.6 19.1 ...
## $ STLpct : num 0.9 2 1.7 1.2 1.2 1.8 1.5 3 1 1.9 ...
## $ BLKpct : num 1.7 0.7 0.8 0.8 4.2 1.1 2.9 1.1 3.5 4.1 ...
## $ TOVpct : num 12.5 15.4 13.8 12.6 9.6 9.7 16.4 16.5 17.4 18.8 ...
## $ USGpct : num 19.7 18.7 17.2 15.3 14.8 17.7 20 13.3 17.1 22 ...
## $ OWS : num 3.2 0.5 -0.4 0.5 3.7 3.2 -1.1 0.7 0.4 0.8 ...
## $ DWS : num 2 0.9 0.9 0.8 3.8 2.9 1.5 1.2 0.5 1.3 ...
## $ WS : num 5.2 1.5 0.4 1.2 7.6 6.1 0.4 2 0.9 2.1 ...
## $ WSper48 : num 0.105 0.068 0.038 0.048 0.181 0.126 0.014 0.082 0.07 0.068 ...
## $ OBPM : num 0.5 -1.9 -4.9 -3.4 1.4 -0.4 -2.7 -2.2 -3.2 -1.6 ...
## $ DBPM : num -1.1 0.3 0.7 -0.9 2.9 1.2 0.1 2.3 0.3 0.6 ...
## $ BPM : num -0.6 -1.7 -4.3 -4.3 4.3 0.8 -2.6 0.2 -2.9 -1 ...
## $ VORP : num 0.9 0.1 -0.3 -0.7 3.2 1.7 -0.2 0.6 -0.1 0.4 ...
## $ Player.y : chr "Aaron Gordon" "Aaron Holiday" "Aaron Nesmith" "Aaron Wiggins" ...
## $ X2022.23 : chr "$19690909" "$1968175" "$3804360" "$1563518" ...
## $ Tm.y : chr "DEN" "ATL" "IND" "OKC" ...
## $ Signed.Using: chr "Bird" "Minimum Salary" "1st Round Pick" "" ...
## $ Guaranteed : int 1 1 1 0 1 1 1 1 1 1 ...
## $ Rk.y : int 63 362 266 430 48 139 281 154 261 278 ...
## $ Weight : int 235 185 215 190 240 214 NA 186 250 NA ...
## $ Height : chr "6-8" "6-0" "6-5" "6-4" ...
sum(all$Player.x != all$Player.y)
## [1] 1
Since there is no difference in the player name, I will remove player.y and rename player.x to name
all <- all %>%
dplyr::select(!Player.y) %>%
rename(name = Player.x)
It is the rank in their respective original tables (alphabetical
order of player name in player statistics tables and salary in 2022-23
season for salary table).
Since it does not carry any extra information, I will remove the
variables.
all <- all %>%
dplyr::select(!c(Rk.x, Rk.y))
Tm.x is the team that the player played in the 2021-22 season, while Tm.y is the team of 2022-23 season. I will change Tm.x to team_2021 and Tm.y to team_2022.
all <- all %>%
rename(team_2021 = Tm.x, team_2022 = Tm.y)
The project aims to predict the salary next year. I will remove the salary of 2023-24 season onward and change the X2022.23 to numeric variables.
all <- all %>%
rename(salary = X2022.23) %>%
mutate(salary = as.numeric(str_extract(salary, "[0-9]+")))
plot_ly(data = all %>%
drop_na(salary), x = ~salary, type = "histogram", nbinsx = 30) %>%
layout(title = "Frequency Diagram of NBA salary in 2022-23 season", xaxis = list(title = "yearly salary (USD)"),
yaxis = list(title = "frequency"))
The salary is highly right-skewed. It is expected as the top NBA
players are paid more in order for the team to keep their top
players.
There is a high frequency concentrated in the 0 to 2 million range. it
might be because of the minimum salary in NBA, which is 1 million to 3
million per year, depending on their experience (Adams, 2022). I
will keep that in mind.
summary(all$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 333333 2196960 5837760 10131574 13955840 48070014 1
The salary of NBA players in the 2022-23 season (who played in the 2021-22 season) ranges from 3.3 million USD to 48.1 million USD. The median salary is 5.8 million USD, and the mean is 10 million USD.
I will first use the correlation with salary to get a feel for the numeric variables on the response variables
numVar <- which(sapply(all, is.numeric))
numVarNames <- names(numVar)
length(numVarNames)
## [1] 50
There are 50 numeric variables
all_numVar <- all[, numVar]
all_numVar <- dplyr::select(all_numVar, !X)
cor_Mat <- cor(all_numVar, use = "pairwise.complete.obs")
cor_names <- names(sort(cor_Mat[, "salary"], decreasing = TRUE))[1:20]
cor_Mat <- cor_Mat[cor_names, cor_names]
corrplot.mixed(cor_Mat, tl.pos = "lt")
We chose three variables that show a high correlation with the
salary: Points per game, Value Over replacement player, and assists per
game.
The correlation shows high multicollinearity among predictive
variables.
For example:
PTS:
PTS - Points per game
It has the highest correlation with salary among the numeric variables (0.7944907). It is the average point per game played.
ggplotly(ggplot(all %>%
drop_na(PTS, salary), aes(x = PTS, y = salary)) + geom_point(col = "blue") +
geom_smooth(formula = y ~ x, method = "loess") + labs(title = "points per game in NBA 2021-22 vs salary in NBA 2022-23",
x = "points per game", y = "yearly salary (USD)"))
There is a clear linear correlation between salary and points per game. The correlation is smaller when the points per game are below about nine but increases after it goes above nine points.
ggplotly(ggplot(all %>%
drop_na(PTS, salary), aes(x = PTS)) + geom_histogram(bins = 30) + labs(title = "Frequency Distribution of points per game in NBA 2021-22",
x = "points per game", y = "Frequency"))
Value Over Replacement Player:
VORP - Value Over Replacement Player (available since the 1973-74 season in the NBA); a box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season.
Although FG, FGA, FT et al., are more highly correlated, they are also highly correlated to points per game (> 0.75). I will look at the next one that is not highly correlated to points per game. It has a correlation of (0.6645534) with salary. It estimates the value provided by the player over a below-average player.
ggplotly(ggplot(all %>%
drop_na(VORP, salary), aes(x = VORP, y = salary, label = ifelse(VORP <= -0.9 |
VORP > 6, name, ""))) + geom_point(col = "blue") + geom_smooth(formula = y ~
x, method = "loess") + geom_smooth(formula = y ~ x, method = "glm", linetype = "dotted",
col = "red", se = FALSE) + labs(title = "Value over replacement player in NBA 2021-22 vs salary in NBA 2022-23",
x = "VORP", y = "yearly salary (USD)") + geom_text(size = 3)) %>%
style(textposition = "right")
It shows a clear linear correlation except for some in both extremes of the VORP. The non-linear part is due to the outliers, while the general trend still shows a positive correlation.
Assists:
AST - Assists per game
It has a high correlation with the salary while not having such a high correlation with points per game. It has a correlation of (0.6154281) with salary. It is the passes that lead to a field goal for the team.
ggplotly(ggplot(all %>%
drop_na(AST, salary), aes(x = AST, y = salary, label = ifelse(AST > 8, name,
""))) + geom_point(col = "blue") + geom_smooth(formula = y ~ x, method = "loess") +
geom_smooth(formula = y ~ x, method = "glm", linetype = "dotted", col = "red",
se = FALSE) + labs(title = "Assists per game in NBA 2021-22 vs salary in NBA 2022-23",
x = "Assists per game", y = "yearly salary (USD)") + geom_text(size = 3)) %>%
style(textposition = "right")
It shows a positive correlation until it goes above six assists per game, where it a shows negative correlation. It may be explained by the fact that the players with high assists are usually not the teamss first attacking choice, which means they might not be the teams’ top player and thus have a lower salary.
Nacol <- names(which(colSums(is.na(all) | all == "") > 0))
sort(colSums(sapply(all[Nacol], function(x) is.na(x) | x == "")), decreasing = TRUE)
## Signed.Using Weight Height X3Ppct Position FTpct
## 49 17 17 14 8 5
## salary
## 1
I will impute the salary first since it is the most important variable of the dataset (the response variable).
kable(all[is.na(all$salary), c("X", "name", "salary")])
| X | name | salary | |
|---|---|---|---|
| 146 | 146 | Ish Wainright | NA |
The salary of Ish Wainright is 125000 USD spotrac (n.d.) (source: spotrac, Accessed: 05/08/2022).
all$salary[all$name == "Ish Wainright"] <- 125000
Signed.Using:
The type of contract use to sign
Changing the value with capitalisation difference to the same to make it the same factor.
unique(all$Signed.Using)
## [1] "Bird" "Minimum Salary" "1st Round Pick"
## [4] "" "Cap Space" "MLE"
## [7] "Bi-Annual Exception" "Mini MLE" "Early Bird"
## [10] "1st Round pick" "Sign and Trade" "Bird Rights"
## [13] "Room Exception" "1st round pick" "Non Bird"
## [16] "Cap space"
all$Signed.Using[grep("^1st [Rr]ound [Pp]ick", all$Signed.Using)] <- "1st round pick"
all$Signed.Using[grep("Cap [Ss]pace", all$Signed.Using)] <- "Cap space"
The NAs mean nothing special about the signing of the contract. The NAs are replaced with “None”.
all$Signed.Using[is.na(all$Signed.Using) | all$Signed.Using == ""] <- "None"
ggplotly(ggplot(all, aes(x = fct_reorder(as.factor(Signed.Using), salary, .fun = "mean"),
y = salary, fill = Signed.Using)) + geom_boxplot() + geom_point(stat = "summary",
fun = "mean") + labs(title = "Type of contract vs salary", x = "Type of contract",
y = "yearly salary (USD)") + theme(axis.text.x = element_text(angle = 45, hjust = 1)))
There are no apparent ordinal elements in the Signed.Using variable so it will be kept as a character variable.
Some of the players’ height and weight are missing. We manually searched up each player and input them. The data are from Basketball Reference.
kable(all[which(is.na(all$Weight) | is.na(all$Height)), c("name", "Weight", "Height")])
| name | Weight | Height | |
|---|---|---|---|
| 7 | Aleksej Pokusevski | NA | NA |
| 10 | Alperen Şengün | NA | NA |
| 22 | Boban Marjanović | NA | NA |
| 24 | Bogdan Bogdanović | NA | NA |
| 25 | Bojan Bogdanović | NA | NA |
| 29 | Brandon Boston Jr. | NA | NA |
| 75 | Dāvis Bertāns | NA | NA |
| 185 | Jonas Valančiūnas | NA | NA |
| 204 | Jusuf Nurkić | NA | NA |
| 245 | Luka Dončić | NA | NA |
| 253 | Marcus Morris | NA | NA |
| 285 | Nikola Jokić | NA | NA |
| 286 | Nikola Vučević | NA | NA |
| 315 | Robert Williams | NA | NA |
| 346 | Théo Maledon | NA | NA |
| 373 | Vlatko Čančar | NA | NA |
| 378 | Willy Hernangómez | NA | NA |
I will manually search up their height and weight.
all$Weight[which(is.na(all$Weight))] <- c(190, 235, 290, 220, 226, 185, 225, 265,
290, 230, 218, 284, 260, 237, 175, 236, 250)
all$Height[which(is.na(all$Height))] <- c("7-0", "6-9", "7-3", "6-6", "6-7", "6-7",
"6-10", "6-11", "6-11", "6-7", "6-8", "6-11", "6-10", "6-8", "6-4", "6-8", "6-11")
Change height unit from feet to inches.
heightinch <- as.numeric(sapply(all$Height, function(x) substring(x, 1, 1))) * 12 +
as.numeric(sapply(all$Height, function(x) substring(x, 3, nchar(x))))
all$Height <- heightinch
g1 <- ggplot(all, aes(x = as.numeric(Height), y = as.numeric(Weight), col = as.numeric(salary),
size = salary)) + geom_point(alpha = 0.7) + theme_minimal() + labs(title = "Weight, Height and salary") +
scale_color_continuous(high = "#132B43", low = "#56B1F7")
ggMarginal(g1, type = "boxplot")
There is clear positive correlation between height and weight but there
is no visible correlation with salary.
X3Ppct:
3 point field goal percentage
Some players had not made any 3 points attempt in the season, and their 3-point percentage became NA.
kable(all[which(is.na(all$X3Ppct)), c("name", "X3P", "X3PA", "X3Ppct")])
| name | X3P | X3PA | X3Ppct | |
|---|---|---|---|---|
| 21 | Bismack Biyombo | 0 | 0 | NA |
| 82 | DeAndre Jordan | 0 | 0 | NA |
| 147 | Ivica Zubac | 0 | 0 | NA |
| 150 | Jaden Springer | 0 | 0 | NA |
| 176 | Jericho Sims | 0 | 0 | NA |
| 269 | Mitchell Robinson | 0 | 0 | NA |
| 273 | Moses Brown | 0 | 0 | NA |
| 280 | Nic Claxton | 0 | 0 | NA |
| 281 | Nick Richards | 0 | 0 | NA |
| 291 | Onyeka Okongwu | 0 | 0 | NA |
| 351 | Tony Bradley | 0 | 0 | NA |
| 364 | Tyrell Terry | 0 | 0 | NA |
| 368 | Udoka Azubuike | 0 | 0 | NA |
| 370 | Vernon Carey Jr. | 0 | 0 | NA |
I will impute by setting to 0 if there is no 3 point attempt.
all$X3Ppct[which(is.na(all$X3Ppct))] <- sapply(which(is.na(all$X3Ppct)), function(x) ifelse(all$X3PA[x] ==
0, 0, all$X3P[x]/all$X3PA[x]))
ggplotly(ggplot(all, aes(x = X3Ppct, y = salary)) + geom_point() + geom_smooth(col = "red",
formula = y ~ x, method = "glm") + labs(title = "3 point percentage vs salary",
x = "3 point field goal percentage", y = "yearly salary (USD)"))
It shows a slight but not significant correlation between salary and 3-point percentage. Most players’ 3-point percentage lies between 20% and 43%.
ggplotly(ggplot(all, aes(x = X3Ppct, y = salary, col = Position)) + geom_point() +
geom_smooth(formula = y ~ x, method = "glm") + facet_grid(Position ~ .) + labs(title = "3 point percentage vs salary for each position",
x = "3 point field goal percentage", y = "yearly salary (USD)"))
There is little to no correlation if separate to each position. It might be the result of the modern NBA requiring everyone to have a certain degree of 3-point ability regardless of their position.
The position of the following players is NA:
kable(all[is.na(all$Position), c("name", "Position")])
| name | Position | |
|---|---|---|
| 18 | Armoni Brooks | NA |
| 97 | Didi Louzada | NA |
| 99 | Domantas Sabonis | NA |
| 202 | Justin Holiday | NA |
| 237 | Larry Nance Jr. | NA |
| 287 | Norman Powell | NA |
| 331 | Spencer Dinwiddie | NA |
| 365 | Tyrese Haliburton | NA |
I will impute manually by search up their primary position
all$Position[is.na(all$Position)] <- c("SG", "SF", "PF", "SG", "PF", "SG", "PG",
"PG")
ggplot(all, aes(x = Position, y = salary, fill = Position)) + geom_boxplot() + labs(title = "salary distribution for each position")
Although there are numbers corresponding to each position in basketball,
the above shows no ordinal correlation with the salary. Hence, I will
keep it as a factor.
FTpct:
Free throw percentage
Some players had not made a free throw attempt throughout the season, which is recorded as NA.
kable(all[which(is.na(all$FTpct)), c("FT", "FTA", "FTpct")])
| FT | FTA | FTpct | |
|---|---|---|---|
| 150 | 0 | 0 | NA |
| 205 | 0 | 0 | NA |
| 251 | 0 | 0 | NA |
| 325 | 0 | 0 | NA |
| 364 | 0 | 0 | NA |
I will impute by setting to 0 if there is no free throw attempt.
all$FTpct[which(is.na(all$FTpct))] <- sapply(which(is.na(all$FTpct)), function(x) ifelse(all$FTA[x] ==
0, 0, all$FT[x]/all$FTA[x]))
ggplotly(ggplot(all, aes(x = FTpct, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = "glm") + labs(title = "Free throw percentage vs salary", x = "free throw percentage",
y = "yearly salary (USD)"))
Guaranteed:
The amount of a player's remaining salary that is guarenteed.
Since it is a direct indication of the salary, I will remove this variable.
all <- dplyr::select(all, !Guaranteed)
Find all character variables:
chrVar <- names(which(sapply(all, is.character)))
chrVar
## [1] "name" "player_id" "team_2021" "team_2022" "Signed.Using"
I will keep them for now to keep track of each entry but remove them before fitting the model.
player_info <- all[, c("name", "player_id")]
ggplotly(ggplot(all, aes(x = fct_reorder(as.factor(team_2021), salary, median, .desc = TRUE),
y = salary, fill = reorder(as.factor(team_2021), salary, .fun = "mean", decreasing = TRUE))) +
geom_boxplot() + labs(title = "Salary for each team", x = "team in 2021-22",
y = "yearly salary in 2022-23 (USD)") + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + guides(fill = guide_legend(title = "Team")))
ggplotly(ggplot(all, aes(x = fct_reorder(as.factor(team_2022), salary, median, .desc = TRUE),
y = salary, fill = reorder(as.factor(team_2022), salary, .fun = "mean", decreasing = TRUE))) +
geom_boxplot() + labs(title = "Salary for each team", x = "team in 2022-23",
y = "yearly salary in 2022-23 (USD)") + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + guides(fill = guide_legend(title = "Team")))
all$team_2021 <- as.factor(all$team_2021)
all$team_2022 <- as.factor(all$team_2022)
I will also remove this variable as this might be a direct indication to the salary of the player.
all <- dplyr::select(all, !Signed.Using)
Although I have already done some visualization in exploratory analysis, I will visualize it again.
summary(all$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 125000 2189160 5823098 10105516 13931408 48070014
ggplotly(ggplot(data = all, aes(x = "", y = salary)) + geom_boxplot() + labs(y = "salary") +
coord_flip() + theme_minimal())
The salary (in USD) ranges from 0.1 million to 48.1 million, with a mean of 10.1 million and a median of 5.8 million.
ggplotly(ggplot(data = all, aes(x = salary)) + geom_histogram(bins = 50) + labs(title = "Frequency distribution of salary",
x = "yearly salary (USD)", y = "frequency") + theme_minimal())
As from above, the data is highly right skewed and has a large spike in about 2 million.
g1 <- ggplot(all, aes(x = Age, y = salary)) + geom_point(alpha = 0.7) + theme_minimal() +
geom_smooth(formula = y ~ x, method = "glm") + labs(title = "Age vs Salary")
ggMarginal(g1, type = "boxplot")
There is a clear positive correlation between age and salary. It can be
due to various reasons, including the existence of rookie contracts,
experienced players general earns more and players with long experience
have a higher minimum salary.
Group variables to different categories:
pbio <- c("Age", "Height", "Weight")
attendence <- c("Game_played", "Game_started", "MP", "USGpct")
shooting <- c("FG", "FGA", "FGpct", "eFGpct", "PTS", "TSpct")
X2_point <- c("X2P", "X2PA", "X2Ppct")
X3_point <- c("X3P", "X3PA", "X3Ppct", "X3PAr")
Free_throw <- c("FT", "FTA", "FTpct", "FTr")
rebounding <- c("ORB", "DRB", "TRB", "ORBpct", "DRBpct", "TRBpct")
playmaking <- c("AST", "TOV", "ASTpct", "TOVpct")
defence <- c("STL", "BLK", "PF", "STLpct", "BLKpct")
overall_adstats <- c("PER", "OWS", "DWS", "WS", "WSper48", "OBPM", "DBPM", "BPM",
"VORP")
all_bio <- melt(all, id.vars = "salary", measure.vars = pbio)
ggplot(all_bio, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
The age shows a clear correlation while height and weight do not show
much. It is reasonable as different positions require different heights
and weights, and players with different heights have a play style that
suits their bodies. There is no clear correlation between their body and
their performance and thus salary.
all_attendence <- melt(all, id.vars = "salary", measure.vars = attendence)
ggplot(all_attendence, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
All of the attributes show a positive correlation. It makes sense as
players with stronger performance get played more and paid more. Both
attendance and salary are related to player performance.
all_ovSh <- melt(all, id.vars = "salary", measure.vars = shooting)
ggplot(all_ovSh, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
Field goal percentage directly correlates to FG and FGA and does not have a large correlation to salary. Effective field goal percentage is removed as it represents similar information to true shooting percentage and has less correlation.
cor(all$FG, (all$FGA * all$FGpct))
## [1] 0.9999042
all <- all %>%
dplyr::select(!eFGpct)
all_X2 <- melt(all, id.vars = "salary", measure.vars = X2_point)
ggplot(all_X2, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
It is out of expectation that 2-point field goal percentage negatively
correlates with salary. The result can be because players who shoot more
tend to decrease in shooting percentage from fatigue while players with
few attempts can easily maintain a high shooting percentage. At the same
time players who shoot more are usually the teams’ top players and thus
have higher salaries.
all_X3 <- melt(all, id.vars = "salary", measure.vars = X3_point)
ggplot(all_X3, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
All except the 3-point attempt rate show a positive correlation which
makes sense, as a high 3-point attempt rate shows that the players only
have a few attempts in total and rely on little ways of scoring, which
are often role players and thus paid less.
all_ft <- melt(all, id.vars = "salary", measure.vars = Free_throw)
ggplot(all_ft, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
All attributes have a positive correlation with salary.
all_rb <- melt(all, id.vars = "salary", measure.vars = rebounding)
ggplot(all_rb, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
It is out of expectation that the offensive rebound percentage shows a
negative correlation. The correlation might be explained by only role
players will attempt to grab offensive rebounds as this can conserve the
main players’ stamina.
all_pm <- melt(all, id.vars = "salary", measure.vars = playmaking)
ggplot(all_pm, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
The plots show a strong correlation between turnover and salary. The top
players with more usage of the ball are easier to turn over the
ball.
all_df <- melt(all, id.vars = "salary", measure.vars = defence)
ggplot(all_df, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
all_ovAd <- melt(all, id.vars = "salary", measure.vars = overall_adstats)
ggplot(all_ovAd, aes(x = value, y = salary)) + geom_point() + geom_smooth(formula = y ~
x, method = glm) + facet_wrap(vars(variable), scales = "free")
To reduce the complexity of the model, I will combine and delete some
variables. A high complexity model will result in overfitting, leading
to lower accuracy in predicting unseen data.
Before this, I will save a copy of the original data for possible future
operation.
write.csv(all[, !names(all) %in% c("X")], "dataset/cleaned_data.csv")
I will remove player id and name since they are unique for each player and thus cannot use in regression, but I will save the player id and name in other variables to take reference from.
players <- all[, c("player_id", "name")]
all <- dplyr::select(all, !c(player_id, name))
It shows little correlation, and this research is mainly about the players’ independent statistics. Hence, it is removed.
cor(as.numeric(fct_reorder(as.factor(all$team_2021), all$salary, median)), all$salary)
## [1] 0.2288759
cor(as.numeric(fct_reorder(as.factor(all$team_2022), all$salary, median)), all$salary)
## [1] 0.1704062
There is no clear correlation between salary and team as each team will have varying salaries for their star players and bench players. I will remove this variable.
all <- dplyr::select(all, !c(team_2021, team_2022))
quick_rf <- randomForest(salary ~ ., data = all, ntree = 100, importance = TRUE)
imp_rf <- randomForest::importance(quick_rf)
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[, 1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]
ggplot(imp_df[1:20, ], aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) + geom_bar(stat = "identity") +
labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
coord_flip() + theme(legend.position = "none")
There are steal, block and offensive rebounds per game records, but their correlation is not very strong. I will combine them into possession gain to make a stronger variable as these actions will all result in a possession gain for the team.
cor(all$STL, all$salary)
## [1] 0.460643
cor(all$BLK, all$salary)
## [1] 0.2902457
cor(all$ORB, all$salary)
## [1] 0.1674434
Removing steal, block and offensive rebounds per game and add possession gain.
all <- all %>%
mutate(possGain = STL + BLK + ORB) %>%
dplyr::select(!c(STL, BLK, ORB))
I will combine turnovers and personal fouls to become possession lost. These variables have low importance in the random forest. These actions will result in a possession lost.
imp_df$MSE[imp_df$Variables %in% c("TOV", "PF")]
## [1] 2.277404 2.245044
all <- all %>%
mutate(possLost = TOV + PF) %>%
dplyr::select(!c(TOV, PF))
I will remove anything about 2-pointers as it is just a portion of field goals that are not 3-pointers.
all <- all %>%
dplyr::select(!c(X2P, X2PA, X2Ppct))
I will remove free throws, field goals and 3-point percentages and attempts. I will replace them with free throw, field goal and 3-pointers missed.
all <- all %>%
mutate(FGM = FGA - FG, X3M = X3PA - X3P, FTM = FTA - FT) %>%
dplyr::select(!c(FGA, FGpct, X3PA, X3Ppct, FTA, FTpct))
I will remove the game started and replace it will starter. It will be defined by whether that player has started for more than 50% of their game played. I will also remove the game played as it is directly related to minutes played while minutes played has a stronger correlation and importance. I will also change minutes played per game to minutes played in total
all <- all %>%
mutate(starter = ifelse(Game_started/Game_played >= 0.5, 1, 0), MP = MP * Game_played) %>%
dplyr::select(!c(Game_started, Game_played))
all$starter <- as.factor(all$starter)
I will remove total rebound as it is the sum of offensive and defensive rebounds, while defensive rebound has a stronger correlation and importance.
all <- dplyr::select(all, !TRB)
I will remove Box Plus or minus as it is the sum of the offensive and defensive box scores.
cor(all[, c("OBPM", "DBPM", "BPM", "salary")])[, "salary"]
## OBPM DBPM BPM salary
## 0.6356575 0.1026086 0.5481368 1.0000000
imp_df[imp_df$Variables %in% c("OBPM", "DBPM", "BPM"), ]
## Variables MSE
## OBPM OBPM 2.941175
## BPM BPM 2.335061
## DBPM DBPM 1.468923
cor(all$OBPM + all$DBPM, all$BPM)
## [1] 0.9998765
all$BPM <- NULL
I will remove them as they are directly related to field goals and field goals missed.
all$FTr <- NULL
all$X3PAr <- NULL
As I am not sure about the effect of the variables on the model, I will not remove any extra variables but look at the results first
rm(list = ls()[!ls() %in% c("all", "players")])
numVar <- names(which(sapply(all, is.numeric)))
salary <- all$salary
player_name <- all$name
numVar <- numVar[!numVar %in% c("salary", "X", "name")]
all_num <- all[, numVar]
all_fac <- all[, !names(all) %in% c(numVar, "salary", "name", "X")]
There are 30 numeric predictors and 2 factor predictors.
I will use min-max normalization for variables with negative values
to turn all data positive.
Variables highly right-skewed (skewness > 0.8) are natural logged to
reduce skewness.
Variables highly left-skewed (skewness < -0.8) are squared to reduce
skewness.
log_names <- c()
minMax <- c()
sq_names <- c()
for (i in 1:ncol(all_num)) {
if (any(all_num[, i] <= 0)) {
process <- preProcess(as.data.frame(all_num[, i]), method = "range")
all_num[, i] <- predict(process, as.data.frame(all_num[, i]))
minMax <- c(minMax, i)
}
if (skew(all_num[, i]) > 0.8) {
all_num[, i] <- log(all_num[, i] + 1)
log_names <- c(log_names, i)
} else if (skew(all_num[, i]) < -0.8) {
all_num[, i] <- all_num[, i]^2
sq_names <- c(sq_names, i)
}
}
log_names <- names(all_num)[log_names]
minMax_names <- names(all_num)[minMax]
sq_names <- names(all_num)[sq_names]
These columns have been log + 1 due to their skewness.
(The + 1 is to prevent logging 0, resulting in NA)
The remaining data is normalized by feature scaling and mean normalization.
all_num[!names(all_num) %in% log_names] <- as.data.frame(scale(all_num[!names(all_num) %in%
log_names]))
I will convert all the remaining variables to numeric (it is required by many machine learning algorithms).
all_fac <- as.data.frame(model.matrix(~. - 1, as.data.frame(all_fac)))
skew(salary)
## [1] 1.564389
The skewness of salary is too high, which will be harder to fit a model
qqnorm(salary)
qqline(salary)
This is a QQ plot where the x-axis is the theoretical quantiles while
the y-axis is the sample quantile. The diagonal line is where sample
theoretical quantiles are perfectly aligned. The placement of the points
indicate the skewness of the sample.
Salary is too right skewed and not normally distributed.
salary <- log(salary)
skew(salary)
## [1] 0.04653786
After logging, the points lie more towards the line (less skewed).
qqnorm(salary)
qqline(salary)
plot(x = 1:length(salary), y = sort(salary))
abline(h = 14)
I treat log(salary) < 14 as outliers and remove them from the data.
all_noNorm <- cbind(X = 1:nrow(all), salary = salary, all[, numVar], all_fac)
alldata <- cbind(X = 1:nrow(all), salary = salary, all_num, all_fac)
row.names(alldata) <- players$name
name <- players$name[alldata$salary > 14]
ind <- which(alldata$salary > 14)
alldata <- alldata[alldata$salary > 14, ]
all_noNorm <- all_noNorm[alldata$salary > 14, ]
I will use train-test split and 10-fold cross-validation in the training set,
inTrain <- sample(1:2, size = nrow(alldata), prob = c(0.8, 0.2), replace = TRUE)
train <- alldata[inTrain == 1, ]
test <- alldata[inTrain == 2, ]
train_noNorm <- all_noNorm[inTrain == 1, ]
test_noNorm <- all_noNorm[inTrain == 2, ]
write.csv(alldata[, -c(1, 2)], "dataset/normalized_data.csv")
The modelling parameter to be used is training root mean squared error to determine which model to choose.
mod_lm <- lm(salary ~ . - X, data = train)
plot(mod_lm, which = 1)
plot(mod_lm, which = 2)
plot(mod_lm, which = 3)
plot(mod_lm, which = 4)
abline(h = 4/nrow(train))
The cook’s distance measures the residual and leverage of that point. It
represents the degree of influence on the regression model. We will look
at the player with a high cook’s distance.
sort(cooks.distance(mod_lm), decreasing = TRUE)[1:3]
## Jaden Springer Desmond Bane Devin Cannady
## 0.42218052 0.04500047 0.04449516
players[players$name %in% names(which(cooks.distance(mod_lm) > 4/nrow(train))), ]
## player_id name
## 43 edwarca01 Carsen Edwards
## 91 banede01 Desmond Bane
## 93 cannade01 Devin Cannady
## 114 kaminfr01 Frank Kaminsky
## 150 sprinja01 Jaden Springer
## 166 culveja01 Jarrett Culver
## 180 harrijo01 Joe Harris
## 189 poolejo01 Jordan Poole
## 220 loveke01 Kevin Love
## 240 jamesle01 LeBron James
## 264 portemi01 Michael Porter Jr.
## 308 tuckera01 Rayjon Tucker
## 322 westbru01 Russell Westbrook
## 324 beysa01 Saddiq Bey
## 344 taylote01 Terry Taylor
## 365 halibty01 Tyrese Haliburton
## 367 jonesty01 Tyus Jones
summary(mod_lm)
##
## Call:
## lm(formula = salary ~ . - X, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82226 -0.31639 0.05619 0.35760 1.27033
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.483750 1.394815 11.818 < 2e-16 ***
## Age 0.307495 0.037006 8.309 5.45e-15 ***
## MP -0.328533 0.100108 -3.282 0.00117 **
## FG 1.411302 1.044245 1.352 0.17771
## X3P -0.162821 0.181761 -0.896 0.37119
## FT -0.317065 1.277467 -0.248 0.80418
## DRB 2.461169 1.768175 1.392 0.16514
## AST 3.371172 1.418029 2.377 0.01816 *
## PTS -1.402118 0.895750 -1.565 0.11873
## PER 0.373310 0.677644 0.551 0.58218
## TSpct -2.156322 2.921740 -0.738 0.46117
## ORBpct 2.857177 3.851514 0.742 0.45886
## DRBpct 6.086728 6.086241 1.000 0.31821
## TRBpct -11.152085 9.186810 -1.214 0.22588
## ASTpct -3.406084 1.245786 -2.734 0.00669 **
## STLpct -0.259637 0.752566 -0.345 0.73037
## BLKpct 0.398177 1.442842 0.276 0.78279
## TOVpct 0.989632 1.092366 0.906 0.36580
## USGpct 0.034019 0.155156 0.219 0.82662
## OWS 2.474102 1.415557 1.748 0.08168 .
## DWS 0.230342 0.102941 2.238 0.02610 *
## OBPM 0.137806 0.139308 0.989 0.32348
## DBPM -0.044575 0.090066 -0.495 0.62108
## VORP -1.675764 1.884377 -0.889 0.37467
## Weight -0.040123 0.059448 -0.675 0.50032
## Height 0.134180 0.071508 1.876 0.06172 .
## possGain 0.411751 1.275648 0.323 0.74712
## possLost -0.001725 0.107122 -0.016 0.98717
## FGM 0.181024 0.270827 0.668 0.50447
## X3M 0.333070 0.161365 2.064 0.04001 *
## FTM 1.307387 0.587963 2.224 0.02704 *
## PositionC -0.220577 0.196897 -1.120 0.26364
## PositionPF -0.135904 0.143901 -0.944 0.34583
## PositionPG -0.207653 0.124694 -1.665 0.09706 .
## PositionSF -0.214658 0.115568 -1.857 0.06439 .
## PositionSG NA NA NA NA
## starter1 0.206174 0.108258 1.904 0.05796 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5332 on 259 degrees of freedom
## Multiple R-squared: 0.7608, Adjusted R-squared: 0.7285
## F-statistic: 23.54 on 35 and 259 DF, p-value: < 2.2e-16
The r squared of the model is 0.7608406
rmse(train$salary, mod_lm$fitted.values)
## [1] 0.4996236
rmse(test$salary, predict(object = mod_lm, newdata = test))
## [1] 0.5427067
The train root mean squared error is 1.1762827, while the testing (out-of-bag) root mean squared error is 0.4904072.
train_control <- trainControl(method = "cv", number = 10)
param_grid <- expand.grid(alpha = 1, lambda = seq(0.001, 0.1, by = 5e-04))
mod_lasso <- train(salary ~ . - X, data = train, method = "glmnet", trControl = train_control,
tuneGrid = param_grid)
Final model hyperparameter
mod_lasso$bestTune
## alpha lambda
## 30 1 0.0155
lambda is the regularization penalty.
The coefficient of the final model:
coef(mod_lasso$finalModel, mod_lasso$bestTune$lambda)
## 37 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 14.627020415
## Age 0.281093896
## MP .
## FG 0.216383524
## X3P .
## FT .
## DRB 0.683145734
## AST .
## PTS .
## PER .
## TSpct .
## ORBpct -0.065845405
## DRBpct .
## TRBpct .
## ASTpct -0.117123868
## STLpct -0.200340328
## BLKpct 0.786544105
## TOVpct .
## USGpct .
## OWS 0.004649154
## DWS 0.003991165
## OBPM .
## DBPM .
## VORP 1.322332179
## Weight .
## Height 0.037242954
## possGain 0.315031195
## possLost .
## FGM 0.254924634
## X3M 0.108150531
## FTM 0.672180412
## PositionC -0.017661106
## PositionPF 0.034136845
## PositionPG .
## PositionSF .
## PositionSG 0.090828654
## starter1 0.332252523
mod_lasso$results$Rsquared[mod_lasso$results$lambda == mod_lasso$bestTune$lambda]
## [1] 0.707579
The r squared of the best-tuned model is 0.707579.
rmse(predict(mod_lasso), train$salary)
## [1] 0.5245659
rmse(predict(mod_lasso, test), test$salary)
## [1] 0.5291957
The train root mean squared error is 1.1476965, while the testing (out-of-bag) root mean squared error is 0.5076991.
mod_elaNet <- train(salary ~ . - X, data = train, method = "glmnet", tuneLength = 10,
trControl = trainControl(method = "cv", number = 10))
mod_elaNet$results$Rsquared[which.min(mod_elaNet$results$RMSE)]
## [1] 0.7177684
The R squared is 0.7177684
rmse(predict(mod_elaNet), train$salary)
## [1] 0.5133823
rmse(predict(mod_elaNet, test), test$salary)
## [1] 0.5344562
The train root mean squared error is 1.1569839, while the testing (out-of-bag) root mean squared error is 0.497098.
mod_stepAIC <- stepAIC(mod_lm, scope = list(upper = ~., lower = ~1), trace = FALSE,
direction = "both")
RSS <- sum((mod_stepAIC$fitted.values - train$salary)^2)
TSS <- sum((train$salary - mean(train$salary))^2)
RSQ <- 1 - RSS/TSS
RSQ
## [1] 0.7535305
The R-squared is 0.7535305 and the AIC value is -364.5193241
mod_stepAIC$coefficients
## (Intercept) Age MP FG DRB AST
## 15.6844715 0.2904833 -0.2329144 1.4273136 2.1189764 2.6561159
## PTS TRBpct ASTpct BLKpct OWS DWS
## -1.2030373 -1.9317407 -2.3767727 1.0820445 1.4693787 0.1176143
## Height FGM X3M FTM starter1 PositionSG
## 0.1077515 0.2414456 0.1694718 1.2876026 0.1944891 0.1708528
rmse(mod_stepAIC$fitted.values, train$salary)
## [1] 0.5072018
rmse(predict(mod_stepAIC, test), test$salary)
## [1] 0.5224379
The train root mean squared error is 1.1703253, while the testing (out-of-bag) root mean squared error is 0.4957687.
mod_dt <- train(salary ~ ., data = train_noNorm[, -1], method = "rpart", trControl = trainControl("cv",
number = 10), tuneLength = 20)
Final model hyperparameter
mod_dt$bestTune
## cp
## 14 0.01577353
cp indicate the complexity of the tree.
fancyRpartPlot(mod_dt$finalModel, main = "Decision Tree", type = 1)
mod_dt$results$Rsquared[as.numeric(row.names(mod_dt$bestTune))]
## [1] 0.6438431
The R-squared of the model is 0.6438431
rmse(train_noNorm$salary, predict(mod_dt))
## [1] 0.509649
pred <- predict(mod_dt, test_noNorm)
rmse(test_noNorm$salary, pred)
## [1] 0.5939312
The train root mean squared error is 1.1648492, while the testing (out-of-bag) root mean squared error is 1.3939946.
mod_rft <- train(salary ~ ., train[, -c(1)], method = "rf", trControl = trainControl(method = "cv",
number = 10))
Final model hyperparameter:
mod_rft$bestTune
## mtry
## 3 36
mtry is the number of samples in each resampling.
rsq_rft <- mod_rft$results$Rsquared[as.numeric(row.names(mod_rft$bestTune))]
The R-squared is 0.7034341.
rmse(predict(mod_rft$finalModel, train), train$salary)
## [1] 0.2264029
rmse(predict(mod_rft$finalModel, test), test$salary)
## [1] 0.4720607
The train root mean squared error is 0.2934023, while the testing (out-of-bag) root mean squared error is 0.3208452.
For the neural network, it is tested with different layers and nodes. This is the final mode hyperparameter with 100 nodes in the first hidden layer, 75 nodes in the second hidden layer and 50 nodes in the third hidden layer. (The experimental record can be found in neural_network_train_record.md)
tunegrid_neural <- c(100, 75, 50)
tunegrid_neural <- as.data.frame(t(matrix(tunegrid_neural, nrow = 3)))
colnames(tunegrid_neural) <- c("layer1", "layer2", "layer3")
mod_neur <- train(salary ~ . - X, data = train, method = "neuralnet", tuneGrid = tunegrid_neural,
trControl = trainControl(method = "cv", number = 10, verboseIter = TRUE), linear.output = TRUE)
mod_neur$results$RMSE
mod_neur$results$Rsquared
## [1] 0.6624476
The R-squared is 0.6624476.
mod_neur$results$RMSE[as.numeric(row.names(mod_neur$bestTune))]
## [1] 0.6170212
rmse(test$salary, predict(mod_neur$finalModel, test))
## [1] 0.6328699
The train root mean squared error is 0.6170212, while the testing (out-of-bag) root mean squared error is 0.3406681.
Random forest model will be used since it has the lowest training RMSE.
mtry - Number of variables randomly sampled as candidates at each split
mod_rft$finalModel$mtry
## [1] 36
The number of samples per split is 36.
ntree - number of decision in the random forest
mod_rft$finalModel$ntree
## [1] 500
The number of decision trees in the random forest is 500.
Training error:
rmse(predict(mod_rft), train$salary)
## [1] 0.2264029
absErr <- mean(abs(exp(predict(mod_rft)) - exp(train$salary)))
absErr
## [1] 1587135
The root mean squared error is 1.1658628.
The average difference between the predicted and actual values in the
training set is 1.6 million USD.
Testing error (out-bag-error): This is the expected error for unseen data.
rmse(predict(mod_rft, test), test$salary)
## [1] 0.4720607
absErr <- mean(abs(exp(predict(mod_rft, test)) - exp(test$salary)))
absErr
## [1] 3034959
The out-of-bag root mean squared error is 0.3208452.
The average difference between the predicted and actual values in the
training set is 3 million USD, which means the expected error for the
salary from unseen data is plus or minus 3 million USD.
RSQ <- mod_rft$results$Rsquared[as.numeric(row.names(mod_rft$bestTune))]
The R-squared of the final model is 0.7034341, which the model has accounted for 70.3% of the variation of the salary.
imp_rf <- mod_rft$finalModel$importance
imp_df <- data.frame(Variables = row.names(imp_rf), MSE = imp_rf[, 1])
imp_df <- imp_df[order(imp_df$MSE, decreasing = TRUE), ]
g1 <- ggplot(imp_df[1:20, ], aes(x = reorder(Variables, MSE), y = MSE, fill = MSE)) +
geom_bar(stat = "identity") + labs(x = "Variables", y = "% increase MSE if variable is randomly permuted") +
coord_flip() + theme(legend.position = "none") + theme_minimal()
g1
The length of the bar is the percentage increase in MSE is randomly
permuted. The higher the value, the more important the variable is. From
the graph, points per game, field goal and age are the top three most
important in the model.
I will use principal component analysis to investigate the importance and correlation between predictive variables.
all_pca <- alldata
all_pca$X <- NULL
all_pca <- as.matrix(all_pca)
rownames(all_pca) <- row.names(alldata)
pca <- prcomp(all_pca, scale = TRUE)
pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var), 5)
Percentage of information contained in each principal component:
ggplot(data.frame(value = pca.var.per, index = 1:length(pca.var.per)), aes(x = index,
y = pca.var.per)) + geom_bar(stat = "identity") + theme_minimal() + labs(title = "Percentage of information in each principal component",
x = "Principal component", y = "Percentage of information (%)")
The plot shows that first two components have already represented more
than 50% of the variation in the dataset. The following analysis will
focus on the first 2 components.
pca.data <- data.frame(Sample = rownames(pca$x), PC1 = pca$x[, 1], PC2 = pca$x[,
2], id = 1:nrow(pca$x), salary = all$salary[ind])
ggplot(pca.data, aes(x = PC1, y = PC2)) + geom_point() + xlab(paste("PC1 - ", pca.var.per[1],
"%", sep = "")) + ylab(paste("PC2 - ", pca.var.per[2], "%", sep = "")) + theme_minimal() +
ggtitle("PCA Graph")
fig <- plot_ly(pca.data, x = ~PC1, y = ~PC2, z = ~salary, marker = list(color = ~salary,
colorscale = c("#FFE1A1", "#683531"), showscale = TRUE)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = "Principal Component 1"), yaxis = list(title = "Principal Component 2"),
zaxis = list(title = "salary")), annotations = list(x = 1.13, y = 1.05, text = "salary",
xref = "paper", yref = "paper", showarrow = FALSE))
fig
Both the first and second principal components show a negative
correlation with salary.
Plot out the principal components that contain more than 5% of the
information of the original dataset.
count = 0
for (i in 1:length(pca.var.per)) {
if (pca.var.per[i] < 0.05) {
break
} else {
print(ggplot(data.frame(score = pca$rotation[, i], var = names(pca$rotation[,
i])), aes(x = reorder(var, abs(score)), y = abs(score), fill = ifelse(pca$rotation[,
i] > 0, "positive", "negative"))) + geom_bar(stat = "identity") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) +
scale_color_manual(values = list(positive = "blue", negative = "red")) +
labs(title = paste0("Principal Component ", as.character(i), " (", as.character(pca.var.per[i] *
100), "%)"), x = "variable", y = "proportion") + guides(fill = guide_legend(title = "")))
}
}
cor_Mat <- cor(data.frame(alldata[, numVar], pca$x[, 1:4]))
cor_high <- names(which(rowSums(abs(cor_Mat[, c("PC1", "PC2", "PC3", "PC4")]) > 0.5) >
0))
cor_high <- cor_high[!cor_high %in% c("PC1", "PC2", "PC3", "PC4")]
options(scipen = 100)
round(cor_Mat[cor_high, c("PC1", "PC2", "PC3", "PC4")], digits = 2)
## PC1 PC2 PC3 PC4
## MP -0.78 0.16 0.15 -0.18
## FG -0.93 0.13 0.17 0.08
## X3P -0.55 0.61 0.30 -0.14
## FT -0.86 0.00 0.05 0.18
## DRB -0.79 -0.42 0.08 0.10
## AST -0.77 0.38 -0.39 0.13
## PTS -0.92 0.19 0.19 0.07
## PER -0.68 -0.47 -0.13 -0.13
## TSpct -0.18 -0.48 0.03 -0.55
## ORBpct 0.12 -0.88 -0.04 0.13
## DRBpct -0.25 -0.81 0.01 0.21
## TRBpct -0.12 -0.92 -0.01 0.19
## ASTpct -0.58 0.40 -0.56 0.22
## STLpct -0.04 0.11 -0.59 -0.11
## BLKpct 0.04 -0.71 -0.15 -0.07
## TOVpct -0.02 -0.25 -0.57 0.31
## USGpct -0.68 0.20 0.06 0.45
## OWS -0.70 -0.25 0.03 -0.43
## DWS -0.72 -0.22 -0.08 -0.26
## OBPM -0.79 -0.12 -0.02 -0.30
## DBPM -0.15 -0.40 -0.55 -0.47
## VORP -0.83 -0.18 -0.15 -0.30
## Weight -0.09 -0.77 0.28 0.14
## Height -0.02 -0.80 0.35 0.16
## possGain -0.55 -0.64 -0.12 -0.01
## possLost -0.85 -0.03 -0.10 0.27
## FGM -0.83 0.44 0.16 0.20
## X3M -0.56 0.64 0.27 0.00
## FTM -0.66 -0.34 -0.02 0.33
corrplot(cor_Mat[c("PC1", "PC2", "PC3", "PC4"), cor_high], tl.pos = "lt", method = "number")
It is highly correlated with minutes played, field goal, free throw,
defensive rebound, assists, points, offensive box score, value over
replacement player, possession lost (turnovers and personal fouls), and
field goal missed. This means that the above variables vary together. If
one increases, then the remaining ones tend to increase as well. These
statistics should be correlated through minutes played as the better
players usually have higher minutes and better statistics in these
criteria.
It is highly correlated with offensive rebound percentage, defensive rebound percentage, total rebound percentage, weight, and height. This means these variables tend to vary together. These statistics are all related to rebounding. Weight and height are crucial in grabbing rebounds as higher weight can lead to better box out while taller can lead to easier catching of the ball, leading to better rebounding.
Some variables have been removed due to their multicollinearity with
each other and their interpretability.
We will define the absolute value of correlation higher than 0.75 as too
highly correlated for linear regression.
alldata_lm <- dplyr::select(alldata, !c(X, PTS, FGM, FT, possLost, OBPM, OWS, ASTpct,
X3M, PositionPG, PositionSG, PositionSF, PositionPF, PositionC))
mod_lm_sh <- train(salary ~ ., alldata_lm, tuneLength = 20, trControl = trainControl(method = "cv",
number = 10), method = "lm")
summary(mod_lm_sh)$r.squared
## [1] 0.7296182
rmse(predict(mod_lm_sh), alldata$salary)
## [1] 0.5218133
The R-squared of this model is 0.7296182, which is still high enough for interpretation, and the residual (1.3217758) is also similar.
train_control <- trainControl(method = "cv", number = 10)
mod_lm_lcurve <- learning_curve_dat(dat = alldata_lm, method = "lm", outcome = "salary",
tuneLength = 20, trControl = train_control, metric = "RMSE", proportion = (1:10)/10)
ggplot(mod_lm_lcurve, aes(x = Training_Size, y = RMSE, color = Data)) + geom_smooth(method = loess,
span = 0.8, formula = y ~ x) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0,
0)) + labs(title = "learning curve of linear regression", x = "Training Size",
y = "RMSE") + theme_minimal()
As the learning curve’s resampling error and train error both converge
to a low RMSE, this shows that the model with these variables is neither
overfitting nor underfitting.
summary(mod_lm_sh)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76806 -0.33955 0.01651 0.36756 1.27412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.300815 0.559333 27.355 < 0.0000000000000002 ***
## Age 0.292665 0.032188 9.092 < 0.0000000000000002 ***
## MP -0.143047 0.078545 -1.821 0.069414 .
## FG 0.642475 0.285371 2.251 0.024973 *
## X3P 0.117674 0.056309 2.090 0.037348 *
## DRB 2.452007 1.265765 1.937 0.053517 .
## AST 0.498903 0.652059 0.765 0.444709
## PER -0.140897 0.450991 -0.312 0.754908
## TSpct -1.365468 1.937087 -0.705 0.481330
## ORBpct 2.010968 3.163803 0.636 0.525435
## DRBpct 1.653886 4.968362 0.333 0.739419
## TRBpct -5.077570 7.377477 -0.688 0.491743
## STLpct -0.314362 0.601318 -0.523 0.601447
## BLKpct 2.201886 1.119657 1.967 0.050011 .
## TOVpct -0.055161 0.819803 -0.067 0.946392
## USGpct 0.029995 0.087718 0.342 0.732596
## DWS 0.076833 0.079427 0.967 0.334027
## DBPM 0.009965 0.067802 0.147 0.883240
## VORP 1.501991 0.728270 2.062 0.039896 *
## Weight -0.045318 0.049069 -0.924 0.356339
## Height 0.129969 0.052146 2.492 0.013144 *
## possGain -0.697663 0.986468 -0.707 0.479885
## FTM 0.786640 0.472979 1.663 0.097163 .
## starter1 0.298772 0.085892 3.478 0.000567 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5392 on 355 degrees of freedom
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7121
## F-statistic: 41.65 on 23 and 355 DF, p-value: < 0.00000000000000022
We will look at variables with significant weight (p-value <
0.1).
Age:
## Estimate Std. Error
## 0.292665200372947242879462 0.032187685763820542139690
## t value Pr(>|t|)
## 9.092458604212778183750743 0.000000000000000006995567
The age shows a very significant positive weight in the model. The
weight is 0.293.
The possibility of age results in that weight for the null hypothesis
being 0 is less than \(2\cdot 10^{-14}
\%\). The weight is more than 9 standard errors away from
0.
The NBA minimum salary increases with the player’s experience ranging
from 953k for 0 experience to 2.721M for 10 or more years.
In addition, older players are often more experienced, while players
with more experience usually give more confidence to the team. They have
already proved with time their ability to play in NBA, which made the
team willing to give them a higher salary.
Starter:
(starter1 = 1 means the player starts for more than 50% of the game,
starter1 = 0 means the player starts for less than 50% of the game)
## Estimate Std. Error t value Pr(>|t|)
## 0.2987724122 0.0858921270 3.4784609781 0.0005669854
The starter factor variable shows a very significant positive weight
in the model. The weight is 0.299.
The possibility of the weight resulting in this weight is 0.057%, which
is about 3.5 standard errors away from 0.
Starter players are often the best players in the team and play the most
in a game. Hence, it is reasonable that the starter players get a higher
salary.
Field Goal:
## Estimate Std. Error t value Pr(>|t|)
## 0.64247509 0.28537135 2.25136505 0.02497346
The field goal has a weight of 0.642 and shows high
significance.
The p-value is \(2.5\cdot 10^{-2}\) and
2.3 standard error from 0.
Players with more field goals have more firepower and a stronger ability
to score. As the NBA encourages offence, players with strong offensive
abilities are worth more. A high means a higher field goal attempt rate
(correlation of FG and FGA = 0.9417247), which represents the players’
offensive importance in the team. Hence, these players with high field
goals will usually be paid more.
3 Point Field Goal:
## Estimate Std. Error t value Pr(>|t|)
## 0.11767364 0.05630899 2.08978435 0.03734845
The 3-point field goal has a weight of 0.118 and shows high
significance.
The p-value is 3.7% and 2.1 standard error away from 0.
In the modern NBA, spacing has become more important and valued by every
team in the league. It can lead to more wide-open chances, more
defensive loopholes from the opponent team and thus easier points. The
3-point ability of a team is one of the crucial factors that determine
the team’s offensive spacing. Hence, players with solid 3-point ability
are often valuable to the team. A high 3-point field goal indicates that
the player has stable 3-point ability, leading to a higher salary.
Value Over Replacement Player:
Value over Replacement Player (VORP) converts the BPM rate into an estimate of each player's overall contribution to the team, measured vs. what a theoretical "replacement player" would provide, where the "replacement player" is defined as a player on minimum salary or not a normal member of a team's rotation.
(source: https://www.basketball-reference.com/about/bpm2.html)
## Estimate Std. Error t value Pr(>|t|)
## 1.50199100 0.72827021 2.06240897 0.03989593
The value over replacement player has a weight of 1.5 and shows high
significance.
The p-value is 3.4% and 2.1 standard error away from 0.
The value over replacement player (VORP) is an overall estimate of the
player’s contribution to a team compared with a rotational player. The
higher the VORP, the better the player’s performance and thus usually
have a higher salary.
Height:
## Estimate Std. Error t value Pr(>|t|)
## 0.12996933 0.05214643 2.49239169 0.01314391
The weight is 0.13 and shows high significance.
The p-value is 1.3% and 2.5 standard errors away from 0.
Height has always been a critical player attribute throughout the
history of basketball. Pace and acceleration have played a key role in
modern NBA, indirectly reducing the importance of height as height is
usually negatively correlated to quickness. However, it is undeniable
that height will affect the players finishing ability, defensive
ability, et al. Shorter players usually have less defensive capability,
which decreases the players’ overall value and thus have a lower
salary.
Minutes Played:
## Estimate Std. Error t value Pr(>|t|)
## -0.14304745 0.07854464 -1.82122491 0.06941423
The minutes played has a weight of -0.143 and shows moderate
significance.
The p-value is 6.9% and -1.8 standard away from 0.
This result contradicts the correlation between minutes played and
salary of 0.6089823.
It shows that the players with high salaries but low minutes played
might be the reason for the result. Most of these players have long-term
injuries, which shows that more minutes played does not necessarily
correlate with higher salaries. The players’ salaries can also be
affected by passing performance and team contribution outside the
court.
Defensive Rebound:
## Estimate Std. Error t value Pr(>|t|)
## 2.45200687 1.26576543 1.93717321 0.05351719
It has a weight of 2.5 and shows moderate significance.
It has a p-value of 5.4% and 1.9 standard error away from 0.
Securing defensive rebounds is vital in a game after the opponent misses
their field goal attempts. A high defensive rebound represents a less
second-chance point for the opponent and a chance for transition
offence. A high defensive rebound can also indirectly indicate the
team’s defensive ability, as a defensive rebound is a result of a
successful defence. This factor will lead to a higher salary for the
player.
Block Percentage:
BLK% - Block Percentage (available since the 1973-74 season in the NBA); the formula is 100 * (BLK * (Tm MP / 5)) / (MP * (Opp FGA - Opp 3PA)). Block percentage is an estimate of the percentage of opponent two-point field goal attempts blocked by the player while he was on the floor.
## Estimate Std. Error t value Pr(>|t|)
## 2.20188562 1.11965738 1.96657090 0.05001135
It has a weight of 2.2 and shows moderate significance.
It has a p-value of 5% and 2 standard errors away from 0.
A high block percentage represents a solid defensive ability which
increases the players’ value, which results in a higher salary.
Free Throw Missed:
## Estimate Std. Error t value Pr(>|t|)
## 0.78663950 0.47297942 1.66315798 0.09716335
It has a weight of 0.79 and shows moderate significance.
It has a p-value of 9.7% and 1.6 standard error away from 0. The result
is counter-intuitive as a better free throw player should be better.
Although it is counter-intuitive, the correlation between free throws missed and salary is 0.5263002. Some players like Giannis Antetokounmpo and LeBron James might explain this as they have a solid overall ability but low free throw percentages.
There are many factors affecting the NBA players’ salaries, and this study provides insight into how the players’ court statistics will affect the players’ salaries. The essential attributes are age, whether they are a starter in the team and points-related attributes (including points, field goals, and free throws). When evaluating an NBA player, we should focus on the field goals per game, 3-point field goals per game, their value over replacement players, height, minutes played, and defensive rebound, according to the study.
The variance not explained by this model could be explained from other perspectives.
There are different ways of predicting an NBA player’s salary back in the old days, without any help from machine learning. The most common way is interviewing experts that are employed by different NBA teams.
There are different aspects that an NBA team would consider when they offer a contract to a player. The first and most common one is using the Predicted Market Value(PMV) of a player . By using this method, it considers the value of a player as the percentage of salary cap space he encompasses. The PMV of the player is calculated using different aspects such as age, performance in the previous season, whether it suits the team etc. The PMV is often used as a reference to the general manager of the team.
Apart from that, interviewing experts that are specialised in sports science is also a way to predict the salary of an NBA player. Those experts could predict the market value of a player by observing the ability of the player, suitability of the player to the team, the age of the player, et, al. And they would compare the salary of the specific player to a player that has a similar market value in order to give a more comprehensive judgement. Therefore, the opinions of the experts are often taken into consideration by the teams.
Lastly, there is another aspect that many teams would consider, which is the reputation of the player. If the player is an All-star player, in general, they would have higher pay. It is because All-star players have a stronger fan base. Their fans would purchase their jerseys and sneakers, which would bring a fortune to the team. Moreover, it would bring more fans to support their team. Therefore, it is another factor that would lead to a higher salary for a player.
We can consider the economic factors that may affect the salary of NBA players. The main concern of different teams’ owners would be profit. As a result, the business value of every player may also be concerned as a factor for team owners to raise or lower the salary of team players.
The business value of team players could take many different data into account, such as the player’s followers on social media, which may lead to a further promotion to the team. It could also be the sales of NBA jerseys or other related souvenirs, which produces profit directly for the team. If keeping the player would give the team owner a larger return, a higher salary would be given.
However, this approach would be affected by the location of the teams’ home stadiums. Larger and richer cities would bring the players a much higher business value. Furthermore, there are salary restrictions for players in NBA, so the estimated value would not be accurate and close to reality.