1 Introduction

Hi, my name is Koki Ando.
I’ve been a big fan of NBA for about 10 years. And recently, Ive been teaching myself R, and i really enjoy using it especcially when im dealing with data about basketball.
Ive done some data analysis about basketball before, but this time, I’m going to dive into nba players’ salary data. through this analysis, what i liked about was predicting something using linear regression model, which i just taught myself a few days ago. hope you guys also find this data analysis interesting.

2 Preparation

2.1 Required packages

library(data.table)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.3.4     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.7.2     <U+221A> stringr 1.3.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following objects are masked from 'package:data.table':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

2.2 Data Preparation

“salary.table” dataset was not provided at kaggle. So i scraped this site and created by myself. full code is on my GitHub page. let me know whatever u think. I needed to create this dataset because i wanted to compare salary dataset of this season(17-18) to the nba players stats of the previous season(16-17).
I wanted to investigate the how the stats effect next season’s salary they get.

salary.table <- 
  fread("https://s3-ap-southeast-2.amazonaws.com/koki25ando/NBA_season1718_salary.csv", 
        data.table = FALSE)
ss <- fread("https://s3-ap-southeast-2.amazonaws.com/playerinfomation/Seasons_Stats.csv",
            data.table = FALSE)

just in case u want to know how the datasets look like.

str(salary.table)
## 'data.frame':    573 obs. of  4 variables:
##  $ V1        : chr  "1" "2" "3" "4" ...
##  $ Player    : chr  "Stephen Curry" "LeBron James" "Paul Millsap" "Gordon Hayward" ...
##  $ Tm        : chr  "GSW" "CLE" "DEN" "BOS" ...
##  $ seson17_18: num  34682550 33285709 31269231 29727900 29512900 ...
str(ss)
## 'data.frame':    24691 obs. of  53 variables:
##  $ V1    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Year  : int  1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
##  $ Player: chr  "Curly Armstrong" "Cliff Barker" "Leo Barnhorst" "Ed Bartels" ...
##  $ Pos   : chr  "G-F" "SG" "SF" "F" ...
##  $ Age   : int  31 29 25 24 24 24 22 23 28 28 ...
##  $ Tm    : chr  "FTW" "INO" "CHS" "TOT" ...
##  $ G     : int  63 49 67 15 13 2 60 3 65 36 ...
##  $ GS    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ MP    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PER   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TS%   : num  0.368 0.435 0.394 0.312 0.308 0.376 0.422 0.275 0.346 0.362 ...
##  $ 3PAr  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ FTr   : num  0.467 0.387 0.259 0.395 0.378 0.75 0.301 0.313 0.395 0.48 ...
##  $ ORB%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DRB%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TRB%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ AST%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ STL%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ BLK%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TOV%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ USG%  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ blanl : logi  NA NA NA NA NA NA ...
##  $ OWS   : num  -0.1 1.6 0.9 -0.5 -0.5 0 3.6 -0.1 -2.2 -0.7 ...
##  $ DWS   : num  3.6 0.6 2.8 -0.1 -0.1 0 1.2 0 5 2.2 ...
##  $ WS    : num  3.5 2.2 3.6 -0.6 -0.6 0 4.8 -0.1 2.8 1.5 ...
##  $ WS/48 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ blank2: logi  NA NA NA NA NA NA ...
##  $ OBPM  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DBPM  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPM   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ VORP  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ FG    : int  144 102 174 22 21 1 340 5 226 125 ...
##  $ FGA   : int  516 274 499 86 82 4 936 16 813 435 ...
##  $ FG%   : num  0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
##  $ 3P    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 3PA   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 3P%   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 2P    : int  144 102 174 22 21 1 340 5 226 125 ...
##  $ 2PA   : int  516 274 499 86 82 4 936 16 813 435 ...
##  $ 2P%   : num  0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
##  $ eFG%  : num  0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
##  $ FT    : int  170 75 90 19 17 2 215 0 209 132 ...
##  $ FTA   : int  241 106 129 34 31 3 282 5 321 209 ...
##  $ FT%   : num  0.705 0.708 0.698 0.559 0.548 0.667 0.762 0 0.651 0.632 ...
##  $ ORB   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ DRB   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TRB   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AST   : int  176 109 140 20 20 0 233 2 163 75 ...
##  $ STL   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BLK   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TOV   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PF    : int  217 99 192 29 27 2 132 6 273 140 ...
##  $ PTS   : int  458 279 438 63 59 4 895 10 661 382 ...

2.2.1 Data Cleaning

I wanted to use the stats data of 2016-17 only, so that i filtered. And unfortunattely stats per game was not in this “ss” dataset. so i mutated them all.

stats17 <- 
  ss %>% filter(Year >= 2017) %>% 
  select(Year:G, MP, PER, FG:PTS) %>% 
  distinct(Player, .keep_all = TRUE) %>% 
  mutate(MPG = MP/G, PPG = PTS/G, APG = AST/G, 
         RPG = TRB/G, TOPG = TOV/G, BPG = BLK/G, 
        SPG = STL/G) 

2.2.2 Merging Data

stats_salary <- merge(stats17, salary.table, by.x = "Player", by.y = "Player")
names(stats_salary)[40] <- "salary17_18"
stats_salary <- stats_salary[-39]

3 Correlation check

3.1 Correlation check No.1

corrplot(cor(stats_salary %>% 
               select(salary17_18, MPG:SPG, 
                      Age, PER, contains("%")), 
             use = "complete.obs"), 
         method = "circle",type = "upper")

3.2 Correlation check No.2

stats_salary_cor <- 
  stats_salary %>% 
  select(salary17_18, PPG, MPG, TOPG, RPG, PER, SPG, APG)
ggpairs(stats_salary_cor)

cor(stats_salary_cor)[,"salary17_18"]
## salary17_18         PPG         MPG        TOPG         RPG         PER 
##   1.0000000   0.7031051   0.6693910   0.5842982   0.5665350   0.5215509 
##         SPG         APG 
##   0.5118549   0.4856552

Correlation strength is: PPG > MPG > TOPG > RPG > PER > SPG > APG

the interesting part of this is that the number of turnover players make is linked to their salary, and they have a positive correlation.
So, i interpreted this relationship like this: the more turnovers they makes means that they are more involved in ball movement, which means that players who make turnovers are, at some extend, important to their team. and i thought this could be expressed as “agressiveness”. i know this interpretation could not be appropriate one. maybe next time, i should get data which include ho long players keep ball.

4 Data Visualization

4.1 Interactive Plot

names(stats_salary)[5] <- "Team"
plot_ly(data = stats_salary, x = ~salary17_18, y = ~PPG, color = ~Team,
        text = ~paste("Player: ", Player))
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

If you hover on the points in this plot, you can check the player names.
As you can tell the most paid player is Stephen Curry. Last summer he got an extended contract with Warriors.

4.2 Scatter plot with regression line

stats_salary %>% 
  ggplot(aes(x = salary17_18, y = PPG)) + 
  geom_point() + 
  geom_smooth(method = "lm") 

5 Regression Analysis

stats_salary_regression <- 
  stats_salary %>% select(salary17_18, MPG:SPG)
lm(salary17_18~., data=stats_salary_regression)
## 
## Call:
## lm(formula = salary17_18 ~ ., data = stats_salary_regression)
## 
## Coefficients:
## (Intercept)          MPG          PPG          APG          RPG  
##    -2792909        30565       686815      1059087       916087  
##        TOPG          BPG          SPG  
##    -2709447       470136       631255
  1. Point per game increases salary by $686,815 per year
  2. The more assists they make the more salary they get

5.1 Is the Player trusted by coach? How many turnovers do they make?

avg.minutes <- mean(stats_salary_regression$MPG)
avg.turnover <- mean(stats_salary_regression$TOPG)
stats_salary_regression$Trusted <- as.factor(ifelse(stats_salary_regression$MPG >= avg.minutes, "Yes", "No"))
stats_salary_regression$Agressiveness <- as.factor(ifelse(stats_salary_regression$TOPG >= avg.turnover, "Yes", "No"))
head(stats_salary_regression)
##   salary17_18       MPG       PPG       APG      RPG      TOPG       BPG
## 1     1312611  7.409091  2.181818 0.1818182 1.636364 0.4545455 0.5909091
## 2     2116955 13.753846  4.953846 1.9230769 1.061538 1.0153846 0.1384615
## 3     5504420 28.725000 12.737500 1.8750000 5.062500 1.1125000 0.5000000
## 4    27734405 32.250000 14.000000 4.9558824 6.823529 1.7058824 1.2794118
## 5     9769821 14.106061  8.106061 0.8636364 4.212121 0.5000000 0.2424242
## 6     7319035 29.065574  8.721311 1.6229508 7.393443 1.5409836 0.7213115
##          SPG Trusted Agressiveness
## 1 0.04545455      No            No
## 2 0.38461538      No            No
## 3 0.80000000     Yes            No
## 4 0.76470588     Yes           Yes
## 5 0.28787879      No            No
## 6 0.98360656     Yes           Yes

5.2 Scatter plot coloured by how many turnovers they make

stats_salary_regression %>% 
  ggplot(aes(x = salary17_18, y = PPG, colour = Agressiveness)) + 
  geom_point() + 
  geom_smooth(method="lm")

As i mentioned before, players who get paid more tend to make more turnovers.

lm(formula = salary17_18 ~ Trusted * Agressiveness, data=stats_salary_regression)
## 
## Call:
## lm(formula = salary17_18 ~ Trusted * Agressiveness, data = stats_salary_regression)
## 
## Coefficients:
##                 (Intercept)                   TrustedYes  
##                     2914582                      5125780  
##            AgressivenessYes  TrustedYes:AgressivenessYes  
##                      969783                      3518647

if you are trusted by coach (which means, in this case, you get play time more than average in the entire league) and score, the salary increase by $325,079.

6 Modeling & Conclusion

6.1 Prediction Function

salary_prediction <- function(m, point, minutes, turn_over){
  pre_new <- predict(m, data.frame(PPG = point, MPG = minutes, TOPG = turn_over))
  msg <- paste("PPG:", point, ",MPG:", minutes, ",TOPG:", turn_over, " ==> Expected Salary: $", round(pre_new), sep = "")
  print(msg)
}

Let’s take a example.
In this season, as we all know Philadelphia 76ers are having a great season. This time i am going to fucus on JJ Redick.

According to Basketball Reference : J.J. Redick, Redick’s stat is “Points per game : 16.7, Minutes per game : 31.2, Turnovers per game : 1.5” so far in this season.
And he doesn’t have contract with 76ers for nect season.
Let’s predict.

6.2 Analysis conclusion

model <- lm(formula = salary17_18 ~ PPG + MPG + TOPG, data = stats_salary_regression)
salary_prediction(model, 16.7, 31.2, 1.5)
## [1] "PPG:16.7,MPG:31.2,TOPG:1.5 ==> Expected Salary: $13959120"

According to my prediction model, Anhe will get $13,959,120 next season.