Hi, my name is Koki Ando.
I’ve been a big fan of NBA for about 10 years. And recently, Ive been teaching myself R, and i really enjoy using it especcially when im dealing with data about basketball.
Ive done some data analysis about basketball before, but this time, I’m going to dive into nba players’ salary data. through this analysis, what i liked about was predicting something using linear regression model, which i just taught myself a few days ago. hope you guys also find this data analysis interesting.
library(data.table)
library(corrplot)## corrplot 0.84 loaded
library(GGally)
library(tidyverse)## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.3.4 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.7.2 <U+221A> stringr 1.3.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(PerformanceAnalytics)## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following objects are masked from 'package:data.table':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
“salary.table” dataset was not provided at kaggle. So i scraped this site and created by myself. full code is on my GitHub page. let me know whatever u think. I needed to create this dataset because i wanted to compare salary dataset of this season(17-18) to the nba players stats of the previous season(16-17).
I wanted to investigate the how the stats effect next season’s salary they get.
salary.table <-
fread("https://s3-ap-southeast-2.amazonaws.com/koki25ando/NBA_season1718_salary.csv",
data.table = FALSE)
ss <- fread("https://s3-ap-southeast-2.amazonaws.com/playerinfomation/Seasons_Stats.csv",
data.table = FALSE)just in case u want to know how the datasets look like.
str(salary.table)## 'data.frame': 573 obs. of 4 variables:
## $ V1 : chr "1" "2" "3" "4" ...
## $ Player : chr "Stephen Curry" "LeBron James" "Paul Millsap" "Gordon Hayward" ...
## $ Tm : chr "GSW" "CLE" "DEN" "BOS" ...
## $ seson17_18: num 34682550 33285709 31269231 29727900 29512900 ...
str(ss)## 'data.frame': 24691 obs. of 53 variables:
## $ V1 : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Year : int 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
## $ Player: chr "Curly Armstrong" "Cliff Barker" "Leo Barnhorst" "Ed Bartels" ...
## $ Pos : chr "G-F" "SG" "SF" "F" ...
## $ Age : int 31 29 25 24 24 24 22 23 28 28 ...
## $ Tm : chr "FTW" "INO" "CHS" "TOT" ...
## $ G : int 63 49 67 15 13 2 60 3 65 36 ...
## $ GS : int NA NA NA NA NA NA NA NA NA NA ...
## $ MP : int NA NA NA NA NA NA NA NA NA NA ...
## $ PER : num NA NA NA NA NA NA NA NA NA NA ...
## $ TS% : num 0.368 0.435 0.394 0.312 0.308 0.376 0.422 0.275 0.346 0.362 ...
## $ 3PAr : num NA NA NA NA NA NA NA NA NA NA ...
## $ FTr : num 0.467 0.387 0.259 0.395 0.378 0.75 0.301 0.313 0.395 0.48 ...
## $ ORB% : num NA NA NA NA NA NA NA NA NA NA ...
## $ DRB% : num NA NA NA NA NA NA NA NA NA NA ...
## $ TRB% : num NA NA NA NA NA NA NA NA NA NA ...
## $ AST% : num NA NA NA NA NA NA NA NA NA NA ...
## $ STL% : num NA NA NA NA NA NA NA NA NA NA ...
## $ BLK% : num NA NA NA NA NA NA NA NA NA NA ...
## $ TOV% : num NA NA NA NA NA NA NA NA NA NA ...
## $ USG% : num NA NA NA NA NA NA NA NA NA NA ...
## $ blanl : logi NA NA NA NA NA NA ...
## $ OWS : num -0.1 1.6 0.9 -0.5 -0.5 0 3.6 -0.1 -2.2 -0.7 ...
## $ DWS : num 3.6 0.6 2.8 -0.1 -0.1 0 1.2 0 5 2.2 ...
## $ WS : num 3.5 2.2 3.6 -0.6 -0.6 0 4.8 -0.1 2.8 1.5 ...
## $ WS/48 : num NA NA NA NA NA NA NA NA NA NA ...
## $ blank2: logi NA NA NA NA NA NA ...
## $ OBPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ DBPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ BPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ VORP : num NA NA NA NA NA NA NA NA NA NA ...
## $ FG : int 144 102 174 22 21 1 340 5 226 125 ...
## $ FGA : int 516 274 499 86 82 4 936 16 813 435 ...
## $ FG% : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ 3P : int NA NA NA NA NA NA NA NA NA NA ...
## $ 3PA : int NA NA NA NA NA NA NA NA NA NA ...
## $ 3P% : num NA NA NA NA NA NA NA NA NA NA ...
## $ 2P : int 144 102 174 22 21 1 340 5 226 125 ...
## $ 2PA : int 516 274 499 86 82 4 936 16 813 435 ...
## $ 2P% : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ eFG% : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ FT : int 170 75 90 19 17 2 215 0 209 132 ...
## $ FTA : int 241 106 129 34 31 3 282 5 321 209 ...
## $ FT% : num 0.705 0.708 0.698 0.559 0.548 0.667 0.762 0 0.651 0.632 ...
## $ ORB : int NA NA NA NA NA NA NA NA NA NA ...
## $ DRB : int NA NA NA NA NA NA NA NA NA NA ...
## $ TRB : int NA NA NA NA NA NA NA NA NA NA ...
## $ AST : int 176 109 140 20 20 0 233 2 163 75 ...
## $ STL : int NA NA NA NA NA NA NA NA NA NA ...
## $ BLK : int NA NA NA NA NA NA NA NA NA NA ...
## $ TOV : int NA NA NA NA NA NA NA NA NA NA ...
## $ PF : int 217 99 192 29 27 2 132 6 273 140 ...
## $ PTS : int 458 279 438 63 59 4 895 10 661 382 ...
I wanted to use the stats data of 2016-17 only, so that i filtered. And unfortunattely stats per game was not in this “ss” dataset. so i mutated them all.
stats17 <-
ss %>% filter(Year >= 2017) %>%
select(Year:G, MP, PER, FG:PTS) %>%
distinct(Player, .keep_all = TRUE) %>%
mutate(MPG = MP/G, PPG = PTS/G, APG = AST/G,
RPG = TRB/G, TOPG = TOV/G, BPG = BLK/G,
SPG = STL/G) stats_salary <- merge(stats17, salary.table, by.x = "Player", by.y = "Player")
names(stats_salary)[40] <- "salary17_18"
stats_salary <- stats_salary[-39]corrplot(cor(stats_salary %>%
select(salary17_18, MPG:SPG,
Age, PER, contains("%")),
use = "complete.obs"),
method = "circle",type = "upper")stats_salary_cor <-
stats_salary %>%
select(salary17_18, PPG, MPG, TOPG, RPG, PER, SPG, APG)
ggpairs(stats_salary_cor)cor(stats_salary_cor)[,"salary17_18"]## salary17_18 PPG MPG TOPG RPG PER
## 1.0000000 0.7031051 0.6693910 0.5842982 0.5665350 0.5215509
## SPG APG
## 0.5118549 0.4856552
Correlation strength is: PPG > MPG > TOPG > RPG > PER > SPG > APG
the interesting part of this is that the number of turnover players make is linked to their salary, and they have a positive correlation.
So, i interpreted this relationship like this: the more turnovers they makes means that they are more involved in ball movement, which means that players who make turnovers are, at some extend, important to their team. and i thought this could be expressed as “agressiveness”. i know this interpretation could not be appropriate one. maybe next time, i should get data which include ho long players keep ball.
names(stats_salary)[5] <- "Team"
plot_ly(data = stats_salary, x = ~salary17_18, y = ~PPG, color = ~Team,
text = ~paste("Player: ", Player))## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
If you hover on the points in this plot, you can check the player names.
As you can tell the most paid player is Stephen Curry. Last summer he got an extended contract with Warriors.
stats_salary %>%
ggplot(aes(x = salary17_18, y = PPG)) +
geom_point() +
geom_smooth(method = "lm") stats_salary_regression <-
stats_salary %>% select(salary17_18, MPG:SPG)
lm(salary17_18~., data=stats_salary_regression)##
## Call:
## lm(formula = salary17_18 ~ ., data = stats_salary_regression)
##
## Coefficients:
## (Intercept) MPG PPG APG RPG
## -2792909 30565 686815 1059087 916087
## TOPG BPG SPG
## -2709447 470136 631255
avg.minutes <- mean(stats_salary_regression$MPG)
avg.turnover <- mean(stats_salary_regression$TOPG)
stats_salary_regression$Trusted <- as.factor(ifelse(stats_salary_regression$MPG >= avg.minutes, "Yes", "No"))
stats_salary_regression$Agressiveness <- as.factor(ifelse(stats_salary_regression$TOPG >= avg.turnover, "Yes", "No"))
head(stats_salary_regression)## salary17_18 MPG PPG APG RPG TOPG BPG
## 1 1312611 7.409091 2.181818 0.1818182 1.636364 0.4545455 0.5909091
## 2 2116955 13.753846 4.953846 1.9230769 1.061538 1.0153846 0.1384615
## 3 5504420 28.725000 12.737500 1.8750000 5.062500 1.1125000 0.5000000
## 4 27734405 32.250000 14.000000 4.9558824 6.823529 1.7058824 1.2794118
## 5 9769821 14.106061 8.106061 0.8636364 4.212121 0.5000000 0.2424242
## 6 7319035 29.065574 8.721311 1.6229508 7.393443 1.5409836 0.7213115
## SPG Trusted Agressiveness
## 1 0.04545455 No No
## 2 0.38461538 No No
## 3 0.80000000 Yes No
## 4 0.76470588 Yes Yes
## 5 0.28787879 No No
## 6 0.98360656 Yes Yes
stats_salary_regression %>%
ggplot(aes(x = salary17_18, y = PPG, colour = Agressiveness)) +
geom_point() +
geom_smooth(method="lm") As i mentioned before, players who get paid more tend to make more turnovers.
lm(formula = salary17_18 ~ Trusted * Agressiveness, data=stats_salary_regression)##
## Call:
## lm(formula = salary17_18 ~ Trusted * Agressiveness, data = stats_salary_regression)
##
## Coefficients:
## (Intercept) TrustedYes
## 2914582 5125780
## AgressivenessYes TrustedYes:AgressivenessYes
## 969783 3518647
if you are trusted by coach (which means, in this case, you get play time more than average in the entire league) and score, the salary increase by $325,079.
salary_prediction <- function(m, point, minutes, turn_over){
pre_new <- predict(m, data.frame(PPG = point, MPG = minutes, TOPG = turn_over))
msg <- paste("PPG:", point, ",MPG:", minutes, ",TOPG:", turn_over, " ==> Expected Salary: $", round(pre_new), sep = "")
print(msg)
}Let’s take a example.
In this season, as we all know Philadelphia 76ers are having a great season. This time i am going to fucus on JJ Redick.
According to Basketball Reference : J.J. Redick, Redick’s stat is “Points per game : 16.7, Minutes per game : 31.2, Turnovers per game : 1.5” so far in this season.
And he doesn’t have contract with 76ers for nect season.
Let’s predict.
model <- lm(formula = salary17_18 ~ PPG + MPG + TOPG, data = stats_salary_regression)
salary_prediction(model, 16.7, 31.2, 1.5)## [1] "PPG:16.7,MPG:31.2,TOPG:1.5 ==> Expected Salary: $13959120"
According to my prediction model, Anhe will get $13,959,120 next season.