Load Packages
library(tidyr)
library(dplyr)
library(psych)
library(tidyverse)
library(ggplot2)Research question
In the NBA, can draft pick location predict success in the league? If so, what are the odds of a second round player, having a successful career in the NBA?
The goal of this project is to predict how successful an NBA player is based on the Draft number in the NBA Draft. Every year, around 60 players get drafted to play in the NBA, but not everyone is successful. To define success, I looked at the amount of years played In the league and VORP (Value Over Replacement Player).
Type of study
This is an observational study.
Explanatory
The explanatory variables are number of years in the league (numerical), VORP(numerical) and experience (categorical).
Data Collection
Data was collected from https://www.basketball-reference.com/draft/ and the list shows career statistics of NBA player’s drafted from 1994 to 2014.
nba_players <- read.csv("https://raw.githubusercontent.com/mikegankhuyag/606-HW/master/NBA%20Draft%20Picks%201994-2014.csv", stringsAsFactors = FALSE)
nba.players <- data.frame(nba_players)
head(nba.players)## Year Rk Pk Tm Player College Yrs G MP
## 1 2014 1 1 CLE Andrew Wiggins University of Kansas 4 251 9069
## 2 2014 2 2 MIL Jabari Parker Duke University 3 152 4874
## 3 2014 3 3 PHI Joel Embiid University of Kansas 2 36 924
## 4 2014 4 4 ORL Aaron Gordon University of Arizona 4 209 5091
## 5 2014 5 5 UTA Dante Exum 2 148 3045
## 6 2014 6 6 BOS Marcus Smart Oklahoma State University 4 211 5995
## PTS TRB AST FG. X3P. FT. MP.1 PTS.1 TRB.1 AST.1 WS WS.48 BPM
## 1 5111 1020 532 0.450 0.332 0.755 36.1 20.4 4.1 2.1 10.5 0.055 -2.3
## 2 2403 847 314 0.491 0.341 0.748 32.1 15.8 5.6 2.1 9.0 0.088 -1.3
## 3 730 294 79 0.469 0.347 0.789 25.7 20.3 8.2 2.2 2.1 0.107 2.7
## 4 2069 1119 322 0.463 0.300 0.701 24.4 9.9 5.4 1.5 10.8 0.102 0.1
## 5 805 263 309 0.385 0.308 0.743 20.6 5.4 1.8 2.1 1.1 0.017 -3.3
## 6 1964 803 772 0.357 0.291 0.757 28.4 9.3 3.8 3.7 9.1 0.073 0.4
## VORP
## 1 -0.8
## 2 0.9
## 3 1.1
## 4 2.7
## 5 -1.0
## 6 3.7
Data Preperation
nba.players$Yrs <-as.character(nba.players$Yrs)
nba.players$Yrs <- as.numeric(nba.players$Yrs)nba.players$Yrs[is.na(nba.players$Yrs)] <- 0nba.players$VORP <- as.character(nba.players$VORP)
nba.players$VORP <- as.numeric(nba.players$VORP)nba.players$VORP[is.na(nba.players$VORP)] <- 0nba.players$Pk <- as.character(nba.players$Pk)
nba.players$Pk <- as.numeric(nba.players$Pk)nba.players$Draft_Inverse <- nba.players$Pk*-1Categorize
For this project, I wanted to categorize success in the NBA by the amount of years a player played in the league.
I categorized years in the league below.
1.Legend - 10 or more years 2.Very Experienced - 8-9 Years 3.Experienced - 6-7 Years 4.Some Experience - 4-5 Years 5.Little Experience - 1-3 Years 6-Unsuccessful - 0 years
Experience_grade <- data.frame(
Years = 0:23,
Experience_g = c("Unsuccessful(0)","Little Experience(1-3)","Little Experience(1-3)","Little Experience(1-3)", "Some Experience(4-5)", "Some Experience(4-5)", "Experienced(6-7)", "Experienced(6-7)", "Very Experienced(8-9)","Very Experienced(8-9)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)","Legend(10+)"))nba.players$Experience <- Experience_grade$Experience_g[match(nba.players$Yrs,Experience_grade$Years)]
head(nba.players$Experience)## [1] Some Experience(4-5) Little Experience(1-3) Little Experience(1-3)
## [4] Some Experience(4-5) Little Experience(1-3) Some Experience(4-5)
## 6 Levels: Experienced(6-7) Legend(10+) ... Very Experienced(8-9)
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
describe(nba.players$Yrs)## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1234 5.59 4.76 4 5.12 4.45 0 21 21 0.72 -0.36
## se
## X1 0.14
describe(nba.players$VORP)## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1234 3.96 10.8 0 1.55 1.19 -8.3 116.6 124.9 4.24 25.14
## se
## X1 0.31
table(nba.players$Experience, useNA = 'ifany')##
## Experienced(6-7) Legend(10+) Little Experience(1-3)
## 130 291 342
## Some Experience(4-5) Unsuccessful(0) Very Experienced(8-9)
## 178 181 112
describeBy(nba.players$Yrs,
group = nba.players$Experience, mat=TRUE)## item group1 vars n mean sd median
## X11 1 Experienced(6-7) 1 130 6.423077 0.4959586 6
## X12 2 Legend(10+) 1 291 12.608247 2.4978199 12
## X13 3 Little Experience(1-3) 1 342 1.941520 0.7863021 2
## X14 4 Some Experience(4-5) 1 178 4.370787 0.4843779 4
## X15 5 Unsuccessful(0) 1 181 0.000000 0.0000000 0
## X16 6 Very Experienced(8-9) 1 112 8.464286 0.5009643 8
## trimmed mad min max range skew kurtosis se
## X11 6.403846 0.0000 6 7 1 0.3078134 -1.9198419 0.04349845
## X12 12.248927 2.9652 10 21 11 1.0244646 0.4998887 0.14642479
## X13 1.927007 1.4826 1 3 2 0.1025859 -1.3809722 0.04251835
## X14 4.340278 0.0000 4 5 1 0.5305261 -1.7281563 0.03630566
## X15 0.000000 0.0000 0 0 0 NaN NaN 0.00000000
## X16 8.455556 0.0000 8 9 1 0.1413091 -1.9976293 0.04733668
describeBy(nba.players$VORP,
group = nba.players$Experience, mat=TRUE)## item group1 vars n mean sd median
## X11 1 Experienced(6-7) 1 130 2.3538462 5.4015380 0.65
## X12 2 Legend(10+) 1 291 13.9927835 17.5389198 9.20
## X13 3 Little Experience(1-3) 1 342 -0.3114035 0.7378846 -0.20
## X14 4 Some Experience(4-5) 1 178 0.1494382 2.4126373 -0.40
## X15 5 Unsuccessful(0) 1 181 0.0000000 0.0000000 0.00
## X16 6 Very Experienced(8-9) 1 112 5.2303571 8.5144587 2.50
## trimmed mad min max range skew kurtosis se
## X11 1.3778846 3.18759 -4.1 25.4 29.5 1.843642 3.494965 0.47374622
## X12 10.9828326 10.97124 -7.1 116.6 123.7 2.215044 6.765234 1.02814968
## X13 -0.2908759 0.29652 -2.9 8.6 11.5 5.089910 63.413006 0.03990023
## X14 -0.1902778 1.03782 -4.1 12.6 16.7 2.215084 7.409229 0.18083481
## X15 0.0000000 0.00000 0.0 0.0 0.0 NaN NaN 0.00000000
## X16 3.9366667 5.63388 -8.3 42.3 50.6 1.916635 5.015057 0.80454072
Ploting the Data
ggplot(nba.players, aes(x=nba.players$Yrs)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(nba.players, aes(x=nba.players$VORP)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##Models ###Predicting years in the league by draft location.
Visualizing the relationship between draft pick position and years in the NBA.
ggplot(data = nba.players, mapping = aes(x=nba.players$Yrs, y= nba.players$Pk))+
geom_point() +
ggtitle("Relationship between Draft Pick and Years in the League")cor(nba.players$Yrs,nba.players$Pk)## [1] -0.5552688
Yrs_P<-lm(Yrs ~ Pk, data = nba.players)
summary(Yrs_P)##
## Call:
## lm(formula = Yrs ~ Pk, data = nba.players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.925 -2.815 -1.083 2.496 14.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.236187 0.228203 44.86 <2e-16 ***
## Pk -0.155503 0.006636 -23.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.961 on 1232 degrees of freedom
## Multiple R-squared: 0.3083, Adjusted R-squared: 0.3078
## F-statistic: 549.2 on 1 and 1232 DF, p-value: < 2.2e-16
The correlation tells the model to calculate a players amount of years in the league by draft spot, is 10.236 plus the multiplication of -0.1555 and draft spot.
Years in NBA = -0.155503*(pick) + 10.236
#In 2017, Lonzo Ball was drafted numeber 2 and based on this model, he will last 9.93 years in the league. Josh Hart was drafted number 30 and based on the model, he will last 5.57 years in the league.
(-0.155503*2)+10.236187## [1] 9.925181
(-0.155503*30)+10.236187## [1] 5.571097
The linear model above.,gives us a R squared of 30.8%. This tells us that 30.8% of variability in years in the NBA is explained by draft pick location.
Prediction and prediction errors
Lets check the conditions of the linear model.
plot(nba.players$Yrs ~ nba.players$Pk)
abline(Yrs_P)Model Assessment: 1. linearity 2. nearly normal residuals 3.constant variability.
plot(Yrs_P$residuals ~ nba.players$Pk)
abline(h = 0, lty = 3)hist(Yrs_P$residuals)qqnorm(Yrs_P$residuals)
qqline(Yrs_P$residuals)Based on the Model Assessment, the linear model looks good.
Predicting VORP with draft location.
ggplot(data = nba.players, mapping = aes(x=nba.players$VORP, y= nba.players$Pk))+
geom_point()+
ggtitle("Relationship between Draft Pick and Player's VORP")cor(nba.players$VORP,nba.players$Pk)## [1] -0.3465931
VORP_P<-lm(VORP ~ Pk, data = nba.players)
summary(VORP_P)##
## Call:
## lm(formula = VORP ~ Pk, data = nba.players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.619 -4.931 -1.514 1.567 106.281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.53909 0.58368 18.06 <2e-16 ***
## Pk -0.22012 0.01697 -12.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 1232 degrees of freedom
## Multiple R-squared: 0.1201, Adjusted R-squared: 0.1194
## F-statistic: 168.2 on 1 and 1232 DF, p-value: < 2.2e-16
The correlation tells the model to calculate a player’s VORP by draft spot, is 10.539 plus the multiplication of -0.22012 and draft spot.
Years in NBA = -0.22012*(pick) + 10.539
#In 2017, Lonzo Ball was drafted numeber 2 and based on this model, his VORP score in the league will be 10.1. Josh Hart was drafted number 30 and based on the model, his VORP score in the league will be 3.94.
(-0.22012 *2)+ 10.53909## [1] 10.09885
(-0.22012 *30) +10.53909## [1] 3.93549
The R^2 in the linear model is 12%, which means that 12% of variability in VORP is predicted by draft pick.
12% is very low, which means draft pick locations aren’t very good for predicting VORP. Let’s check if the model is skewed.
nba.players %>% select(Pk, Player, PTS, VORP, Experience) %>%
filter(VORP > 60 | VORP < -5)## Pk Player PTS VORP Experience
## 1 16 Nick Young 7643 -5.1 Legend(10+)
## 2 4 Chris Paul 15614 71.7 Legend(10+)
## 3 13 Sebastian Telfair 4183 -5.8 Legend(10+)
## 4 1 LeBron James 28959 116.6 Legend(10+)
## 5 41 Willie Green 6059 -5.9 Legend(10+)
## 6 1 Michael Olowokandi 4135 -8.3 Very Experienced(8-9)
## 7 9 Dirk Nowitzki 30331 65.8 Legend(10+)
## 8 10 Paul Pierce 26397 61.5 Legend(10+)
## 9 1 Tim Duncan 26496 89.3 Legend(10+)
## 10 14 Maurice Taylor 5889 -5.9 Very Experienced(8-9)
## 11 27 Jacque Vaughn 3463 -7.1 Legend(10+)
## 12 13 Kobe Bryant 33643 72.1 Legend(10+)
## 13 5 Kevin Garnett 26071 94.0 Legend(10+)
## 14 2 Jason Kidd 17529 78.2 Legend(10+)
We see that only 8 players have a VORP score 60 and 6 players have a score of -5.
Lets transform VORP using log +1 transformation to deal with the skew.
nba.players$LOGVORP <- log(nba.players$VORP +1)## Warning in log(nba.players$VORP + 1): NaNs produced
nba.players$LOGVORP[which(is.infinite(nba.players$LOGVORP))] =0
nba.players$LOGVORP[which(is.nan(nba.players$LOGVORP))] =0
LOGVORP_P <- lm(LOGVORP ~ Pk, data = nba.players)
summary(LOGVORP_P)##
## Call:
## lm(formula = LOGVORP ~ Pk, data = nba.players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8109 -0.8082 -0.1140 0.6161 4.1415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.540108 0.070767 21.76 <2e-16 ***
## Pk -0.031835 0.002058 -15.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.228 on 1232 degrees of freedom
## Multiple R-squared: 0.1627, Adjusted R-squared: 0.162
## F-statistic: 239.4 on 1 and 1232 DF, p-value: < 2.2e-16
VORP with a log + 1 transformation did not make a significant change.
plot(LOGVORP_P$residuals ~ nba.players$Pk)
abline(h = 0, lty = 3)hist(LOGVORP_P$residuals)qqnorm(LOGVORP_P$residuals)
qqline(LOGVORP_P$residuals)Since VORP is not a good predictor, lets focus on years in the league to score success in the league.
ggplot(nba.players, aes(x=Experience, y=Pk)) +
geom_boxplot(fill = 'yellow') +
ggtitle("Players Draft Picks & Experience")+
coord_flip() Based on the categories, lets calculate the odds of each pick falling into each category.
nba_counts <- nba.players %>% group_by(Pk) %>% count(Experience)
nba_draft <- nba.players %>% group_by(Pk) %>% count(Pk)
Prob <- left_join(nba_counts,nba_draft, by = "Pk")
colnames(Prob) <- c("Pk","Experience","Experience_count","Pick_count")
probability_nba <- mutate(Prob, probability = Experience_count/Pick_count)
probability_nba## # A tibble: 303 x 5
## # Groups: Pk [60]
## Pk Experience Experience_count Pick_count probability
## <dbl> <fctr> <int> <int> <dbl>
## 1 1 Experienced(6-7) 2 21 0.09523810
## 2 1 Legend(10+) 11 21 0.52380952
## 3 1 Little Experience(1-3) 1 21 0.04761905
## 4 1 Some Experience(4-5) 2 21 0.09523810
## 5 1 Very Experienced(8-9) 5 21 0.23809524
## 6 2 Experienced(6-7) 2 21 0.09523810
## 7 2 Legend(10+) 10 21 0.47619048
## 8 2 Little Experience(1-3) 2 21 0.09523810
## 9 2 Some Experience(4-5) 2 21 0.09523810
## 10 2 Very Experienced(8-9) 5 21 0.23809524
## # ... with 293 more rows
probability_nba_spread <- probability_nba %>% select(Pk, Experience, probability) %>% spread(Experience, probability)
probability_nba_spread[is.na(probability_nba_spread)] = 0
probability_nba_spread <-select(probability_nba_spread,'Pk','Unsuccessful(0)','Little Experience(1-3)','Some Experience(4-5)','Experienced(6-7)','Very Experienced(8-9)','Legend(10+)')
probability_nba_spread[sample(nrow(probability_nba_spread), 10), ]## # A tibble: 10 x 7
## # Groups: Pk [10]
## Pk `Unsuccessful(0)` `Little Experience(1-3)` `Some Experience(4-5)`
## <dbl> <dbl> <dbl> <dbl>
## 1 27 0.0000000 0.2380952 0.1904762
## 2 22 0.0000000 0.3809524 0.3333333
## 3 49 0.3333333 0.3333333 0.0952381
## 4 55 0.4000000 0.3500000 0.0500000
## 5 56 0.4000000 0.4000000 0.0000000
## 6 48 0.2857143 0.4761905 0.1904762
## 7 46 0.2857143 0.4285714 0.0952381
## 8 2 0.0000000 0.0952381 0.0952381
## 9 44 0.4285714 0.2380952 0.1428571
## 10 21 0.0000000 0.2857143 0.1428571
## # ... with 3 more variables: `Experienced(6-7)` <dbl>, `Very
## # Experienced(8-9)` <dbl>, `Legend(10+)` <dbl>
Visualizing the predictions based on the categories.
par(mfrow=c(3,2))
plot(probability_nba_spread$`Unsuccessful(0)`,main="Draft Pick Spot of Players That Never Played")
plot(probability_nba_spread$`Little Experience(1-3)`,main="Draft Pick Spots of Players with 1-3 Yrs Experience")
plot(probability_nba_spread$`Some Experience(4-5)`,main="Draft Pick Spots of Players with 4-5 Yrs Experience")
plot(probability_nba_spread$`Experienced(6-7)`,main="Draft Pick Spots of Players with 6-7 Yrs Experience")
plot(probability_nba_spread$`Very Experienced(8-9)`,main="Draft Pick Spots of Players with 8-9 Yrs Experience")
plot(probability_nba_spread$`Legend(10+)`,main="Draft Pick Spots of Players with Over 10 Yrs Experience") As, expected, there ins an inverse relationship, between players that never played and players that played 10 or more years. There isn’t much of a relationship between players that played 4-7.
We can see the overall trend below.
plot(nba.players %>%
select(Pk, Yrs) %>%
group_by(Pk) %>%
summarise(mean = mean(Yrs)), main ="Average NBA Career span(Years) based on Draft Spot")2nd Rounders.
Below is the visualization of average second rounders by draft pick(picks 31-60).
plot(nba.players %>%
select(Pk, Yrs) %>%
filter(Pk >=31) %>%
group_by(Pk) %>%
summarise(mean = mean(Yrs)), main ="Average NBA Career span(Years) of Second Rounders") Visualizing only second rounders, the model is not linear as all picks. Let’s check which draft spots has been most successful.
probability_nba %>%
select(Pk, Experience, probability) %>%
filter(Pk >=31 & Experience == 'Legend(10+)' & probability >0.15)## # A tibble: 4 x 3
## # Groups: Pk [4]
## Pk Experience probability
## <dbl> <fctr> <dbl>
## 1 35 Legend(10+) 0.1904762
## 2 37 Legend(10+) 0.2857143
## 3 42 Legend(10+) 0.1904762
## 4 43 Legend(10+) 0.1904762
Based on the data, a player has a 28.5% chance of playing 10 or more years if they are selected 37th overall. Players selected, 35,42, and 43 have a 19% chance of playing 10 or more years in the league.
A lot of second rounders do not play in the league, as their contracts aren’t guaranteed. Which spots had the highest chances of playing any years in the league?
probability_nba %>%
select(Pk, Experience, probability) %>%
filter(Pk >=31 & Experience !='Unsuccessful(0)' & probability >0.50)## # A tibble: 5 x 3
## # Groups: Pk [5]
## Pk Experience probability
## <dbl> <fctr> <dbl>
## 1 32 Little Experience(1-3) 0.6666667
## 2 33 Little Experience(1-3) 0.5238095
## 3 38 Little Experience(1-3) 0.5238095
## 4 40 Little Experience(1-3) 0.5238095
## 5 41 Little Experience(1-3) 0.5238095
there was a 67% chance of playing in the league if you are selected 32nd and 52% chance of playing in the league if you are selected 33rd, 38th, 40th, or 41st. All surprisingly more successful than the 31st pick.
Are other NBA statistics a better predictor for Experience?
From our initial data, I tested the other stats to check if they were better predictors of success.
Exp_all <- lm(Yrs ~ PTS + TRB + AST +FG. +X3P.+ FT., data = nba.players)
summary(Exp_all)##
## Call:
## lm(formula = Yrs ~ PTS + TRB + AST + FG. + X3P. + FT., data = nba.players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.103 -1.554 -0.321 1.431 7.899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.401e-01 7.630e-01 0.184 0.854
## PTS 2.488e-04 4.273e-05 5.823 7.94e-09 ***
## TRB 9.606e-04 7.924e-05 12.124 < 2e-16 ***
## AST 6.672e-04 9.980e-05 6.685 3.95e-11 ***
## FG. 6.207e+00 1.258e+00 4.935 9.45e-07 ***
## X3P. 7.245e-01 5.987e-01 1.210 0.227
## FT. 1.128e+00 7.346e-01 1.535 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.264 on 944 degrees of freedom
## (283 observations deleted due to missingness)
## Multiple R-squared: 0.7433, Adjusted R-squared: 0.7417
## F-statistic: 455.6 on 6 and 944 DF, p-value: < 2.2e-16
From the data we can see that Pts, Rebounds, Assists and FG% is correlated with years in the NBA. This is unsurprising, considering that points, rebounds, assists, and field goal percentages are the most crucial statistics in the NBA.
Conclusion
To predict how successful an NBA player is, I modeled two numerical values, years in the league vs draft pick location and VORP score and draft pick location. Based on our linear model, draft pick location vs VORP had an R^2 of 12% and was heavily skewed. The model for draft pick location and years in the league had a strong R^2 of 30%.
Since, years in the league had a stronger correlation, I decided to categorize the amount of years in the league into the 6 categories below. 1.Legend - 10 or more years 2.Very Experienced - 8-9 Years 3.Experienced - 6-7 Years 4.Some Experience - 4-5 Years 5.Little Experience - 1-3 Years 6-Unsuccessful - 0 years
Using the categories above, I was able to find the probability of each location falling into the category. As expected there was an inverse relationship between players that played 10 years in the league vs players that played 0 years in the league. The probabilities of players that played between 4-7 years in the league were pretty evenly selected in the draft. Based on the probability scores above, I decided to look into second rounders. There was a surprising statistic that the 37th pick overall had a 28% chance of playing 10 years or more, which is significantly higher than any other picks. Picks 35, 42, and 43 were second highest with 19% chance of playing 10 or more years in the league.
Since playing in the league isn’t always guaranteed, I wanted to see which location had the highest chance or playing in the league at all. It was the 32nd pick that had a chance of 66% of playing 1-3 years in the league. Also, to stay in the league, other statistics are always important. Based on the overall statistics of the players from 1994 to 2014, points, assists, rebounds and field goal percentage had the highest correlation to the number of years in the league.
Overall, this if a player was known to be going into the second round, they have a 66% chance of playing the league if they are drafted 32nd and a 28% chance to play in the league more than 10 years if they are 37th overall. Since our model only has a 30% R^2, its important to consider other statistics such as points, assists, rebounds and field goal percentage.