Each year, one rookie in the NBA sits amongst the top of their class, earning the Rookie of the Year award. Many of the game’s greatest players have acheived this feat, while others have won and had lackluster careers.
The race to see who wins this award is always interesting, but to what degree can this be predicted? With the help of machine learning I will be simluating the probabilities for the top 5 rookies in this year’s draft class to win Rookie of the Year, based off of their statistical averages for the season.
In order to do this, previous Rookie of the year winners/candidates data will be analyzed and tested in order to obtain probabilities for this season’s rookie class.
Logistic Regression and K-Nearest Neighbor classifiers are utilized to predict the winner.
A shiny app was developed to show the probability of winning the award based off different statistical averages using the best fitting model.
library(tidyverse)
library(Zelig)
library(readr)
library(texreg)
library(formattable)
library(class)
library(caret)
roy_raw = list.files(pattern = '*.csv')
roy = lapply(roy_raw, read_csv) %>%
bind_rows() %>%
mutate(winner = as.integer(ifelse(Rank %in% c('1'),1,0))) %>%
filter(!is.na(Player))
formattable(head(roy))
| Rank | Player | Age | Tm | First | Pts Won | Pts Max | Share | G | MP | PTS | TRB | AST | STL | BLK | FG% | 3P% | FT% | WS | WS/48 | Year | winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Mike Miller | 20 | ORL | 75 | 75 | 124 | 0.605 | 82 | 29.1 | 11.9 | 4.0 | 1.7 | 0.6 | 0.2 | 0.436 | 0.407 | 0.711 | 5.3 | 0.106 | NA | 1 |
| 2 | Kenyon Martin | 23 | NJN | 36 | 36 | 124 | 0.290 | 68 | 33.4 | 12.0 | 7.4 | 1.9 | 1.1 | 1.7 | 0.445 | 0.091 | 0.630 | 2.1 | 0.044 | NA | 0 |
| 3 | Marc Jackson | 26 | GSW | 7 | 7 | 124 | 0.056 | 48 | 29.4 | 13.2 | 7.5 | 1.2 | 0.7 | 0.6 | 0.467 | 0.217 | 0.802 | 2.6 | 0.089 | NA | 0 |
| 4 | Darius Miles | 19 | LAC | 3 | 3 | 124 | 0.024 | 81 | 26.3 | 9.4 | 5.9 | 1.2 | 0.6 | 1.5 | 0.505 | 0.053 | 0.521 | 3.0 | 0.068 | NA | 0 |
| 4 | Morris Peterson | 23 | TOR | 3 | 3 | 124 | 0.024 | 80 | 22.6 | 9.3 | 3.2 | 1.3 | 0.8 | 0.3 | 0.431 | 0.382 | 0.717 | 3.6 | 0.096 | NA | 0 |
| 1 | Pau Gasol | 21 | MEM | 117 | 117 | 126 | 0.929 | 82 | 36.7 | 17.6 | 8.9 | 2.7 | 0.5 | 2.1 | 0.518 | 0.200 | 0.709 | 7.6 | 0.121 | NA | 1 |
Using the ‘nbastatsR’ API, data is directly obtained from BasketballReference.com, which provides player-level data from the NBA. The top 5 Rookie of the Year candidates are selected along with their season’s statistical averages (Points, Assists, Rebounds, Blocks, Steals). As their stats change throughout the season, the change in their probability to win the award will be reflected in the model.
To begin the analysis, all top Rookie of the Year candidates from the 1988 to 2018 season were selected. A binary integer is added to specify the winner of the award from their class; 1 for winning, 0 for not.
A view of the data is shown above. The variables of interest will be the players’ per game averages, including points, rebounds, assists, steals, and blocks. The Rookie of the Year award is almost entirely dependent on indivdual performance throughout the season.
library(caTools)
set.seed(88)
split = sample.split(roy$winner, SplitRatio = .9)
royTrain = subset(roy, split == TRUE)
royTest = subset(roy, split == FALSE)
A training and testing set need to be developed from the original dataset. A 90-10 split will be used for the training to test set ratio.
royLog1 <- glm(winner ~ PTS, family = binomial, data = royTrain)
royLog2 <- glm(winner ~ PTS + TRB, family = binomial, data = royTrain)
royLog3 <- glm(winner ~ PTS + AST, family = binomial, data = royTrain)
royLog4 <- glm(winner ~ PTS + TRB + AST, family = binomial, data = royTrain)
royLog5 <- glm(winner ~ PTS + TRB + AST + STL, family = binomial, data = royTrain)
royLog6 <- glm(winner ~ PTS + TRB + AST + STL + BLK, family = binomial, data = royTrain)
htmlreg(list(royLog1,royLog2, royLog3, royLog4, royLog5, royLog6), doctype = FALSE)
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | ||
|---|---|---|---|---|---|---|---|
| (Intercept) | -8.86*** | -9.17*** | -9.65*** | -10.92*** | -10.37*** | -10.41*** | |
| (1.56) | (1.64) | (1.76) | (2.01) | (2.03) | (2.05) | ||
| PTS | 0.48*** | 0.47*** | 0.45*** | 0.38*** | 0.41*** | 0.42*** | |
| (0.09) | (0.09) | (0.10) | (0.10) | (0.11) | (0.11) | ||
| TRB | 0.09 | 0.31 | 0.31 | 0.34 | |||
| (0.13) | (0.17) | (0.17) | (0.22) | ||||
| AST | 0.33* | 0.50** | 0.65** | 0.65** | |||
| (0.15) | (0.18) | (0.22) | (0.22) | ||||
| STL | -1.46 | -1.51 | |||||
| (1.07) | (1.11) | ||||||
| BLK | -0.14 | ||||||
| (0.83) | |||||||
| AIC | 80.26 | 81.73 | 77.19 | 75.31 | 75.24 | 77.21 | |
| BIC | 86.28 | 90.76 | 86.22 | 87.35 | 90.29 | 95.27 | |
| Log Likelihood | -38.13 | -37.87 | -35.60 | -33.65 | -32.62 | -32.60 | |
| Deviance | 76.26 | 75.73 | 71.19 | 67.31 | 65.24 | 65.21 | |
| Num. obs. | 150 | 150 | 150 | 150 | 150 | 150 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||||
Historically, the Rookie of the Year award is dependent on the statistical averages of the player. Team-based metrics and records can’t be taken into consideration as the better rookies are generally drafted to teams that had losing records from the previous season. For an award like the MVP, team metrics and individual statistics both weigh heavily.
First, Several logistic models are created to figure out the best fit for the data, which in this case would be Model 4 (‘TRB’ found to have significance between .05 & .1’).
predictTest = predict(royLog4, type='response', newdata=royTest)
tableTest <- table(royTest$winner, predictTest >= .2)
fourfoldplot(tableTest)
There are 12 cases where the player was correctly labeled as not winning the award, with 2 false positives indicating a player won the award when they really didn’t, and 2 cases where the player was correctly labeled as winning Rookie of the Year (87.5%)
library(ggthemes)
library(reshape2)
library(cowplot)
library(ggpubr)
oldroy1 <- read.csv('9596_roy.csv')
oldroy2 <- read.csv('9697_roy.csv')
oldroy3 <- read.csv('9798_roy.csv')
oldroyComb <- rbind(oldroy1,oldroy2,oldroy3) %>%
filter(!is.na(Rank)) %>%
mutate(Winner = as.integer(ifelse(Rank %in% c('1'),1,0)))
oldroyComb <- subset(oldroyComb, select = c('Player','PTS','AST','TRB','BLK','STL','Winner','Rank', 'Year'))
predictTest3 = predict(royLog4, type='response', newdata=oldroyComb)
oldroyComb$Predicted <- predictTest3
oldroyComb <- oldroyComb %>%
filter(Rank <= 2)
oldroyComb$Predicted <- round(oldroyComb$Predicted, digits = 2)
oldRoy96 <- oldroyComb %>%
filter(Year == 96)
oldRoy97 <- oldroyComb %>%
filter(Year == 97)
oldRoy98 <- oldroyComb %>%
filter(Year == 98)
center_title <- theme(plot.title = element_text(hjust=.5))
plot96 <- ggplot(oldRoy96, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal() + scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1995-96\n') + center_title + theme(axis.text.x = element_text(angle=45, hjust =1)) + scale_y_continuous(limits = c(0,1))
plot97 <- ggplot(oldRoy97, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal() + scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1996-97') + center_title + scale_x_discrete(limits=c('Stephon Marbury','Allen Iverson'))+ theme(axis.text.x = element_text(angle=45, hjust =1))
plot98 <- ggplot(oldRoy98, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal() + scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1997-98') + center_title + theme(axis.text.x = element_text(angle=45, hjust =1)) + scale_y_continuous(limits = c(0,1))
ggarrange(plot96, plot97, plot98, ncol=3, common.legend = TRUE, legend = 'top')
To test the model, Rookie of the Year candidates were selected from the 1995 to 1998 season, which is data outside the testing set.
The two players in comparison are tested as their own individual cases.
The award winner for their respective year is shown next to the runner up to highlight the model’s accuracy in predicting the winner. The model gave a 90% or higher probability to the actual winner of the award.
library(nbastatR)
library(dplyr)
library(plotly)
rookies19 <- bref_players_stats(seasons = 2019, tables = "per_game", assign_to_environment = FALSE, only_totals = TRUE, nest_data = FALSE, join_data = TRUE, return_message = FALSE) %>%
filter(namePlayer %in% c('Luka Doncic','Kevin Knox','Collin Sexton','Deandre Ayton','Trae Young')) %>%
rename(TRB = trbPerGame, AST = astPerGame, STL = stlPerGame, BLK = blkPerGame, PTS = ptsPerGame, Player = namePlayer) %>%
select(Player, PTS, AST, TRB, STL, BLK)
## PerGame
## Assigning NBA player dictionary to df_dict_nba_players to your environment
formattable(rookies19, align=c('l'), list(PTS = color_tile("white","orange"), AST=color_tile("white","orange"),TRB= color_tile("white","orange"),STL = color_tile("white","orange"),BLK = color_tile("white","orange")))
| Player | PTS | AST | TRB | STL | BLK |
|---|---|---|---|---|---|
| Collin Sexton | 15.0 | 2.9 | 3.1 | 0.6 | 0.0 |
| Deandre Ayton | 16.4 | 1.9 | 10.5 | 0.8 | 0.9 |
| Kevin Knox | 12.6 | 1.0 | 4.4 | 0.5 | 0.3 |
| Luka Doncic | 20.9 | 5.7 | 7.3 | 1.0 | 0.3 |
| Trae Young | 17.8 | 7.7 | 3.3 | 0.8 | 0.2 |
When looking at the statistics for the top candidates, Luka Doncic comes in at first in points, second in assists, second in rebounds, and first in steals. By looking at these numbers he would come across as the leading candidate. However, Deandre Ayton has the second highest average in points, first in rebounds, and first in blocks.
predictTest2 = predict(royLog4 , type='response', newdata=rookies19)
rookies19$percentage <- predictTest2
rookies19$percentage <- round(rookies19$percentage, digits = 2)
ggplot(rookies19, aes(Player, percentage, fill = as.factor(percentage))) + geom_bar(stat='identity', show.legend = FALSE) + scale_fill_manual(values = c('#f58426','#6f263d','#e03a3e','#1d1160','#00538c'),name='Calculated Probabilities') + theme_minimal() + scale_x_discrete(limits=c('Luka Doncic','Trae Young','Deandre Ayton','Collin Sexton','Kevin Knox')) + geom_text(aes(label = percentage), vjust = -.25, color='black', fontface='bold', position = position_dodge(width=.9)) + labs(x='\nPlayer', y='Probability\n')
Based on the model, Luka Doncic has the highest probability to win Rookie of the Year, followed by Trae Young, then Deandre Ayton. Even though Collin Sexton and Kevin Knox are considered top 5, their chances to win are minimal.
royNN <- knn(train = royTrain[,11:13], test = royTest[,11:13], cl = royTrain[,22, drop = TRUE], k = 5)
tableTest2 <- table(royTest$winner, royNN)
fourfoldplot(tableTest2)
Next, a K-Nearest Neighbor model is deployed on the dataset. Using a model that incorporates points, rebounds, and assists, this confusion matrix is obtained. 13 observations are correctly labled as not winning the award, with 1 observation being correctly labeled as the winner, 1 observation being incorrectly labled as the winner, and 1 labeled incorrectly as not winning.
royTable <- data.frame(select(royTest, Player, winner), royNN)
colnames(royTable) <- c('Player','Observed', 'Predicted')
royTable <- filter_(royTable, 'Observed != Predicted')
formattable(royTable)
| Player | Observed | Predicted |
|---|---|---|
| Emeka Okafor | 1 | 0 |
| John Wall | 0 | 1 |
oldroyCombNN <- rbind(oldroy1,oldroy2,oldroy3) %>%
filter(!is.na(Rank)) %>%
mutate(winner = as.integer(ifelse(Rank %in% c('1'),1,0)))
oldroyCombNN <- subset(oldroyCombNN, select = c('Player','PTS','AST','TRB','BLK','STL','winner','Rank', 'Year'))
oldroyCombNN <- oldroyCombNN %>%
filter(Rank <= 2)
oldroyNN <- knn(train = royTrain[,11:13], test = select(oldroyCombNN, PTS,TRB,AST), cl = royTrain[,22, drop = TRUE], k = 5)
oldroyTable <- data.frame(select(oldroyCombNN, Player, Year, winner), oldroyNN)
colnames(oldroyTable) <- c('Player', 'Year', 'Observed', 'Predicted')
formattable(oldroyTable)
| Player | Year | Observed | Predicted |
|---|---|---|---|
| Damon Stoudamire | 96 | 1 | 1 |
| Arvydas Sabonis | 96 | 0 | 0 |
| Stephon Marbury | 97 | 0 | 0 |
| Allen Iverson | 97 | 1 | 1 |
| Tim Duncan | 98 | 1 | 1 |
| Keith Van Horn | 98 | 0 | 0 |
The K-NN model correctly predicts the previous Rookie of the Year winner in each case from 3 seperate years, as the logistic model did.
rookie19NN <- knn(train = royTrain[,11:13], test = select(rookies19, PTS,TRB,AST), cl = royTrain[,22, drop = TRUE], k = 5)
rookietable <- data.frame(select(rookies19, Player), rookie19NN)
colnames(rookietable) <- c('Player','Prediction')
formattable(rookietable)
| Player | Prediction |
|---|---|
| Collin Sexton | 0 |
| Deandre Ayton | 0 |
| Kevin Knox | 0 |
| Luka Doncic | 1 |
| Trae Young | 1 |
When running the model on the current 2019 rookie class, Luka Doncic and Trae Young are both predicted to win the award when observed as their own case. Due to their statistical averages, they are both classified as winning.
When comparing the results of the logstic and kNN models, the predictions are almost exact. The logstic and kNN models both accurately predicted the previous winners correctly, and both predicted the same rookie to win the award this year.
The exception being the kNN model classified two players being able to win. Due to Trae Young’s relatively high points per game average and high assists per game average, the model attributes him as winning. Luka Doncic’s high points per game average paired with a relatively high assist per game average also attributes him as winning.
The logistic model observes Luka Doncic as having the highest probability to win the award at 90%, with Trae Young at a still strong probability at ~70%.