Predicting the 2018-19 NBA Rookie of the Year Through Machine Learning

Introduction

Each year, one rookie in the NBA sits amongst the top of their class, earning the Rookie of the Year award. Many of the game’s greatest players have acheived this feat, while others have won and had lackluster careers.

The race to see who wins this award is always interesting, but to what degree can this be predicted? With the help of machine learning I will be simluating the probabilities for the top 5 rookies in this year’s draft class to win Rookie of the Year, based off of their statistical averages for the season.

In order to do this, previous Rookie of the year winners/candidates data will be analyzed and tested in order to obtain probabilities for this season’s rookie class.

Logistic Regression and K-Nearest Neighbor classifiers are utilized to predict the winner.

Shiny App

A shiny app was developed to show the probability of winning the award based off different statistical averages using the best fitting model.

Link: https://joevorbeck.shinyapps.io/AppRoy/

Reading in the Dataset

library(tidyverse)
library(Zelig)
library(readr)
library(texreg)
library(formattable)
library(class)
library(caret)

roy_raw = list.files(pattern = '*.csv')

roy = lapply(roy_raw, read_csv) %>%
  bind_rows() %>%
  mutate(winner = as.integer(ifelse(Rank %in% c('1'),1,0))) %>%
  filter(!is.na(Player))

formattable(head(roy))

Rank	Player	Age	Tm	First	Pts Won	Pts Max	Share	G	MP	PTS	TRB	AST	STL	BLK	FG%	3P%	FT%	WS	WS/48	Year	winner
1	Mike Miller	20	ORL	75	75	124	0.605	82	29.1	11.9	4.0	1.7	0.6	0.2	0.436	0.407	0.711	5.3	0.106	NA	1
2	Kenyon Martin	23	NJN	36	36	124	0.290	68	33.4	12.0	7.4	1.9	1.1	1.7	0.445	0.091	0.630	2.1	0.044	NA	0
3	Marc Jackson	26	GSW	7	7	124	0.056	48	29.4	13.2	7.5	1.2	0.7	0.6	0.467	0.217	0.802	2.6	0.089	NA	0
4	Darius Miles	19	LAC	3	3	124	0.024	81	26.3	9.4	5.9	1.2	0.6	1.5	0.505	0.053	0.521	3.0	0.068	NA	0
4	Morris Peterson	23	TOR	3	3	124	0.024	80	22.6	9.3	3.2	1.3	0.8	0.3	0.431	0.382	0.717	3.6	0.096	NA	0
1	Pau Gasol	21	MEM	117	117	126	0.929	82	36.7	17.6	8.9	2.7	0.5	2.1	0.518	0.200	0.709	7.6	0.121	NA	1

Using the ‘nbastatsR’ API, data is directly obtained from BasketballReference.com, which provides player-level data from the NBA. The top 5 Rookie of the Year candidates are selected along with their season’s statistical averages (Points, Assists, Rebounds, Blocks, Steals). As their stats change throughout the season, the change in their probability to win the award will be reflected in the model.

To begin the analysis, all top Rookie of the Year candidates from the 1988 to 2018 season were selected. A binary integer is added to specify the winner of the award from their class; 1 for winning, 0 for not.

A view of the data is shown above. The variables of interest will be the players’ per game averages, including points, rebounds, assists, steals, and blocks. The Rookie of the Year award is almost entirely dependent on indivdual performance throughout the season.

Splitting the Data

library(caTools)
set.seed(88)
split = sample.split(roy$winner, SplitRatio = .9)

royTrain = subset(roy, split == TRUE)
royTest = subset(roy, split == FALSE)

A training and testing set need to be developed from the original dataset. A 90-10 split will be used for the training to test set ratio.

Logistic Model Selection

royLog1 <- glm(winner ~ PTS, family = binomial, data = royTrain)
royLog2 <- glm(winner ~ PTS + TRB, family = binomial, data = royTrain)
royLog3 <- glm(winner ~ PTS + AST, family = binomial, data = royTrain)
royLog4 <- glm(winner ~ PTS + TRB + AST, family = binomial, data = royTrain)
royLog5 <- glm(winner ~ PTS + TRB + AST + STL, family = binomial, data = royTrain)
royLog6 <- glm(winner ~ PTS + TRB + AST + STL + BLK, family = binomial, data = royTrain)

htmlreg(list(royLog1,royLog2, royLog3, royLog4, royLog5, royLog6), doctype = FALSE)

Statistical models
	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6
(Intercept)	-8.86^***	-9.17^***	-9.65^***	-10.92^***	-10.37^***	-10.41^***
	(1.56)	(1.64)	(1.76)	(2.01)	(2.03)	(2.05)
PTS	0.48^***	0.47^***	0.45^***	0.38^***	0.41^***	0.42^***
	(0.09)	(0.09)	(0.10)	(0.10)	(0.11)	(0.11)
TRB		0.09		0.31	0.31	0.34
		(0.13)		(0.17)	(0.17)	(0.22)
AST			0.33^*	0.50^**	0.65^**	0.65^**
			(0.15)	(0.18)	(0.22)	(0.22)
STL					-1.46	-1.51
					(1.07)	(1.11)
BLK						-0.14
						(0.83)
AIC	80.26	81.73	77.19	75.31	75.24	77.21
BIC	86.28	90.76	86.22	87.35	90.29	95.27
Log Likelihood	-38.13	-37.87	-35.60	-33.65	-32.62	-32.60
Deviance	76.26	75.73	71.19	67.31	65.24	65.21
Num. obs.	150	150	150	150	150	150
p < 0.001, p < 0.01, p < 0.05

Historically, the Rookie of the Year award is dependent on the statistical averages of the player. Team-based metrics and records can’t be taken into consideration as the better rookies are generally drafted to teams that had losing records from the previous season. For an award like the MVP, team metrics and individual statistics both weigh heavily.

First, Several logistic models are created to figure out the best fit for the data, which in this case would be Model 4 (‘TRB’ found to have significance between .05 & .1’).

Logistic Regression - Confusion Matrix

predictTest = predict(royLog4, type='response', newdata=royTest)
tableTest <- table(royTest$winner, predictTest >= .2)

fourfoldplot(tableTest)

There are 12 cases where the player was correctly labeled as not winning the award, with 2 false positives indicating a player won the award when they really didn’t, and 2 cases where the player was correctly labeled as winning Rookie of the Year (87.5%)

Logistic Regression - Predicting Previous ROY

library(ggthemes)
library(reshape2)
library(cowplot)
library(ggpubr)
oldroy1 <- read.csv('9596_roy.csv')
oldroy2 <- read.csv('9697_roy.csv')
oldroy3 <- read.csv('9798_roy.csv')

oldroyComb <- rbind(oldroy1,oldroy2,oldroy3) %>%
  filter(!is.na(Rank)) %>% 
  mutate(Winner = as.integer(ifelse(Rank %in% c('1'),1,0)))

oldroyComb <- subset(oldroyComb, select = c('Player','PTS','AST','TRB','BLK','STL','Winner','Rank', 'Year'))
  
predictTest3 = predict(royLog4, type='response', newdata=oldroyComb)

oldroyComb$Predicted <- predictTest3
oldroyComb <- oldroyComb %>%
  filter(Rank <= 2)

oldroyComb$Predicted <- round(oldroyComb$Predicted, digits = 2)

oldRoy96 <- oldroyComb %>%
  filter(Year == 96)

oldRoy97 <- oldroyComb %>%
  filter(Year == 97)

oldRoy98 <- oldroyComb %>%
  filter(Year == 98)

center_title <- theme(plot.title = element_text(hjust=.5))
                              
plot96 <- ggplot(oldRoy96, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal() +  scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1995-96\n') + center_title + theme(axis.text.x = element_text(angle=45, hjust =1)) + scale_y_continuous(limits = c(0,1))

plot97 <- ggplot(oldRoy97, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal()  +  scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1996-97') + center_title + scale_x_discrete(limits=c('Stephon Marbury','Allen Iverson'))+ theme(axis.text.x = element_text(angle=45, hjust =1))
  
plot98 <- ggplot(oldRoy98, aes(Player, Predicted, fill = factor(Winner))) + geom_bar(stat='identity') + theme_minimal() +  scale_fill_fivethirtyeight(name='Rookie of the Year', labels=c('Runner Up','Winner')) + labs(x=NULL, y=NULL) + geom_text(aes(label = Predicted), vjust = 1.5, color='white', fontface='bold') + labs(title='1997-98') + center_title + theme(axis.text.x = element_text(angle=45, hjust =1)) + scale_y_continuous(limits = c(0,1))

ggarrange(plot96, plot97, plot98, ncol=3, common.legend = TRUE, legend = 'top')

To test the model, Rookie of the Year candidates were selected from the 1995 to 1998 season, which is data outside the testing set.

The two players in comparison are tested as their own individual cases.

The award winner for their respective year is shown next to the runner up to highlight the model’s accuracy in predicting the winner. The model gave a 90% or higher probability to the actual winner of the award.

Logistic Regression - Predicting this Season’s Rookie of the Year

library(nbastatR)
library(dplyr)
library(plotly)

rookies19 <- bref_players_stats(seasons = 2019, tables = "per_game", assign_to_environment = FALSE, only_totals = TRUE, nest_data = FALSE, join_data = TRUE, return_message = FALSE) %>%
  filter(namePlayer %in% c('Luka Doncic','Kevin Knox','Collin Sexton','Deandre Ayton','Trae Young')) %>%
  rename(TRB = trbPerGame, AST = astPerGame, STL = stlPerGame, BLK = blkPerGame, PTS = ptsPerGame, Player = namePlayer) %>%
  select(Player, PTS, AST, TRB, STL, BLK)

## PerGame
## Assigning NBA player dictionary to df_dict_nba_players to your environment

formattable(rookies19, align=c('l'), list(PTS = color_tile("white","orange"), AST=color_tile("white","orange"),TRB= color_tile("white","orange"),STL = color_tile("white","orange"),BLK = color_tile("white","orange")))

Player	PTS	AST	TRB	STL	BLK
Collin Sexton	15.0	2.9	3.1	0.6	0.0
Deandre Ayton	16.4	1.9	10.5	0.8	0.9
Kevin Knox	12.6	1.0	4.4	0.5	0.3
Luka Doncic	20.9	5.7	7.3	1.0	0.3
Trae Young	17.8	7.7	3.3	0.8	0.2

When looking at the statistics for the top candidates, Luka Doncic comes in at first in points, second in assists, second in rebounds, and first in steals. By looking at these numbers he would come across as the leading candidate. However, Deandre Ayton has the second highest average in points, first in rebounds, and first in blocks.

predictTest2 = predict(royLog4 , type='response', newdata=rookies19) 

rookies19$percentage <- predictTest2
rookies19$percentage <- round(rookies19$percentage, digits = 2)

ggplot(rookies19, aes(Player, percentage, fill = as.factor(percentage))) + geom_bar(stat='identity', show.legend = FALSE) + scale_fill_manual(values = c('#f58426','#6f263d','#e03a3e','#1d1160','#00538c'),name='Calculated Probabilities') + theme_minimal() + scale_x_discrete(limits=c('Luka Doncic','Trae Young','Deandre Ayton','Collin Sexton','Kevin Knox')) + geom_text(aes(label = percentage), vjust = -.25, color='black', fontface='bold', position = position_dodge(width=.9)) + labs(x='\nPlayer', y='Probability\n')

Based on the model, Luka Doncic has the highest probability to win Rookie of the Year, followed by Trae Young, then Deandre Ayton. Even though Collin Sexton and Kevin Knox are considered top 5, their chances to win are minimal.

K-NN Model Selection

royNN <- knn(train = royTrain[,11:13], test = royTest[,11:13], cl = royTrain[,22, drop = TRUE], k = 5)

tableTest2 <- table(royTest$winner, royNN)

fourfoldplot(tableTest2)

Next, a K-Nearest Neighbor model is deployed on the dataset. Using a model that incorporates points, rebounds, and assists, this confusion matrix is obtained. 13 observations are correctly labled as not winning the award, with 1 observation being correctly labeled as the winner, 1 observation being incorrectly labled as the winner, and 1 labeled incorrectly as not winning.

royTable <- data.frame(select(royTest, Player, winner), royNN)
colnames(royTable) <- c('Player','Observed', 'Predicted')
royTable <- filter_(royTable, 'Observed != Predicted')

formattable(royTable)

Player	Observed	Predicted
Emeka Okafor	1	0
John Wall	0	1

K-NN - Predicting Previous ROY

oldroyCombNN <- rbind(oldroy1,oldroy2,oldroy3) %>%
  filter(!is.na(Rank)) %>% 
  mutate(winner = as.integer(ifelse(Rank %in% c('1'),1,0)))

oldroyCombNN <- subset(oldroyCombNN, select = c('Player','PTS','AST','TRB','BLK','STL','winner','Rank', 'Year'))

oldroyCombNN <- oldroyCombNN %>%
  filter(Rank <= 2)

oldroyNN <- knn(train = royTrain[,11:13], test = select(oldroyCombNN, PTS,TRB,AST), cl = royTrain[,22, drop = TRUE], k = 5)

oldroyTable <- data.frame(select(oldroyCombNN, Player, Year, winner), oldroyNN)
colnames(oldroyTable) <- c('Player', 'Year', 'Observed', 'Predicted')


formattable(oldroyTable)

Player	Year	Observed	Predicted
Damon Stoudamire	96	1	1
Arvydas Sabonis	96	0	0
Stephon Marbury	97	0	0
Allen Iverson	97	1	1
Tim Duncan	98	1	1
Keith Van Horn	98	0	0

The K-NN model correctly predicts the previous Rookie of the Year winner in each case from 3 seperate years, as the logistic model did.

K-NN - Predicting this Season’s Rookie of the Year

rookie19NN <- knn(train = royTrain[,11:13], test = select(rookies19, PTS,TRB,AST), cl = royTrain[,22, drop = TRUE], k = 5)

rookietable <- data.frame(select(rookies19, Player), rookie19NN)
colnames(rookietable) <- c('Player','Prediction')

formattable(rookietable)

Player	Prediction
Collin Sexton	0
Deandre Ayton	0
Kevin Knox	0
Luka Doncic	1
Trae Young	1

When running the model on the current 2019 rookie class, Luka Doncic and Trae Young are both predicted to win the award when observed as their own case. Due to their statistical averages, they are both classified as winning.

Results

When comparing the results of the logstic and kNN models, the predictions are almost exact. The logstic and kNN models both accurately predicted the previous winners correctly, and both predicted the same rookie to win the award this year.

The exception being the kNN model classified two players being able to win. Due to Trae Young’s relatively high points per game average and high assists per game average, the model attributes him as winning. Luka Doncic’s high points per game average paired with a relatively high assist per game average also attributes him as winning.

The logistic model observes Luka Doncic as having the highest probability to win the award at 90%, with Trae Young at a still strong probability at ~70%.