In the following report, statistical analysis was performed on the NBA dataset to identify the features within the data that are most suitable for a multiregression model. From the features determined significant, a model was fitted to predict the Win/Loss % of a team in next year’s NBA season.
Using our model, we can simply forecast the values for each team’s feature values for the upcoming 2019-2020 season and obtain a predicted Win/Loss %. From our sorted results, we predict that the Golden State Warriors ‘GSW’ will have the highest Win/Loss % in the upcoming season (based on their most recent performance), with the Houston Rockets ‘HOU’, Toronto Raptors ‘TOR’, and Utah Jazz ‘UTA’ all within range of contending for the championships alongside the GSW.
However, caution is noted from these results as the model fitted contains several limitations, most notably that of an over-reliance on historical data for future outlook (with no regard for the effect that player trades, injuries, and chemistry have on Win/Loss %), as well as the noise present in the data. Moving forward, we will use our findings in this model to lay foundation for our advanced, interactive ‘roster’ model which will incorporate individual player stats to simulate the effects that particular trades will have on predicted outcomes.
To build a multi-variable model, we will use the data created in the ‘Data Collection and Cleaning’ script within the NBA Simulator - Simple Model github repository. We will also modify the data (by adding new features and creating specific subsets) as performed in our exploratory data analysis.
## Read in necessary libraries
library(data.table)
library(dplyr)
library(forecast)
## Read data and alter names of key variables for ease of use
setwd("..")
stats <- fread(file = "NBA Data by Team and Season.csv")
names(stats)[names(stats) == "ADV.W.L."] <- "W.L"
names(stats)[names(stats) == "ADV.playoffs"] <- "playoffs"
## Create 'NRtg' variable
stats <- stats[,ADV.NRtg:= ADV.ORtg - ADV.DRtg]
## Create subset of recent 5 years of data
stats <- stats[stats$ADV.Season %in% c("2018-19","2017-18","2016-17","2015-16","2014-15"),]
names(stats)
## [1] "key" "ADV.Season" "ADV.Tm" "ADV.G"
## [5] "ADV.W" "ADV.L" "W.L" "ADV.MOV"
## [9] "ADV.SOS" "ADV.SRS" "ADV.Pace" "ADV.ORtg"
## [13] "ADV.DRtg" "ADV.eFG." "ADV.TOV." "ADV.ORB."
## [17] "ADV.FT.FGA" "ADV.eFG..OPP" "ADV.TOV..OPP" "ADV.ORB..OPP"
## [21] "ADV.FT.FGA.OPP" "playoffs" "champs" "TM.FG"
## [25] "TM.FGA" "TM.2P" "TM.2PA" "TM.3P"
## [29] "TM.3PA" "TM.FT" "TM.FTA" "TM.ORB"
## [33] "TM.DRB" "TM.TRB" "TM.AST" "TM.STL"
## [37] "TM.BLK" "TM.TOV" "TM.PF" "TM.PTS"
## [41] "OPP.FG" "OPP.FGA" "OPP.2P" "OPP.2PA"
## [45] "OPP.3P" "OPP.3PA" "OPP.FT" "OPP.FTA"
## [49] "OPP.ORB" "OPP.DRB" "OPP.TRB" "OPP.AST"
## [53] "OPP.STL" "OPP.BLK" "OPP.TOV" "OPP.PF"
## [57] "OPP.PTS" "ADV.NRtg"
From our exploratory data analysis, the features of interest include the following: ‘ORtg’ & ‘DRtg’ (and therefore, their net effect in ‘NRtg’), eFG%, AST, PTS, OPP.DRB, OPP.AST, and OPP.PTS.
Descriptions of these variables can be found within the README file in this repository, as well as at Basketball Reference’s glossary for NBA stats.
First, we will examine the key features suspected to affect W/L% the most (from our previous exploratory data analysis performed).
pairs(stats[,c("W.L","ADV.NRtg", "ADV.eFG.","TM.AST","TM.PTS")], lower.panel = NULL)
pairs(stats[,c("W.L","OPP.DRB","OPP.AST","OPP.PTS")], lower.panel = NULL)
From the above charts, Net Rating appears to have the strongest relationship with W/L%, with most of the remaining features showing weaker signs of a relationship.
However, these only include a few of the features as these were initially suspected to influence W/L% the most. To explore if other features may be better correlated to W/L%, we will fit a model using all features and analyze their p-values to determine significance in their relationships.
fitall <- lm(W.L ~ ., data = stats[,-c(1,2,3,4,5,6,22,23,58)]) ## fit model
pvals <- summary(fitall)$coefficients[,4] # extract p-values
coefs <- summary(fitall)$coefficients[,1] # extract coefficients
top <- cbind(coefs, pvals) # cbind coefficients and p-values
top <- top[order(top[,2], decreasing = FALSE),] # sort by p-values
top <- top[-(grep("Intercept",rownames(top))),] # Remove Intercept row
top <- head(top, n = 12); top # select top features
## coefs pvals
## ADV.DRtg -0.14880998 0.003311750
## OPP.TRB -0.18970436 0.005464123
## TM.ORB 0.22134092 0.005738584
## TM.FTA -0.04066872 0.014589743
## TM.2P -0.15169697 0.017964510
## TM.AST -0.00663483 0.021174820
## TM.FG 0.22257755 0.021269077
## ADV.ORB. -0.06537841 0.054831393
## OPP.DRB 0.13184901 0.064838416
## ADV.eFG..OPP 12.98988395 0.069352136
## ADV.Pace -0.01804002 0.072556940
## OPP.ORB 0.13450586 0.078732045
From the model fitted with all features (excluding Net Rating ‘NRtg’ due to collinearity with ‘DRtg’ and ‘ORtg’, and excluding select identifier variables in columns 1:6), there are 7 features that are significant at the 95% confidence level, with 5 other variables with p-values between 0.05 and 0.08.
We will also assess each of the non-significant variables to see if they should be removed or kept in the final model.
To assess this, we will perform the following steps:
Create initial model with the top 8 features
Review Adjusted R-squared from initial model
Remove the feature with the highest p-value
Run revised model and compare new Adjusted R-squared to previous Adjusted R-squared
Continue removing features with highest p-values, re-running the model, and comparing
If Adjusted R-squared decreases after removing some feature, consider keeping feature within the final model
topfeatures <- rownames(top) # extract names of top features
finalstats <- stats %>% select("W.L",topfeatures) # create subset of 'stats' dataset w/ top features
topmodel <- lm(W.L ~ ., data = finalstats) # fit model using all topfeatures
summary(topmodel)
##
## Call:
## lm(formula = W.L ~ ., data = finalstats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.098762 -0.025978 -0.001373 0.023372 0.119589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.174663 0.781363 6.623 7.39e-10 ***
## ADV.DRtg -0.026066 0.002683 -9.716 < 2e-16 ***
## OPP.TRB -0.153299 0.066617 -2.301 0.0229 *
## TM.ORB 0.089358 0.055639 1.606 0.1106
## TM.FTA 0.020663 0.002024 10.211 < 2e-16 ***
## TM.2P -0.031339 0.002155 -14.545 < 2e-16 ***
## TM.AST -0.003475 0.002607 -1.333 0.1847
## TM.FG 0.098366 0.004086 24.076 < 2e-16 ***
## ADV.ORB. -0.059009 0.032066 -1.840 0.0679 .
## OPP.DRB 0.113367 0.069410 1.633 0.1047
## ADV.eFG..OPP -0.280004 0.484934 -0.577 0.5646
## ADV.Pace -0.033585 0.003152 -10.657 < 2e-16 ***
## OPP.ORB 0.155655 0.066537 2.339 0.0208 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03972 on 137 degrees of freedom
## Multiple R-squared: 0.9368, Adjusted R-squared: 0.9313
## F-statistic: 169.3 on 12 and 137 DF, p-value: < 2.2e-16
From our first model, the adjusted R-squared is 0.9313.
We now fit a new model, removing the feature with the highest p-value (ADV.eFG..OPP).
topfeatures <- topfeatures[topfeatures != c("ADV.eFG..OPP")] # remove feature
finalstats <- stats %>% select("W.L",topfeatures)
topmodel <- lm(W.L ~ ., data = finalstats)
summary(topmodel)
##
## Call:
## lm(formula = W.L ~ ., data = finalstats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.099609 -0.024956 -0.002132 0.022938 0.117186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.131596 0.775914 6.614 7.6e-10 ***
## ADV.DRtg -0.027445 0.001218 -22.535 < 2e-16 ***
## OPP.TRB -0.153745 0.066451 -2.314 0.0222 *
## TM.ORB 0.086389 0.055267 1.563 0.1203
## TM.FTA 0.020760 0.002012 10.320 < 2e-16 ***
## TM.2P -0.031399 0.002147 -14.625 < 2e-16 ***
## TM.AST -0.003552 0.002597 -1.368 0.1736
## TM.FG 0.098536 0.004065 24.239 < 2e-16 ***
## ADV.ORB. -0.057198 0.031834 -1.797 0.0746 .
## OPP.DRB 0.114800 0.069198 1.659 0.0994 .
## ADV.Pace -0.033757 0.003130 -10.786 < 2e-16 ***
## OPP.ORB 0.157555 0.066295 2.377 0.0188 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03963 on 138 degrees of freedom
## Multiple R-squared: 0.9367, Adjusted R-squared: 0.9316
## F-statistic: 185.5 on 11 and 138 DF, p-value: < 2.2e-16
It appears that the Adjusted R-squared increased slightly to 0.9316. Therefore, we will continue to remove features (next to remove: TM.AST):
topfeatures <- topfeatures[topfeatures != c("TM.AST")] # remove feature
finalstats <- stats %>% select("W.L",topfeatures)
topmodel <- lm(W.L ~ ., data = finalstats)
summary(topmodel)
##
## Call:
## lm(formula = W.L ~ ., data = finalstats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10461 -0.02368 -0.00192 0.02341 0.11348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.058443 0.776489 6.515 1.24e-09 ***
## ADV.DRtg -0.026961 0.001169 -23.064 < 2e-16 ***
## OPP.TRB -0.161205 0.066434 -2.427 0.0165 *
## TM.ORB 0.083191 0.055390 1.502 0.1354
## TM.FTA 0.021526 0.001938 11.105 < 2e-16 ***
## TM.2P -0.031056 0.002139 -14.520 < 2e-16 ***
## TM.FG 0.096571 0.003815 25.315 < 2e-16 ***
## ADV.ORB. -0.054466 0.031871 -1.709 0.0897 .
## OPP.DRB 0.125085 0.069003 1.813 0.0720 .
## ADV.Pace -0.035036 0.002996 -11.694 < 2e-16 ***
## OPP.ORB 0.163693 0.066350 2.467 0.0148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03975 on 139 degrees of freedom
## Multiple R-squared: 0.9358, Adjusted R-squared: 0.9312
## F-statistic: 202.6 on 10 and 139 DF, p-value: < 2.2e-16
It appears that the Adjusted R-Squared decreased slightly to 0.9312. Therefore, we will keep the TM.AST feature and, instead, remove the TM.ORB feature:
topfeatures <- rownames(top) # reset 'topfeatures' variable to initial values including all top features
topfeatures <- topfeatures[topfeatures != c("TM.ORB", "ADV.eFG..OPP")] # remove features
finalstats <- stats %>% select("W.L",topfeatures)
topmodel <- lm(W.L ~ ., data = finalstats)
summary(topmodel)
##
## Call:
## lm(formula = W.L ~ ., data = finalstats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.109056 -0.022381 -0.002155 0.022341 0.117833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.944428 0.159654 24.706 < 2e-16 ***
## ADV.DRtg -0.027603 0.001220 -22.626 < 2e-16 ***
## OPP.TRB -0.153573 0.066795 -2.299 0.022986 *
## TM.FTA 0.020864 0.002021 10.323 < 2e-16 ***
## TM.2P -0.031313 0.002157 -14.515 < 2e-16 ***
## TM.AST -0.003380 0.002608 -1.296 0.197103
## TM.FG 0.098794 0.004083 24.197 < 2e-16 ***
## ADV.ORB. -0.007537 0.002031 -3.711 0.000298 ***
## OPP.DRB 0.141753 0.067362 2.104 0.037150 *
## ADV.Pace -0.033820 0.003146 -10.751 < 2e-16 ***
## OPP.ORB 0.157830 0.066638 2.368 0.019238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03983 on 139 degrees of freedom
## Multiple R-squared: 0.9355, Adjusted R-squared: 0.9309
## F-statistic: 201.7 on 10 and 139 DF, p-value: < 2.2e-16
Similarly, Adjusted R-squared decreased slightly to 0.9309. As a result, we’ll opt to only remove the ‘ADV.eFG..OPP’ feature.
Our final model will consist of the following variables:
topfeatures <- rownames(top) # reset 'topfeatures' variable to initial values including all top features
topfeatures <- topfeatures[topfeatures != c("ADV.eFG..OPP")] # remove feature
finalstats <- stats %>% select("W.L",topfeatures)
topmodel <- lm(W.L ~ ., data = finalstats)
topfeatures
## [1] "ADV.DRtg" "OPP.TRB" "TM.ORB" "TM.FTA" "TM.2P" "TM.AST"
## [7] "TM.FG" "ADV.ORB." "OPP.DRB" "ADV.Pace" "OPP.ORB"
par(mfrow = c(2,2))
plot(topmodel)
From the diagnostic plots on our final model, there doesn’t appear to be any noticeable patterns that require us to troubleshoot the model.
In order to build predictions on which teams are most likely to win the upcoming season’s NBA Championship using our model, we will simply forecast the features determined previously for each team based on their historical trend. These forecasted values will be plugged into the final model to estimate each team’s W/L%, with the highest W/L% denoting our best guess for the year’s NBA champions.
First, we will gather (from the stats dataset) Team names, Season, and the top features and split this into a list by Team:
StatsByTeam <- stats %>% select("ADV.Tm", "ADV.Season", topfeatures)
StatsByTeam <- split(StatsByTeam,StatsByTeam$ADV.Tm)
Within each element in this list resides data for each team in the NBA for the top features selected in our model:
head(StatsByTeam[[1]],n = 3)
## ADV.Tm ADV.Season ADV.DRtg OPP.TRB TM.ORB TM.FTA TM.2P TM.AST TM.FG
## 1: ATL 2018-19 113.9 46.0 11.6 23.4 28.4 25.8 41.4
## 2: ATL 2017-18 110.6 44.2 9.1 20.2 27.0 23.7 38.2
## 3: ATL 2016-17 105.7 43.9 10.3 24.9 29.2 23.6 38.1
## ADV.ORB. OPP.DRB ADV.Pace OPP.ORB
## 1: 24.7 35.4 103.9 10.6
## 2: 21.1 34.0 98.3 10.3
## 3: 23.6 33.2 97.4 10.7
Next, we will run through each element in the list (i.e., each team set) and forecast each variable for each team using the ‘forecast’ package. The outer ‘for’ loop runs through each element within the list and the nested ‘for’ loop forecasts the value of each column within each list element:
## Create empty 'forecasts' dataframe to be filled with team forecasts from 'for' loops below
forecasts <- data.frame(matrix(NA, nrow = length(topfeatures)+1,
ncol = length(StatsByTeam)))
## Outer 'for' loop to run through each list element and order the team's dataset by season, then pull only past 3 years of data
for(i in 1:length(StatsByTeam)){
Tm <- data.frame(StatsByTeam[i]) ## Set team subset
Tm <- Tm[order(Tm[2], decreasing = TRUE),]; Tm <- head(Tm, 3) ## past 3 years of data
teamdata <- data.frame()
## Nested 'for' loop to forecast the value of each column within each team's subset of data.
for(n in 1:length(topfeatures)){
col <- n + 2
series <- ts(Tm[col]) # create time series to be used for forecast() function
forecast <- data.frame(forecast(series))[1,1] # generate simple forecast based on time series
teamdata[n,1] <- forecast
}
finalteam <- rbind(Tm[1,1],teamdata)
forecasts[,i] <- finalteam
}
## Create and clean final data frame of forecasts (to be used in prediction model)
forecasts <- data.frame(t(forecasts)); names(forecasts) <- c("Team", topfeatures)
len <- as.numeric(ncol(forecasts))
forecasts[2:len] <- apply(apply(forecasts[2:len],2,as.character),2,as.numeric)
forecasts[,1] <- as.character(forecasts[,1])
Note that the number of years to calculate the forecast on is 3. Therefore, we are forecasting all values based on the most recent 3 years of team performance. This is based on the assumption that team performance is highly-dependent on the players within the team. Player contract lengths float between 1 to 5 years (with a few exceptions). Therefore, we will assume that players will be signed to a team for an average of 2.5 years (rounded up to 3). However, this assumption does not consider the true mean of the distribution of contract lengths within the NBA, nor does it account for the fact players can be traded mid-contract.
Our final dataset to be plugged into our prediction model is structured as follows:
head(forecasts, n = 3)
## Team ADV.DRtg OPP.TRB TM.ORB TM.FTA TM.2P TM.AST TM.FG
## X1 ATL 110.0667 44.70000 10.333333 22.83333 28.20000 24.36667 39.23333
## X2 BOS 106.7000 44.73333 9.433333 21.13333 27.66667 24.66667 39.66667
## X3 BRK 110.3333 46.73333 9.833333 24.23333 26.70000 22.96667 38.76667
## ADV.ORB. OPP.DRB ADV.Pace OPP.ORB
## X1 23.13333 34.20000 99.86667 10.53333
## X2 21.43333 34.50000 97.46667 10.26667
## X3 21.46667 35.93333 100.33333 10.80000
Using our final model stored in ‘topmodel’, we can predict the W/L%’s per team based on the forecasts made above:
forecasts <- data.table(forecasts)
forecasts <- forecasts[,WL:= predict(topmodel,forecasts)] # Create new WL column using values fitted into 'topmodel'
forecasts <- forecasts[order(forecasts[,c("WL")], decreasing = TRUE),]
pred <- forecasts[1:16,c("Team","WL")]; pred # Select only top 16 teams (i.e., playoff-qualifying)
## Team WL
## 1: GSW 0.7489522
## 2: HOU 0.6868871
## 3: TOR 0.6769078
## 4: UTA 0.6408476
## 5: SAS 0.6025206
## 6: MIL 0.6008968
## 7: BOS 0.5937128
## 8: OKC 0.5875458
## 9: LAC 0.5648256
## 10: POR 0.5579062
## 11: DEN 0.5575158
## 12: IND 0.5476148
## 13: MIA 0.5311400
## 14: WAS 0.5082633
## 15: PHI 0.4971288
## 16: NOP 0.4967300
Based on our model, we predict the above 16 teams to qualify for the 2019-2020 NBA Playoffs, with the Golden State Warriors ‘GSW’ possessing the highest W/L%. In addition, GSW, HOU, TOR and UTA are shown as our model’s predictions for the 2019-2020 season’s primary championship contenders.
A few notes on the assumptions and limitations on the model:
Simple model oversimplifies the factors involved in determining championship contending teams. Within the data, there exists an abundance of noise, making it difficult to create accurate predictions using a simple regression model.
The model only takes into account past performance and not future outlooks. For example, major changes can occur during offseasons (e.g., superstar trades, injuries, etc.) which can drasticaly affect the future outlook of teams, regardless of historical performance.
Feature values for each team are forecasted based on only the past 3 years of data. However, team ‘eras’ and ‘dynasties’ (i.e., periods where teams achieve high success over several seasons, control several outperforming allstars, etc.) can last several years, or even dissolve in 1-2 seasons due to lack of chemistry.
The model does not take into account the effect that specific players, coaches, and management teams have on performance. It assumes that teams remain static (i.e., no trades/waivers/drafts/retirements/injuries/coaching/management changes occur).
These limitations will be addressed in our advanced machine learning ‘roster’ model (outline below).
We’ve developed a multivariable regression model (based on 11 features from our NBA stats dataset) which allow us to predict the results of the upcoming 2019-2020 NBA season. The features used in our model include the following (with a description of each feature found here): “ADV.DRtg” “OPP.TRB” “TM.ORB” “TM.FTA” “TM.2P” “TM.AST” “TM.FG” “ADV.ORB.” “OPP.DRB” “ADV.Pace” “OPP.ORB”.
From our model (using the past 3 years of data), we predict that the Golden State Warriors (GSW), Houston Rockets (HOU), San Antonio Spurs (SAS), and Toronto Raptors (TOR) will be the main championship contenders in the 2019-2020 season.
Moving forward, a more advanced prediction model will be developed to address some of the limitations of this multivariable model through the following:
To address noise in the dataset across features, machine learning models will be explored (e.g., boosting) to determine if we can obtain high prediction accuracy by combining potentially weak features within the data.
To reduce our model’s reliance on historical performance and increase our ability to weigh our predictions based on expected future events (such as trades), our advanced model will incorporate roster data. This will allow us to simulate the effects of trades/new signings/waivers/etc. as each individual player will account for some contribution to the team’s expected future performance.
To determine the appropriate horizon of time to forecast team performance (i.e., how many years of data to forecast feature values on), each team will be analyzed for the proportion of players in their ‘core’ that are expected to remain on the team. This will be based on the assumption that if a team stays entirely the same from one season to the next (i.e., no one gets traded/waived/etc.), then the team should be expected to perform generally the same as the previous year. By giving each team a proportion of ‘core’ players (essentially a player turnover/retention ratio), we can adjust the weighting of historical performance to be fed into the final model.
By incorporating these tweaks and developing a machine learning model, we can improve our prediction accuracy, as well as develop a dynamic model allowing us to predict future seasons by making player trades, new signings, retirements, etc. all possible.