The following exploratory data analysis report was performed on an NBA dataset of advanced, team per game, and opponent team per game statistics gathered from the Basketball Reference site. The data collected covered performance metrics from the regular season in the NBA spanning the years 1980-2019 (as years prior to this, several metrics were missing/not calculated or recorded).
Using this data, exploratory analysis was performed to understand the traits of winning teams, as well as pinpoint trends occuring in the league year over year. In this case, ‘winning teams’ were defined as those teams qualifying for the playoffs (and subsequently, those teams that eventually went on to become NBA champions).
To measure a team’s success more closely, Win/Loss % (W/L%) was used as an estimator and several metrics were explored to determine if some metrics were stronger predictors of W/L% than others.
Our findings suggest that a team’s regular season Net Rating (i.e., a function of Offensive Rating, less Defensive Rating) had the highest correlation to a team’s W/L%. In addition, the apparent cutoff W/L% to be considered a championship contender floated around the 0.65 W/L% mark. Championship teams also tend to show high margins of victory (MOV) during the regular season. For playoff-qualifying teams, a below-average strength of schedule (SOS) was observed, suggesting that there may be a ‘luck’ factor involved in determining winning teams (since a lower SOS indicates an ‘easier’ game schedule).
Across the league, we also observe an increase in relative pace of game (and total points scored per game) from the 2000s to current date (2019). A portion of this increase in total points per game can be attributed to the league’s increased reliance on the 3-point shot (and subsequently, the league’s decreased reliance on the 2-point shot, indicated by total 3-point and 2-point shots attempted season over season).
From these findings, we will build 2 versions of an NBA simulator:
1. First, a simple linear model using the most suitable feature to predict W/L%. With this model, we will then forecast the feature values in coming years for each team, fit these forecasts on our linear model, and obtain an estimate for W/L%. By ranking teams based on their predicted W/L%, we obtain a prediction of the NBA’s next championship team (2019-2020 season) and qualifying playoff contenders.
2. Next, an advanced & interactive predictor model with the end goal of being able to incoporate team rosters and the effect of player trades on predicted team performance. This model will contain an advanced regression model to fit W/L% based on each player’s performance & contributions. By increasing the granularity of our model to incorporate player contributions, we can then adjust a team’s overall predicted performance by ‘trading’ players to other teams (and, therefore, moving their contributions to other teams). This granularity will allow our model to account for major player acquisitions/losses that occur each season which should increase our prediction success.
library(data.table)
library(ggplot2)
library(gridExtra)
library(caret)
library(forecast)
For this report, we will load the “NBA Data by Team and Season.csv” file created from the “Data Collection and Cleaning.R” script:
setwd("..")
stats <- fread(file = "NBA Data by Team and Season.csv")
names(stats)[names(stats) == "ADV.W.L."] <- "W.L"
names(stats)[names(stats) == "ADV.playoffs"] <- "playoffs"
names(stats)
## [1] "key" "ADV.Season" "ADV.Tm" "ADV.G"
## [5] "ADV.W" "ADV.L" "W.L" "ADV.MOV"
## [9] "ADV.SOS" "ADV.SRS" "ADV.Pace" "ADV.ORtg"
## [13] "ADV.DRtg" "ADV.eFG." "ADV.TOV." "ADV.ORB."
## [17] "ADV.FT.FGA" "ADV.eFG..OPP" "ADV.TOV..OPP" "ADV.ORB..OPP"
## [21] "ADV.FT.FGA.OPP" "playoffs" "champs" "TM.FG"
## [25] "TM.FGA" "TM.2P" "TM.2PA" "TM.3P"
## [29] "TM.3PA" "TM.FT" "TM.FTA" "TM.ORB"
## [33] "TM.DRB" "TM.TRB" "TM.AST" "TM.STL"
## [37] "TM.BLK" "TM.TOV" "TM.PF" "TM.PTS"
## [41] "OPP.FG" "OPP.FGA" "OPP.2P" "OPP.2PA"
## [45] "OPP.3P" "OPP.3PA" "OPP.FT" "OPP.FTA"
## [49] "OPP.ORB" "OPP.DRB" "OPP.TRB" "OPP.AST"
## [53] "OPP.STL" "OPP.BLK" "OPP.TOV" "OPP.PF"
## [57] "OPP.PTS"
Each variable is outlined within the README file found in the NBA Simulator repository. The procedure for obtaining, cleaning, and consolidating the data can also be found in the README file.
To fit a regression model on the data and build predictions on a team's chances of making the playoffs and contending for a championship, we will explore the data to identify possible features consistent in winning teams. We define 'winning teams' as those with a Win/Loss % (W/L%) high enough to have made the playoffs. Within this set, we define 'championship contending teams' as those with a Win/Loss % within the 'championship cut-off' range which we will discuss in the next section.
We will also identify trends within the league (across seasons) to identify how the play-style within the NBA has evolved through the years as this will affect the weighting of certain features in our regression model when developing a more advanced predictor.
However, for the purposes of our inital ‘simple’ model, we will fit a simple linear regression using the single feature that we find most predictive of team W/L%.
To identify if there is a cutoff for being considered a championship contender, we can plot the distribution of the W/L% of playoff teams and color the distribution of the championship teams (see ‘Win/Loss % of Playoff Teams’ histogram below).
playoffplot <- qplot(stats$W.L, fill = stats$playoffs, xlab = "W/L%",
ylab = "Count (Teams)", binwidth = 0.05)
playoffdata <-stats[stats$playoffs == TRUE] ## create subset of playoff-only observations
champsplot <- qplot(playoffdata$W.L, fill = playoffdata$champs, xlab = "W/L%",
ylab = "Count (Teams)", binwidth = 0.05)
grid.arrange(playoffplot, champsplot, nrow = 1)
WLbyAllTeams <- summary(stats$W.L)
WLbyPlayoffTeams <- summary(stats[stats$playoffs == TRUE]$W.L)
WLbyChampTeams <- summary(stats[stats$champs == TRUE]$W.L)
rbind(WLbyAllTeams, WLbyPlayoffTeams,WLbyChampTeams)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## WLbyAllTeams 0.106 0.37875 0.512 0.5002362 0.610 0.890
## WLbyPlayoffTeams 0.366 0.53700 0.606 0.6084870 0.671 0.890
## WLbyChampTeams 0.573 0.70200 0.744 0.7436154 0.793 0.878
From the playoff histogram on the right, it appears that in order to be considered a championship contending team, W/L% must be atleast in the top half of the distribution of W/L%’s across playoff teams. To support this insight, we need to verify if these cutoffs remain static year over year (aka. that the distribution of W/L% is relatively the same across seasons). This will ensure that the average W/L% isn’t, in fact, changing over time. To do this, we can examine the same distributions of W/L% but by season:
qplot(stats$ADV.Season, stats$W.L, geom = c("boxplot"),
ylab = "W/L% Distribution", xlab = "Season (2000-01 to 2018-19)")
From the histogram above, we can conclude that the cutoff we’ve set to contend for a championship is valid since the W/L% distribution does not appear to be changing season by season.
A factor that we’d like to examine is the offensive and defensive ratings of teams. These ratings provide a measure of a team’s performance on both sides of the court and incorporate calculations using player points, field-goal percentage, total possessions, fouls, free-throws, rebounds, turnovers, blocks, and steals. By netting the 2 off (i.e., Offensive Rating, less Defensive Rating), we obtain a team’s ‘Net Rating’ which is a comprehensive view of a team’s performance.
Similar to the histograms we created above, we can plot the distribution of Offensive (ORtg) and Defensive Ratings (DRtg) and fill by playoff and non-playoff teams:
ORtg.hist <- qplot(stats$ADV.ORtg, fill = stats$playoffs,
xlab = "Offensive Rating", ylab = "Count (Teams)")
ORtg.box <- qplot(stats$playoffs, stats$ADV.ORtg, geom = c("boxplot"),
xlab = "Made Playoffs?", ylab = "Offensive Rating")
DRtg.hist <- qplot(stats$ADV.DRtg, fill = stats$playoffs,
xlab = "Defensive Rating", ylab = "Count (Teams)")
DRtg.box <- qplot(stats$playoffs, stats$ADV.DRtg, geom = c("boxplot"),
xlab = "Made Playoffs?", ylab = "Defensive Rating")
grid.arrange(ORtg.hist,ORtg.box, DRtg.hist, DRtg.box,nrow = 2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
grid.arrange(qplot(stats$ADV.ORtg, stats$W.L, geom = c("point"), color = stats$playoffs,
xlab = "Offensive Rating", ylab = "W/L%"),
qplot(stats$playoffs, stats$ADV.ORtg, geom = c("boxplot"),
xlab = "Playoff Team?", ylab = "Offensive Rating"),
qplot(stats$ADV.DRtg, stats$W.L, geom = c("point"), color = stats$playoffs,
xlab = "Defensive Rating", ylab = "W/L%"),
qplot(stats$playoffs, stats$ADV.DRtg, geom = c("boxplot"),
xlab = "Playoff Team?", ylab = "Defensive Rating"),
nrow = 2)
From these plots, it’s clear that both metrics of ORtg and DRtg are sufficient indicators of playoff and non-playoff teams. Below, hypothesis tests are performed to verify that the differences in mean ORtg and DRtg between playoff and non-playoff teams are statistically significant:
t.test(stats$ADV.ORtg[stats$playoffs == TRUE],
stats$ADV.ORtg[stats$playoffs == FALSE],
paired = FALSE)
##
## Welch Two Sample t-test
##
## data: stats$ADV.ORtg[stats$playoffs == TRUE] and stats$ADV.ORtg[stats$playoffs == FALSE]
## t = 19.575, df = 1021.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.492334 4.270539
## sample estimates:
## mean of x mean of y
## 108.2474 104.3660
Based on the results (p-value < 2.2e-16), there is statistically significant evidence that the true difference in means between the groups is not equal to 0 and, therefore, offensive ratings are higher (i.e., more favorable) in those teams that make it to the playoffs.
Statistical significance is also observed in the test for the difference in mean defensive ratings between the groups:
t.test(stats$ADV.DRtg[stats$playoffs == TRUE],
stats$ADV.DRtg[stats$playoffs == FALSE],
paired = FALSE)
##
## Welch Two Sample t-test
##
## data: stats$ADV.DRtg[stats$playoffs == TRUE] and stats$ADV.DRtg[stats$playoffs == FALSE]
## t = -19.973, df = 1053.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.981427 -3.269099
## sample estimates:
## mean of x mean of y
## 104.9686 108.5938
Therefore, defensive ratings are lower (i.e., more favorable) in those teams that make it to the playoffs.
Another way to view these features is to net them off (Offensive rating, less Defensive rating) to obtain a Net rating ‘NRtg’ for each team by season, but we will view this only for playoff teams:
stats <- stats[,ADV.NRtg:= ADV.ORtg - ADV.DRtg]
playoffdata <- playoffdata[,ADV.NRtg:= ADV.ORtg - ADV.DRtg]
grid.arrange(qplot(playoffdata$ADV.NRtg, playoffdata$W.L, color = playoffdata$champs,
ylab = "W/L%", xlab = "Net Rating"),
qplot(playoffdata$champs, playoffdata$ADV.NRtg, geom = c("boxplot"),
xlab = "Championship Team?", ylab = "Net Rating"),
nrow = 1)
t.test(playoffdata$ADV.NRtg[playoffdata$champs == TRUE],
playoffdata$ADV.NRtg[playoffdata$champs == FALSE],paired = FALSE)
##
## Welch Two Sample t-test
##
## data: playoffdata$ADV.NRtg[playoffdata$champs == TRUE] and playoffdata$ADV.NRtg[playoffdata$champs == FALSE]
## t = 10.665, df = 44.935, p-value = 6.779e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.569589 5.231802
## sample estimates:
## mean of x mean of y
## 7.400000 2.999304
From the hypothesis test performed on the difference in the mean NRtg of championship playoff teams vs. non-championship playoff teams, the p-value = 6.779e-14 is statistically significant. Therefore, we have evidence to suggest that NRtg is higher in championship teams.
The Margin-of-victory (MOV) statistic is a measure of how large a team’s wins are on average (in points). For example, a team with an MOV of 12 points states that on average, that team wins against their opponents by 12 points. As such, this statistic can be viewed as a measure of how balanced the league is (across teams) since an unbalanced league (i.e., a league with NBA superstars collecting in select teams) would show high variability in MOV. On the other hand, a balanced league would show less variable MOV with average values around 0 (if all teams are highly-competitive with one another and no specific teams dominating the rest). To analyze this, we plot the distribution of MOV in playoff teams vs. non-playoff teams (boxplot), as well as the distribution in championship vs. non-championship teams (histogram).
MOV.box <- qplot(stats$playoffs, stats$ADV.MOV, geom = c("boxplot"),
xlab = "Made Playoffs?", ylab = "MOV")
MOVchamps.hist <- qplot(stats$ADV.MOV, fill = stats$champs,
xlab = "MOV", ylab = "Count (Teams)")
grid.arrange(MOV.box, MOVchamps.hist,
nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above, the championship teams reside approximately in the top quarter quantile of MOV ratings.
By graphing the MOV distributions by year, we can get a sense of how the balance in the league has changed over time:
MOVbySeason.box <- qplot(stats$ADV.Season, stats$ADV.MOV,
geom = c("boxplot")); MOVbySeason.box
The seasons with longer bars indicate higher variability (and therefore, higher imbalance) across teams in those leagues. It appears that in 2007-2008, there was a spike in MOV variation. However, in recent years, the league’s MOV distribution has been fairly stable and slightly less variable over time, suggesting that the league is trending towards a more balanced composition of teams (i.e., less teams ‘dominating’ the league, fewer ‘superteams’ hoarding allstar rosters).
Since we know that MOV is, on average, higher in teams that make the playoffs as opposed to non-playoff teams, we can also perform a hypothesis test between the difference in the mean MOV of championship playoff teams vs. non-championship playoff teams:
MOVtest <- t.test(playoffdata$ADV.MOV[stats$champs == TRUE],
playoffdata$ADV.MOV[stats$champs == FALSE],
paired = FALSE)
MOVtest
##
## Welch Two Sample t-test
##
## data: playoffdata$ADV.MOV[stats$champs == TRUE] and playoffdata$ADV.MOV[stats$champs == FALSE]
## t = -0.44128, df = 20.776, p-value = 0.6636
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4164798 0.9208434
## sample estimates:
## mean of x mean of y
## 2.879000 3.126818
MOVtest$p.value
## [1] 0.6635701
From the large p-value = 0.6635701, we do not have statistically significant evidence to suggest that the difference in mean MOV between champions and non-championship playoff teams is not equal to 0. Therefore, MOV may not be a reliable feature when predicting championship teams out of the pool of playoff teams.
Another variable we will visit is the Strength-of-schedule (SOS) metric which is a measure of the difficulty in a team’s schedule. Although there is variability in how this metric is calculated, the following are typical variables:
- Opponent proficiency
- Road trip length, back-to-back game schedules, and game locations
- Home-court advantages
- Game times (mornings/afternoons/evenings)
For the purposes of this analysis, we will use the SOS ratings calculated by Basketball Reference.
By analyzing SOS, we’re looking to see if various factors of a team’s schedule (and therefore, luck) plays a role in determining playoff or non-playoff status.
SOS.hist <- qplot(stats$ADV.SOS, fill = stats$playoffs,
xlab = "SOS", ylab = "Count (Teams)")
SOS.box <- qplot(stats$playoffs, stats$ADV.SOS, geom = c("boxplot"),
xlab = "Made Playoffs?", ylab = "SOS")
grid.arrange(SOS.hist, SOS.box, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
SOStest <- t.test(stats$ADV.SOS[stats$playoffs == TRUE],
stats$ADV.SOS[stats$playoffs == FALSE],
paired = FALSE); SOStest
##
## Welch Two Sample t-test
##
## data: stats$ADV.SOS[stats$playoffs == TRUE] and stats$ADV.SOS[stats$playoffs == FALSE]
## t = -15.374, df = 1019.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3851192 -0.2979380
## sample estimates:
## mean of x mean of y
## -0.1466775 0.1948511
From the above hypothesis test, we see that the mean SOS is, in fact, lower in teams that qualify for the playoffs. Therefore, it appears that there is a luck factor involved since teams that qualify for the playoffs have easier-than-average schedules.
Next, we will take a closer look at the correlations of specific metrics to W/L%:
par(mfrow=c(2,1))
pairs(stats[,c("W.L","ADV.Pace","ADV.ORtg","ADV.DRtg","ADV.eFG.","ADV.TOV.", "ADV.NRtg")],
lower.panel = NULL)
pairs(stats[,c("W.L","ADV.ORtg","ADV.DRtg","ADV.eFG.", "ADV.NRtg")],
lower.panel = NULL)
From the advanced statistics, it appears that there exists a relationship between W/L% and the following variables: ORtg, DRtg, eFG%, NRtg.
par(mfrow=c(2,1))
pairs(stats[,c("W.L","TM.2P","TM.3P","TM.FTA","TM.ORB")], lower.panel = NULL)
pairs(stats[,c("W.L","TM.DRB","TM.AST","TM.TOV","TM.PTS")], lower.panel = NULL)
pairs(stats[,c("W.L","TM.AST","TM.PTS")], lower.panel = NULL)
From the above correlation plots, the variables that appear to have correlation with W/L% are: AST and PTS.
par(mfrow=c(2,1))
pairs(stats[,c("W.L","OPP.2P","OPP.3P","OPP.FT")], lower.panel = NULL)
pairs(stats[,c("W.L","OPP.ORB","OPP.DRB","OPP.AST","OPP.TOV")], lower.panel = NULL)
Similar variables of interest (excluding OPP.TOV) appear when examining the opponent game statistics for each team. The variables that appear to have a relationship with W/L% are: OPP.DRB, and OPP.AST. However, these relationships don’t appear to be as pronounced as the ones examined from the ADV and TM variables.
pairs(stats[,c("W.L","OPP.DRB","OPP.AST")], lower.panel = NULL)
To account for the effect of possible changes across the league in recent years (e.g., pace of game, total points scored, 3-point era, etc.), we will also view the same statistics spanning across the most recent 5-year period:
substats <- stats[is.element(stats$ADV.Season,c("2018-19","2017-18","2016-17",
"2015-16","2014-15"))]
unique(substats$ADV.Season)
## [1] "2018-19" "2017-18" "2016-17" "2015-16" "2014-15"
par(mfrow=c(3,1))
pairs(substats[,c("W.L","ADV.ORtg","ADV.DRtg","ADV.eFG.", "ADV.NRtg")], lower.panel = NULL)
pairs(substats[,c("W.L","TM.AST","TM.TOV","TM.PTS")], lower.panel = NULL)
pairs(substats[,c("W.L","OPP.DRB","OPP.AST","OPP.PTS")], lower.panel = NULL)
From our analysis of the features within the most recent 5 years of data, our conclusion stands with the following variables showing relationship to W/L%: ‘ORtg’ & ‘DRtg’ (and therefore, their net effect in ‘NRtg’), eFG%, AST, PTS, OPP.DRB, OPP.AST, and OPP.PTS.
However, statistical analysis will be performed on these features to determine if these relationships are significant (performed in our ‘Simple Linear Model’ documentation).
The next section of analysis aims to highlight changes in the overall league season-by-season to examine changes in play-style and the relative importance of various metrics.
SeasonPace.plot <- qplot(stats$ADV.Season, stats$ADV.Pace, geom = c("boxplot"),
xlab = "Season (2000-01 to 2018-19)", ylab = "Pace")
SeasonPTS.plot <- qplot(stats$ADV.Season, stats$TM.PTS, geom = c("boxplot"),
xlab = "Season (2000-01 to 2018-19)", ylab = "Pts. per Game")
grid.arrange(SeasonPace.plot, SeasonPTS.plot, nrow = 1)
From plotting both Pace and Pts Per Game (PTS) by season, we observe a downward trend during the earlier years of the league between 1980-2000, followed by an upward trend towards faster-paced games in more recent years (i.e., greater number of possessions/shorter possession times). However, no clear relationship exists between Pace and Win/Loss % (see scatterplot below).
qplot(stats$W.L, stats$ADV.Pace, xlab = "W/L%", ylab = "Pace")
From plotting both 3-point shots and 2-point shots attempted by season, we see a clear trend in the league for greater reliance on the 3-point shot which explains some of the increase in average total points scored season-by-season.
qplot(stats$TM.3PA,fill = stats$ADV.Season, binwidth = 5,
xlab = "3-Point Attempts", ylab = "Count (Teams)")
qplot(stats$TM.2PA,fill = stats$ADV.Season, binwidth = 7.5,
xlab = "2-Point Attempts", ylab = "Count (Teams)")
Field-goal percentage is one measure of efficiency as it takes the total field-goals made and compare it to the total field-goals attempted. The Effective field-goal percentage (eFG%) adjusts for the fact that 3-point shots are more valuable than 2-point shots.
From the plot below, it appears that championship teams are generally more efficient (landing in the top majority) for a large number of the years in the dataset.
qplot(stats$ADV.eFG., stats$ADV.Season, color = stats$champs,
xlab = "eFG%", ylab = "Season")
From our analysis, we determine the following:
From our findings, we determine the following for the approach concerning our predictor models: