The NBA has seen an explosion of data. Advancements in tracking technologies and developments of more sophisticated measures of performance encourage richer understandings of what happens between the baselines. Furthermore, the availability and accessability of this data has never been greater, although much remains behind closed doors.
And whether framed as evolution or devolution, the NBA game has changed along with measures of its performance. A cursory search through the numbers and a glance over the trends in season totals confirm that the mainstream identity of an NBA offenses has fluctuated over time. For example, consider the plots below. The first, “Mean Three Point Rate by Year, 1980-2015” and the second, “Mean Pace by Year, 1980-2015”, both illustrate these changes.
This fluctuation problematizes predictive modeling over time with NBA statistics. In 2015, for example, three point makes were strongly correlated to team success (r = 0.60) and total points (r = 0.66). By contrast, three point makes in 1983 had the opposite effect: they were negatively correlated to wins (r = -0.32) and negatively correlated to total points (r = -0.13). To call out two explicit examples, the San Antonio Spurs hit 94 three point shots in 1983 en route to 53 wins, and 22 years later, the Houston Rockets hit 839 more three point shots (933), while winning only three more games (56). Again, the game has changed.
The purpose of this project is to explore how NBA teams have manufactured points over a thirty-five year period (1980-2015). The goal is not to predict precise point totals, but to begin to quantify stylistic changes. The 1979-1980 season is a sensible starting point, as it marked the advent of the three-point shot.
Is it possible to build a single model that transcends these fluctuations and models the competitive balance of the eighties, nineties, tens, and teens? A model that handles the grinding, slow teams of the late nineties as well as the trigger-happy, three point shooting teams of the present? Perhaps. At the very least, the attempt may provide clues to how NBA teams adapt (or fail to adapt) to a relatively fluid competitive landscape. We can pay special attention to the seasons before and after important rule changes such as the hand-check rule introduced during the 2004-2005 season, and the extension of the three-point line before the 1997-1998 season.
The data was scraped from Basketball Reference, a popular sports website that has collected data from every game and team in the NBA’s history. Using the rvest package and the Chrome extension SelectorGadget, I created get_nba_season_data(), a function that takes a single argument, year, and returns a data.frame of 72 variables or features of team data for that season. The source code for the function is below, along with some limited commentary:
library(dplyr)
library(magrittr)
library(rvest)
library(stringr)
library(knitr)
get_nba_season_data <- function(year) {
options(warn = -1) #silences later left_join() warning about coercion to factor
#Concatenate url to scrape
url <- paste0("http://www.basketball-reference.com/leagues/NBA_", year, ".html")
#Get season data
season_data <- url %>%
read_html() %>%
html_nodes("#team") %>%
html_table(header = FALSE, fill = FALSE) %>%
data.frame()
#Clean season data header
names(season_data) <- season_data %>%
slice(1) %>%
str_replace_all("%", "_pct") %>%
str_replace_all("/", "_per_") %>%
make.names(unique = TRUE, allow = TRUE) %>%
str_replace_all("[.]", "") %>%
str_replace_all("1", "_pct") %>%
str_replace_all("X", "") %>%
tolower()
#Clean season data, create new variables: year, playoff
season_data %<>%
slice(-c(1, nrow(.))) %>%
mutate(year = rep(year, nrow(.)),
playoffs = ifelse(grepl("[*]", team), "playoffs", "lottery")) %>%
mutate(team = str_replace_all(team, "[*]", "")) %>%
select(-rk, -g)
#Get miscellaneous data
misc_data <- url %>%
read_html() %>%
html_nodes("#misc") %>%
html_table(header = FALSE, fill = FALSE) %>%
data.frame()
#Clean miscellaneous data header
names(misc_data) <- misc_data %>%
slice(2) %>%
str_replace_all("%", "_pct") %>%
str_replace_all("/", "_per_") %>%
make.names(unique = TRUE, allow = TRUE) %>%
str_replace_all("[.]", "") %>%
str_replace_all("X", "") %>%
tolower()
#Clean miscellaneous data
misc_data %<>%
slice(-c(1:2, nrow(.))) %>%
select(-rk, -arena, -attendance) %>%
mutate(team = str_replace_all(team, "[*]", ""))
#Specify certain 'opponent' variables
names(misc_data)[(ncol(misc_data) - 4):ncol(misc_data)] <-
paste0('opp_', names(misc_data)[(ncol(misc_data) - 4):ncol(misc_data)])
#Join season data and miscellaneous data
all_data <- left_join(season_data, misc_data, by = "team")
#Get oppononent data
opp_data <- url %>%
read_html() %>%
html_nodes("#opponent") %>%
html_table(header = FALSE, fill = FALSE) %>%
data.frame()
#Clean opponent data header
names(opp_data) <- opp_data %>%
slice(1) %>%
str_replace_all("%", "_pct") %>%
str_replace_all("/", "_per_") %>%
make.names(unique = TRUE, allow = TRUE) %>%
str_replace_all("[.]", "") %>%
str_replace_all("X", "") %>%
str_replace_all("[1]", "_pct") %>%
tolower() %>%
paste0("opp_", .)
#Clean opponent data
opp_data %<>%
slice(-c(1, nrow(.))) %>%
select(-c(1, 3:4)) %>%
rename(team = opp_team) %>%
mutate(team = str_replace_all(team, "[*]", ""))
#Join opponent data to larger dataset
all_data <- left_join(all_data, opp_data, by = "team")
#Get wins data from different url
wins_url <- paste0('http://www.basketball-reference.com/leagues/NBA_', year, '_standings.html')
wins_data <- wins_url %>%
read_html() %>%
html_nodes("#expanded-standings td:nth-child(2) ,
#expanded-standings td:nth-child(2) ,
#expanded-standings td:nth-child(3) ,
#expanded-standings .sort_default_asc+ .tooltip") %>%
html_text()
teams <- wins_data[c(TRUE, FALSE)]
wins <- substr(wins_data[c(FALSE, TRUE)], 1, 2)
wins_df <- data.frame(team = teams,
wins = wins)
#Join wins data to larger dataset
all_season_data <- left_join(all_data, wins_df, by = "team")
#Get season champion
champion <- url %>%
read_html() %>%
html_nodes('.sub_index+ p a') %>%
html_text()
#Create champion variable, specify variable order
all_season_data %<>%
mutate(champion = ifelse(team == champion, "Champion", "Non-Champion"))%>%
select(team, playoffs, champion, everything())
#Convert factor classes to character to numeric
all_season_data[,4:ncol(all_season_data)] <-
sapply(all_season_data[,4:ncol(all_season_data)], as.character)
all_season_data[,4:ncol(all_season_data)] <-
sapply(all_season_data[,4:ncol(all_season_data)], as.numeric)
return(all_season_data)
}
Now that we have our function, we can scrape the season data between 1980 and 2015:
years <- 1980:2015
nba_data <- data.frame() #Initialize empty data frame
for (year in years) {
year_data <- get_nba_season_data(year)
nba_data <- rbind(nba_data, year_data) #Bind rows from each function call
}
Our nba_data dataset has 984 observations and 71 features. The features we want to include in the model should exhibit a moderately strong linear relationship with our predictor variable, pts, and here I’ll set the minimum threshold to 0.4:
nba_correlations <- apply(nba_data[sapply(nba_data, is.numeric)],
2,
function(x) with(nba_data, cor(x, pts, use = "complete.obs")))
mod_names <- nba_correlations[abs(nba_correlations) > 0.4]
nba_data_for_model <- nba_data %>%
select_(.dots = names(mod_names))
The remaining features to be used in the model:
| Feature | Description |
|---|---|
| orb | Offensive Rebounds |
| stl | Steals |
| ast | Assists |
| tov | Turnovers |
| pace | Estimate of the number of possessions per 48 minutes |
| opp_fga | Opponent Field Goal Attempts |
| opp_orb | Opponent Offensive Rebounds |
| opp_drb | Opponent Defensive Rebounds |
| opp_ast | Opponent Assists |
| opp_stl | Opponent Steals |
| opp_tov | Opponent Turnovers |
Each of these variables makes intuitive sense. With the exception of pace, they all indicate some measure of a team’s ability to increase their number of possessions, and thus their opportunities to score, or approximate the quality of shot attempts.
Behind the scenes, I removed some additional variables that are also highly correlated to pts such as pts_per_game, ortg (Offensive Rating), pw (Pythagorean Wins), fg_pct (Field Goal Percentage), fga (Field Goal Attempts), and fg (Field Goals). Including these variables in any model is senseless because many of these variables are calculated from or caused by pts.
We can now use the package leaps to evaluate all the best-subset models. We’ll first attempt to build a model using adjusted R-squared as our preferred measure.
library(leaps)
regfit.full <- regsubsets(pts ~ ., data = nba_data_for_model, nvmax = 12)
reg.summary <- summary(regfit.full)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R-Squared")
optimal_variables <- which.max(reg.summary$adjr2)
points(optimal_variables,
reg.summary$adjr2[optimal_variables],
pch=20, col="red")
plot(regfit.full, scale = "adjr2", ylab = "Adjusted R Squared")
The model with the highest adjusted R-squared has nine features and includes ast, stl, tov, opp_fga, opp_orb, opp_drb, opp_ast, opp_stl, and opp_tov. There is, however, no significant change after six variables, so we can safely choose the corresponding model with ast, tov, opp_fga, opp_drb, opp_orb, and opp_tov.
Now we can build the model with those features included:
set.seed(50)
sample_index <- sample(1:2, nrow(nba_data_for_model), replace = TRUE, prob = c(0.80, 0.20))
nba_training <- nba_data_for_model[sample_index == 1,]
nba_testing <- nba_data_for_model[sample_index == 2,]
nba_mod_r2 <- lm(pts ~ ast + tov + opp_fga +
opp_drb + opp_orb + opp_tov, data = nba_training)
summary(nba_mod_r2)
##
## Call:
## lm(formula = pts ~ ast + tov + opp_fga + opp_drb + opp_orb +
## opp_tov, data = nba_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -594.52 -148.02 -9.68 137.17 617.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.84688 84.31650 1.149 0.251
## ast 0.29286 0.05537 5.289 1.59e-07 ***
## tov -1.41628 0.08099 -17.487 < 2e-16 ***
## opp_fga 1.68213 0.03942 42.668 < 2e-16 ***
## opp_drb -1.07222 0.06029 -17.785 < 2e-16 ***
## opp_orb -1.79749 0.10231 -17.569 < 2e-16 ***
## opp_tov 1.91615 0.07750 24.726 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207 on 783 degrees of freedom
## Multiple R-squared: 0.9468, Adjusted R-squared: 0.9464
## F-statistic: 2321 on 6 and 783 DF, p-value: < 2.2e-16
Diagnostic plots:
par(mfrow = c(1, 2))
plot(nba_mod_r2, which = c(1, 2))
Test performance:
predictions_adjr2 <- predict.lm(nba_mod_r2, newdata = nba_testing)
nba_df <- data.frame(predictions = predictions_adjr2,
values = nba_testing$pts)
ggplot(nba_df, aes(predictions, values)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(x = "Predicted Values", y = "Actual Values") +
theme_minimal()
nba_df %>%
mutate(residual = predictions - values) %>%
summarize(Mean_Difference = mean(abs(residual)))
## Mean_Difference
## 1 169.0154
The model’s point prediction was off by an average of 169 points.
Now let’s rebuild the model using the Bayesian Information Criterion (BIC):
optimal_variables <- which.min(reg.summary$bic)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "Bayesian Information Criterion")
points(optimal_variables,
reg.summary$bic[optimal_variables],
pch=20, col="red")
plot(regfit.full, scale = "bic", ylab = "Bayesian Information Criterion")
The model with the lowest BIC has only six features and includes ast, tov, opp_fga, opp_orb, opp_drb, and opp_tov, the same variables we ascertained from the previous model.
Earlier I mentioned how team statistics fluctuate over time. Our preferred model was trained and tested on data spanning a thirty-five year period, and it is worth questioning the model’s consistency. Does the model perform better in a particular “era”, or period of NBA history? Using the purrr package, we can split and model the our data by year while extracting the adjusted R-squared for each year:
library(purrr)
year_adjr2 <- nba_data %>%
split(.$year) %>%
map(~ lm(pts ~ ast + tov + opp_fga + opp_orb + opp_drb + opp_tov, data = .)) %>%
map(summary) %>%
map_dbl("adj.r.squared")
df_year_adjr2 <- data.frame(year = as.numeric(names(year_adjr2)),
rsq = unname(year_adjr2))
ggplot(df_year_adjr2, aes(year, rsq)) +
geom_point() +
geom_smooth(color = "red") +
labs(x = "Year", y = "Adjusted R-Squared") +
theme_minimal()
There is a slight trend downward in the adjusted R squared from 1980 to 2000, before the value plateaus, and perhaps even begins another decline. This phenomenon resists easy interpretation. The trend appears to loosely follow the general decline in pace, but another explanation is the stylistic changes from 2000-2015. The early 2000s were marked by heavy doses of isolation play, but towards the end of the decade and soon into the tens, NBA teams attempted and made three pointers more aggressively. Still, the model’s performance is remarkably consistent over the 35 year period.
As mentioned above, teams are shooting more and more three pointers. In the eighties and early nineties, no three-point shooting metric was a strong indicator of offensive production. That changed in the 2000s.
This model takes three predictable variables: threes (Made 3-Point Field Goals), three_rate (3-Point Attempt Rate), and threep_pct (3-Point Percentage).
threes_nba <- nba_data %>%
filter(year != 2012) %>%
split(.$year) %>%
map(~ lm(pts ~ threes + three_rate + threep_pct, data = .)) %>%
map(summary) %>%
map_dbl("adj.r.squared")
df_threes <- data.frame(year = as.numeric(names(threes_nba)),
rsq = unname(threes_nba))
ggplot(df_threes, aes(year, rsq, label = year)) +
geom_point() +
geom_smooth(color = "red") +
labs(x = "Year", y = "Adjusted R-Squared") +
theme_minimal()