The NBA has realized a talent boom in the last 5-10 years. Popularity, attention, and league-wide revenue has been growing, as well as punditry and data analysis. Fans of the league are riveted by such important, profound questions such as “is Kevin Durant a cupcake?”
More importantly, the NBA is on the cutting-edge of data collection in sports. Thanks to the NBA’s partnership with STATS SportVU, they are tracking data using a system of 6 cameras hung from the ceiling that track ball and player movement 25 times per second. They can use this unstructured data to define metrics such as secondary assists, miles traveled, made field goals defended at rim, etc.
The Golden State Warriors received high praise last season as the best team ever. They are frequently said to play the right type of basketball, that they’re unselfish, and that they make their teammates better, but can we prove this? Or disprove it?
Basketball is one of the most popular sports in America and worldwide. The strategy on and off the court is fascinating. It’s my personal favorite and I’m excited that we finally have the means to answer the really complex questions that keep fans coming back.
Lucky for us, the NBA has made the data they collect readily available on their website. Unfortunately, they do not make it readily available for download. This required some data wrangling using the jsonlite package.
Variables of interest include:
Other variables will be considered as the model is constructed. I will use data from the 2016-2017 NBA regular season and postseason.
I’ll first try to understand how a player adds value to his team in a quantifiable way.
Then, I can establish a baseline ‘sum of its parts’ for overall team performance and see which teams underperform/overperform.
I can then try to predict which teams will outperform the sum of their parts, or which players make their teams better indirectly.
Ultimately, there are a few assumptions I’ll have to more fully address over the course of this project:
Besides the aforementioned pundits and fans, basketball organizations should care, players should care, agents should care, everyone who gets paid from basketball related revenue should care. Team strategy off the court is proving critical to success. It’s hard to gain an edge in a league with a salary cap and paying the best players/getting the best coach doesn’t cut it anymore. Being able to answer questions like these allows teams to find value where others miss it and optimize the resources they have.
Further iterations on this question would allow teams to understand not only which players impact their teammates, but which types of teammates best suit their star players. This could help organizations tailor their teams around a foundation (e.g. how the Cavs build around Lebron James after Kyrie Irving’s request for a trade, or how to build around the new pairing of Chris Paul and James Harden for the Rockets).
# Packages Required
library(tidyverse) # used for several functions in r, including dplyr, ggplot2, etc.
library(jsonlite) # used to convert JSON files to dataframes
library(plotly) # used for making graphs interactive and easier to interpret (especially when overplotted)
library(DT) # used for providing a data dictionary in HTML
library(stringr) # needed to alter some text strings to match observations
# Import json files from web
# List of json URLs
json_urls <- c("http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
"http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
"http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=2+Pointers&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
"http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=3+Pointers&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
"http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=Overall&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
"http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
","http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
",
"http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
",
"http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Four+Factors&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
")
json_names <- list("nba_advanced", "nba_assistedFGs", "nba_d2fgm", "nba_d3fgm", "nba_dfg", "nba_plusminus", "nba_teambasic", "nba_teamadvanced", "nba_fourfactors")
# Create Data Import Loop
data_list <- list(NULL)
for(i in seq_along(json_urls)){
foo <- fromJSON(paste(readLines(json_urls[i])),
simplifyDataFrame = TRUE,
flatten = TRUE)
flat_file <- flatten(foo$resultSets)
foo <- data.frame(flat_file$rowSet)
colnames(foo) <- t(data.frame(flat_file$headers))
data_list[i] <- list(foo)
print(i)
}
data_list <- setNames(data_list, nm = json_names)
# import csv, edit column = name to match json files
nba_rpm <- read_csv(
"https://raw.githubusercontent.com/robert-fields/nba-stats/master/nba_rpm_csv.csv",
col_types = "_?___????")
nba_rpm$Name <- substr(nba_rpm$Name,
1,
nchar(nba_rpm$Name)-
(nchar(nba_rpm$Name)-str_locate(nba_rpm$Name,','))-1)
# match different spellings of names
nba_rpm$Name[nba_rpm$Name == "Nene Hilario"] <- "Nene"
nba_rpm$Name[nba_rpm$Name == "Sheldon Mac"] <- "Sheldon McClellan"
# Merge relevant variables from json files and csv and filter out NA from csv
player_data <-
data_list[[1]] %>%
left_join(data_list[[2]][,c("PLAYER_ID","PCT_AST_FGM","PCT_UAST_FGM")],
by = "PLAYER_ID") %>%
left_join(data_list[[3]][,c("PLAYER_NAME","FG2M")],
by = "PLAYER_NAME") %>%
left_join(data_list[[4]][,c("PLAYER_NAME","FG3M")],
by = "PLAYER_NAME") %>%
left_join(data_list[[5]][,c("PLAYER_NAME","D_FG_PCT")],
by = "PLAYER_NAME") %>%
left_join(data_list[[6]][,c("PLAYER_ID","PLUS_MINUS")],
by = "PLAYER_ID") %>%
left_join(nba_rpm, by = c("PLAYER_NAME" = "Name")) %>%
filter(!is.na(RPM))
# Merge team data
team_data <-
data_list[[9]] %>%
left_join(data_list[[7]][,c("TEAM_ID","PLUS_MINUS")],
by = "TEAM_ID") %>%
left_join(data_list[[8]][,c("TEAM_ID","TS_PCT","PIE",
"NET_RATING","OFF_RATING","DEF_RATING",
"AST_PCT","AST_TO","AST_RATIO")],
by = "TEAM_ID")
# clean up workspace
rm(list = setdiff(ls(), c("player_data","team_data")))
# change vectors to numerics
player_data[5:67] <- sapply(player_data[5:67],
function(x) as.numeric(as.character(x)))
team_data[3:39] <- sapply(team_data[3:39],
function(x) as.numeric(as.character(x)))
# filter out unneeded variables, match team and player data, add variables
player_data <-
player_data %>%
select(-(GP_RANK:CFPARAMS)) %>%
left_join(mutate(team_data,
TEAM_PIE = PIE,
TEAM_PLUS_MINUS = PLUS_MINUS,
TEAM_WINS = W,
TEAM_MIN = MIN,
TEAM_NET_RATING = NET_RATING,
TEAM_OFF_RATING = OFF_RATING,
TEAM_DEF_RATING = DEF_RATING)
[,c("TEAM_ID","TEAM_PIE","TEAM_PLUS_MINUS","TEAM_WINS","TEAM_MIN",
"TEAM_NET_RATING","TEAM_OFF_RATING","TEAM_DEF_RATING")],
by = "TEAM_ID") %>%
mutate(TEAM_RPM_SHARE = (RPM * (GP * MIN) / (TEAM_MIN)))
team_data <-
team_data %>%
select(-(GP_RANK:CFPARAMS)) %>%
left_join(
aggregate(player_data$TEAM_RPM_SHARE ~ player_data$TEAM_ID,
player_data, sum),
by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
rename("TEAM_RPM" = "player_data$TEAM_RPM_SHARE") %>%
left_join(
aggregate(player_data$WINS ~ player_data$TEAM_ID,player_data, sum),
by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
rename("TEAM_RPM_WINS" = "player_data$WINS")
player_data <- player_data %>%
left_join(mutate(team_data,
TEAM_RPM_WINS = TEAM_RPM_WINS,
TEAM_RPM = TEAM_RPM)[,c("TEAM_ID","TEAM_RPM_WINS",
"TEAM_RPM")],
by = "TEAM_ID")I have included a data dictionary with details on each variable below each data set on their respective tabs. A more in depth look at different statistics used for NBA data can be found in the NBA stats glossary.
I had to compile two data sets for Team data and Player data. I did this by linking to the JSON URL from the stats.nba.com website HTML and using the “jsonlite” package to scrape this data and flatten it into data frames. I used a loop to tidy up this repetitive process.
data_list <- list(NULL)
for(i in seq_along(json_urls)){
foo <- fromJSON(paste(readLines(json_urls[i])),
simplifyDataFrame = TRUE,
flatten = TRUE)
flat_file <- flatten(foo$resultSets)
foo <- data.frame(flat_file$rowSet)
colnames(foo) <- t(data.frame(flat_file$headers))
data_list[i] <- list(foo)
print(i)
}
data_list <- setNames(data_list, nm = json_names)
I combined all of these data frames into two single data sets. I also pulled Real Plus/Minus data from ESPN, converted it into a .csv and imported that. I was able to merge all the JSON dataframes and the csv based on either a Player ID from nba.com or the Player Name. I also joined data from the team_data set to the player_data set to make some comparisons between teams and players.
I created a variable TEAM_RPM_SHARE in the player_data set that weighted a player’s RPM by the minutes he played during the season. I then aggregated this by team to create a TEAM_RPM variable in the team_data set. This is the variable I use as a baseline for team performance based on the players they have on their team. I did the same thing with RPM WINS, creating the TEAM_RPM_WINS variable in the team_data set.
mutate(player_data, TEAM_RPM_SHARE = (RPM * (GP * MIN) / (TEAM_MIN)))
left_join(team_data,
aggregate(player_data$TEAM_RPM_SHARE ~ player_data$TEAM_ID,
player_data, sum),
by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
rename("TEAM_RPM" = "player_data$TEAM_RPM_SHARE")
I dropped any unneccessary variables, paring it down to 39 variables for Team data and 67 variables for Player data that might be of interest later on.
## Observations: 30
## Variables: 26
## $ TEAM_ID <fctr> 1610612737, 1610612738, 1610612751, 1610612766,...
## $ TEAM_NAME <fctr> Atlanta Hawks, Boston Celtics, Brooklyn Nets, C...
## $ GP <dbl> 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, ...
## $ W <dbl> 43, 53, 20, 36, 41, 51, 33, 40, 37, 67, 55, 42, ...
## $ L <dbl> 39, 29, 62, 46, 41, 31, 49, 42, 45, 15, 27, 40, ...
## $ W_PCT <dbl> 0.524, 0.646, 0.244, 0.439, 0.500, 0.622, 0.402,...
## $ MIN <dbl> 3976, 3951, 3951, 3966, 3956, 3976, 3956, 3951, ...
## $ EFG_PCT <dbl> 0.504, 0.525, 0.507, 0.501, 0.487, 0.547, 0.505,...
## $ FTA_RATE <dbl> 0.295, 0.273, 0.289, 0.279, 0.259, 0.275, 0.225,...
## $ TM_TOV_PCT <dbl> 0.157, 0.133, 0.159, 0.117, 0.138, 0.137, 0.126,...
## $ OREB_PCT <dbl> 0.236, 0.212, 0.196, 0.199, 0.270, 0.219, 0.181,...
## $ OPP_EFG_PCT <dbl> 0.507, 0.503, 0.513, 0.523, 0.507, 0.516, 0.529,...
## $ OPP_FTA_RATE <dbl> 0.232, 0.290, 0.275, 0.211, 0.219, 0.226, 0.282,...
## $ OPP_TOV_PCT <dbl> 0.153, 0.139, 0.128, 0.132, 0.137, 0.127, 0.157,...
## $ OPP_OREB_PCT <dbl> 0.239, 0.247, 0.239, 0.204, 0.232, 0.242, 0.224,...
## $ PLUS_MINUS <dbl> -0.9, 2.6, -6.7, 0.2, 0.4, 3.2, -2.9, 0.5, -1.1,...
## $ TS_PCT <dbl> 0.541, 0.567, 0.551, 0.547, 0.530, 0.580, 0.541,...
## $ PIE <dbl> 0.506, 0.514, 0.460, 0.505, 0.504, 0.510, 0.479,...
## $ NET_RATING <dbl> -0.8, 3.1, -6.1, 0.3, 0.1, 2.9, -2.6, -0.5, -2.0...
## $ OFF_RATING <dbl> 102.3, 108.6, 101.9, 106.4, 104.6, 110.9, 103.7,...
## $ DEF_RATING <dbl> 103.1, 105.5, 108.0, 106.1, 104.5, 108.0, 106.3,...
## $ AST_PCT <dbl> 0.621, 0.653, 0.566, 0.611, 0.584, 0.567, 0.574,...
## $ AST_TO <dbl> 1.50, 1.90, 1.29, 2.01, 1.66, 1.66, 1.75, 1.69, ...
## $ AST_RATIO <dbl> 17.5, 18.9, 16.0, 17.7, 17.0, 17.2, 16.9, 18.3, ...
## $ TEAM_RPM <dbl> -0.3893458, 2.9810554, -6.5075485, 1.8591311, -0...
## $ TEAM_RPM_WINS <dbl> 41.23, 46.97, 21.13, 41.66, 35.50, 56.19, 28.86,...
## Observations: 464
## Variables: 50
## $ PLAYER_ID <fctr> 1627773, 201166, 203932, 1626151, 203940, 2...
## $ PLAYER_NAME <chr> "AJ Hammons", "Aaron Brooks", "Aaron Gordon"...
## $ TEAM_ID <fctr> 1610612742, 1610612754, 1610612753, 1610612...
## $ TEAM_ABBREVIATION <fctr> DAL, IND, ORL, CHA, MIN, BOS, IND, POR, LAC...
## $ AGE <dbl> 24, 32, 21, 22, 26, 31, 32, 26, 34, 24, 25, ...
## $ GP <dbl> 22, 65, 80, 5, 18, 68, 66, 61, 30, 47, 42, 6...
## $ W <dbl> 4, 36, 29, 2, 5, 46, 33, 33, 20, 11, 26, 37,...
## $ L <dbl> 18, 29, 51, 3, 13, 22, 33, 28, 10, 36, 16, 3...
## $ W_PCT <dbl> 0.182, 0.554, 0.363, 0.400, 0.278, 0.676, 0....
## $ MIN <dbl> 7.4, 13.7, 28.7, 3.3, 7.5, 32.3, 14.1, 29.1,...
## $ OFF_RATING <dbl> 102.2, 101.5, 105.4, 83.3, 102.6, 110.7, 102...
## $ DEF_RATING <dbl> 102.8, 104.6, 108.2, 101.9, 101.8, 105.8, 10...
## $ NET_RATING <dbl> -0.6, -3.0, -2.8, -18.6, 0.8, 5.0, -5.8, 1.8...
## $ AST_PCT <dbl> 0.038, 0.216, 0.097, 0.375, 0.089, 0.239, 0....
## $ AST_TO <dbl> 0.40, 1.89, 1.69, 0.00, 0.88, 2.93, 1.73, 1....
## $ AST_RATIO <dbl> 6.2, 24.6, 12.5, 38.1, 9.0, 25.7, 9.5, 13.8,...
## $ OREB_PCT <dbl> 0.049, 0.022, 0.054, 0.000, 0.069, 0.049, 0....
## $ DREB_PCT <dbl> 0.199, 0.064, 0.141, 0.200, 0.200, 0.183, 0....
## $ REB_PCT <dbl> 0.119, 0.043, 0.096, 0.094, 0.132, 0.118, 0....
## $ TM_TOV_PCT <dbl> 15.4, 13.0, 7.4, 0.0, 10.3, 8.8, 5.5, 13.1, ...
## $ EFG_PCT <dbl> 0.464, 0.483, 0.499, 0.000, 0.454, 0.527, 0....
## $ TS_PCT <dbl> 0.472, 0.507, 0.530, 0.102, 0.505, 0.553, 0....
## $ USG_PCT <dbl> 0.167, 0.191, 0.200, 0.142, 0.224, 0.199, 0....
## $ PACE <dbl> 95.66, 96.55, 99.70, 92.43, 96.52, 98.96, 96...
## $ PIE <dbl> 0.043, 0.062, 0.088, 0.000, 0.082, 0.125, 0....
## $ FGM <dbl> 17, 121, 393, 0, 23, 379, 235, 183, 30, 138,...
## $ FGA <dbl> 42, 300, 865, 4, 54, 801, 471, 466, 80, 267,...
## $ FGM_PG <dbl> 0.8, 1.9, 4.9, 0.0, 1.3, 5.6, 3.6, 3.0, 1.0,...
## $ FGA_PG <dbl> 1.9, 4.6, 10.8, 0.8, 3.0, 11.8, 7.1, 7.6, 2....
## $ FG_PCT <dbl> 0.405, 0.403, 0.454, 0.000, 0.426, 0.473, 0....
## $ PCT_AST_FGM <dbl> 0.941, 0.289, 0.570, 0.000, 0.696, 0.678, 0....
## $ PCT_UAST_FGM <dbl> 0.059, 0.711, 0.430, 0.000, 0.304, 0.322, 0....
## $ FG2M <dbl> 1.59, 1.43, 2.91, 0.75, 1.24, 5.63, 2.43, 3....
## $ FG3M <dbl> 0.18, 0.75, 1.01, NA, 0.24, 1.19, 0.32, 1.20...
## $ D_FG_PCT <dbl> 0.476, 0.455, 0.437, 0.600, 0.439, 0.467, 0....
## $ PLUS_MINUS <dbl> -0.2, -0.5, -2.0, -1.0, 0.6, 3.2, -1.5, 0.4,...
## $ ORPM <dbl> -2.77, -1.81, 1.25, -1.42, -1.69, 0.76, -1.0...
## $ DRPM <dbl> 1.27, -1.47, -0.78, -0.32, 1.36, 1.06, -0.45...
## $ RPM <dbl> -1.50, -3.28, 0.47, -1.74, -0.33, 1.82, -1.4...
## $ WINS <dbl> 0.16, -0.09, 5.32, 0.01, 0.24, 6.93, 0.98, 4...
## $ TEAM_PIE <dbl> 0.479, 0.505, 0.460, 0.505, 0.501, 0.514, 0....
## $ TEAM_PLUS_MINUS <dbl> -2.9, -0.2, -6.6, 0.2, -1.1, 2.6, -0.2, -0.5...
## $ TEAM_WINS <dbl> 33, 42, 29, 36, 31, 53, 42, 41, 51, 24, 51, ...
## $ TEAM_MIN <dbl> 3956, 3971, 3961, 3966, 3961, 3951, 3971, 39...
## $ TEAM_NET_RATING <dbl> -2.6, -0.1, -6.8, 0.3, -1.0, 3.1, -0.1, 0.0,...
## $ TEAM_OFF_RATING <dbl> 103.7, 106.2, 101.2, 106.4, 108.1, 108.6, 10...
## $ TEAM_DEF_RATING <dbl> 106.3, 106.3, 108.0, 106.1, 109.1, 105.5, 10...
## $ TEAM_RPM_SHARE <dbl> -0.061729019, -0.735542684, 0.272436253, -0....
## $ TEAM_RPM_WINS <dbl> 28.86, 38.70, 20.55, 41.66, 36.75, 46.97, 38...
## $ TEAM_RPM <dbl> -2.7163504, -0.1413704, -7.2108700, 1.859131...
datatable(player_data,
extensions = c("FixedColumns","Buttons","KeyTable","Scroller"),
options = list(orderClasses = TRUE,
fixedColumns = list(leftColumns = 5),
dom = "Bfrtip",
buttons = c("copy","csv","excel","colvis"),
keys = TRUE,
deferRender = FALSE,
scrollY = 750,
scrollX = 1000,
scroller = TRUE,
filter = "top",
orderMulti = TRUE,
columnDefs =
list(list(visible = FALSE, targets = c(1,3,5:50)))))#Creating Data Dictionary
VarType<- c("PLAYER_NAME",
"TEAM_ABBREVIATION",
"GP",
"W",
"L",
"MIN",
"OFF_RATING",
"DEF_RATING",
"NET_RATING",
"AST_PCT",
"AST_TO",
"AST_RATIO",
"OREB_PCT",
"DREB_PCT",
"REB_PCT",
"TM_TOV_PCT",
"EFG_PCT",
"TS_PCT",
"USG_PCT",
"PACE",
"PIE",
"FG_PCT",
"PCT_UAST_FGM",
"DFG2M",
"DFG3M",
"D_FG_PCT",
"PLUS_MINUS",
"ORPM",
"DRPM",
"RPM",
"WINS",
"TEAM_PIE",
"TEAM_PLUS_MINUS",
"TEAM_WINS",
"TEAM_RPM_WINS",
"TEAM_RPM")
VarDesc<-c("Player Name",
"3-letter abbreviation of team name",
"Games Played",
"Wins",
"Losses",
"Minutes Played",
"Offensive Rating (points per 100 possessions while player is on court)",
"Defensive Rating (points allowed per 100 possesions while player is on court)",
"Net Rating (point differential per 100 possessions while player is on court)",
"Assist Percentage (percentage of shots a player assists (AST / TmFGM - FGM))",
"Assist to Turnover Ratio",
"Assist Ratio (number of assists per 100 possessions used)",
"Offensive Rebound Percentage (percentage of available offensive rebounds a player grabbed while on floor)",
"Defensive Rebound Percentage (percentage of available defensive rebounds a player grabbed while on floor)",
"Overall Rebound Percentage (percentage of total rebounds a player grabbed while on floor)",
"Team Turnover Percentage (turnovers per 100 possessions used by a player)",
"Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)",
"True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))",
"Usage Rate (possessions used by a player when on the floor ((FGA + (.44*FTA) + TO) / Possessions)))",
"Pace of play (number of possessions per 48 minutes)",
"Player Impact Estimate (comparable to other advanced statistic ratings e.g. PER)",
"Field Goal Percentage",
"Percentage of FGs made that were unassisted",
"Made 2-point FGs a player defended",
"Made 3-point FGs a player defended",
"Defensive Field Goal Percentage",
"Plus Minus (point differential while player is on the floor)",
"Offensive Real Plus Minus",
"Defensive Real Plus Minus",
"Real Plus Minus",
"RPM Wins (similar to win share, or number of wins a player contributes to his team's win total)",
"Team Player Impact Estimate (overall PIE of the team a player is on)",
"Team Plus Minus (Overall Plus Minus of team a player is on)",
"Team Wins (overall wins of the team a player is on, including games in which a player did not play)",
"Aggregate of RPM Wins (predicted wins a player contributes to a team)",
"Aggregate of RPM for all players on a team, weighted by minutes played)")
Data_Dictionary<- as.data.frame(cbind(VarType,VarDesc))
colnames(Data_Dictionary)<-c("Variable Name","Variable Description")
kable(Data_Dictionary, caption = "Data Dictionary of relevant variables")| Variable Name | Variable Description |
|---|---|
| PLAYER_NAME | Player Name |
| TEAM_ABBREVIATION | 3-letter abbreviation of team name |
| GP | Games Played |
| W | Wins |
| L | Losses |
| MIN | Minutes Played |
| OFF_RATING | Offensive Rating (points per 100 possessions while player is on court) |
| DEF_RATING | Defensive Rating (points allowed per 100 possesions while player is on court) |
| NET_RATING | Net Rating (point differential per 100 possessions while player is on court) |
| AST_PCT | Assist Percentage (percentage of shots a player assists (AST / TmFGM - FGM)) |
| AST_TO | Assist to Turnover Ratio |
| AST_RATIO | Assist Ratio (number of assists per 100 possessions used) |
| OREB_PCT | Offensive Rebound Percentage (percentage of available offensive rebounds a player grabbed while on floor) |
| DREB_PCT | Defensive Rebound Percentage (percentage of available defensive rebounds a player grabbed while on floor) |
| REB_PCT | Overall Rebound Percentage (percentage of total rebounds a player grabbed while on floor) |
| TM_TOV_PCT | Team Turnover Percentage (turnovers per 100 possessions used by a player) |
| EFG_PCT | Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers) |
| TS_PCT | True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt)) |
| USG_PCT | Usage Rate (possessions used by a player when on the floor ((FGA + (.44*FTA) + TO) / Possessions))) |
| PACE | Pace of play (number of possessions per 48 minutes) |
| PIE | Player Impact Estimate (comparable to other advanced statistic ratings e.g. PER) |
| FG_PCT | Field Goal Percentage |
| PCT_UAST_FGM | Percentage of FGs made that were unassisted |
| DFG2M | Made 2-point FGs a player defended |
| DFG3M | Made 3-point FGs a player defended |
| D_FG_PCT | Defensive Field Goal Percentage |
| PLUS_MINUS | Plus Minus (point differential while player is on the floor) |
| ORPM | Offensive Real Plus Minus |
| DRPM | Defensive Real Plus Minus |
| RPM | Real Plus Minus |
| WINS | RPM Wins (similar to win share, or number of wins a player contributes to his team’s win total) |
| TEAM_PIE | Team Player Impact Estimate (overall PIE of the team a player is on) |
| TEAM_PLUS_MINUS | Team Plus Minus (Overall Plus Minus of team a player is on) |
| TEAM_WINS | Team Wins (overall wins of the team a player is on, including games in which a player did not play) |
| TEAM_RPM_WINS | Aggregate of RPM Wins (predicted wins a player contributes to a team) |
| TEAM_RPM | Aggregate of RPM for all players on a team, weighted by minutes played) |
datatable(team_data,
extensions = c("FixedColumns","Buttons","KeyTable","Scroller"),
options = list(orderClasses = TRUE,
fixedColumns = list(leftColumns = 3),
dom = "Bfrtip",
buttons = c("copy","csv","excel","colvis"),
keys = TRUE,
deferRender = FALSE,
scrollY = 750,
scrollX = 1000,
scroller = TRUE,
filter = "top",
orderMulti = TRUE,
columnDefs =
list(list(visible = FALSE, targets = c(1,3:26)))))VarType<- c("TEAM_NAME",
"W",
"L",
"W_PCT",
"MIN",
"EFG_PCT",
"TM_TOV_PCT",
"OREB_PCT",
"OPP_EFG_PCT",
"OPP_TOV_PCT",
"OPP_OREB_PCT",
"PLUS_MINUS",
"TS_PCT",
"PIE",
"NET_RATING",
"AST_PCT",
"AST_TO",
"AST_RATIO",
"TEAM_RPM")
VarDesc<-c("Team Name",
"Wins",
"Losses",
"Win Percentage",
"Minutes",
"Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)",
"Team Turnover Percentage (turnovers per 100 possessions)",
"Offensive Rebound Percentage (percentage of available offensive rebounds a team grabs)",
"Opponent's Effective Field Goal Percentage",
"Opponent's Turnover Percentage",
"Opponent's Offensive Rebound Percentage",
"Plus Minus (overall point differential)",
"True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))",
"Player Impact Estimate",
"Net Rating",
"Assist Percentage",
"Assist to Turnovers",
"Assist to Turnover Ratio",
"Team Real Plus Minus")
Data_Dictionary<- as.data.frame(cbind(VarType,VarDesc))
colnames(Data_Dictionary)<-c("Variable Name","Description")
kable(Data_Dictionary, caption = "Data Dictionary of relevant variables")| Variable Name | Description |
|---|---|
| TEAM_NAME | Team Name |
| W | Wins |
| L | Losses |
| W_PCT | Win Percentage |
| MIN | Minutes |
| EFG_PCT | Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers) |
| TM_TOV_PCT | Team Turnover Percentage (turnovers per 100 possessions) |
| OREB_PCT | Offensive Rebound Percentage (percentage of available offensive rebounds a team grabs) |
| OPP_EFG_PCT | Opponent’s Effective Field Goal Percentage |
| OPP_TOV_PCT | Opponent’s Turnover Percentage |
| OPP_OREB_PCT | Opponent’s Offensive Rebound Percentage |
| PLUS_MINUS | Plus Minus (overall point differential) |
| TS_PCT | True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt)) |
| PIE | Player Impact Estimate |
| NET_RATING | Net Rating |
| AST_PCT | Assist Percentage |
| AST_TO | Assist to Turnovers |
| AST_RATIO | Assist to Turnover Ratio |
| TEAM_RPM | Team Real Plus Minus |
Some preliminary analysis shows that there are noticeable differences in the distributions of player talent across teams, which should be expected.
This follows conventional wisdom (the Warriors are good, the Sixers not so much), but there are a few interesting points that deserve further analysis. Such as whether the Warriors are better or worse than the sum of their parts (comparing the sum of player RPM, a predicted value, vs. NET_RATING, an observed value), and how other teams compare in the same context.
# boxplot of player RPM on each team
ggplotly(
player_data %>%
arrange(desc(TEAM_PLUS_MINUS)) %>%
ggplot(aes(TEAM_ABBREVIATION, RPM)) +
geom_boxplot())# density plot of player RPM on each team
ggplotly(
player_data %>%
arrange(desc(TEAM_PLUS_MINUS)) %>%
ggplot(aes(RPM, fill = TEAM_ABBREVIATION)) +
geom_density(alpha = .2))In thinking through approaching this analysis, I settled on two different questions:
For the first question, I compared TEAM_RPM vs. NET_RATING for each team. I also compared TEAM_RPM_WINS vs. W. The RPM statistics are produced by ESPN and adjusted to be a representation of a player’s contribution to his team, independent of teammates or opponents. Therefore, when aggregated these statistics should represent the sum of each player’s individual contribution.
I plotted both of these with a line with slope = 1 (a null hypothesis that teams equal the sum of their parts). I also plotted these with a geom_smooth line of best fit to show how teams might compare to other teams in the league.
ggplotly(
ggplot(team_data,aes(TEAM_RPM,NET_RATING)) +
geom_point(aes(color = TEAM_NAME)) +
geom_smooth(method = "glm") +
geom_abline()
)ggplotly(
ggplot(team_data,aes(TEAM_RPM_WINS,W)) +
geom_point(aes(color = TEAM_NAME)) +
geom_smooth(method = "glm") +
geom_abline())These graphs indicate that teams like the Utah Jazz and San Antonio Spurs may be some of the bigger overachievers.
The NET_RATING chart suggests that while the Warriors might not be maximizing the sum of the talent on that team, they are still doing better than league average.
It is also worth observing where the slope = 1 line and the best fit line intersect, suggesting that it’s easier for bad teams to outperform their talent level. Additionally, the gap is much closer when observing NET_RATING (a points measure) instead of wins, suggesting that point differentials are better representations of talent levels (but ultimately, wins are still what teams are after).
This might suggest there are diminishing marginal returns to talent and how efficiently it translates to points/wins.
I’d like to build a model that looks into what players do on the court (i.e. shooting, passing, defensive statistics, etc.) and how this contributes to the synergy of a team. Furthermore, I’d like to see which player lineups are most effective at pushing a team past its “sum of its parts” baseline.
Potential ideas to explore:
This is an on-going project that I will continue to work on in my spare time.