Introduction

What’s the question?

Which NBA players make their teammates better?

The NBA has realized a talent boom in the last 5-10 years. Popularity, attention, and league-wide revenue has been growing, as well as punditry and data analysis. Fans of the league are riveted by such important, profound questions such as “is Kevin Durant a cupcake?”

More importantly, the NBA is on the cutting-edge of data collection in sports. Thanks to the NBA’s partnership with STATS SportVU, they are tracking data using a system of 6 cameras hung from the ceiling that track ball and player movement 25 times per second. They can use this unstructured data to define metrics such as secondary assists, miles traveled, made field goals defended at rim, etc.

The Golden State Warriors received high praise last season as the best team ever. They are frequently said to play the right type of basketball, that they’re unselfish, and that they make their teammates better, but can we prove this? Or disprove it?

Basketball is one of the most popular sports in America and worldwide. The strategy on and off the court is fascinating. It’s my personal favorite and I’m excited that we finally have the means to answer the really complex questions that keep fans coming back.

What data do we need?

Lucky for us, the NBA has made the data they collect readily available on their website. Unfortunately, they do not make it readily available for download. This required some data wrangling using the jsonlite package.

Variables of interest include:

Real Plus/Minus
Points Scored
Points Scored Against
Player Efficiency Rating (PER)
Usage Rates
Effective Field Goal %
True Shooting %
Net Rating
Wins

Other variables will be considered as the model is constructed. I will use data from the 2016-2017 NBA regular season and postseason.

What methodology can we use?

I’ll first try to understand how a player adds value to his team in a quantifiable way.

Then, I can establish a baseline ‘sum of its parts’ for overall team performance and see which teams underperform/overperform.

I can then try to predict which teams will outperform the sum of their parts, or which players make their teams better indirectly.

Ultimately, there are a few assumptions I’ll have to more fully address over the course of this project:

How am I defining making your teammates ‘better’?
How will I measure a player’s impact on this definition of ‘better’?
And how do I determine the difference between one player making teammates better and synergy between teammates?
- Comparing similar lineups with and without a particular player may shed light on this, but ultimately this question may not matter. The information is still useful in picking lineups or players that maximize a group of 5 players’ value while they’re on the court.

Who cares?

Besides the aforementioned pundits and fans, basketball organizations should care, players should care, agents should care, everyone who gets paid from basketball related revenue should care. Team strategy off the court is proving critical to success. It’s hard to gain an edge in a league with a salary cap and paying the best players/getting the best coach doesn’t cut it anymore. Being able to answer questions like these allows teams to find value where others miss it and optimize the resources they have.

Further iterations on this question would allow teams to understand not only which players impact their teammates, but which types of teammates best suit their star players. This could help organizations tailor their teams around a foundation (e.g. how the Cavs build around Lebron James after Kyrie Irving’s request for a trade, or how to build around the new pairing of Chris Paul and James Harden for the Rockets).

Packages & Data Prep

Packages Used

  # Packages Required
  library(tidyverse) # used for several functions in r, including dplyr, ggplot2, etc.
  library(jsonlite) # used to convert JSON files to dataframes
  library(plotly) # used for making graphs interactive and easier to interpret (especially when overplotted)
  library(DT) # used for providing a data dictionary in HTML
  library(stringr) # needed to alter some text strings to match observations

Data Set

Compilation

# Import json files from web

# List of json URLs

json_urls <- c("http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
               "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
               "http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=2+Pointers&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
               "http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=3+Pointers&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
               "http://stats.nba.com/stats/leaguedashptdefend?College=&Conference=&Country=&DateFrom=&DateTo=&DefenseCategory=Overall&Division=&DraftPick=&DraftYear=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
",
               "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=
","http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
",
                    "http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
",
                    "http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Four+Factors&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
")
json_names <- list("nba_advanced", "nba_assistedFGs", "nba_d2fgm", "nba_d3fgm", "nba_dfg", "nba_plusminus", "nba_teambasic", "nba_teamadvanced", "nba_fourfactors")

# Create Data Import Loop

data_list <- list(NULL)
for(i in seq_along(json_urls)){
  foo <- fromJSON(paste(readLines(json_urls[i])),
                  simplifyDataFrame = TRUE,
                  flatten = TRUE)
  flat_file <- flatten(foo$resultSets)
  foo <- data.frame(flat_file$rowSet)
  colnames(foo) <- t(data.frame(flat_file$headers))
  data_list[i] <- list(foo)
  print(i)
  }
data_list <- setNames(data_list, nm = json_names)

# import csv, edit column = name to match json files

nba_rpm <- read_csv(
  "https://raw.githubusercontent.com/robert-fields/nba-stats/master/nba_rpm_csv.csv",
  col_types = "_?___????")
nba_rpm$Name <- substr(nba_rpm$Name,
                       1,
                       nchar(nba_rpm$Name)-
                         (nchar(nba_rpm$Name)-str_locate(nba_rpm$Name,','))-1)

# match different spellings of names
nba_rpm$Name[nba_rpm$Name == "Nene Hilario"] <- "Nene"
nba_rpm$Name[nba_rpm$Name == "Sheldon Mac"] <- "Sheldon McClellan"

# Merge relevant variables from json files and csv and filter out NA from csv

player_data <-
  data_list[[1]] %>%
  left_join(data_list[[2]][,c("PLAYER_ID","PCT_AST_FGM","PCT_UAST_FGM")],
          by = "PLAYER_ID") %>%
  left_join(data_list[[3]][,c("PLAYER_NAME","FG2M")],
            by = "PLAYER_NAME") %>%
  left_join(data_list[[4]][,c("PLAYER_NAME","FG3M")],
            by = "PLAYER_NAME") %>%
  left_join(data_list[[5]][,c("PLAYER_NAME","D_FG_PCT")],
            by = "PLAYER_NAME") %>%
  left_join(data_list[[6]][,c("PLAYER_ID","PLUS_MINUS")],
            by = "PLAYER_ID") %>%
  left_join(nba_rpm, by = c("PLAYER_NAME" = "Name")) %>%
  filter(!is.na(RPM))

# Merge team data

team_data <-
  data_list[[9]] %>%
  left_join(data_list[[7]][,c("TEAM_ID","PLUS_MINUS")],
            by = "TEAM_ID") %>%
  left_join(data_list[[8]][,c("TEAM_ID","TS_PCT","PIE",
                              "NET_RATING","OFF_RATING","DEF_RATING",
                              "AST_PCT","AST_TO","AST_RATIO")],
            by = "TEAM_ID")

# clean up workspace
rm(list = setdiff(ls(), c("player_data","team_data")))

# change vectors to numerics
player_data[5:67] <- sapply(player_data[5:67],
  function(x) as.numeric(as.character(x)))

team_data[3:39] <- sapply(team_data[3:39],
                            function(x) as.numeric(as.character(x)))

# filter out unneeded variables, match team and player data, add variables
player_data <-
  player_data %>%
  select(-(GP_RANK:CFPARAMS)) %>%
  left_join(mutate(team_data,
                   TEAM_PIE = PIE,
                   TEAM_PLUS_MINUS = PLUS_MINUS,
                   TEAM_WINS = W,
                   TEAM_MIN = MIN,
                   TEAM_NET_RATING = NET_RATING,
                   TEAM_OFF_RATING = OFF_RATING,
                   TEAM_DEF_RATING = DEF_RATING)
            [,c("TEAM_ID","TEAM_PIE","TEAM_PLUS_MINUS","TEAM_WINS","TEAM_MIN",
                "TEAM_NET_RATING","TEAM_OFF_RATING","TEAM_DEF_RATING")],
       by = "TEAM_ID") %>%
       mutate(TEAM_RPM_SHARE = (RPM * (GP * MIN) / (TEAM_MIN)))

 team_data <-
   team_data %>%
   select(-(GP_RANK:CFPARAMS)) %>%
   left_join(
     aggregate(player_data$TEAM_RPM_SHARE ~ player_data$TEAM_ID,
       player_data, sum),
       by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
   rename("TEAM_RPM" = "player_data$TEAM_RPM_SHARE") %>%
   left_join(
             aggregate(player_data$WINS ~ player_data$TEAM_ID,player_data, sum),
             by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
   rename("TEAM_RPM_WINS" = "player_data$WINS")

 player_data <- player_data %>%
  left_join(mutate(team_data,
            TEAM_RPM_WINS = TEAM_RPM_WINS,
            TEAM_RPM = TEAM_RPM)[,c("TEAM_ID","TEAM_RPM_WINS",
                                    "TEAM_RPM")],
            by = "TEAM_ID")

I have included a data dictionary with details on each variable below each data set on their respective tabs. A more in depth look at different statistics used for NBA data can be found in the NBA stats glossary.

I had to compile two data sets for Team data and Player data. I did this by linking to the JSON URL from the stats.nba.com website HTML and using the “jsonlite” package to scrape this data and flatten it into data frames. I used a loop to tidy up this repetitive process.

data_list <- list(NULL)
for(i in seq_along(json_urls)){
  foo <- fromJSON(paste(readLines(json_urls[i])),
                  simplifyDataFrame = TRUE,
                  flatten = TRUE)
  flat_file <- flatten(foo$resultSets)
  foo <- data.frame(flat_file$rowSet)
  colnames(foo) <- t(data.frame(flat_file$headers))
  data_list[i] <- list(foo)
  print(i)
  }
data_list <- setNames(data_list, nm = json_names)

I combined all of these data frames into two single data sets. I also pulled Real Plus/Minus data from ESPN, converted it into a .csv and imported that. I was able to merge all the JSON dataframes and the csv based on either a Player ID from nba.com or the Player Name. I also joined data from the team_data set to the player_data set to make some comparisons between teams and players.

I created a variable TEAM_RPM_SHARE in the player_data set that weighted a player’s RPM by the minutes he played during the season. I then aggregated this by team to create a TEAM_RPM variable in the team_data set. This is the variable I use as a baseline for team performance based on the players they have on their team. I did the same thing with RPM WINS, creating the TEAM_RPM_WINS variable in the team_data set.

mutate(player_data, TEAM_RPM_SHARE = (RPM * (GP * MIN) / (TEAM_MIN)))

left_join(team_data,
     aggregate(player_data$TEAM_RPM_SHARE ~ player_data$TEAM_ID,
       player_data, sum),
       by = c("TEAM_ID" = "player_data$TEAM_ID")) %>%
   rename("TEAM_RPM" = "player_data$TEAM_RPM_SHARE")

I dropped any unneccessary variables, paring it down to 39 variables for Team data and 67 variables for Player data that might be of interest later on.

## Observations: 30
## Variables: 26
## $ TEAM_ID       <fctr> 1610612737, 1610612738, 1610612751, 1610612766,...
## $ TEAM_NAME     <fctr> Atlanta Hawks, Boston Celtics, Brooklyn Nets, C...
## $ GP            <dbl> 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, ...
## $ W             <dbl> 43, 53, 20, 36, 41, 51, 33, 40, 37, 67, 55, 42, ...
## $ L             <dbl> 39, 29, 62, 46, 41, 31, 49, 42, 45, 15, 27, 40, ...
## $ W_PCT         <dbl> 0.524, 0.646, 0.244, 0.439, 0.500, 0.622, 0.402,...
## $ MIN           <dbl> 3976, 3951, 3951, 3966, 3956, 3976, 3956, 3951, ...
## $ EFG_PCT       <dbl> 0.504, 0.525, 0.507, 0.501, 0.487, 0.547, 0.505,...
## $ FTA_RATE      <dbl> 0.295, 0.273, 0.289, 0.279, 0.259, 0.275, 0.225,...
## $ TM_TOV_PCT    <dbl> 0.157, 0.133, 0.159, 0.117, 0.138, 0.137, 0.126,...
## $ OREB_PCT      <dbl> 0.236, 0.212, 0.196, 0.199, 0.270, 0.219, 0.181,...
## $ OPP_EFG_PCT   <dbl> 0.507, 0.503, 0.513, 0.523, 0.507, 0.516, 0.529,...
## $ OPP_FTA_RATE  <dbl> 0.232, 0.290, 0.275, 0.211, 0.219, 0.226, 0.282,...
## $ OPP_TOV_PCT   <dbl> 0.153, 0.139, 0.128, 0.132, 0.137, 0.127, 0.157,...
## $ OPP_OREB_PCT  <dbl> 0.239, 0.247, 0.239, 0.204, 0.232, 0.242, 0.224,...
## $ PLUS_MINUS    <dbl> -0.9, 2.6, -6.7, 0.2, 0.4, 3.2, -2.9, 0.5, -1.1,...
## $ TS_PCT        <dbl> 0.541, 0.567, 0.551, 0.547, 0.530, 0.580, 0.541,...
## $ PIE           <dbl> 0.506, 0.514, 0.460, 0.505, 0.504, 0.510, 0.479,...
## $ NET_RATING    <dbl> -0.8, 3.1, -6.1, 0.3, 0.1, 2.9, -2.6, -0.5, -2.0...
## $ OFF_RATING    <dbl> 102.3, 108.6, 101.9, 106.4, 104.6, 110.9, 103.7,...
## $ DEF_RATING    <dbl> 103.1, 105.5, 108.0, 106.1, 104.5, 108.0, 106.3,...
## $ AST_PCT       <dbl> 0.621, 0.653, 0.566, 0.611, 0.584, 0.567, 0.574,...
## $ AST_TO        <dbl> 1.50, 1.90, 1.29, 2.01, 1.66, 1.66, 1.75, 1.69, ...
## $ AST_RATIO     <dbl> 17.5, 18.9, 16.0, 17.7, 17.0, 17.2, 16.9, 18.3, ...
## $ TEAM_RPM      <dbl> -0.3893458, 2.9810554, -6.5075485, 1.8591311, -0...
## $ TEAM_RPM_WINS <dbl> 41.23, 46.97, 21.13, 41.66, 35.50, 56.19, 28.86,...

## Observations: 464
## Variables: 50
## $ PLAYER_ID         <fctr> 1627773, 201166, 203932, 1626151, 203940, 2...
## $ PLAYER_NAME       <chr> "AJ Hammons", "Aaron Brooks", "Aaron Gordon"...
## $ TEAM_ID           <fctr> 1610612742, 1610612754, 1610612753, 1610612...
## $ TEAM_ABBREVIATION <fctr> DAL, IND, ORL, CHA, MIN, BOS, IND, POR, LAC...
## $ AGE               <dbl> 24, 32, 21, 22, 26, 31, 32, 26, 34, 24, 25, ...
## $ GP                <dbl> 22, 65, 80, 5, 18, 68, 66, 61, 30, 47, 42, 6...
## $ W                 <dbl> 4, 36, 29, 2, 5, 46, 33, 33, 20, 11, 26, 37,...
## $ L                 <dbl> 18, 29, 51, 3, 13, 22, 33, 28, 10, 36, 16, 3...
## $ W_PCT             <dbl> 0.182, 0.554, 0.363, 0.400, 0.278, 0.676, 0....
## $ MIN               <dbl> 7.4, 13.7, 28.7, 3.3, 7.5, 32.3, 14.1, 29.1,...
## $ OFF_RATING        <dbl> 102.2, 101.5, 105.4, 83.3, 102.6, 110.7, 102...
## $ DEF_RATING        <dbl> 102.8, 104.6, 108.2, 101.9, 101.8, 105.8, 10...
## $ NET_RATING        <dbl> -0.6, -3.0, -2.8, -18.6, 0.8, 5.0, -5.8, 1.8...
## $ AST_PCT           <dbl> 0.038, 0.216, 0.097, 0.375, 0.089, 0.239, 0....
## $ AST_TO            <dbl> 0.40, 1.89, 1.69, 0.00, 0.88, 2.93, 1.73, 1....
## $ AST_RATIO         <dbl> 6.2, 24.6, 12.5, 38.1, 9.0, 25.7, 9.5, 13.8,...
## $ OREB_PCT          <dbl> 0.049, 0.022, 0.054, 0.000, 0.069, 0.049, 0....
## $ DREB_PCT          <dbl> 0.199, 0.064, 0.141, 0.200, 0.200, 0.183, 0....
## $ REB_PCT           <dbl> 0.119, 0.043, 0.096, 0.094, 0.132, 0.118, 0....
## $ TM_TOV_PCT        <dbl> 15.4, 13.0, 7.4, 0.0, 10.3, 8.8, 5.5, 13.1, ...
## $ EFG_PCT           <dbl> 0.464, 0.483, 0.499, 0.000, 0.454, 0.527, 0....
## $ TS_PCT            <dbl> 0.472, 0.507, 0.530, 0.102, 0.505, 0.553, 0....
## $ USG_PCT           <dbl> 0.167, 0.191, 0.200, 0.142, 0.224, 0.199, 0....
## $ PACE              <dbl> 95.66, 96.55, 99.70, 92.43, 96.52, 98.96, 96...
## $ PIE               <dbl> 0.043, 0.062, 0.088, 0.000, 0.082, 0.125, 0....
## $ FGM               <dbl> 17, 121, 393, 0, 23, 379, 235, 183, 30, 138,...
## $ FGA               <dbl> 42, 300, 865, 4, 54, 801, 471, 466, 80, 267,...
## $ FGM_PG            <dbl> 0.8, 1.9, 4.9, 0.0, 1.3, 5.6, 3.6, 3.0, 1.0,...
## $ FGA_PG            <dbl> 1.9, 4.6, 10.8, 0.8, 3.0, 11.8, 7.1, 7.6, 2....
## $ FG_PCT            <dbl> 0.405, 0.403, 0.454, 0.000, 0.426, 0.473, 0....
## $ PCT_AST_FGM       <dbl> 0.941, 0.289, 0.570, 0.000, 0.696, 0.678, 0....
## $ PCT_UAST_FGM      <dbl> 0.059, 0.711, 0.430, 0.000, 0.304, 0.322, 0....
## $ FG2M              <dbl> 1.59, 1.43, 2.91, 0.75, 1.24, 5.63, 2.43, 3....
## $ FG3M              <dbl> 0.18, 0.75, 1.01, NA, 0.24, 1.19, 0.32, 1.20...
## $ D_FG_PCT          <dbl> 0.476, 0.455, 0.437, 0.600, 0.439, 0.467, 0....
## $ PLUS_MINUS        <dbl> -0.2, -0.5, -2.0, -1.0, 0.6, 3.2, -1.5, 0.4,...
## $ ORPM              <dbl> -2.77, -1.81, 1.25, -1.42, -1.69, 0.76, -1.0...
## $ DRPM              <dbl> 1.27, -1.47, -0.78, -0.32, 1.36, 1.06, -0.45...
## $ RPM               <dbl> -1.50, -3.28, 0.47, -1.74, -0.33, 1.82, -1.4...
## $ WINS              <dbl> 0.16, -0.09, 5.32, 0.01, 0.24, 6.93, 0.98, 4...
## $ TEAM_PIE          <dbl> 0.479, 0.505, 0.460, 0.505, 0.501, 0.514, 0....
## $ TEAM_PLUS_MINUS   <dbl> -2.9, -0.2, -6.6, 0.2, -1.1, 2.6, -0.2, -0.5...
## $ TEAM_WINS         <dbl> 33, 42, 29, 36, 31, 53, 42, 41, 51, 24, 51, ...
## $ TEAM_MIN          <dbl> 3956, 3971, 3961, 3966, 3961, 3951, 3971, 39...
## $ TEAM_NET_RATING   <dbl> -2.6, -0.1, -6.8, 0.3, -1.0, 3.1, -0.1, 0.0,...
## $ TEAM_OFF_RATING   <dbl> 103.7, 106.2, 101.2, 106.4, 108.1, 108.6, 10...
## $ TEAM_DEF_RATING   <dbl> 106.3, 106.3, 108.0, 106.1, 109.1, 105.5, 10...
## $ TEAM_RPM_SHARE    <dbl> -0.061729019, -0.735542684, 0.272436253, -0....
## $ TEAM_RPM_WINS     <dbl> 28.86, 38.70, 20.55, 41.66, 36.75, 46.97, 38...
## $ TEAM_RPM          <dbl> -2.7163504, -0.1413704, -7.2108700, 1.859131...

Player Data

datatable(player_data,
          extensions = c("FixedColumns","Buttons","KeyTable","Scroller"),
          options = list(orderClasses = TRUE,
                         fixedColumns = list(leftColumns = 5),
                         dom = "Bfrtip",
                         buttons = c("copy","csv","excel","colvis"),
                         keys = TRUE,
                         deferRender = FALSE,
                         scrollY = 750,
                         scrollX = 1000,
                         scroller = TRUE,
                         filter = "top",
                         orderMulti = TRUE,
                         columnDefs =
                           list(list(visible = FALSE, targets = c(1,3,5:50)))))

#Creating Data Dictionary

VarType<- c("PLAYER_NAME",
            "TEAM_ABBREVIATION",
            "GP",
            "W",
            "L",
            "MIN",
            "OFF_RATING",
            "DEF_RATING",
            "NET_RATING",
            "AST_PCT",
            "AST_TO",
            "AST_RATIO",
            "OREB_PCT",
            "DREB_PCT",
            "REB_PCT",
            "TM_TOV_PCT",
            "EFG_PCT",
            "TS_PCT",
            "USG_PCT",
            "PACE",
            "PIE",
            "FG_PCT",
            "PCT_UAST_FGM",
            "DFG2M",
            "DFG3M",
            "D_FG_PCT",
            "PLUS_MINUS",
            "ORPM",
            "DRPM",
            "RPM",
            "WINS",
            "TEAM_PIE",
            "TEAM_PLUS_MINUS",
            "TEAM_WINS",
            "TEAM_RPM_WINS",
            "TEAM_RPM")

VarDesc<-c("Player Name",
           "3-letter abbreviation of team name",
           "Games Played",
           "Wins",
           "Losses",
           "Minutes Played",
           "Offensive Rating (points per 100 possessions while player is on court)",
           "Defensive Rating (points allowed per 100 possesions while player is on court)",
           "Net Rating (point differential per 100 possessions while player is on court)",
           "Assist Percentage (percentage of shots a player assists (AST / TmFGM - FGM))",
           "Assist to Turnover Ratio",
           "Assist Ratio (number of assists per 100 possessions used)",
           "Offensive Rebound Percentage (percentage of available offensive rebounds a player grabbed while on floor)",
           "Defensive Rebound Percentage (percentage of available defensive rebounds a player grabbed while on floor)",
           "Overall Rebound Percentage (percentage of total rebounds a player grabbed while on floor)",
           "Team Turnover Percentage (turnovers per 100 possessions used by a player)",
           "Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)",
           "True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))",
           "Usage Rate (possessions used by a player when on the floor ((FGA + (.44*FTA) + TO) / Possessions)))",
           "Pace of play (number of possessions per 48 minutes)",
           "Player Impact Estimate (comparable to other advanced statistic ratings e.g. PER)",
           "Field Goal Percentage",
           "Percentage of FGs made that were unassisted",
           "Made 2-point FGs a player defended",
           "Made 3-point FGs a player defended",
           "Defensive Field Goal Percentage",
           "Plus Minus (point differential while player is on the floor)",
           "Offensive Real Plus Minus",
           "Defensive Real Plus Minus",
           "Real Plus Minus",
           "RPM Wins (similar to win share, or number of wins a player contributes to his team's win total)",
           "Team Player Impact Estimate (overall PIE of the team a player is on)",
           "Team Plus Minus (Overall Plus Minus of team a player is on)",
           "Team Wins (overall wins of the team a player is on, including games in which a player did not play)",
           "Aggregate of RPM Wins (predicted wins a player contributes to a team)",
           "Aggregate of RPM for all players on a team, weighted by minutes played)")
Data_Dictionary<- as.data.frame(cbind(VarType,VarDesc))
colnames(Data_Dictionary)<-c("Variable Name","Variable Description")
kable(Data_Dictionary, caption = "Data Dictionary of relevant variables")

Data Dictionary of relevant variables
Variable Name	Variable Description
PLAYER_NAME	Player Name
TEAM_ABBREVIATION	3-letter abbreviation of team name
GP	Games Played
W	Wins
L	Losses
MIN	Minutes Played
OFF_RATING	Offensive Rating (points per 100 possessions while player is on court)
DEF_RATING	Defensive Rating (points allowed per 100 possesions while player is on court)
NET_RATING	Net Rating (point differential per 100 possessions while player is on court)
AST_PCT	Assist Percentage (percentage of shots a player assists (AST / TmFGM - FGM))
AST_TO	Assist to Turnover Ratio
AST_RATIO	Assist Ratio (number of assists per 100 possessions used)
OREB_PCT	Offensive Rebound Percentage (percentage of available offensive rebounds a player grabbed while on floor)
DREB_PCT	Defensive Rebound Percentage (percentage of available defensive rebounds a player grabbed while on floor)
REB_PCT	Overall Rebound Percentage (percentage of total rebounds a player grabbed while on floor)
TM_TOV_PCT	Team Turnover Percentage (turnovers per 100 possessions used by a player)
EFG_PCT	Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)
TS_PCT	True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))
USG_PCT	Usage Rate (possessions used by a player when on the floor ((FGA + (.44*FTA) + TO) / Possessions)))
PACE	Pace of play (number of possessions per 48 minutes)
PIE	Player Impact Estimate (comparable to other advanced statistic ratings e.g. PER)
FG_PCT	Field Goal Percentage
PCT_UAST_FGM	Percentage of FGs made that were unassisted
DFG2M	Made 2-point FGs a player defended
DFG3M	Made 3-point FGs a player defended
D_FG_PCT	Defensive Field Goal Percentage
PLUS_MINUS	Plus Minus (point differential while player is on the floor)
ORPM	Offensive Real Plus Minus
DRPM	Defensive Real Plus Minus
RPM	Real Plus Minus
WINS	RPM Wins (similar to win share, or number of wins a player contributes to his team’s win total)
TEAM_PIE	Team Player Impact Estimate (overall PIE of the team a player is on)
TEAM_PLUS_MINUS	Team Plus Minus (Overall Plus Minus of team a player is on)
TEAM_WINS	Team Wins (overall wins of the team a player is on, including games in which a player did not play)
TEAM_RPM_WINS	Aggregate of RPM Wins (predicted wins a player contributes to a team)
TEAM_RPM	Aggregate of RPM for all players on a team, weighted by minutes played)

Team Data

datatable(team_data,
          extensions = c("FixedColumns","Buttons","KeyTable","Scroller"),
          options = list(orderClasses = TRUE,
                         fixedColumns = list(leftColumns = 3),
                         dom = "Bfrtip",
                         buttons = c("copy","csv","excel","colvis"),
                         keys = TRUE,
                         deferRender = FALSE,
                         scrollY = 750,
                         scrollX = 1000,
                         scroller = TRUE,
                         filter = "top",
                         orderMulti = TRUE,
                         columnDefs =
                           list(list(visible = FALSE, targets = c(1,3:26)))))

VarType<- c("TEAM_NAME",
            "W",
            "L",
            "W_PCT",
            "MIN",
            "EFG_PCT",
            "TM_TOV_PCT",
            "OREB_PCT",
            "OPP_EFG_PCT",
            "OPP_TOV_PCT",
            "OPP_OREB_PCT",
            "PLUS_MINUS",
            "TS_PCT",
            "PIE",
            "NET_RATING",
            "AST_PCT",
            "AST_TO",
            "AST_RATIO",
            "TEAM_RPM")

VarDesc<-c("Team Name",
           "Wins",
           "Losses",
           "Win Percentage",
           "Minutes",
           "Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)",
           "Team Turnover Percentage (turnovers per 100 possessions)",
           "Offensive Rebound Percentage (percentage of available offensive rebounds a team grabs)",
           "Opponent's Effective Field Goal Percentage",
           "Opponent's Turnover Percentage",
           "Opponent's Offensive Rebound Percentage",
           "Plus Minus (overall point differential)",
           "True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))",
           "Player Impact Estimate",
           "Net Rating",
           "Assist Percentage",
           "Assist to Turnovers",
           "Assist to Turnover Ratio",
           "Team Real Plus Minus")
Data_Dictionary<- as.data.frame(cbind(VarType,VarDesc))
colnames(Data_Dictionary)<-c("Variable Name","Description")
kable(Data_Dictionary, caption = "Data Dictionary of relevant variables")

Data Dictionary of relevant variables
Variable Name	Description
TEAM_NAME	Team Name
W	Wins
L	Losses
W_PCT	Win Percentage
MIN	Minutes
EFG_PCT	Effective Field Goal Percentage (adjusts field goal percentage based on higher expected value of 3-pointers)
TM_TOV_PCT	Team Turnover Percentage (turnovers per 100 possessions)
OREB_PCT	Offensive Rebound Percentage (percentage of available offensive rebounds a team grabs)
OPP_EFG_PCT	Opponent’s Effective Field Goal Percentage
OPP_TOV_PCT	Opponent’s Turnover Percentage
OPP_OREB_PCT	Opponent’s Offensive Rebound Percentage
PLUS_MINUS	Plus Minus (overall point differential)
TS_PCT	True Shooting Percentage (metric of shooting efficiency (points / points possible on possessions with FG or FT attempt))
PIE	Player Impact Estimate
NET_RATING	Net Rating
AST_PCT	Assist Percentage
AST_TO	Assist to Turnovers
AST_RATIO	Assist to Turnover Ratio
TEAM_RPM	Team Real Plus Minus

Analysis

Data Exploration

Talent Distribution

Some preliminary analysis shows that there are noticeable differences in the distributions of player talent across teams, which should be expected.

This follows conventional wisdom (the Warriors are good, the Sixers not so much), but there are a few interesting points that deserve further analysis. Such as whether the Warriors are better or worse than the sum of their parts (comparing the sum of player RPM, a predicted value, vs. NET_RATING, an observed value), and how other teams compare in the same context.

# boxplot of player RPM on each team
ggplotly(
  player_data %>%
  arrange(desc(TEAM_PLUS_MINUS)) %>%
  ggplot(aes(TEAM_ABBREVIATION, RPM)) +
    geom_boxplot())

# density plot of player RPM on each team
ggplotly(
  player_data %>%
  arrange(desc(TEAM_PLUS_MINUS)) %>%
  ggplot(aes(RPM, fill = TEAM_ABBREVIATION)) +
  geom_density(alpha = .2))

Team Performance vs. Baseline

In thinking through approaching this analysis, I settled on two different questions:

How do teams perform against a “sum of their parts” baseline?
Which players are responsible for this performance?

For the first question, I compared TEAM_RPM vs. NET_RATING for each team. I also compared TEAM_RPM_WINS vs. W. The RPM statistics are produced by ESPN and adjusted to be a representation of a player’s contribution to his team, independent of teammates or opponents. Therefore, when aggregated these statistics should represent the sum of each player’s individual contribution.

I plotted both of these with a line with slope = 1 (a null hypothesis that teams equal the sum of their parts). I also plotted these with a geom_smooth line of best fit to show how teams might compare to other teams in the league.

ggplotly(
  ggplot(team_data,aes(TEAM_RPM,NET_RATING)) +
    geom_point(aes(color = TEAM_NAME)) +
    geom_smooth(method = "glm") +
    geom_abline()
)

ggplotly(
  ggplot(team_data,aes(TEAM_RPM_WINS,W)) +
    geom_point(aes(color = TEAM_NAME)) +
    geom_smooth(method = "glm") +
    geom_abline())

These graphs indicate that teams like the Utah Jazz and San Antonio Spurs may be some of the bigger overachievers.

The NET_RATING chart suggests that while the Warriors might not be maximizing the sum of the talent on that team, they are still doing better than league average.

It is also worth observing where the slope = 1 line and the best fit line intersect, suggesting that it’s easier for bad teams to outperform their talent level. Additionally, the gap is much closer when observing NET_RATING (a points measure) instead of wins, suggesting that point differentials are better representations of talent levels (but ultimately, wins are still what teams are after).

This might suggest there are diminishing marginal returns to talent and how efficiently it translates to points/wins.

Next Steps

I’d like to build a model that looks into what players do on the court (i.e. shooting, passing, defensive statistics, etc.) and how this contributes to the synergy of a team. Furthermore, I’d like to see which player lineups are most effective at pushing a team past its “sum of its parts” baseline.

Potential ideas to explore:

On/Off numbers for each player when they’re on or off the court
- Points, Net Rating, Rebounds, Assists, etc.
Hierarchical models that explore the overall impacts of a player and sets of players
Factor Analysis on game statistics and their impact on winning
Clustering on Players/Lineups/Teams to determine if there are NBA ‘Archetypes’ of successful teams
Differences in the Regular season vs. Postseason

This is an on-going project that I will continue to work on in my spare time.

Do the Warriors really play the right way?

Robert Fields

August 23, 2017