Collecting NBA Salary Data

Motivation: Three is Greater Than Two

NBA analytics junkies devote most of their energy to developing new metrics for relative player value. Is John Wall better than Victor Oladipo? Is James Harden better than Steph Curry? However, an oven overlooked, yet crucial, element of the analysis is comparing player value to pay - a price-to-earnings ratio for the NBA. While we know that the three-point-shot is, give or take, worth one more point than the two-point-shot, what may be less clear is how much we should pay those who excel in that skill.

Just like in financial markets, truths become trends, which become fads, which become bubbles, which become busts. Investors try to buy the next Google or Apple while it’s still a young company and stay away from the next Pets.com at the top of a speculative frenzy. The same should be true in the NBA. The hypothesis I would like to test is: are three-point-shots now overvalued in the NBA? And, by extension, are the post-scoring dinousaurs of your father’s NBA now a bargain?

To answer that question, we will need data. This post will show you how to get it.

The Data We Need

We need the following four pieces of data:

Salary Data: We’ll use this to track the market value of different skills over time.
Salary Cap Data: 20 million dollars can’t buy you in 2018 what it could in 2001. We’ll use the salary cap data to discount annual salaray by year.
Player Bio Info: We’ll use data on height and weight to help categorize player archetypes.
Player Stats: Duh.

Salary Data

Patricia Bender has collected historical NBA salary data at this website: https://www.eskimo.com/~pbender/. Bender’s data, while terrific, is not presented with the busy data scientist in mind. The data requries extensive cleaning and tidying. Here’s what it looks like in it’s raw form on the web:

A web scraper that collects salary data requires the following libraries:

library(tidyverse)
library(rvest)
library(stringr)

The breaks between teams differ for each year in the sample. Additionally, the number of dots separating player name and salary depends on the year. So, in order for our webscraper to operate correctly, it must identify the following metadata:

Year: This artifact, which we’ll cal year.roster, is listed on the top of every page. We need it so that we can deflate salaries by the appropriate salary cap number.
Splitting Function: In the raw data, Bender demarcates each team either using four dashes or a double line break. All years after 2007 use the dashes, and years before that use the double line break.
Position Split: As you can see in the image above, each player salary object is seperated by a series of periods. However, each year has a different number of periods seperating the two. This object, position.split gives us the position (# of characters) where we need to split the text to give us either player or salary.

The final wrapper collects, cleans, and tidies the html for each salary-year page according the the metadata as identified above.

get.salary.data <- function(x) {
  url.base <- x
  html.raw <- read_html(url.base)
  
  #What year does this salary correspond to?
  year.roster <<-
    html.raw %>% 
      html_text() %>% str_split("\n") %>% .[[1]] %>% .[1] %>% str_split(" ") %>% .[[1]] %>% .[1] %>% str_split("-") %>% unlist() %>% .[2] %>% as.numeric()
  
  #How do we split each team, with a series of dashes or a double line break?
  regex.splitting <- ifelse(year.roster > 2007, "----", "\n\n")
  
  #Where do we split the text between player name and player salary?
  position.split <<- 
    html.raw %>%
      html_text() %>%
      str_split("\n") %>% unlist() %>% 
      .[str_which(.,"Boston(.*?)Total")] %>% #We use the Boston Celtics here because their name hasn't changed over the course of the sample
      str_locate("\\$") %>% as.numeric() %>% .[1]
  
  #Collect, combine, and clean the data
  html.raw %>%
    html_text() %>%
    #Split text by team
    str_split(regex.splitting) %>% unlist() %>% .[5:length(.)] %>%
    #We'll come back to this
    map(clean.salary.data) %>%
    bind_rows() %>%
    mutate(player = str_replace(player,":","")) %>%
    filter(player != team) %>%
    filter(str_detect(player, "\\$") == FALSE)
}

The clean.salary.data function is the meat of this scraper - it is how we turn the text from Bender’s webpage into a tidy dataframe.

clean.salary.data <- function(x) {
  x %>%
    str_split("\n") %>% #Each player gets their own line
    unlist() %>%
    as_tibble() %>%
    mutate(salary = str_sub(value, position.split,position.split+11) %>% trimws() %>% str_replace_all(.,",","") %>% str_replace(.,"\\$","") %>% as.numeric(),
           player = str_sub(value, 0,position.split-1) %>% str_replace("(Total|Total:)","") %>% str_replace_all(., "\\.","") %>% trimws()) %>%
    select(player,salary) %>%
    na.omit() %>%
    mutate(team = slice(.,1) %>% .$player %>% str_replace("\\$","") %>% str_replace(":",""),
           year = year.roster) %>%
    filter(player != team)
}

The final salary dataset includes 7,889 observations of player salaries from 2001-2016. The distribution of player salaries is heavily skewed to the right. Although the median salary is 2.27 million dollars, the average salary is 4.4 million dollars. In 2016, Kobe Bryant made 25 million and Lebron made 23 million.

Salary Cap Data

Copy and paste from Basketball Referance. Move it along, people!

Player Bio Data

Next, we’ll scrape the NBA player bio data. That is, That is, each player’s height, weight, data of birth, etc. You can find all this data at Basketball Reference, one of the best resources for data on the NBA: https://www.basketball-reference.com/players/a/. The webiste is arranged alphabetically, with the first letter of the player’s last name designating each page.

To collect this data, we first need a wrapper that can collect, clean, and tidy the data from each page. The rvest package’s html_text function collapses tables into a single vector. However, we can exploit the regular ordering of the vector to organinze the data into a data frame. In this case, each column is a lagged instance of that vector. Then, we’ll just select every nth row to get a regular dataframe.

get.player.base.info <- function(x) {
  print(x) #Track progress
  
  html.raw <- x %>% read_html()
  
  #Store output as a vector of character strings
  vector.raw <- 
    html.raw %>%
    html_nodes(".center , .right , .left") %>%
    html_text()
  
  #Arrange vector into columns
  df.raw <- 
    tibble(
      player      = vector.raw,
      year.start  = lead(vector.raw,1),
      year.retire = lead(vector.raw,2),
      position    = lead(vector.raw,3),
      height      = lead(vector.raw,4),
      weight      = lead(vector.raw,5),
      date.birth  = lead(vector.raw,6),
      college     = lead(vector.raw,7)
    )
  
  #Clean the output
  df.raw %>%
    #Every 8th row is a legitmate row that we want to save, that's because the dataframe has 8 columns
    .[-c(1:8),] %>%
    .[seq(1,nrow(.),8),] %>%
    mutate(year.start    = as.numeric(year.start),
           year.retire   = as.numeric(year.retire),
           career.length = year.retire-year.start) %>%
    arrange(desc(career.length))
}

Then, we can industrialize that wrapper to collect data across each letter of the alphabet:

url.list.production <- sprintf("https://www.basketball-reference.com/players/%s/",letters) #create a vector of 26 urls - one for each letter in the alphabet

data.safe.list <- 
  url.list.production %>% map(safely(get.player.base.info))

data.player.base.info <- data.safe.list %>%
  transpose() %>%
  .$result %>%
  bind_rows()

NBA player bodies have changed A LOT since the days when a guy named Josesph Franklin “Jumping Joe” Fulks (I’m not kidding) led the league in scoring and smoked at halftime.

Jumping Joe

You can see that the average height for NBA players grew from just over six feet in 1947 to more than six and a half feet by 2016. Over the same period, player weights peaked in 2010 and have since been trending down as the leauge embraced faster players and small-ball tactic.

Player Stats

The final dataset we need is basic player stats. That is, statistics like points, assists, and rebounds per game. We will use this info to categorize players into archetypes like “back-to-the-basket post-scorers” of “scoring guard.”

The best place to get this data is also Basketball Reference in the seasonal player stats pages. Each page lists all player basic stats from a given season.

The method to collect basic stats is very simlar to that of the player bio data. We first write a wrapper to collect, clean, and tidy the html for each page, then industrialize that wrapper to collect data across each page.

library(magrittr)

#Write wrapper to get player stats data

get.basic.stats <- function(x) {
  print(x) #Track web scraping progress

  year <- x %>% str_extract("NBA_(.*?)_per") %>% str_replace("NBA_","") %>% str_replace("_per", "") %>% as.numeric()
    
  html.raw <- x %>% read_html()
  
  vector.raw <-
    html.raw %>%
      html_nodes(".left , .right , .center") %>% html_text()

  #Organize raw vector into a dataframe
  df.raw <- 
    tibble(
      rank                    = vector.raw,
      player                  = lead(vector.raw,1),
      position                = lead(vector.raw,2),
      age                     = lead(vector.raw,3),
      team                    = lead(vector.raw,4),
      games.played            = lead(vector.raw,5),
      games.started           = lead(vector.raw,6),
      minutes                 = lead(vector.raw,7),
      field.goals.made        = lead(vector.raw,8),
      field.goals.attempted   = lead(vector.raw,9),
      field.goal.percentage   = lead(vector.raw,10),
      three.pointer.made      = lead(vector.raw,11),
      three.pointer.attempted = lead(vector.raw,12),
      three.point.percentage  = lead(vector.raw,13),
      two.point.made          = lead(vector.raw,14),
      two.point.attempt       = lead(vector.raw,15), 
      effective.fg.pct        = lead(vector.raw,16),
      free.throws.made        = lead(vector.raw,17),
      free.throws.attempted   = lead(vector.raw,18),
      free.throw.pct          = lead(vector.raw,19),
      rebounds.offensive      = lead(vector.raw,20),
      rebounds.defensive      = lead(vector.raw,21),
      rebounds.total          = lead(vector.raw,22),
      steals                  = lead(vector.raw,23),
      blocks                  = lead(vector.raw,24),
      turnovers               = lead(vector.raw,25),
      personal.fouls          = lead(vector.raw,26),
      points.per.game         = lead(vector.raw,27))
  
  
  #Clean dataframe
  df.clean <- 
    df.raw %>%
    .[-c(1:30),] %>%
    .[seq(1,nrow(.),30),] 
  
  df.clean[,c(6:28)] %<>% lapply(function(x) as.numeric(x))
  df.clean[is.na(df.clean)] <- 0

  #Output
  df.clean %>%
    mutate(year = year) %>%
    filter(rank != "Rk")
}

#Industrialize

sequence.nba.salary <- str_pad(1:16, width = 2, "left", pad = 0)
url.list.production <- sprintf("https://www.basketball-reference.com/leagues/NBA_20%s_per_game.html",sequence.nba.salary)

data.safe.list <- 
  url.list.production %>% map(safely(get.basic.stats))

data.player.stats.basic <- data.safe.list %>%
  transpose() %>%
  .$result %>%
  bind_rows()

The NBA game has clearly become more open - the distribution of three point shots taken per game slides more the the right year after year.

Next Steps

Now that we have the data collected, we can move on to test the hypothesis outlined at the top: are three-point-shots now overvalued in the NBA? And, by extension, are the post-scoring players now a bargain?