How Much Does an NBA Player Really Cost

Author

Brian Caceres

A new way of looking at a players value:

This exploration stems from my curiosity of how much each individual basketball players contributions to a team cost. There are many advance statistics in the NBA but not one that puts a price on each positive statistic.

NBA teams pay salaries and they want to get their money’s worth, so I will set out to explore a players contribution compared to their salary. I will do this by totaling all positive contributions (such as field-goals, steals, blocks etc.) and dividing that by salary. This is somewhat similar to salary per minute played which is another statistic, however my version gives a better idea the price of each positive value.

Loading Libraries and Data-set.

Also setting a working directory so that the read_csv function pulls the correct data that I stored on my computer.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(RColorBrewer)

setwd("/Users/Briancaceres/Desktop/Data_110")
nba_dataset <- read_csv("nba_2022-23_all_stats_with_salary.csv")
New names:
Rows: 467 Columns: 52
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): Player Name, Position, Team dbl (49): ...1, Salary, Age, GP, GS, MP, FG,
FGA, FG%, 3P, 3PA, 3P%, 2P, 2PA...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Taking a look at our data

This will help me decide how I want to go about creating my new variable of price per positive statistic. By looking at the data set, I know I need to first create the total number of positive values per column. For example, I know the total number of games played and the field goals per game, so I can multiply those two variables to get a new column which shows the total number of field goals for the season.

nba_dataset
# A tibble: 467 × 52
    ...1 `Player Name` Salary Position   Age Team     GP    GS    MP    FG   FGA
   <dbl> <chr>          <dbl> <chr>    <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0 Stephen Curry 4.81e7 PG          34 GSW      56    56  34.7  10    20.2
 2     1 John Wall     4.73e7 PG          32 LAC      34     3  22.2   4.1   9.9
 3     2 Russell West… 4.71e7 PG          34 LAL/…    73    24  29.1   5.9  13.6
 4     3 LeBron James  4.45e7 PF          38 LAL      55    54  35.5  11.1  22.2
 5     4 Kevin Durant  4.41e7 PF          34 BRK/…    47    47  35.6  10.3  18.3
 6     5 Bradley Beal  4.33e7 SG          29 WAS      50    50  33.5   8.9  17.6
 7     6 Kawhi Leonard 4.25e7 SF          31 LAC      52    50  33.6   8.6  16.8
 8     7 Paul George   4.25e7 SF          32 LAC      56    56  34.6   8.2  17.9
 9     8 Giannis Ante… 4.25e7 PF          28 MIL      63    63  32.1  11.2  20.3
10     9 Damian Lilla… 4.25e7 PG          32 POR      58    58  36.3   9.6  20.7
# ℹ 457 more rows
# ℹ 41 more variables: `FG%` <dbl>, `3P` <dbl>, `3PA` <dbl>, `3P%` <dbl>,
#   `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `eFG%` <dbl>, FT <dbl>, FTA <dbl>,
#   `FT%` <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>,
#   BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, `Total Minutes` <dbl>,
#   PER <dbl>, `TS%` <dbl>, `3PAr` <dbl>, FTr <dbl>, `ORB%` <dbl>,
#   `DRB%` <dbl>, `TRB%` <dbl>, `AST%` <dbl>, `STL%` <dbl>, `BLK%` <dbl>, …

Data cleanup; I want to make all the column letters lowercase considering I did not have any control in the input so I do not want any surprises later when I start manipulating the data. I also replaced all spaces between column names with an underscore to keep my syntax consistent.

names(nba_dataset) <-tolower(names(nba_dataset))
names(nba_dataset) <- gsub(" ","_",names(nba_dataset))
nba_dataset 
# A tibble: 467 × 52
    ...1 player_name   salary position   age team     gp    gs    mp    fg   fga
   <dbl> <chr>          <dbl> <chr>    <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0 Stephen Curry 4.81e7 PG          34 GSW      56    56  34.7  10    20.2
 2     1 John Wall     4.73e7 PG          32 LAC      34     3  22.2   4.1   9.9
 3     2 Russell West… 4.71e7 PG          34 LAL/…    73    24  29.1   5.9  13.6
 4     3 LeBron James  4.45e7 PF          38 LAL      55    54  35.5  11.1  22.2
 5     4 Kevin Durant  4.41e7 PF          34 BRK/…    47    47  35.6  10.3  18.3
 6     5 Bradley Beal  4.33e7 SG          29 WAS      50    50  33.5   8.9  17.6
 7     6 Kawhi Leonard 4.25e7 SF          31 LAC      52    50  33.6   8.6  16.8
 8     7 Paul George   4.25e7 SF          32 LAC      56    56  34.6   8.2  17.9
 9     8 Giannis Ante… 4.25e7 PF          28 MIL      63    63  32.1  11.2  20.3
10     9 Damian Lilla… 4.25e7 PG          32 POR      58    58  36.3   9.6  20.7
# ℹ 457 more rows
# ℹ 41 more variables: `fg%` <dbl>, `3p` <dbl>, `3pa` <dbl>, `3p%` <dbl>,
#   `2p` <dbl>, `2pa` <dbl>, `2p%` <dbl>, `efg%` <dbl>, ft <dbl>, fta <dbl>,
#   `ft%` <dbl>, orb <dbl>, drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>,
#   blk <dbl>, tov <dbl>, pf <dbl>, pts <dbl>, total_minutes <dbl>, per <dbl>,
#   `ts%` <dbl>, `3par` <dbl>, ftr <dbl>, `orb%` <dbl>, `drb%` <dbl>,
#   `trb%` <dbl>, `ast%` <dbl>, `stl%` <dbl>, `blk%` <dbl>, `tov%` <dbl>, …

I need to create new variables that holds the total field goals(including free throws), total rebounds, total assist, total steals, and total blocks. I will then add all these values and divide by the players salary so that we can get a cost per productive value. I will then look at the cost per productive value and compare that to a players “win share” statistic.

Looking at the chunk below: I used the mutate function to add columns. Within the mutate function I totaled field goals and multiplied that with games played to get the total number of baskets made in the year. I then used mutate in a similar way to creat new columns for total assist, total steals, and total blocks. These are all my productive values.

I then used mutate again to create my total production variable which summed all values into one column. I then used mutate again to divide this sum column by the players salary which gave me the price per productive value that I have been looking for. There were a couple huge outliers such as Kemba Walker who unfortunately got injured after signing a huge deal. I used the filter function to only return the players that have a price per productive value of less than $120,000, this took out my major outliers and left me with 96.7% of remaining players.

stat_totals <- nba_dataset |>
  mutate(total_fieldgoals = 
           ((nba_dataset$"3p" + 
               nba_dataset$"2p" + 
               nba_dataset$"ft") * 
              nba_dataset$"gp")) |>
  mutate(total_rebounds = 
           nba_dataset$"trb" * nba_dataset$"gp") |>
  mutate(total_assist = 
           nba_dataset$"ast" * nba_dataset$"gp") |>
  mutate(total_steals = 
           nba_dataset$"stl" * nba_dataset$"gp") |>
  mutate(total_blocks = 
           nba_dataset$"blk" * nba_dataset$"gp")

newstat <- stat_totals |>
  mutate(total_production = 
           stat_totals$"total_steals" +
           stat_totals$"total_blocks" + 
           stat_totals$"total_assist" + 
           stat_totals$"total_rebounds" +
           stat_totals$"total_fieldgoals")

stat_totals2 <- newstat |>
  mutate(price_per_productive = 
           nba_dataset$"salary" /
           newstat$"total_production") |>
select("player_name",
       "total_production",
       "price_per_productive",
       "age", "ws", "position",
       "salary") |>
  filter(price_per_productive <=120000)

stat_totals2
# A tibble: 452 × 7
   player_name total_production price_per_productive   age    ws position salary
   <chr>                  <dbl>                <dbl> <dbl> <dbl> <chr>     <dbl>
 1 Stephen Cu…            1585.               30332.    34   7.8 PG       4.81e7
 2 John Wall               527                89840.    32   0.3 PG       4.73e7
 3 Russell We…            1716.               27444.    34   1.9 PG       4.71e7
 4 LeBron Jam…            1776.               25035.    38   5.6 PF       4.45e7
 5 Kevin Dura…            1438.               30677.    34   6.8 PF       4.41e7
 6 Bradley Be…            1180                36677.    29   3.4 SG       4.33e7
 7 Kawhi Leon…            1331.               31920.    31   7.1 SF       4.25e7
 8 Paul George            1450.               29297.    32   4.6 SF       4.25e7
 9 Giannis An…            2407.               17657.    28   8.6 PF       4.25e7
10 Damian Lil…            1839.               23111.    32   9   PG       4.25e7
# ℹ 442 more rows

Explaining win shares:

Win shares (ws) is an NBA advance statistic which compares the marginal offense to the marginal points per game. That is a fancy way of calculating how many wins a player contributed to a teams overall wins. I like to think of it as how much stock a player is worth to the teams wins. Just to give perspective, Stephen Curry is credited for 7.8 of Golden States wins, and they won 44 games total. This stat shows he is very valuable to his team.

Why compare my new variable to win shares?

I wanted to see if there was a trend in a players price per productive value and how valuable they are in general to their team. I also wanted to spotlight any outliers either positive or negative as that is valuable information to a team.

I used geom_point to create a histogram with the price per productive value on the y axis and win shares on the x axis. I also used color = position so that it is easier to what position a data point is. And it makes it easier to cross reference with the table above to see which player we are looking at.

ggplot(stat_totals2, aes(x = ws,
                         y = price_per_productive,
                         color = position, 
                         alpha = 0.7)) +
  geom_point()+
  theme_minimal()+
  labs(title = "NBA: Price Per Productive Value vs. Win Shares", 
       y = "Price per productive value*", 
       x = "Win Shares", 
       legend = "Position",
       caption = "source: Basketball-Reference.com
       *Productive value is considered any given field-goal, rebound, assist, steal, or block
       **only shows Price per productive value <= $120,000 to exclude major outliers ie Kemba Walkers $465,000 pppv ")+
  scale_color_brewer(palette = "Paired")

Conclusion Essay

My data cleaning consisted of changing all column names to lower case. This helps me in case I get stuck in my data wrangling if my code is not working and I have to hunt down the proper spelling of a variable. I also noticed that any multi word variable name used a space as a seperator. I used the gsub function to replace all spaces with “_” because that is the syntax that I am used to and it helps me stay consistent.

I had a few issues that I need to use data wrangling for. I needed the total number of a statistic for the whole season instead of the season average. I also needed to create my new variable “price per production value”.

I tackled the first issue by using mutate(). I created a new column for each statistic and muliplied the season average to the number of games played. This gave me the season totals. I then used mutate() again to create my all totals variable which summed all my productive season totals. I then divided the year salary by the all productive season totals which gave me the price of each productive value.

I then used select to look at the variables I think may be of use and to narrow down the variables I needed to plot in my scatterplot. While looking through my data I realized that I have a few huge outliers would skew how my scatterplot shows the data distribution. I decided to use the filter() function to return the players whose price per productive value was less than $120,000

I decided to use a scatter plot consider I have over 400 players data to show. I also set the color equel to position which allows us to see what position a player plays. It also helps narrow down which player any given dot represent as we can now see the price per productive value(pppv), win share, and position they play

The scatter plot does show the outliers of players who have a very high pppv compared to there win shares. An interesting discovery is how narrower the date shifts when the win shares increase. Meaning the most significant players provide very good financial value to their teams as well on the court performance. This visualization can also be used by a player to negotiate an increased salary. It could also be used by a team to see who is not giving them an adequate return on their investment. There are no inherent surprises. I like that the best value player that has more than 10 win shares also won the mvp.

I would have liked to use plotly to add interactivity. That way we would be able to immediately see who a player is and their data.