In the last post we wrote web scrapers to collected data on NBA player salaries, basic info, and playing stats over the past 20 years. Now that we have the data collected, we can move on to test the hypothesis outlined at the top: are three-point-shots now overvalued in the NBA? And, by extension, are the post-scoring players now a bargain?
NBA players are often characterized by position: point guard, center, small forward, etc. However, these postions fail to explain the diferences in skill types between players; some times tall guys can shoot, sometimes little guys can rebound, sometimes there is a guy who can do it all.
Although NBA jargon has proliferated to fill the gaps of NBA platonic ideals - words like three-and-D, point-forward, stretch-five, combo-guard, the list goes on - we can take a statistical approach to classification. As a first description of the dataset, let’s use principal component analysis to identify the phenotypes that explain the most of what makes each NBA player unqiue.
First, we clean the player bio data to extract numeric values for height and weight which we will merge with the player statistics data.
data.player.bodies <-
data.player.base.info %>%
rowwise() %>%
#Turn feet-inches into numeric variable for height
mutate(height.feet = str_sub(height,0,1) %>% as.numeric(),
height.inches = str_split(height, "-") %>% unlist() %>% .[2] %>% as.numeric(),
height.total = height.feet + height.inches/12,
weight = as.numeric(weight)) %>%
ungroup() %>%
select(player, height.total, weight)
The player statistics data requires some cleaning as well before it can be used. For example, counting stats must be normalized by minutes per game, player positions must be converted to numeric values, and all data must be scaled to prevent large-valued objects (like total minutes played) from swamping out the rest of the data. Finally, we’ll select only players who have played at least 500 minutes. No scrubs.
data.player.descriptions <-
data.player.stats.basic %>%
select(-X1, -rank) %>%
#Convert traditional positional labels into a numeric value
mutate(position.numeric = case_when(
position == "PG" ~ 1,
position == "SG" ~ 2,
position == "SF" ~ 3,
position == "PF" ~ 4,
position == "C" ~ 5)
) %>%
select(-position, -team) %>%
#Create new variables. Especially, normalize the counting statistics to count per 48 minutes
mutate(games.started.pct = games.started/games.played,
minutes.total = minutes*games.played,
three.point.ratio = three.pointer.attempted/field.goals.attempted,
fga.per.min = 48*field.goals.attempted/minutes,
three.attempt.per.min = 48*three.pointer.attempted/minutes,
fta.per.min = 48*free.throws.attempted/minutes,
rebounds.offensive.per.min = 48*rebounds.offensive/minutes,
rebounds.defensive.per.min = 48*rebounds.defensive/minutes,
steals.per.min = 48*steals/minutes,
blocks.per.min = 48*blocks/minutes,
turnovers.per.min = 48*turnovers/minutes,
fouls.per.min = 48*personal.fouls/minutes,
assist.per.min = 48*assists/minutes
) %>%
#Join with the player body data
left_join(data.player.bodies) %>%
select(player,age,minutes,minutes.total, ends_with("min"), ends_with("pct"), three.point.ratio, height.total, weight, year,position.numeric, points.per.game)
#Create list of variables that need to be scales
list.variables.to.scale <- data.player.descriptions %>%
select(-player, -year, -minutes.total) %>% names()
data.players.clean <-
data.player.descriptions %>%
#No scrubs
filter(minutes.total > 500) %>%
select(-minutes.total) %>%
#Scale each numeric variable
mutate_each_(funs(scale(.) %>% as.vector), vars = list.variables.to.scale) %>%
select(-player, -year)
Now, we can run a principal components analysis with the PCA
function. 74% of the original dataset’s variance can be explained in the first five principal components.
pca <- PCA(data.players.clean, graph = FALSE,ncp = 5)
pca$eig %>%
as_tibble() %>%
rownames_to_column() %>%
.[,c(1,4)] %>%
set_colnames(c("component","cum.pct.of.var")) %>%
mutate(component = as.numeric(component)) %>%
slice(1:5) %>%
ggplot(aes(component, cum.pct.of.var)) +
geom_line() +
theme_minimal() +
labs(title = "Principal Component Analysis",
subtitle = "Five principal components explain 74% of the variance in the NBA player description dataset",
x = "Principal Component",
y = "Cumulative Percentage of Variance (%)")
We’ve collapsed the original descriptive dataset down to five essential NBA phenotypes: Size, Volume Scoring, Shooting, Irrational Confidence, and Age:
This table shows the correlation between each of the five principal components and the underlying data. Blue colors and higher numbers indicator higher correlation and red colors with negative numbers indicate lower correlation. For example, the “Size” phenotype comes with correlations of 0.82 to player weight, 0.83 to player height, and -0.72 for three point ratio - the proportion of a players shots taken behind the arc.
We can be sure now that NBA players actually exhibit pheotypes the way that fans, coaches, and the players themselves speak.
Remember the motivation of the study, have the relative values of three-point shooters and back-to-the-basket scorers diverged to the point where NBA GM’s can be selling high on the first and buying low on the second? To get there, we will first have to create clusters of players to identify them in each bucket. To do that, we’ll use hierarchical clustering.
To do so, we first clean the data for clustering.
data.to.cluster <-
data.player.descriptions %>%
#No Scrubs
filter(minutes.total > 500) %>%
select(-minutes.total) %>%
#Scale variables
mutate_each_(funs(scale(.) %>% as.vector), vars = list.variables.to.scale) %>%
mutate(ID = str_c(player,year, sep = ";")) %>%
select(ID, everything(), -player, -year) %>%
group_by(ID) %>%
arrange(ID, desc(minutes)) %>%
#Select only one player per player year combo; the player with the most minutes
slice(1) %>%
ungroup() %>%
as.data.frame() %>%
column_to_rownames(var = "ID") %>%
na.omit()
Determining how many clusters to use is thorny. According to the elbow and silohette tests for k-means clustering, the ideal number of clusters is 2. While that may minimize statistical measures of in and out of cluster variance, it doesn’t help us in our study.
Instead, I used hierarchical clustering plus a healthy dose of hand-waving to identify the cut off point. In an ideal world, each cluster would be approximately the same size and correspond to a well-known player archetype. It didn’t work out that way. Instead, I looked at the number of players in each cluster for cutoffs ranging from 1 to 20 clusters and found 14 to be the ideal cluster number. Why? Because that is where we last reduced the largest cluster without addiing trivially small clusters beneath it.
library(ggalt)
library(cluster)
library(ggdendro)
get.cluster.counts <- function(k) {
#Track progress
print(k)
data.to.cluster %>%
#Do hierarchical clustering
dist() %>%
hclust() %>%
cutree(k = k) %>%
table() %>%
#Tidy the number of observations in each cluster
as.tibble() %>%
arrange(desc(n)) %>%
mutate(clusters = k) %>%
select(clusters, n)
}
total.cluster.counts <-
c(1:20) %>%
map(get.cluster.counts)
total.cluster.counts %>%
bind_rows() %>%
group_by(clusters) %>%
slice(c(1, n())) %>%
mutate(type = rep(c("High","Low"))) %>%
spread(key = type, value = n) %>%
ggplot(aes(x = High, xend = Low, y = clusters, group = clusters)) +
geom_dumbbell(color="#a3c4dc",
size_x = 2, size_xend = 2,
size = 1,
colour_x = "#0e668b") +
theme_minimal() +
labs(title = "Number of Players in the Largest and Smallest Clusters",
y = "Clusters",
x = "Players")
Now we have the tools to answer our initial question: do teams overpay for three’s and underpay for bigs?
To answer that, I calculated the average salary for each player archetype (cluster) in each year. However, I actually calculated the average salary as a ratio of total salary cap, so that we deflate salaries appropriately.
I should note two things: First, there is no cluster that obviously correlates to three point shooting. The closest one is the final cluster, 14, but that really is the last few Shane Battier seasons. Second, there is one interesting trend here - the prices paid for the starting big, someone like a 2016 Marc Gasol or 2015 Zach Randolph, have the most obvious downtrend.
I don’t think that I’ve appropriately answered the question at hand. In fact, I don’t think that clustering is the right way to think about this problem. First, this doesn’t control for “quality” - however it is measured by archetype (# of rebounds for a player paid to rebound, for example). Second, because none of our clusters isolated three point shooting, we were unable to answer the question about overpaying for three pointers.
Next Steps In my next post, I will use regression techniques to build a model for player salaries. That way, I can control for player quality and three point shooting more directly.