Libraries and Reading Data

library(tidymodels) # broom, dials, parsnip, tune, workflows, yardstick
library(tidyverse) #ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, lubridate
library(gghighlight)

NBACleanData <- read_csv("Data/model0623.csv",show_col_types = FALSE) %>% select(-1)
model_df <-  NBACleanData %>% select(-80) %>%
  mutate(Year = as.character(Year)) %>%
  mutate(across(where(is.numeric), ~as.numeric(scale(.))))

From Pt.2 of the project, I decided to use 7 clusters. Using PCA didn’t really provide improved results in the dataset, plus I need to be able to explain the results in a concise format. Therefore, I’ve opted to not use PCA to reduce dimensionality. I’ve also decided to use kmeans over hierarchical, as I think it will be a little bit easier to interpret.

Modeling - K-Means w/o PCA

From the last part, here is the model

set.seed(1234)

nba_clust <- kmeans(model_df %>% select(-c(1:4)),
                    iter.max = 1, 
                    nstart = 10 ,
                    centers = 7)

# head(augment(nba_clust, model_df))[, c(80,1:9)],5) 
nba_aug <- broom::augment(nba_clust, model_df)  %>%
  rename(Cluster = .cluster) %>%
  select(-c(1:4)) %>%
  select(Cluster, everything())

Analysis - All Data

Create a Dataframe for Plotting Results

options(scipen = 99)
kmeans_centers <- data.frame(Cluster = c(paste0('Cluster ', 1:7)), nba_clust$centers) %>% #creates a column w/ the cluster name
  pivot_longer(!Cluster, names_to = 'feature', values_to = 'center') %>% #pivots longer for easier graphing
  mutate(feature = as.factor(feature)) %>% #makes the feature a factor
  mutate(Cluster = as.factor(Cluster))
head(kmeans_centers)

## # A tibble: 6 × 3
##   Cluster   feature     center
##   <fct>     <fct>        <dbl>
## 1 Cluster 1 FG_pp        1.14 
## 2 Cluster 1 FGA_pp       0.715
## 3 Cluster 1 FG_pct_pp    0.848
## 4 Cluster 1 ThrP_pp     -0.807
## 5 Cluster 1 ThrPA_pp    -0.839
## 6 Cluster 1 ThrP_pct_pp -0.360

final_df <- model_df %>%
  select(c(1:4)) %>%
  cbind(nba_aug) %>%
  mutate(Cluster = as.character(Cluster))

df_2022 <- final_df %>%
  filter(Year %in% '2022')

The plots below showcase the makeup of each cluster.

Plot 1: Each stat category is listed on the x-axis and the cluster center is listed on the y-axis. The labels on each point represent the rank of that cluster within each stat category.
- A rank of 1 in the per game pre 100 possessions, and advanced stat categories (suffixed with pp, pg, adv), indicate that players in that cluster performed particularly well in that category - with the exception being in the personal foul and turnover stat categories (PF and TOV), where a high value is less desirable
- A rank of 1 in the shooting stat category is more informational rather than desirable or undesirable. It represents, among other stats, the distances where players are taking shots from. For example, it helps identify which players/clusters take 3-point shots vs shots close to the basket.
- Looking at Cluster 1, we see that players in this cluster get a lot of rebounds (Reb_T, Reb_D, Reb_O) - they have a high cluster center and rank first within that category. We also see that they commit a lot of fouls and have a poor defensive rating. A high cluster center for fouls is undesirable and a low cluster center for defensive rating means that these players aren’t good at defense (more offensive minded).
Plot 2: This shows the percentage of stats that fall within each rank for the cluster
- If there’s a high percentage in ranks 1 and 2, it means that the cluster includes players that generally perform well. If there’s a high percentage in ranks 6 and 7, that means that the cluster includes players that generally don’t contribute as much as the others
Plot 3: This shows the count of traditional NBA positions in each cluster.

I’ve included a new position name for each cluster, a summary of each position (aka cluster), and listed some of the players from the 2022 season that fall into each position.

Plot Function for the Clusters

cluster_plots <- function(cluster_value) {
  
  # Geom_point for rankings in each category
  filtered_data <- kmeans_centers %>%
    group_by(feature) %>%
    mutate(rank = ifelse(as.character(feature) %in% c('PF_pg', 'PF_pp', 'TOV_pg', 'TOV_pp', 'TOV_pct_adv'), 
                         rank(center), 
                         rank(desc(center)))
    )%>% # adjust rank for high values in less desriable stat categories
    arrange(case_when(Cluster == cluster_value ~ 1, TRUE ~ 2), Cluster, rank, desc(center)) %>%
    mutate(Cluster = as.character(Cluster)) %>%
    filter(Cluster == cluster_value)
  
  plot <- kmeans_centers %>%
    group_by(feature) %>%
    mutate(rank = ifelse(as.character(feature) %in% c('PF_pg', 'PF_pp', 'TOV_pg', 'TOV_pp', 'TOV_pct_adv'), 
                         rank(center), 
                         rank(desc(center)))
    )%>% # adjust rank for high values in less desriable stat categories
    arrange(case_when(Cluster == cluster_value ~ 1, TRUE ~ 2), Cluster, rank, desc(center)) %>%
    ggplot(aes(x = factor(feature, levels = unique(feature)), y = center, color = cluster_value)) +
    geom_point(color = "#5A2D81") +
    geom_text(data = filtered_data, aes(label = rank), color = "black", vjust = -0.5) +  # Add point labels for rank
    theme_minimal() +
    gghighlight(Cluster == cluster_value, use_direct_label = FALSE) +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, size = 9, hjust = 0.5),
          axis.title = element_text(hjust = 0.5),
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    labs(x = "Statistic", y = "Cluster Center",
         title = "Visualizing K-Means Cluster Makeups",
         subtitle = cluster_value)
  
  print(plot)
  
  
  # Bar Plot for Percentage of Stat Categories in Each Rank
  filtered_data <- kmeans_centers %>%
    group_by(feature) %>%
    mutate(rank = ifelse(as.character(feature) %in% c('PF_pg', 'PF_pp', 'TOV_pg', 'TOV_pp', 'TOV_pct_adv'), 
                         rank(center), 
                         rank(desc(center)))
    ) %>%
    arrange(case_when(Cluster == cluster_value ~ 1, TRUE ~ 2), Cluster, rank, desc(center)) %>%
    mutate(Cluster = as.character(Cluster)) %>%
    filter(Cluster == cluster_value)
  
  plot <- kmeans_centers %>%
    group_by(feature) %>%
    mutate(rank = ifelse(as.character(feature) %in% c('PF_pg', 'PF_pp', 'TOV_pg', 'TOV_pp', 'TOV_pct_adv'), 
                         rank(center), 
                         rank(desc(center)))
    ) %>%
    ungroup() %>%
    select(Cluster, rank) %>%
    group_by(Cluster, rank) %>%
    summarise(count = n()) %>%
    ungroup() %>%
    group_by(Cluster) %>%
    mutate(percentage = count / sum(count) * 100) %>%
    filter(Cluster == cluster_value) %>%
    ggplot(aes(x = as.factor(rank), y = percentage, fill = as.factor(rank))) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(percentage, 0), "%")), vjust = -0.5, color = "black") +
    labs(x = "Rank", y = "Percentage",
         title = "Percentage of Stat Categories in Each Rank", subtitle = cluster_value) +
    scale_fill_discrete(name = "Rank") +
    scale_x_discrete(limits = as.character(1:7)) +
    theme_minimal()
  
  print(plot)
  
  #Bar plot that shows count of traditional positions in each cluster
  df_2022 <- final_df %>%
    filter(Year %in% '2022')
  
  count_df <- df_2022 %>%
    filter(Cluster == substr(cluster_value, nchar(cluster_value), nchar(cluster_value))) %>%
    group_by(Pos) %>%
    summarise(count = n()) %>%
    ungroup() %>%
    arrange(desc(count))
  
  plot <- ggplot(count_df, aes(x = reorder(Pos, -count), y = count, fill = Pos)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = count), vjust = -0.5, color = "black") +
    scale_fill_manual(values = c("red", "blue", "green", "yellow", "orange")) +
    labs(x = "Pos", y = "Count", title = paste0("Bar Plot of Traditional Positions in ", cluster_value))
  
  print(plot)
}

Cluster Breakdown Analysis

Cluster 1: MVP Bigs

In 2022, there were 20 players in this cluster, ~5% - the smallest cluster
- Notable players include: Anthony Davis, Deandre Ayton, Kristaps Porzingis, Rudy Gobert, Domantas Sabonis, Bam Adebayo
- These players are predominately centers with some power forwards sprinkled in
They rank first or second in ~60% of the stats, ranking second in ~45% of the stats and rank second to last or last in ~5%
They’re first or second in almost all rebounding and blocking stats. They’re second in many of the advanced stats including PER, Win Share, and VORP. They play the second most points and score the second most points. They take a lot of two point shoots and shoot a lot of free throws
They only have lower ranks in a few stats. They turn the ball over a lot and they foul a lot. They have the shortest field goal distance, which isn’t a bad thing, it just means they shoot close to the basket.

cluster_plots('Cluster 1')

Cluster 2: 3-Point Backups

In 2022, there were 52 players in this cluster, ~13%
- Notable players include: Aaron Holiday, Donte DiVincenzo, Kent Bazemore, Lou Williams, Rajon Rondo
- 40 of the 52 are guards – with 26 being point guards, there are 11 forwards, and the lone center is Killian Tillie
They rank first or second in only ~10% of the stats and rank second to last or last in ~65%, with ~45% being last.
They’re first or second in only a few stats. They have the highest percentage of shots from 16 feet and beyond and least amount of fouls per game. They have the highest defensive rating(opponent’s points while player is on the floor), but that’s likely due to them playing the least amount of minutes. They’re second in a few other 3-point shooting categories.
They rank the lowest in many stats – minutes, points, rebounds, blocks, shots, and free throws. They’re last in many of the advanced stats as well – PER, BPM, VORP, Win Share.

cluster_plots('Cluster 2')

Cluster 3: Game Generals

In 2022, there were 62 players in this cluster, ~16%
- Notable players include: CJ McCollum, Derrick Rose, Fred VanVleet, , Klay Thompson, Marcus Smart, Russell Westbrook
- 41 of the 62 are guards – with 22 being point guards, there are 20 forwards, and the lone center is Kelly Olynyk
They don’t rank first in any stats and are rank second in 15%. The only rank second to last or last in 20%, meaning that the majority of their rankings are right in the middle – with 37% ranked third
They’re strongest in the assist, steal, and 3-point stats. They play the third most minutes and score the third most points. They take command in each game on both the offensive and defensive side and shoot the ball well
They don’t get a lot of blocks or rebounds or take shots close to the basket. They commit a lot of turnovers.

cluster_plots('Cluster 3')

Cluster 4: 3-Point Threats

In 2022, there were 161 players in this cluster, ~41%, by far the largest cluster.
- Notable players include: Alex Caruso, Blake Griffin, Carmelo Anthony, Duncan Robinson, Evan Fournier, Jae Crowder
- 53 are shooting guards, 44 are small forwards, 37 are power forwards, 18 are point guards, and 9 are centers. It’s truly a big mix of players.
They rank first or second in 20%, ranking first in 16%. They rank second to last or last in 30%
They’re first in many of the 3-point stats and turnover stats. When they shoot, they shoot threes.
They don’t take 2 point shoots, get rebounds or shoot free throws

cluster_plots('Cluster 4')

Cluster 5: Shooting Bigs

In 2022, there were 22 players, ~6%
- Notable players include: DeMarcus Cousins, Derrick Favors, Robin Lopez, Serge Ibaka
- the majority are centers (10) and power forwards (7), with the lone shooting guard being Hamidou Diallo
They rank first or second in 6% and second to last or last in~40%, ranking first and last in only one stat category
They have a good jump shot. Most of their shots are taken between 10-16 feet and most of their shots are assisted. They don’t typically create shots, but rather serve as an outlet for players driving the lane.
They don’t play a lot of minutes, score points, or get steals and assists. They have low win share and BPM values.

cluster_plots('Cluster 5')

Cluster 6: Franchise Players

In 2022, there were 32 players, ~8%
- Notable players include: Chris Paul, Giannis Antetokounmpo, James Harden, Joel Embiid, Kevin Durant, Nikola Jokic, LeBron James, Luka Doncic, Stephen Curry
- the majority are point guards(13), but there’s a mix of each position.
They rank first in 43% and second in 13% (56% total). They rank second to last or last in only 5%.
These players do it all. These are your all-stars, your future hall of famers, your Finals MVPs. You build a team around these players.
The only down side is that they commit a lot of turnovers.

cluster_plots('Cluster 6')

Cluster 7: Rebounding Bigs

In 2022, there were 43 players, ~11%
- Notable players include: Andre Drummond, Draymond Green, JaVale McGee, Steven Adams, Tristan Thompson
- 33 of the 43 are centers, with the lone shooting guard being Terry Taylor
They rank first or second in ~35%, but second to last or last in 38%
They’re first in rebounding and block stats. They have the highest field goal percentage and take the majority of their shots between 0-3 feet.
They don’t get a lot of points, steals, or assists. They commit a lot of fouls and turnovers, they don’t shot often, and almost never take threes.

cluster_plots('Cluster 7')

Analysis -Historical and 2022 Players

History of Clusters

The below plot shows how the player types have evolved since the 2000-2001 season and a brief summary of each cluster is included.

MVP Bigs: Started to taper off in 2017. Went from about 8%-11% down to about 5%-6%. This position is most related to the traditional center role, which has significantly evolved over the years. More centers are taking threes and shooting the ball rather than playing in the post.
3-Point Backups: was generally between 17% and 23%, but then dropped to 12% and 13% in 2021 and 2022. Some of these players likely became 3-Point Threats or Shooting Bigs in these years.
Game Generals: Fairly consistent over the years, ranging between 16%-24% without any significant increases of decreases. It’s actually the same in 2022 as it was in 2001 - 16%
3-Point Threats: This position has shown the most dramatic increase - it was 6.4% in 2004 and 41% in 2022. The 3-point shot is a huge part of the game today.
Shooting Bigs: This position has shown the most dramatic decrease - it was 28% in 2001 and fell to 2.5% in 2020. A lot of these players have likely become 3-Point Threats.
Franchise Players: These players have always been part of the game, generally around 5%-9%.
Rebounding Bigs: These players have also generally been past of the game, ranging around 6% to 13%.

ggplot(final_df %>% mutate(Year = as.numeric(Year)),
       aes(x=Year, 
           fill=factor(Cluster))
       ) +
  geom_bar(position="fill") +
  geom_text(
    aes(label=paste0(signif(..count.. / tapply(..count.., ..x.., sum)[as.character(..x..)]*100, digits=2),"%")), 
    stat="count",
    position=position_fill(vjust=0.5),colour="black", size=3)+
  labs(y="Percentage") +
  scale_x_continuous("Year", breaks=seq(2001,2022,by=1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_fill_discrete(name = "Player\nType", labels =c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats','Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs')) +
  theme(legend.key.height=unit(2, "cm"))

2022 - Traditional Positions Vs New Positions - Counts

The plot below displays the count of each traditional player position (C, PF, SF, SG, PG) within each of the new clusters (i.e. new player types).

MVP Bigs
3-point Backups
Game Generals
3-point Threats
Shooting Bigs
Franchise Players
Rebounding Bigs

Of the seven new player types, four have players from each traditional position - 3-point Backups, Game Generals, 3-point Threats, and Franchise Players. However, 3-point Backups and Game Generals only have 1 center each.
MVP Bigs and Rebounding Bigs only have Power Forwards and Centers with the exception of Rebounding Bigs, which includes one shooting guard.

ggplot(final_df %>% 
         mutate(Year = as.numeric(Year),
                Cluster = as.numeric(Cluster)) %>%
         filter(Year == 2022),
       aes(x = Cluster, fill = Pos)) +
  geom_bar(aes(y = (..count..)),colour="white") +
  geom_text(stat='count', aes(label=..count..),position = position_stack(vjust = 0.5), color = "black") +
  ggtitle("Count of Traditional Position in Each New Player Type - 2022") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(name = "New Position Name", breaks = c(1, 2, 3, 4, 5, 6, 7),
                     labels = c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats', 'Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs')) +
  labs(fill = "Traditional Position", y = "Count") +
  theme(legend.key.height=unit(2, "cm"))

2022 - Traditional Positions Vs New Positions - Percentages

dat <- final_df %>% filter(Year ==2022) 
dat1 <- as.data.frame(prop.table(table(dat$Pos, dat$Cluster), margin = 1))
colnames(dat1) <- c("Pos", "Cluster", "percent")

dat1 <- dat1 %>% 
  group_by(Pos) %>% 
  mutate(Pos_label_y = 1 - (cumsum(percent) - 0.5 * percent)) %>% 
  ungroup()

ggplot(dat1, aes(Pos, y = percent, fill = factor(Cluster))) +
  geom_bar(data = . %>% filter(percent > 0), position = "fill", stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(data = . %>% filter(percent > 0), aes(y = Pos_label_y, label = round(100 * percent, 1))) +
  ggtitle("Percentage of Cluster within Each Traditional Position - 2022") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_discrete(name = "Traditional Position") +
  scale_fill_discrete(name = "New Position Name", labels = c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats', 'Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs')) +
  ylab("Percentage") +
  theme(strip.background = element_blank(),
        strip.text = element_blank(),
        legend.key.height = unit(1, "cm"))

2022 - New Positions Per Team

#teams by cluster
ggplot(df_2022 %>% 
         mutate(Year = as.numeric(Year),
                Cluster = as.numeric(Cluster)) %>%
         filter(Year == 2022),
       aes(x = Tm, fill = factor(Cluster))) +
  ggtitle("Count of New Position on Each Team - 2022") +
  geom_bar(position = "fill") + ylab("Proportion") +
  stat_count(geom = "text", 
             aes(label = stat(count)),
             position=position_fill(vjust=0.5), colour="white") +
    scale_fill_discrete(name = "New Position", labels =c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats', 'Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs')) +
  theme(legend.key.height=unit(1, "cm"))

## 2022 - New Positions Per Team - above .500

above500_2022 <- c('PHO','MEM','MIA','GSW','DAL','BOS','MIL','PHI','UTA','TOR','DEN','MIN','CHI','BRK','CLE','ATL','CHO','LAC')

ggplot(data = df_2022 %>% filter(Tm %in% above500_2022), aes(x = Tm, fill = factor(Cluster))) +
  geom_bar(position = "fill") + ylab("Proportion by Count") +
  stat_count(geom = "text", 
             aes(label = stat(count)),
             position=position_fill(vjust=0.5), colour="Black") +
      scale_fill_discrete(name = "New Position", labels =c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats', 'Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs'))  +
  theme(legend.key.height=unit(1, "cm"))

### Takeaways * Each team with a record above .500 had at least 1 Franchise Player. * Boston, Brooklyn, Chicago, Milwaukee, Philadelphia, and Phoenix had 2 Franchise Players * Boston, Dallas, Denver, Golden State, LA Clippers, Minnesota, and Philadelphia didn’t have an MVP Big * Cleveland had 2 MVP Bigs

2022 - New Positions Per Team - below .500

below500_2022 <- c('NYK','NOP','WAS','SAS','LAL','SAC','POR','IND','OKC','DET','ORL','HOU')
ggplot(data = df_2022 %>% filter(Tm %in% below500_2022), aes(x = Tm, fill = factor(Cluster))) +
  geom_bar(position = "fill") + ylab("Proportion by Count") +
  stat_count(geom = "text", 
             aes(label = stat(count)),
             position=position_fill(vjust=0.5), colour="Black") +
      scale_fill_discrete(name = "New Position", labels =c('MVP Bigs', '3-Point\nBackups', 'Game\nGenerals', '3-Point\nThreats', 'Shooting\nBigs', 'Franchise\nPlayers', 'Rebounding\nBigs'))  +
  theme(legend.key.height=unit(1, "cm"))

### Takeaways * 5 of the 11 Teams with a record below .500 didn’t have a Franchise Player. * Of the 5, Detroit and Indiana didn’t have an MVP Big.

Conclusion and Next Steps

Further analysis needs to be done to determine which player types are included on successful teams. This analysis will be used to help determine what kind of players to target so that a team can be successful.

NBA Player Segmentation Pt. 3 - Clustering Analysis Results

Mike Kaminski

2023-06-23