Introduction

This project applies clustering to NBA player statistics from the 2024/2025 season. The goal is to check whether players can be grouped into meaningful clusters that reflect different playing styles (e.g., scorers, playmakers, rim protectors) using a small set of basic per-game metrics.

Data

The dataset comes from Basketball-Reference (NBA 2025 per-game table) and was exported to Excel. We keep the player’s name for interpretation, but clustering is performed only on numeric performance variables.

path <- "sportsref_download.xlsx"

nba_raw <- read_excel(path)

glimpse(nba_raw)
## Rows: 332
## Columns: 10
## $ Rk     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ Player <chr> "Shai Gilgeous-Alexander", "Giannis Antetokounmpo", "Nikola Jok…
## $ G      <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, 46, 75, 47, 58, 51, 50,…
## $ MP     <chr> "34.2", "34.2", "36.7", "35.4", "36.3", "36.4", "36.5", "37.7",…
## $ TRB    <chr> "5.0", "11.9", "12.7", "8.1999999999999993", "5.7", "8.69999999…
## $ AST    <chr> "6.4", "6.5", "10.199999999999999", "7.7", "4.5", "6.0", "4.2",…
## $ STL    <chr> "1.7", "0.9", "1.8", "1.8", "1.2", "1.1000000000000001", "0.8",…
## $ BLK    <chr> "1.0", "1.2", "0.6", "0.4", "0.6", "0.5", "1.2", "0.4", "0.8", …
## $ TOV    <chr> "2.4", "3.1", "3.3", "3.6", "3.2", "2.9", "3.1", "2.4", "4.4000…
## $ PTS    <chr> "32.7", "30.4", "29.6", "28.2", "27.6", "26.8", "26.6", "26.3",…
cols_needed <- c("Player","G","MP","PTS","AST","TRB","STL","BLK","TOV")
nba <- nba_raw %>%
select(all_of(cols_needed))
to_num <- function(x) {
x_chr <- as.character(x)
x_chr <- str_replace_all(x_chr, ",", ".")
suppressWarnings(as.numeric(x_chr))
}

nba <- nba %>%
mutate(
G = to_num(G),
MP = to_num(MP),
PTS = to_num(PTS),
AST = to_num(AST),
TRB = to_num(TRB),
STL = to_num(STL),
BLK = to_num(BLK),
TOV = to_num(TOV)
)

summary(nba)
##     Player                G               MP             PTS        
##  Length:332         Min.   :25.00   Min.   :15.00   Min.   : 3.000  
##  Class :character   1st Qu.:50.75   1st Qu.:19.90   1st Qu.: 7.675  
##  Mode  :character   Median :64.00   Median :25.45   Median :10.400  
##                     Mean   :61.30   Mean   :25.50   Mean   :12.346  
##                     3rd Qu.:74.00   3rd Qu.:31.20   3rd Qu.:16.200  
##                     Max.   :82.00   Max.   :37.70   Max.   :32.700  
##       AST              TRB             STL              BLK       
##  Min.   : 0.500   Min.   : 1.20   Min.   :0.2000   Min.   :0.000  
##  1st Qu.: 1.400   1st Qu.: 3.00   1st Qu.:0.6000   1st Qu.:0.200  
##  Median : 2.200   Median : 4.00   Median :0.8000   Median :0.400  
##  Mean   : 2.855   Mean   : 4.69   Mean   :0.8678   Mean   :0.519  
##  3rd Qu.: 3.825   3rd Qu.: 5.70   3rd Qu.:1.0250   3rd Qu.:0.600  
##  Max.   :11.600   Max.   :13.90   Max.   :3.0000   Max.   :3.800  
##       TOV       
##  Min.   :0.300  
##  1st Qu.:0.900  
##  Median :1.200  
##  Mean   :1.462  
##  3rd Qu.:1.900  
##  Max.   :4.700

Filtering and feature selection

To reduce noise from players with very small samples, we keep only players who:

played at least 25 games (G ≥ 25),

averaged at least 15 minutes per game (MP ≥ 15).

For clustering we use a basic and interpretable feature set: PTS, AST, TRB, STL, BLK, TOV.

nba_f <- nba %>%
filter(!is.na(G), !is.na(MP)) %>%
filter(G >= 25, MP >= 15) %>%
drop_na(PTS, AST, TRB, STL, BLK, TOV)

nrow(nba_f)
## [1] 332
players <- nba_f$Player

X <- nba_f %>%
select(PTS, AST, TRB, STL, BLK, TOV)

Standardization

K-means is distance-based, so variables must be standardized to comparable scale. We use z-score standardization (mean 0, sd 1).

X_scaled <- scale(X)

Choosing the number of clusters (k)

We use two common diagnostics:

Elbow method (within-cluster sum of squares),

Average silhouette width.

set.seed(123)

fviz_nbclust(X_scaled, kmeans, method = "wss") +
ggtitle("Elbow method (WSS)")

fviz_nbclust(X_scaled, kmeans, method = "silhouette") +
ggtitle("Silhouette method")

The elbow plot shows a sharp decrease in within-cluster sum of squares for small values of k, followed by a noticeable flattening around k=4. Beyond this point, adding more clusters results in only marginal improvements, suggesting that four clusters provide a reasonable trade-off between model complexity and compactness.

The silhouette method indicates the highest average silhouette width for k=2, suggesting a coarse separation of players into two broad groups. However, such a solution is overly simplistic from a basketball perspective. While silhouette values decrease for larger k, the solution with k=4 still exhibits acceptable stability and allows for a more meaningful, role-based interpretation of players.

K-means clustering

set.seed(123)
k <- 4

km <- kmeans(X_scaled, centers = k, nstart = 50)

km$size
## [1]  56 146  40  90

Cluster quality assessment

library(cluster)

sil <- silhouette(km$cluster, dist(X_scaled))
mean_silhouette <- mean(sil[, 3])
mean_silhouette
## [1] 0.2722838

For k=4, the average silhouette width equals = 0.27, indicating a moderate but acceptable level of cluster separation given the increased granularity of the solution.

Cluster visualization

fviz_cluster(km, data = X_scaled, geom = "point") +
ggtitle("K-means clusters of NBA players (scaled features)")

The cluster visualization shows a clear separation between groups, particularly for interior-oriented players and playmaking-oriented players, confirming that the clustering captures meaningful differences in playing styles.

Interpreting clusters

cluster_summary <- nba_f %>%
mutate(cluster = factor(km$cluster)) %>%
group_by(cluster) %>%
summarise(
n = n(),
PTS = mean(PTS),
AST = mean(AST),
TRB = mean(TRB),
STL = mean(STL),
BLK = mean(BLK),
TOV = mean(TOV),
.groups = "drop"
) %>%
arrange(cluster)

cluster_summary
## # A tibble: 4 × 8
##   cluster     n   PTS   AST   TRB   STL   BLK   TOV
##   <fct>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1          56 21.8   5.84  6.25 1.26  0.534 2.81 
## 2 2         146  7.88  1.57  3.53 0.652 0.371 0.851
## 3 3          40 13.3   2.14  8.38 0.762 1.35  1.52 
## 4 4          90 13.3   3.40  3.97 1.02  0.381 1.59

Based on the cluster means, the four clusters can be interpreted as follows. One cluster is characterized by high scoring output (PTS), indicating scoring-oriented players. Another cluster exhibits high assist values (AST) and relatively higher turnovers, corresponding to playmaking roles. A third cluster shows elevated rebounds (TRB) and blocks (BLK), representing interior and rim-protecting players. The final cluster displays more balanced statistics with relatively higher steals (STL), suggesting defensive or role-oriented contributors.

Example players from each cluster

results <- nba_f %>%
mutate(cluster = factor(km$cluster)) %>%
select(Player, cluster, PTS, AST, TRB, STL, BLK, TOV) %>%
arrange(cluster, desc(PTS))
results %>%
group_by(cluster) %>%
slice_head(n = 10)
## # A tibble: 40 × 8
## # Groups:   cluster [4]
##    Player                  cluster   PTS   AST   TRB   STL   BLK   TOV
##    <chr>                   <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Shai Gilgeous-Alexander 1        32.7   6.4   5     1.7   1     2.4
##  2 Giannis Antetokounmpo   1        30.4   6.5  11.9   0.9   1.2   3.1
##  3 Nikola Jokić            1        29.6  10.2  12.7   1.8   0.6   3.3
##  4 Luka Dončić             1        28.2   7.7   8.2   1.8   0.4   3.6
##  5 Anthony Edwards         1        27.6   4.5   5.7   1.2   0.6   3.2
##  6 Jayson Tatum            1        26.8   6     8.7   1.1   0.5   2.9
##  7 Kevin Durant            1        26.6   4.2   6     0.8   1.2   3.1
##  8 Tyrese Maxey            1        26.3   6.1   3.3   1.8   0.4   2.4
##  9 Cade Cunningham         1        26.1   9.1   6.1   1     0.8   4.4
## 10 Jalen Brunson           1        26     7.3   2.9   0.9   0.1   2.5
## # ℹ 30 more rows

Conclusion

The clustering results suggest that even a small set of basic per-game statistics can separate NBA players into interpretable groups. K-means provides a simple segmentation that aligns with common roles in basketball, such as scorers, playmakers, and interior players. Further extensions could include comparing the results with advanced metrics or adding additional features (e.g., shooting efficiency) while controlling for scale.