This project applies clustering to NBA player statistics from the 2024/2025 season. The goal is to check whether players can be grouped into meaningful clusters that reflect different playing styles (e.g., scorers, playmakers, rim protectors) using a small set of basic per-game metrics.
The dataset comes from Basketball-Reference (NBA 2025 per-game table) and was exported to Excel. We keep the player’s name for interpretation, but clustering is performed only on numeric performance variables.
path <- "sportsref_download.xlsx"
nba_raw <- read_excel(path)
glimpse(nba_raw)
## Rows: 332
## Columns: 10
## $ Rk <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ Player <chr> "Shai Gilgeous-Alexander", "Giannis Antetokounmpo", "Nikola Jok…
## $ G <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, 46, 75, 47, 58, 51, 50,…
## $ MP <chr> "34.2", "34.2", "36.7", "35.4", "36.3", "36.4", "36.5", "37.7",…
## $ TRB <chr> "5.0", "11.9", "12.7", "8.1999999999999993", "5.7", "8.69999999…
## $ AST <chr> "6.4", "6.5", "10.199999999999999", "7.7", "4.5", "6.0", "4.2",…
## $ STL <chr> "1.7", "0.9", "1.8", "1.8", "1.2", "1.1000000000000001", "0.8",…
## $ BLK <chr> "1.0", "1.2", "0.6", "0.4", "0.6", "0.5", "1.2", "0.4", "0.8", …
## $ TOV <chr> "2.4", "3.1", "3.3", "3.6", "3.2", "2.9", "3.1", "2.4", "4.4000…
## $ PTS <chr> "32.7", "30.4", "29.6", "28.2", "27.6", "26.8", "26.6", "26.3",…
cols_needed <- c("Player","G","MP","PTS","AST","TRB","STL","BLK","TOV")
nba <- nba_raw %>%
select(all_of(cols_needed))
to_num <- function(x) {
x_chr <- as.character(x)
x_chr <- str_replace_all(x_chr, ",", ".")
suppressWarnings(as.numeric(x_chr))
}
nba <- nba %>%
mutate(
G = to_num(G),
MP = to_num(MP),
PTS = to_num(PTS),
AST = to_num(AST),
TRB = to_num(TRB),
STL = to_num(STL),
BLK = to_num(BLK),
TOV = to_num(TOV)
)
summary(nba)
## Player G MP PTS
## Length:332 Min. :25.00 Min. :15.00 Min. : 3.000
## Class :character 1st Qu.:50.75 1st Qu.:19.90 1st Qu.: 7.675
## Mode :character Median :64.00 Median :25.45 Median :10.400
## Mean :61.30 Mean :25.50 Mean :12.346
## 3rd Qu.:74.00 3rd Qu.:31.20 3rd Qu.:16.200
## Max. :82.00 Max. :37.70 Max. :32.700
## AST TRB STL BLK
## Min. : 0.500 Min. : 1.20 Min. :0.2000 Min. :0.000
## 1st Qu.: 1.400 1st Qu.: 3.00 1st Qu.:0.6000 1st Qu.:0.200
## Median : 2.200 Median : 4.00 Median :0.8000 Median :0.400
## Mean : 2.855 Mean : 4.69 Mean :0.8678 Mean :0.519
## 3rd Qu.: 3.825 3rd Qu.: 5.70 3rd Qu.:1.0250 3rd Qu.:0.600
## Max. :11.600 Max. :13.90 Max. :3.0000 Max. :3.800
## TOV
## Min. :0.300
## 1st Qu.:0.900
## Median :1.200
## Mean :1.462
## 3rd Qu.:1.900
## Max. :4.700
To reduce noise from players with very small samples, we keep only players who:
played at least 25 games (G ≥ 25),
averaged at least 15 minutes per game (MP ≥ 15).
For clustering we use a basic and interpretable feature set: PTS, AST, TRB, STL, BLK, TOV.
nba_f <- nba %>%
filter(!is.na(G), !is.na(MP)) %>%
filter(G >= 25, MP >= 15) %>%
drop_na(PTS, AST, TRB, STL, BLK, TOV)
nrow(nba_f)
## [1] 332
players <- nba_f$Player
X <- nba_f %>%
select(PTS, AST, TRB, STL, BLK, TOV)
K-means is distance-based, so variables must be standardized to comparable scale. We use z-score standardization (mean 0, sd 1).
X_scaled <- scale(X)
We use two common diagnostics:
Elbow method (within-cluster sum of squares),
Average silhouette width.
set.seed(123)
fviz_nbclust(X_scaled, kmeans, method = "wss") +
ggtitle("Elbow method (WSS)")
fviz_nbclust(X_scaled, kmeans, method = "silhouette") +
ggtitle("Silhouette method")
The elbow plot shows a sharp decrease in within-cluster sum of squares
for small values of k, followed by a noticeable flattening around k=4.
Beyond this point, adding more clusters results in only marginal
improvements, suggesting that four clusters provide a reasonable
trade-off between model complexity and compactness.
The silhouette method indicates the highest average silhouette width for k=2, suggesting a coarse separation of players into two broad groups. However, such a solution is overly simplistic from a basketball perspective. While silhouette values decrease for larger k, the solution with k=4 still exhibits acceptable stability and allows for a more meaningful, role-based interpretation of players.
set.seed(123)
k <- 4
km <- kmeans(X_scaled, centers = k, nstart = 50)
km$size
## [1] 56 146 40 90
library(cluster)
sil <- silhouette(km$cluster, dist(X_scaled))
mean_silhouette <- mean(sil[, 3])
mean_silhouette
## [1] 0.2722838
For k=4, the average silhouette width equals = 0.27, indicating a moderate but acceptable level of cluster separation given the increased granularity of the solution.
fviz_cluster(km, data = X_scaled, geom = "point") +
ggtitle("K-means clusters of NBA players (scaled features)")
The cluster visualization shows a clear separation between groups, particularly for interior-oriented players and playmaking-oriented players, confirming that the clustering captures meaningful differences in playing styles.
cluster_summary <- nba_f %>%
mutate(cluster = factor(km$cluster)) %>%
group_by(cluster) %>%
summarise(
n = n(),
PTS = mean(PTS),
AST = mean(AST),
TRB = mean(TRB),
STL = mean(STL),
BLK = mean(BLK),
TOV = mean(TOV),
.groups = "drop"
) %>%
arrange(cluster)
cluster_summary
## # A tibble: 4 × 8
## cluster n PTS AST TRB STL BLK TOV
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 56 21.8 5.84 6.25 1.26 0.534 2.81
## 2 2 146 7.88 1.57 3.53 0.652 0.371 0.851
## 3 3 40 13.3 2.14 8.38 0.762 1.35 1.52
## 4 4 90 13.3 3.40 3.97 1.02 0.381 1.59
Based on the cluster means, the four clusters can be interpreted as follows. One cluster is characterized by high scoring output (PTS), indicating scoring-oriented players. Another cluster exhibits high assist values (AST) and relatively higher turnovers, corresponding to playmaking roles. A third cluster shows elevated rebounds (TRB) and blocks (BLK), representing interior and rim-protecting players. The final cluster displays more balanced statistics with relatively higher steals (STL), suggesting defensive or role-oriented contributors.
results <- nba_f %>%
mutate(cluster = factor(km$cluster)) %>%
select(Player, cluster, PTS, AST, TRB, STL, BLK, TOV) %>%
arrange(cluster, desc(PTS))
results %>%
group_by(cluster) %>%
slice_head(n = 10)
## # A tibble: 40 × 8
## # Groups: cluster [4]
## Player cluster PTS AST TRB STL BLK TOV
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Shai Gilgeous-Alexander 1 32.7 6.4 5 1.7 1 2.4
## 2 Giannis Antetokounmpo 1 30.4 6.5 11.9 0.9 1.2 3.1
## 3 Nikola Jokić 1 29.6 10.2 12.7 1.8 0.6 3.3
## 4 Luka Dončić 1 28.2 7.7 8.2 1.8 0.4 3.6
## 5 Anthony Edwards 1 27.6 4.5 5.7 1.2 0.6 3.2
## 6 Jayson Tatum 1 26.8 6 8.7 1.1 0.5 2.9
## 7 Kevin Durant 1 26.6 4.2 6 0.8 1.2 3.1
## 8 Tyrese Maxey 1 26.3 6.1 3.3 1.8 0.4 2.4
## 9 Cade Cunningham 1 26.1 9.1 6.1 1 0.8 4.4
## 10 Jalen Brunson 1 26 7.3 2.9 0.9 0.1 2.5
## # ℹ 30 more rows
The clustering results suggest that even a small set of basic per-game statistics can separate NBA players into interpretable groups. K-means provides a simple segmentation that aligns with common roles in basketball, such as scorers, playmakers, and interior players. Further extensions could include comparing the results with advanced metrics or adding additional features (e.g., shooting efficiency) while controlling for scale.