1. Introduction

Traditional basketball analysis often reduces players to points per game or positional labels. In practice, “importance on the field” is much broader: some players are important because they initiate offense, others because they hold defense, maintain efficiency or simply because coaches trust them with heavy minutes. These dimensions can overlap and they are rarely captured by a single statistic.

This project uses unsupervised learning to discover data-driven “importance profiles” from player statistics. The goal is not to reproduce positions, but to uncover different ways players provide value using engineered features that represent offensive involvement, creation, efficiency, defense, versatility, and playing time. The workflow of this project is:

  1. Clustering in the engineered importance-feature space → 2) PCA to identify dominant dimensions and remove redundancy → 3) Re-clustering in reduced space → 4) Comparing solutions

This design answers a practical question: does dimension reduction clarify and stabilize the structure discovered by clustering or does it fundamentally change it? If the main profiles persist but cluster quality improves, PCA can be seen as a refinement step that makes the importance structure more visible and interpretable.

2. Data and preprocessing

The dataset contains player average statistics from the Spanish ACB 2024/25 season. The data were manually collected by copying the 2024-2025 Spanish ACB Stats - Averages table from RealGM website (https://basketball.realgm.com/international/league/4/Spanish-ACB/stats/2025/Averages/All/All/points/All/desc/1/Regular_Season). Each row represents a player’s per-game averages and related box-score indicators for the season.

// Load ACB stats - 2024-25.csv data
acb <- read.csv("ACB stats - 2024-25.csv", sep = ",", stringsAsFactors = FALSE, check.names = FALSE)
##   ID              Player Team GP  MPG PPG FGM FGA   FG% 3PM 3PA   3P% FTM FTA
## 1  1    Aaron Doornekamp  TEN 39 22.6 7.6 2.4 5.4 0.438 1.8 4.2 0.427 1.1 1.2
## 2  2 Aaron Patrick Ganal  AND  6  5.9 1.7 0.3 1.3 0.250 0.0 0.5 0.000 1.0 1.3
## 3  3          Adam Hanga  JOV 35 23.6 7.3 2.5 6.3 0.387 1.3 3.9 0.341 1.0 1.8
## 4  4        Adam Somogyi  BRE  9  6.6 1.8 0.6 2.1 0.263 0.1 0.8 0.143 0.6 0.7
## 5  5  Adrian de la Torre  JOV  3  2.1 0.0 0.0 0.3 0.000 0.0 0.0 0.000 0.0 0.0
## 6  6         A.J. Durham  SJG 34 21.4 9.9 3.2 7.9 0.404 1.0 3.4 0.307 2.6 3.3
##     FT% ORB DRB RPG APG SPG BPG TOV  PF
## 1 0.854 0.8 3.3 4.1 1.4 0.9 0.3 0.4 2.7
## 2 0.750 0.3 0.5 0.8 0.8 0.2 0.0 0.5 1.2
## 3 0.565 0.8 3.2 4.0 2.5 0.8 0.4 1.2 2.7
## 4 0.833 0.1 0.7 0.8 0.8 0.2 0.0 0.3 0.4
## 5 0.000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3
## 6 0.784 0.4 1.6 2.0 2.3 0.6 0.3 1.4 2.4

Because the table is scraped from a league leaderboard, it includes a real-world complication: some players appear more than once if they changed teams mid-season. These are not data errors - they represent separate stints.

// Check for "duplicates" (mid-season transfers) - If a player changes team mid-season, they appear more than once
if ("Player" %in% names(acb)) {
  dup_players <- acb %>%
    dplyr::count(Player, name = "n") %>%
    dplyr::filter(n > 1)
  
  if (nrow(dup_players) > 0) {
    message("Players with multiple rows detected (likely transfers):")
    print(dup_players)
  } else {
    message("No duplicated players found.")
  }
}
##                         Player n
## 1                 Amida Brimah 2
## 2                Arturs Kurucs 2
## 3              Jaime Fernandez 2
## 4                  Jay Sorolla 2
## 5                 Kaiser Gates 2
## 6 Keye Van Der Vuurst De Vries 2
## 7                Omar Silverio 2
## 8              Ousmane N'Diaye 2
## 9                 Pep Busquets 2

However, unsupervised learning requires one observation per entity, so in the analysis we aggregate stints into a single player-season record. Per-game statistics are aggregated using minutes-weighted averaging, while shooting volume is reconstructed using totals - per-game * games - to preserve internal consistency.

// Transfers handling: aggregate to player season level

// stint_minutes (weighting)
acb <- acb %>% mutate(stint_minutes = GP * MPG)

// Aggregate to one row per Player
acb_player <- acb %>% group_by(Player) %>% summarise(
    Team = dplyr::first(Team),
    GP = sum(GP, na.rm = TRUE),
    
// Total minutes across all stints
    total_minutes = sum(stint_minutes, na.rm = TRUE),
    
// Minutes per game as minutes-weighted average
    MPG = weighted.mean(MPG, w = stint_minutes, na.rm = TRUE),
    
// Per-game counting stats (minutes-weighted averages)
    PPG = weighted.mean(PPG, w = stint_minutes, na.rm = TRUE),
    RPG = weighted.mean(RPG, w = stint_minutes, na.rm = TRUE),
    APG = weighted.mean(APG, w = stint_minutes, na.rm = TRUE),
    SPG = weighted.mean(SPG, w = stint_minutes, na.rm = TRUE),
    BPG = weighted.mean(BPG, w = stint_minutes, na.rm = TRUE),
    TOV = weighted.mean(TOV, w = stint_minutes, na.rm = TRUE),
    PF  = weighted.mean(PF,  w = stint_minutes, na.rm = TRUE),
    ORB = weighted.mean(ORB, w = stint_minutes, na.rm = TRUE),
    DRB = weighted.mean(DRB, w = stint_minutes, na.rm = TRUE),
    
// Shooting totals across the season (per-game * games)
    FGA_tot = sum(FGA * GP, na.rm = TRUE),
    FGM_tot = sum(FGM * GP, na.rm = TRUE),
    `3PA_tot` = sum(`3PA` * GP, na.rm = TRUE),
    `3PM_tot` = sum(`3PM` * GP, na.rm = TRUE),
    FTA_tot = sum(FTA * GP, na.rm = TRUE),
    FTM_tot = sum(FTM * GP, na.rm = TRUE)
  ) %>%
  ungroup() %>%
// Converting season totals back to per-game to match the original scale
  mutate(
    FGA = ifelse(GP > 0, FGA_tot / GP, NA_real_),
    FGM = ifelse(GP > 0, FGM_tot / GP, NA_real_),
    `3PA` = ifelse(GP > 0, `3PA_tot` / GP, NA_real_),
    `3PM` = ifelse(GP > 0, `3PM_tot` / GP, NA_real_),
    FTA = ifelse(GP > 0, FTA_tot / GP, NA_real_),
    FTM = ifelse(GP > 0, FTM_tot / GP, NA_real_)
  ) %>%
// Recalculating shooting percentages from totals to be consistent
  mutate(
    `FG%`  = ifelse(FGA_tot > 0, FGM_tot / FGA_tot, NA_real_),
    `3P%`  = ifelse(`3PA_tot` > 0, `3PM_tot` / `3PA_tot`, NA_real_),
    `FT%`  = ifelse(FTA_tot > 0, FTM_tot / FTA_tot, NA_real_)
  )

// Aggregated dataset
data0 <- acb_player

After aggregation, the dataset contains N players with one row per player-season. We can proceed with exploratory data analysis, importance feature engineering and the clustering/PCA pipeline.

3. Exploratory Data Analysis (after aggregation)

3.1 Distribution of variables

After aggregating mid-season transfers into season-level player profiles, univariate distributions of key numeric variables were examined:

eda_vars <- c("GP","MPG","PPG","FGA","FGM","FTA","FTM","RPG","APG","SPG","BPG","TOV","PF","ORB","DRB","3PA","3PM")
eda_vars <- eda_vars[eda_vars %in% names(data0)]
EDA (post-aggregation): Distributions of key variables

EDA (post-aggregation): Distributions of key variables

Most offensive volume statistics (e.g. 3PA, FTA, PPG) exhibit strong right skewness, indicating that a small number of players carry a disproportionately high offensive load, while the majority operate in lower-usage roles. Playmaking metrics such as assists per game display a similar pattern, with a clear separation between primary initiators and non-creators.

Defensive and rebounding variables present mixed distributions: defensive rebounds are approximately bell-shaped, whereas offensive rebounds, steals and blocks are highly skewed, reflecting specialist behavior. Playing time (MPG) shows a relatively stable unimodal distribution, supporting its use as a proxy for coach trust and on-court importance.

Overall, the observed heterogeneity, skewness and scale differences across variables support the use of feature standardization, principal component analysis and clustering techniques. The distributions suggest the presence of latent structure rather than random variation, motivating subsequent clusterability testing and unsupervised learning.

3.2 Correlation heatmap

A correlation heatmap of the aggregated player-level variables reveals strong positive relationships among offensive volume statistics such as points per game, field-goal attempts, field-goal makes, and free-throw attempts. This indicates that these variables capture a common latent dimension related to offensive usage and scoring responsibility. Similarly, three-point attempts and makes exhibit near-perfect correlation, reflecting shooting specialization.

In contrast, playmaking (assists per game), rebounding, and defensive activity (steals and blocks) display weaker or more variable correlations with scoring measures, suggesting the presence of additional, partially independent dimensions of player contribution. Rebounding metrics, particularly defensive rebounds, form a distinct correlation structure, while offensive rebounds and defensive events behave as specialist indicators.

The presence of highly correlated feature groups alongside less correlated complementary variables provides strong motivation for dimensionality reduction via principal component analysis. PCA allows correlated offensive metrics to be summarized into a small number of components while preserving orthogonal information related to playmaking and defensive impact, thereby enabling more meaningful clustering of player importance profiles.

4. Feature engineering - defining “importance on the field”

A compact set of features intended to capture importance on the field beyond position has been engineered:

data <- data0 %>%
  mutate(
    // True shooting % proxy (using per-game PPG and per-game FGA/FTA)
    ts_proxy = ifelse(
      (FGA + 0.44 * FTA) > 0,
      PPG / (2 * (FGA + 0.44 * FTA)),
      NA_real_
    ),
    
    // Usage proxy: shot attempts per minute
    usage_proxy = ifelse(MPG > 0, FGA / MPG, NA_real_),
    
    // Creation impact: assists scaled by efficiency
    creation_impact = APG * ts_proxy,
    
    // Defensive impact: rebounding + steals + blocks
    defensive_impact = (ORB + DRB) + SPG + BPG,
    
    // Efficiency impact: TS proxy
    efficiency_impact = ts_proxy,
    
    // Minutes-based importance: season total minutes
    importance_minutes = total_minutes
  )

// Versatility index: how balanced PPG, APG, RPG, SPG, BPG are
versatility_stats <- c("PPG", "APG", "RPG", "SPG", "BPG")

entropy_row <- function(x) {
  x[is.na(x)] <- 0
  x[x < 0] <- 0
  
  if (sum(x) == 0) return(0)
  
  p <- x / sum(x)
  p <- p[p > 0]
  
  -sum(p * log(p))
}


data$versatility_index <- apply(data[, versatility_stats, drop = FALSE], 1, entropy_row)

In order to remove outliers and improve stability of per-game estimates low-minute players were filtered out in the next step:

// Keeping only players with enough total minutes for stable estimates
min_minutes <- 100
data_filt <- data %>% filter(total_minutes >= min_minutes)

cat("Players after minute filter:", nrow(data_filt), "\n")
## Players after minute filter: 244

Established features were selected to form the importance feature matrix and they were standardized to ensure equal contribution to distance-based methods, providing a consistent input for clustering, PCA and their assessment.

importance_features <- c(
  "usage_proxy",
  "creation_impact",
  "efficiency_impact",
  "defensive_impact",
  "versatility_index",
  "importance_minutes"
)

X <- data_filt %>% select(all_of(importance_features)) %>% na.omit()
data_filt <- data_filt[rownames(X), ]

// Scaling for clustering/PCA/Hopkins
X_scaled <- scale(X)

5. Initial clustering

5.1 The Hopkins statistic

install.packages("hopkins")
library(hopkins)

set.seed(2024)
hop_value <- hopkins::hopkins(X_scaled, m = nrow(X_scaled) - 1)
cat("Hopkins statistic:", hop_value, "\n")
## Hopkins statistic (importance feature space): 0.9939613

Prior to clustering, the Hopkins statistic was calculated to assess whether the importance-feature space exhibits a meaningful tendency to form clusters rather than a random structure. The result - H = 0.9939613 - indicates a very strong cluster tendency, justifying the application of the intended k-means clustering method to the engineered importance features.

5.2 Choice of the optimal number clusters

set.seed(2024)

fviz_nbclust(X_scaled, kmeans, method = "wss") +
  ggtitle("Elbow method for k (importance features)")

fviz_nbclust(X_scaled, kmeans, method = "silhouette") +
  ggtitle("Silhouette method for k (importance features)")

k <- 4

The optimal number of clusters was selected by jointly considering the elbow method and the silhouette coefficient. The elbow plot shows a clear change in slope at k = 4, indicating that four clusters capture the dominant structure of the importance-feature space. The silhouette analysis exhibits a long flattening between k = 4 and k = 6, with only marginal improvements beyond four clusters, which in accordance with the principles of parsimony and interpretability warranted the choice of k = 4 as the final number of clusters.

5.3 Interpretation of importance-based clusters

The clustering solution revealed four distinct player importance profiles.

set.seed(2024)
km_raw <- kmeans(X_scaled, centers = k, nstart = 50)
data_filt$cluster_raw <- factor(km_raw$cluster)

cluster_profiles_raw <- data_filt %>% group_by(cluster_raw) %>% summarise(
     across(all_of(importance_features), \(x) mean(x, na.rm = TRUE)))

print(cluster_profiles_raw)
## # A tibble: 4 × 7
##   cluster_raw usage_proxy creation_impact efficiency_impact defensive_impact
##   <fct>             <dbl>           <dbl>             <dbl>            <dbl>
## 1 1                 0.405           0.640             0.516             2.71
## 2 2                 0.239           0.548             0.563             3.31
## 3 3                 0.364           2.27              0.579             3.38
## 4 4                 0.314           0.740             0.614             5.37
## # ℹ 2 more variables: versatility_index <dbl>, importance_minutes <dbl>

The first cluster consists of high-usage scorers with relatively low efficiency, defensive impact and versatility. Although these players take a large share of offensive possessions, their overall importance is limited by lower efficiency and reduced playing time.

The second cluster contains low-usage role players who contribute efficiently across multiple dimensions but receive limited minutes on the court. Their importance lies in reliability and balance rather than offensive control of the game.

The third cluster represents primary creators and offensive engines. These players combine high usage with exceptional creation impact, solid efficiency and most importantly - the highest playing time, indicating central importance within team structures as leaders.

The fourth cluster is characterized by the highest defensive impact, excellent efficiency and substantial playing time, despite only moderate offensive usage. These players function as defensive anchors and high-impact two-way contributors whose importance is driven by defensive presence rather than the number of shots taken.

Based on these pairwise projections of key features, the structure of importance-based clustering was assessed, prior to dimensionality reduction. In the usage versus playing time space, clusters clearly differentiate players with high offensive involvement but limited minutes from those with sustained on-court responsibility, highlighting importance cannot be determined only by usage. The creation versus defensive impact plot reveals orthogonal sources of importance, visibly separating primary attackers from defensively-dominant contributors. The efficiency versus usage projection further distinguishes volume-driven scorers from efficient high-impact players.These visualizations confirm that the clusters correspond to distinct and interpretable importance profiles rather than arbitrary partitions, motivating the subsequent application of principal component analysis to further understand and refine the structure.

6. PCA of importance features and secondary clustering

6.1 Dimensionality reduction (PCA)

Principal component analysis was applied to the standardized importance-feature matrix to reduce the number of dimensions and reveal latent structure. The first two components explain the majority - 56.6% - of the total variance.

pca_res <- prcomp(X_scaled, center = TRUE, scale. = FALSE)
summary(pca_res)

fviz_eig(pca_res, addlabels = TRUE) +
  ggtitle("PCA on importance features: explained variance")
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.3501 1.2551 1.0789 0.9417 0.64440 0.36835
## Proportion of Variance 0.3038 0.2626 0.1940 0.1478 0.06921 0.02261
## Cumulative Proportion  0.3038 0.5664 0.7604 0.9082 0.97739 1.00000

The first principal component (PC1) captures overall on-court importance: high playing time, efficiency, defensive impact and offensive creation. Therefore, players with high PC1 scores represent high-impact contributors, who are heavily relied upon by their teams across multiple areas.

The second principal component (PC2) distinguishes between different expressions of importance, separating ball-dominant offensive creators from players whose importance is driven more by efficiency and non-scoring contributions.

6.2 Clustering in the reduced space

The number of clusters was reassessed after projecting the data onto the first four principal components, which together explain over 90% of the total variance (satisfying the recommended threshold of information loss).

pc_scores <- pca_res$x[, 1:4, drop = FALSE]

fviz_nbclust(pc_scores, kmeans, method = "wss") +
  ggtitle("Elbow method for k (PCA scores, first 4 PCs)")

fviz_nbclust(pc_scores, kmeans, method = "silhouette") +
  ggtitle("Silhouette method for k (PCA scores, first 4 PCs)")

The elbow plot shows a clear change in slope around k = 4, indicating diminishing returns in within-cluster compactness beyond this point. The silhouette score reaches its maximum at k = 6, but the improvement over k = 4 is marginal, while higher values of k could reduce interpretability and disturb previously coherent importance profiles. This shows that PCA enhances cluster compactness and separation by removing redundancy among importance features, while also preserving the underlying four-cluster structure in this case. Consequently, four clusters were retained for both the original and PCA-based clustering solutions.

set.seed(2024)
km_pca <- kmeans(pc_scores, centers = k, nstart = 50)
data_filt$cluster_pca <- factor(km_pca$cluster)

cluster_profiles_pca <- data_filt %>%
  group_by(cluster_pca) %>%
  summarise(
    across(all_of(importance_features), mean, na.rm = TRUE),
    n_players = n(), .groups = "drop"
  )

print(cluster_profiles_pca)
## # A tibble: 4 × 8
##   cluster_pca usage_proxy creation_impact efficiency_impact defensive_impact
##   <fct>             <dbl>           <dbl>             <dbl>            <dbl>
## 1 1                 0.314           0.740             0.614             5.37
## 2 2                 0.364           2.27              0.579             3.38
## 3 3                 0.405           0.640             0.516             2.71
## 4 4                 0.239           0.548             0.563             3.31
## # ℹ 3 more variables: versatility_index <dbl>, importance_minutes <dbl>,
## #   n_players <int>

7. Comparison between raw-feature and PCA-based clustering

To evaluate the effect of dimension reduction on the clustering structure, the two k-means solutions obtained from the original importance-feature space and from the PCA-reduced space were compared.

tab_compare <- table(
  RawCluster = data_filt$cluster_raw,
  PCACluster = data_filt$cluster_pca
)
print(tab_compare)
tab_melt <- melt(tab_compare)
tab_melt

ggplot(tab_melt, aes(x = RawCluster, y = PCACluster, fill = value)) +
  geom_tile() +
  geom_text(aes(label = value), color = "black") +
  scale_fill_gradient(low = "cyan", high = "red") +
  ggtitle("Transition between raw-feature and PCA-based clusters") +
  xlab("Clusters on importance features") +
  ylab("Clusters on PCA scores")

ari_value <- adjustedRandIndex(data_filt$cluster_raw, data_filt$cluster_pca)
cat("Adjusted Rand Index between raw and PCA clusterings:", ari_value)
##           PCACluster
## RawCluster  1  2  3  4
##          1  0  0 53  0
##          2  0  0  0 77
##          3  0 40  0  0
##          4 74  0  0  0
##    RawCluster PCACluster value
## 1           1          1     0
## 2           2          1     0
## 3           3          1     0
## 4           4          1    74
## 5           1          2     0
## 6           2          2     0
## 7           3          2    40
## 8           4          2     0
## 9           1          3    53
## 10          2          3     0
## 11          3          3     0
## 12          4          3     0
## 13          1          4     0
## 14          2          4    77
## 15          3          4     0
## 16          4          4     0

## Adjusted Rand Index between raw and PCA clusterings: 1

The transition matrix reveals a one-to-one correspondence between clusters across the two solutions with no overlap of players between clusters. Differences in cluster labels happened solely due to the arbitrary labeling of k-means and do not reflect structural changes in the partition. This is confirmed by an Adjusted Rand Index of 1.00 and a contingency table showing one-to-one correspondence between clusters. This indicates that retaining the first four principal components, accounting for around 90% of variance, preserved the full cluster structure.

8. Conclusion

This project demonstrated that player importance in professional basketball in the Spanish League is multidimensional and cannot be reduced just to scoring output or positional labels. By engineering features that capture offensive involvement, playmaking abilities, efficiency, defensive contribution, versatility and playing time, clustering revealed four distinct and interpretable importance profiles: primary offensive engines, volume scorers, defensive anchors and low-usage versatile role players. These profiles reflect different ways players contribute to team success and highlight that importance manifests through multiple, partially independent dimensions.

Principal component analysis proved to be an effective refinement step. Retaining the first four principal components preserved over 90% of the original variance and resulted in a clustering solution that was identical to the raw-feature clustering up to label permutation, confirming the robustness of the discovered structure. At the same time, PCA clarified latent relationships among importance dimensions and reduced redundancy among highly correlated features, improving interpretability without altering the underlying grouping of players. Overall, the results support the use of unsupervised learning combined with feature engineering and dimensionality reduction as a meaningful framework for understanding player importance beyond old-school box-score summaries.