League of Legends is one of the most played competitive games in the world, featuring over 160 unique champions each with distinct combat roles, resource mechanics, and stat profiles. Balancing such a large and diverse roster is a significant design challenge and one that data-driven analysis can help illuminate. This project applies unsupervised learning to a dataset of League of Legends champion base statistics sourced from Kaggle (Cute Dango, League of Legends Champions dataset, available at kaggle.com/datasets/cutedango/league-of-legends-champions) to discover whether champions naturally cluster into distinct statistical archetypes, and which features drive those groupings.
The workflow combines three complementary approaches:
Hard clustering (K-Means, Hierarchical) to identify stable, discrete champion archetypes, Soft clustering (Fuzzy C-Means) to quantify champion hybridity, how strongly each champion belongs to one archetype versus another, Dimensionality reduction (PCA, MDS, UMAP, t-SNE, SOM) to visualize the structure of the feature space and validate clustering results across multiple independent methods.
Research questions
How many distinct champion archetypes emerge when clustering by numerical base statistics?
Which attributes best discriminate between champion types, and what do they reveal about Riot’s design philosophy?
Can fuzzy membership scores meaningfully quantify champion hybridity and identify statistically unusual designs?
Why this matters understanding the statistical structure of champion design has practical implications for game balance, champion categorization, and the study of how design constraints shape a competitive roster. It also serves as a case study in applying unsupervised learning to a real-world dataset where ground truth labels exist (champion roles) but are deliberately withheld to test whether the data alone recovers meaningful structure.
library(tidyverse)
library(cluster)
library(factoextra)
library(NbClust)
library(e1071)
library(gridExtra)
library(ggrepel)
library(DT)
library(mclust)
library(ggplot2)
library(corrplot)
library(factoextra)
lol <- read.csv("LoL_champions.csv")
cat("Dataset loaded successfully!\n")
## Dataset loaded successfully!
cat("Total champions:", nrow(lol), "\n")
## Total champions: 167
cat("Total variables:", ncol(lol), "\n")
## Total variables: 24
To ensure the clustering model discovers patterns based purely on numerical combat statistics, rather than pre-existing classifications, several columns were removed or transformed before analysis.
Removed columns Tags (champion type, e.g. Mage, Assassin), Role (lane assignment, e.g. top, jungle, mid, bot), and Name were all dropped. Including these would risk grouping champions by their assigned identity rather than their underlying stat profiles.
Resource type encoding Resourse.type was converted into a binary variable (resource_bin), where 1 = Mana and 0 = any other resource. Mana is by far the most common resource in the dataset, the remaining champions either use a unique champion-specific resource or none at all, so a binary encoding captures the most meaningful distinction without introducing unnecessary categories.
Range deduplication Range.type (a categorical melee/ranged label) was removed in favour of the numeric Attack.range variable, which carries the same information in a more precise and model-friendly form. Keeping both would artificially double the weight of range in the distance calculations. After these transformations, the dataset contains no missing values and was standardised to mean = 0, sd = 1 before clustering to ensure all features contribute equally regardless of their original scale.
The final processed dataset contains 167 champions and 20 numerical features, which served as the input for all subsequent clustering and dimensionality reduction methods.
# Removed data that would classify our data in groups like role in game. Range.type deleted because of double with attack.range.
lol_clean <- lol %>%
select(-Name, -Tags, -Role) %>%
mutate(resource_bin = ifelse(Resourse.type == "Mana", 1, 0)) %>%
select(-Range.type, -Resourse.type)
cat("Missing values present:", any(is.na(lol_clean)), "\n")
## Missing values present: FALSE
if (any(is.na(lol_clean))) {
lol_clean <- lol_clean %>% drop_na()
}
cat("\nProcessed data structure:\n")
##
## Processed data structure:
str(lol_clean)
## 'data.frame': 167 obs. of 20 variables:
## $ Base.HP : int 650 590 600 630 685 685 550 560 580 640 ...
## $ HP.per.lvl : int 114 104 119 107 120 94 92 102 102 101 ...
## $ Base.mana : int 0 418 200 350 350 285 495 418 348 280 ...
## $ Mana.per.lvl : num 0 25 0 40 40 40 45 25 42 35 ...
## $ Movement.speed : int 345 330 345 330 330 335 325 335 325 325 ...
## $ Base.armor : int 38 21 23 26 47 33 21 19 26 26 ...
## $ Armor.per.lvl : num 4.8 4.7 4.7 4.7 4.7 4 4.9 4.7 4.2 4.6 ...
## $ Base.magic.resistance : int 32 30 37 30 32 32 30 30 30 30 ...
## $ Magic.resistance.per.lvl : num 2.05 1.3 2.05 1.3 2.05 2.05 1.3 1.3 1.3 1.3 ...
## $ Attack.range : int 175 550 125 500 125 125 600 625 550 600 ...
## $ HP.regeneration : num 3 2.5 9 3.75 8.5 9 5.5 5.5 3.25 3.5 ...
## $ HP.regeneration.per.lvl : num 0.5 0.6 0.9 0.65 0.85 0.85 0.55 0.55 0.55 0.55 ...
## $ Mana.regeneration : num 0 8 50 8.2 8.5 7.4 8 8 6.5 7 ...
## $ Mana.regeneration.per.lvl: num 0 0.8 0 0.7 0.8 0.55 0.8 0.8 0.4 0.65 ...
## $ Attack.damage : int 60 53 62 52 62 57 51 50 55 59 ...
## $ Attack.damage.per.lvl : num 5 3 3.3 3 3.75 3.8 3.2 2.65 2.3 2.95 ...
## $ Attack.speed.per.lvl : num 2.5 2.2 3.2 4 2.12 ...
## $ Attack.speed : num 0.651 0.668 0.625 0.638 0.625 0.736 0.658 0.61 0.64 0.658 ...
## $ AS.ratio : num 0.651 0.625 0.625 0.4 0.625 0.638 0.625 0.625 0.64 0.658 ...
## $ resource_bin : num 0 1 0 1 1 1 1 1 1 1 ...
lol_scaled <- scale(lol_clean)
cat("\n✓ Data standardized (mean=0, sd=1)\n")
##
## ✓ Data standardized (mean=0, sd=1)
Before running clustering, we test whether the numeric feature space shows non-random structure (i.e., whether clustering is meaningful) using the Hopkins statistic.
set.seed(123)
res <- get_clust_tendency(lol_scaled, n = nrow(lol_scaled)-1, graph = TRUE)
cat("Hopkins Statistic:", round(res$hopkins_stat, 4), "\n")
## Hopkins Statistic: 0.7601
# Interpretation
if(res$hopkins_stat > 0.7) {
cat("Interpretation: The score above 0.7 indicates a strong clustering tendency (highly non-random structure).\n")
} else if(res$hopkins_stat >= 0.5) {
cat("Interpretation: The score above 0.5 suggests some structure exists, but clusters may not be sharply defined.\n")
} else {
cat("Interpretation: The score close to 0.5 indicates that the data is close to a random distribution.\n")
}
## Interpretation: The score above 0.7 indicates a strong clustering tendency (highly non-random structure).
# Visual Assessment of Cluster Tendency (VAT)
res$plot +
labs(title = "Visual Assessment of Cluster Tendency (VAT)",
subtitle = paste("Hopkins Statistic:", round(res$hopkins_stat, 4)),
caption = "Interpretation: The presence of distinct red square-shaped blocks along the diagonal \nconfirms the existence of natural clusters in the League of Legends champion data.")
The VAT plot displays the dissimilarity matrix of the champions. The
distinct red blocks visible along the diagonal represent groups of
champions that are highly similar to each other but different from other
groups. Since these rectangular structures are clearly defined, it
visually confirms that the dataset contains natural archetypes
(clusters), supporting the high Hopkins statistic and justifying the
further use of K-means and Hierarchical clustering. After we know that
data is clusterable, we can move on to using different analysis.
fviz_nbclust(lol_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal K",
subtitle = "Look for the 'elbow' where improvement slows") +
theme_minimal() +
geom_vline(xintercept = 3, linetype = "dashed", color = "red")
The Elbow Method was used to determine the optimal number of clusters by
plotting the Total Within-Cluster Sum of Squares (WSS) against the
number of clusters (K). As shown in the plot, a distinct ‘elbow’ or
‘knee’ point is visible at K = 3, where the rate of decrease in WSS
begins to level off significantly. To clearly highlight this transition,
a red dashed line has been added at the third cluster. Choosing K = 3
provides an optimal balance between model complexity and cluster
compactness, following the principle of parsimony.
fviz_nbclust(lol_scaled, kmeans, method = "silhouette") +
labs(title = "Silhouette Heuristic for K",
subtitle = "Higher average silhouette width indicates better clustering") +
theme_minimal() +
geom_vline(xintercept = 3, linetype = "dashed", color = "red")
The Silhouette Method was employed to validate the cluster quality by
measuring how well each object lies within its cluster. The blue dashed
line indicates the mathematically optimal number of clusters according
to this heuristic (K=2), which yields the highest average silhouette
width. However, for the purpose of this research and to maintain better
interpretability of champion archetypes, we have highlighted our
previously selected K=3 with a red dashed line. While K=2 provides a
slightly higher cohesion, K=3 allows for a more granular and meaningful
separation of League of Legends roles (e.g., distinguishing between
tanks, mages, and marksmen) without a significant drop in silhouette
quality.
To provide a robust conclusion regarding the optimal number of clusters, the NbClust package was utilized. This tool is a comprehensive validation framework that computes 30 different indices (such as Hubert, Duda, and Beale) simultaneously.
set.seed(123)
nb_result <- NbClust(
lol_scaled,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans"
)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 9
##
##
## *******************************************************************
min_k <- 2
max_k <- 10
row1 <- as.numeric(nb_result$Best.nc[1, ])
row2 <- as.numeric(nb_result$Best.nc[2, ])
is_valid_k_row <- function(x, min_k, max_k) {
x <- x[is.finite(x)]
if (length(x) == 0) return(FALSE)
mean(x >= min_k & x <= max_k & abs(x - round(x)) < 1e-8) > 0.8
}
k_votes <- if (is_valid_k_row(row1, min_k, max_k)) row1 else row2
k_votes <- k_votes[is.finite(k_votes)]
k_votes <- k_votes[k_votes >= min_k & k_votes <= max_k]
k_votes <- round(k_votes)
votes_table <- sort(table(k_votes), decreasing = TRUE)
k_nbclust <- as.integer(names(votes_table)[1])
cat("NbClust consensus (majority vote): K =", k_nbclust, "\n\n")
## NbClust consensus (majority vote): K = 3
cat("Votes per K:\n")
## Votes per K:
print(votes_table)
## k_votes
## 3 2 5 9 8
## 7 4 4 4 1
The final validation using the NbClust package provides a definitive statistical justification for our clustering structure. By evaluating 30 different internal indices simultaneously, the algorithm produced a distribution of ‘votes’ for the optimal number of clusters.
According to the majority rule, the optimal number of clusters is K = 3, which received the highest number of votes (7 indices). While there were secondary suggestions for K=2, K=5, and K=9 (each receiving 4 votes), the consensus clearly favors a 3-cluster partition. This result is particularly significant because it aligns with the ‘elbow’ observed in our previous analysis.
Specifically, the Hubert and D-indices (graphical methods) were used to identify the ‘knee’ or significant peak in the second differences plot, which further supports the stability of the 3-cluster solution. Choosing K=3 ensures that the champion archetypes are statistically distinct while remaining broad enough to represent the core gameplay roles in League of Legends.
cat("Elbow suggests K ≈ 3\n")
## Elbow suggests K ≈ 3
cat("NbClust majority suggests K =", k_nbclust, "\n")
## NbClust majority suggests K = 3
cat("The initial silhouette heuristic plot may suggest a smaller K (often K=2).\n")
## The initial silhouette heuristic plot may suggest a smaller K (often K=2).
cat("We start with K=3 for interpretability, then re-check using detailed silhouette diagnostics.\n\n")
## We start with K=3 for interpretability, then re-check using detailed silhouette diagnostics.
k_final <- 3
set.seed(123)
kmeans_result <- kmeans(lol_scaled, centers = k_final, nstart = 25)
lol$cluster <- factor(kmeans_result$cluster)
cat("K used:", k_final, "\n")
## K used: 3
print(setNames(table(lol$cluster), paste("Clusters size",names(table(lol$cluster)))))
## Clusters size 1 Clusters size 2 Clusters size 3
## 78 27 62
fviz_cluster(
kmeans_result, data = lol_scaled,
geom = "point",
ellipse.type = "convex",
palette = c("coral", "darkgreen", "steelblue"),
ggtheme = theme_minimal()
) +
labs(
title = paste0("K-Means Clustering (K=", k_final, ")"),
subtitle = paste(nrow(lol), "champions clustered in", ncol(lol_clean), "dimensional space")
)
sil <- silhouette(kmeans_result$cluster, dist(lol_scaled))
fviz_silhouette(sil) +
labs(
title = paste0("Silhouette Plot for K-Means (K=", k_final, ")"),
subtitle = "K=3 initial solution"
)
## cluster size ave.sil.width
## 1 1 78 0.30
## 2 2 27 0.12
## 3 3 62 0.29
avg3 <- mean(sil[, 3])
neg3 <- sum(sil[, 3] < 0)
cat(" SILHOUETTE SUMMARY (K=3) \n")
## SILHOUETTE SUMMARY (K=3)
cat("Avg silhouette:", round(avg3, 4), "\n")
## Avg silhouette: 0.2687
cat("Negative silhouettes:", neg3, "\n\n")
## Negative silhouettes: 2
clusters <- sort(unique(sil[, 1]))
sil_summary <- data.frame(
Cluster = clusters,
Size = sapply(clusters, function(i) sum(sil[, 1] == i)),
Avg_Silhouette = sapply(clusters, function(i) mean(sil[sil[, 1] == i, 3]))
)
print(sil_summary, row.names = FALSE)
## Cluster Size Avg_Silhouette
## 1 78 0.3042883
## 2 27 0.1223551
## 3 62 0.2877652
weak_cluster <- sil_summary$Cluster[which.min(sil_summary$Avg_Silhouette)]
weak_value <- min(sil_summary$Avg_Silhouette)
cat("Interpretation:\n")
## Interpretation:
cat("K=3 is interpretable, but one cluster is relatively weak (lowest avg silhouette).\n")
## K=3 is interpretable, but one cluster is relatively weak (lowest avg silhouette).
cat("Weakest cluster:", weak_cluster, "avg silhouette =", round(weak_value, 3), "\n")
## Weakest cluster: 2 avg silhouette = 0.122
A detailed examination of the silhouette widths reveals that Cluster 2 is significantly weaker than the others. Most importantly, two champions exhibit negative silhouette values, meaning they are mathematically closer to a neighboring cluster than to their assigned group. That is why I decided to test other k for silhouette scan.
set.seed(123)
ks <- 2:10
avg_sil_by_k <- sapply(ks, function(k) {
km <- kmeans(lol_scaled, centers = k, nstart = 25)
sil_k <- silhouette(km$cluster, dist(lol_scaled))
mean(sil_k[, 3])
})
sil_df <- data.frame(K = ks, Avg_Silhouette = as.numeric(avg_sil_by_k))
print(sil_df, row.names = FALSE)
## K Avg_Silhouette
## 2 0.2475940
## 3 0.2687396
## 4 0.2890218
## 5 0.1928662
## 6 0.1973759
## 7 0.1411024
## 8 0.1397115
## 9 0.1460029
## 10 0.1501829
ggplot(sil_df, aes(x = K, y = Avg_Silhouette)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
geom_vline(xintercept = k_final, linetype = "dashed") +
theme_minimal() +
labs(
title = "Average Silhouette Width by K",
subtitle = "Higher is better; dashed line shows current K",
y = "Average silhouette width"
)
set.seed(123)
km4 <- kmeans(lol_scaled, centers = 4, nstart = 25)
sil4 <- silhouette(km4$cluster, dist(lol_scaled))
avg4 <- mean(sil4[, 3])
neg4 <- sum(sil4[, 3] < 0)
seeds <- 1:20
avg_sils4 <- sapply(seeds, function(s) {
set.seed(s)
km <- kmeans(lol_scaled, centers = 4, nstart = 25)
sil_k <- silhouette(km$cluster, dist(lol_scaled))
mean(sil_k[, 3])
})
cat("\n DECISION CHECK (K=3 vs K=4) \n")
##
## DECISION CHECK (K=3 vs K=4)
cat("K=3: avg silhouette =", round(avg3, 4), "| negatives =", neg3, "\n")
## K=3: avg silhouette = 0.2687 | negatives = 2
cat("K=4: avg silhouette =", round(avg4, 4), "| negatives =", neg4, "\n")
## K=4: avg silhouette = 0.289 | negatives = 0
cat("K=4 stability (20 seeds): mean =", round(mean(avg_sils4), 4),
"| sd =", round(sd(avg_sils4), 4), "\n\n")
## K=4 stability (20 seeds): mean = 0.289 | sd = 0
k_final <- 4
kmeans_result <- km4
lol$cluster <- factor(kmeans_result$cluster)
After evaluating a range of possible clusters (K=2 to 10), a strategic decision was made to switch from K=3 to K=4. Although K=3 was initially supported by the Elbow method and NbClust, the Silhouette analysis revealed a significant improvement in model quality at K=4 (Average Silhouette Width =0.289). Now we move on to checking the silhouette for K=4.
fviz_cluster(
kmeans_result, data = lol_scaled,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal()
) +
labs(
title = paste0("K-Means Clustering (Final K=", k_final, ")"),
subtitle = paste(nrow(lol), "champions clustered in", ncol(lol_clean), "dimensional space")
)
sil_final <- silhouette(kmeans_result$cluster, dist(lol_scaled))
fviz_silhouette(sil_final) +
labs(
title = paste0("Silhouette Plot for Final K-Means (K=", k_final, ")"),
subtitle = "Final chosen solution"
)
## cluster size ave.sil.width
## 1 1 62 0.29
## 2 2 22 0.22
## 3 3 78 0.30
## 4 4 5 0.37
cat("\n FINAL SILHOUETTE SUMMARY (K=4) \n")
##
## FINAL SILHOUETTE SUMMARY (K=4)
cat("Avg silhouette:", round(mean(sil_final[, 3]), 4), "\n")
## Avg silhouette: 0.289
cat("Negative silhouettes:", sum(sil_final[, 3] < 0), "\n")
## Negative silhouettes: 0
The decision to finalize the model with K=4 is driven by superior statistical validation and improved cluster purity. As shown in the final Silhouette summary, the Average Silhouette Width reached 0.289, and more importantly, all four clusters now have their individual averages situated above the red threshold line.
A key factor in this selection is the complete elimination of negative silhouette values, ensuring every champion is correctly assigned to its most similar group. While the 2D visualization shows a slight overlap between two clusters, this is likely a projection artifact of reducing 20-dimensional data into a two-dimensional plane. The silhouette results confirm that these groups are mathematically distinct. Furthermore, the emergence of a small, highly specialized Cluster 4 (n=5, Si=0.37) is particularly valuable. These extreme profiles would have been lost in a coarser model, but here they represent a unique archetype that we expect to see fully separated during the subsequent 3D dimension reduction analysis.
To validate the robustness of our K=4 solution, we performed Hierarchical Clustering and compared the results with the previous K-Means partition. The comparison reveals an exceptional degree of consistency between the two different algorithmic approaches.
dist_matrix <- dist(lol_scaled, method = "euclidean")
hclust_result <- hclust(dist_matrix, method = "ward.D2")
hclust_clusters <- cutree(hclust_result, k = k_final)
lol$hclust_cluster <- factor(hclust_clusters)
cat("K used:", k_final, "\n")
## K used: 4
cat("Cluster sizes:\n")
## Cluster sizes:
print(table(lol$hclust_cluster))
##
## 1 2 3 4
## 22 77 5 63
fviz_dend(
hclust_result, k = k_final,
cex = 0.4,
color_labels_by_k = TRUE,
rect = TRUE,
main = paste0("Hierarchical Clustering Dendrogram (K=", k_final, ")"),
xlab = "Champions",
ylab = "Height (Distance)"
)
comparison_table <- table(KMeans = lol$cluster, Hierarchical = lol$hclust_cluster)
cat("K-MEANS VS HIERARCHICAL (CROSS-TAB) \n\n")
## K-MEANS VS HIERARCHICAL (CROSS-TAB)
print(comparison_table)
## Hierarchical
## KMeans 1 2 3 4
## 1 0 0 0 62
## 2 22 0 0 0
## 3 0 77 0 1
## 4 0 0 5 0
naive_agreement <- sum(diag(comparison_table)) / nrow(lol)
cat("Naive diagonal agreement (label IDs must match):",
round(naive_agreement * 100, 1), "%\n")
## Naive diagonal agreement (label IDs must match): 0 %
if (!requireNamespace("gtools", quietly = TRUE)) install.packages("gtools")
perms <- gtools::permutations(n = k_final, r = k_final, v = 1:k_final)
best_agreement <- -Inf
best_perm <- NULL
kmeans_int <- as.integer(lol$cluster)
for (i in seq_len(nrow(perms))) {
p <- perms[i, ]
mapped_h <- p[hclust_clusters]
agree <- mean(kmeans_int == mapped_h)
if (agree > best_agreement) {
best_agreement <- agree
best_perm <- p
}
}
cat("Best label-matched agreement:", round(best_agreement * 100, 1), "%\n")
## Best label-matched agreement: 99.4 %
cat("Best mapping (Hierarchical old label -> KMeans label):\n")
## Best mapping (Hierarchical old label -> KMeans label):
print(setNames(best_perm, 1:k_final))
## 1 2 3 4
## 2 3 4 1
ari <- mclust::adjustedRandIndex(kmeans_int, hclust_clusters)
cat("Adjusted Rand Index (ARI; label-invariant):", round(ari, 3), "\n\n")
## Adjusted Rand Index (ARI; label-invariant): 0.978
Key Findings from the Cross-Tabulation
Label Switching While the initial ‘naive’ diagonal agreement was 0%, this is merely a result of arbitrary cluster ID assignment (e.g., K-Means ‘Cluster 1’ corresponds to Hierarchical ‘Cluster 4’). After re-mapping the labels, we achieved a 99.4% best label-matched agreement.
Statistical Robustness The Adjusted Rand Index (ARI) of 0.978 indicates a near-perfect overlap. Since ARI is invariant to label permutations, this score confirms that both methods are identifying the exact same underlying champion archetypes.
Cluster Stability Out of 167 champions, only one single champion was classified differently between the two methods (K-Means Cluster 3 vs. Hierarchical Cluster 4).
Conclusion The convergence of these two distinct mathematical techniques-centroid-based (K-Means) and connectivity-based (Hierarchical)-strongly validates our model. The clusters are not artifacts of the algorithm used, but represent stable, distinct archetypes in the League of Legends dataset.
Why a separate K for fuzzy?
While K-Means and Hierarchical clustering aim to find hard
partitions-where each champion belongs to exactly one group-Fuzzy
C-Means (FCM) optimizes for soft memberships. This allows a champion to
have a degree of belonging (0 to 1) to multiple clusters
simultaneously.
Because the underlying mathematical objective differs, we evaluate the optimal number of clusters (K fuzzy) independently. The goal in fuzzy clustering is not just to minimize distance, but to ensure that the resulting membership grades are informative. If K is too high, memberships often become ‘diluted’ or almost uniform (e.g., 0.25 across four groups), providing no real insight into a champion’s identity. Therefore, we select K fuzzy by identifying the point where clusters remain distinct and the ‘fuzziness’ effectively highlights true hybrid characters rather than creating statistical noise.
set.seed(123)
ks <- 2:10
m_baseline <- 2.0 # baseline m used only for scanning K
fuzzy_metrics <- lapply(ks, function(k) {
fm <- cmeans(lol_scaled, centers = k, m = m_baseline, iter.max = 300)
U <- fm$membership
maxmem <- apply(U, 1, max)
entropy <- apply(U, 1, function(x) -sum(x * log(x + 1e-10)))
entropy_norm <- entropy / log(k)
data.frame(
K = k,
Avg_MaxMembership = mean(maxmem),
Avg_Uncertainty = mean(1 - maxmem),
Avg_EntropyNorm = mean(entropy_norm)
)
})
fuzzy_df <- dplyr::bind_rows(fuzzy_metrics) %>%
dplyr::arrange(K)
knitr::kable(
fuzzy_df, digits = 4,
caption = paste0("Fuzzy K scan (baseline m = ", m_baseline,
"). Higher Avg_MaxMembership and lower Avg_EntropyNorm are better.")
)
| K | Avg_MaxMembership | Avg_Uncertainty | Avg_EntropyNorm |
|---|---|---|---|
| 2 | 0.6843 | 0.3157 | 0.8768 |
| 3 | 0.4727 | 0.5273 | 0.9245 |
| 4 | 0.3423 | 0.6577 | 0.9384 |
| 5 | 0.2802 | 0.7198 | 0.9450 |
| 6 | 0.2282 | 0.7718 | 0.9523 |
| 7 | 0.1925 | 0.8075 | 0.9606 |
| 8 | 0.1760 | 0.8240 | 0.9581 |
| 9 | 0.1576 | 0.8424 | 0.9623 |
| 10 | 0.1674 | 0.8326 | 0.9437 |
best_k_maxmem <- fuzzy_df$K[which.max(fuzzy_df$Avg_MaxMembership)]
best_k_entropy <- fuzzy_df$K[which.min(fuzzy_df$Avg_EntropyNorm)]
K_fuzzy <- best_k_maxmem
Model selection (K)
The scan over K=2..10 indicates that K = 2 yields the
most informative memberships (highest average max-membership and lowest
normalized entropy).
For larger K, memberships become increasingly uniform, suggesting that
the data support only a small number of fuzzy prototypes.
m_val <- 1.5
cat("Chosen m for fuzzy c-means:", m_val, "\n")
## Chosen m for fuzzy c-means: 1.5
Moderate fuzziness for outlier detection For the fuzziness exponent, I selected m=1.5. This value was chosen as a strategic middle ground; while an initial test at m=1.3 (derived from previous benchmarks) provided insufficient ‘fuzziness’ to distinguish hybrids, 1.5 offers a moderate level of overlap. This setting is ideal for outlier and hybrid detection, allowing us to identify champions that sit on the boundary between the two main archetypes without introducing excessive statistical noise.
set.seed(123)
fuzzy_result <- cmeans(lol_scaled, centers = K_fuzzy, m = m_val, iter.max = 300)
lol$fuzzy_cluster <- factor(fuzzy_result$cluster)
membership <- fuzzy_result$membership
lol$max_membership <- apply(membership, 1, max)
lol$uncertainty <- 1 - lol$max_membership
entropy <- apply(membership, 1, function(x) -sum(x * log(x + 1e-10)))
lol$entropy_norm <- entropy / log(K_fuzzy)
cat("Uncertainty Statistics Summary\n")
## Uncertainty Statistics Summary
summary(lol$uncertainty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01773 0.06688 0.11848 0.16004 0.23617 0.49706
cat("Normalized Entropy Summary\n")
## Normalized Entropy Summary
summary(lol$entropy_norm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1285 0.3542 0.5250 0.5616 0.7886 1.0000
Following the selection of K=2 and a fuzziness parameter m=1.5, we conducted a deep-dive into the membership structures to evaluate the “soft” boundaries between champion archetypes.
centers <- round(fuzzy_result$centers, 2)
knitr::kable(centers, caption = "Fuzzy c-means cluster centers (standardized).")
| Base.HP | HP.per.lvl | Base.mana | Mana.per.lvl | Movement.speed | Base.armor | Armor.per.lvl | Base.magic.resistance | Magic.resistance.per.lvl | Attack.range | HP.regeneration | HP.regeneration.per.lvl | Mana.regeneration | Mana.regeneration.per.lvl | Attack.damage | Attack.damage.per.lvl | Attack.speed.per.lvl | Attack.speed | AS.ratio | resource_bin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.40 | -0.18 | 0.37 | 0.03 | -0.54 | -0.62 | -0.04 | -0.43 | -0.78 | 0.79 | -0.50 | -0.41 | 0.05 | 0.27 | -0.59 | -0.32 | -0.06 | -0.11 | -0.09 | 0.27 |
| 0.44 | 0.18 | -0.34 | 0.02 | 0.59 | 0.66 | 0.07 | 0.47 | 0.82 | -0.81 | 0.56 | 0.44 | -0.07 | -0.24 | 0.63 | 0.35 | 0.04 | 0.11 | 0.13 | -0.21 |
if (!("Attack.range" %in% colnames(fuzzy_result$centers))) {
stop("Attack.range not found in fuzzy_result$centers. Check preprocessing column names.")
}
range_center <- fuzzy_result$centers[, "Attack.range"]
ranged_id <- which.max(range_center)
melee_id <- which.min(range_center)
label_map <- rep(NA_character_, K_fuzzy)
label_map[ranged_id] <- "Ranged/Mana-like"
label_map[melee_id] <- "Melee/Tanky-like"
lol$fuzzy_cluster_label <- factor(
label_map[as.integer(lol$fuzzy_cluster)],
levels = c("Melee/Tanky-like", "Ranged/Mana-like")
)
knitr::kable(
data.frame(Cluster_ID = 1:K_fuzzy, Label = label_map),
caption = "Fuzzy cluster label mapping based on Attack.range in the centers."
)
| Cluster_ID | Label |
|---|---|
| 1 | Ranged/Mana-like |
| 2 | Melee/Tanky-like |
par(mfrow = c(1, 2), mar = c(5, 5, 4, 2))
hist(lol$uncertainty,
breaks = 30,
col = "steelblue",
border = "white",
main = paste0("Uncertainty Distribution (K_fuzzy=", K_fuzzy, ", m=", m_val, ")"),
xlab = "Uncertainty (1 - max membership)",
ylab = "Frequency")
abline(v = median(lol$uncertainty), col = "red", lty = 2, lwd = 2)
hist(lol$entropy_norm,
breaks = 30,
col = "coral",
border = "white",
main = paste0("Normalized Entropy (K_fuzzy=", K_fuzzy, ", m=", m_val, ")"),
xlab = "Normalized entropy",
ylab = "Frequency")
abline(v = median(lol$entropy_norm), col = "red", lty = 2, lwd = 2)
par(mfrow = c(1, 1))
cat("UNCERTAINTY STATISTICS (FUZZY) \n\n")
## UNCERTAINTY STATISTICS (FUZZY)
cat("Uncertainty summary:\n")
## Uncertainty summary:
print(summary(lol$uncertainty))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01773 0.06688 0.11848 0.16004 0.23617 0.49706
cat("Normalized entropy summary:\n")
## Normalized entropy summary:
print(summary(lol$entropy_norm))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1285 0.3542 0.5250 0.5616 0.7886 1.0000
cat("Correlation (uncertainty vs entropy): ",
round(cor(lol$uncertainty, lol$entropy_norm), 3), "\n", sep = "")
## Correlation (uncertainty vs entropy): 0.964
The initial scan across different values of K confirmed that a binary split provides the most stable fuzzy prototypes. As shown in the quality table:
Avg Max Membership (0.68) At K=2, champions show the strongest peak membership, meaning the algorithm can clearly distinguish between the two primary poles (Melee/Tanky vs. Ranged/Mana-based).
Entropy and Uncertainty As K increases, the Normalized Entropy rises sharply toward 1.0, indicating that memberships become too diluted to be meaningful.
The Uncertainty Distribution (calculated as 1−max membership) and Normalized Entropy are highly correlated (r=0.964), serving as excellent proxies for identifying “statistical hybrids.”
Core Archetypes The majority of champions exhibit low uncertainty (Median =0.118), meaning they fit firmly into one of the two main prototypes.
The Hybrid “Bridge” Champions with uncertainty values approaching 0.5 (Max =0.497) are the true “misfits” of the dataset. These characters possess nearly equal membership in both the Ranged/Mana-like and Melee/Tanky-like clusters.
Strategic Interpretation While our previous K=4 Hard Clustering (K-Means) provided granular roles, this Fuzzy K=2 model reveals the fundamental “biological” spectrum of the game. High-uncertainty champions represent the most versatile designs in the meta, blending survivability with utility or range.
Why transition to PCA? Clustering 167 champions across 20 different statistical dimensions (HP, Mana, Armor, Attack Speed, etc.) creates a “curse of dimensionality” where it becomes difficult to visualize and interpret the underlying patterns. PCA allows us to reduce this complexity by transforming the original correlated variables into a few uncorrelated Principal Components. This is particularly effective for this model because many champion stats are naturally linked (e.g., high Armor often correlates with high HP), allowing PCA to capture the “essence” of a champion’s profile in a simplified space.
pca_result <- prcomp(lol_scaled, scale. = FALSE)
lol$PC1 <- pca_result$x[, 1]
lol$PC2 <- pca_result$x[, 2]
var_explained <- summary(pca_result)$importance[2, 1:2] * 100
cat("PCA VARIANCE EXPLAINED\n\n")
## PCA VARIANCE EXPLAINED
cat("PC1 explains", round(var_explained[1], 1), "%\n")
## PC1 explains 28 %
cat("PC2 explains", round(var_explained[2], 1), "%\n")
## PC2 explains 15.5 %
cat("PC1+PC2 explain", round(sum(var_explained[1:2]), 1), "%\n")
## PC1+PC2 explain 43.5 %
The PCA results show that the first two components capture a total of 43.5% of the entire dataset’s variance. PC1 is the most dominant, explaining 28% of the differences between champions, likely representing the fundamental ‘Tankiness vs. Squishiness’ or ‘Melee vs. Ranged’ spectrum. PC2 adds another 15.5%, capturing secondary nuances such as utility or scaling. While 43.5% might seem moderate, in a complex dataset with 20 variables, it is a significant achievement that allows us to visualize our 4 clusters in a 2D plane while retaining the most critical structural information.
loadings <- as.data.frame(pca_result$rotation[, 1:2])
loadings$Feature <- rownames(loadings)
top_pc1 <- loadings |> dplyr::arrange(dplyr::desc(abs(PC1))) |> head(10)
top_pc2 <- loadings |> dplyr::arrange(dplyr::desc(abs(PC2))) |> head(10)
knitr::kable(top_pc1, digits = 3, caption = "Top 10 absolute loadings for PC1")
| PC1 | PC2 | Feature | |
|---|---|---|---|
| Attack.range | 0.378 | 0.065 | Attack.range |
| Magic.resistance.per.lvl | -0.375 | -0.103 | Magic.resistance.per.lvl |
| Attack.damage | -0.324 | -0.114 | Attack.damage |
| Base.armor | -0.304 | -0.067 | Base.armor |
| Movement.speed | -0.299 | 0.013 | Movement.speed |
| Base.mana | 0.245 | -0.353 | Base.mana |
| HP.regeneration | -0.241 | -0.237 | HP.regeneration |
| Base.magic.resistance | -0.233 | -0.153 | Base.magic.resistance |
| Base.HP | -0.221 | -0.177 | Base.HP |
| resource_bin | 0.219 | -0.444 | resource_bin |
knitr::kable(top_pc2, digits = 3, caption = "Top 10 absolute loadings for PC2")
| PC1 | PC2 | Feature | |
|---|---|---|---|
| Mana.per.lvl | 0.104 | -0.466 | Mana.per.lvl |
| resource_bin | 0.219 | -0.444 | resource_bin |
| Mana.regeneration.per.lvl | 0.214 | -0.433 | Mana.regeneration.per.lvl |
| Base.mana | 0.245 | -0.353 | Base.mana |
| HP.regeneration | -0.241 | -0.237 | HP.regeneration |
| HP.regeneration.per.lvl | -0.192 | -0.221 | HP.regeneration.per.lvl |
| HP.per.lvl | -0.110 | -0.204 | HP.per.lvl |
| Base.HP | -0.221 | -0.177 | Base.HP |
| Base.magic.resistance | -0.233 | -0.153 | Base.magic.resistance |
| Attack.damage.per.lvl | -0.187 | -0.121 | Attack.damage.per.lvl |
The PCA loadings reveal the underlying “DNA” of the champion archetypes by showing which specific statistics drive the separation on the plots.
PC1: The “Frontline vs. Backline” Spectrum The first principal component (28% of variance) acts as a primary axis for combat positioning:
Attack.range (0.378) is the dominant positive driver. Champions on the positive side of PC1 are characterized by high reach and safety.
Magic.resistance.per.lvl (-0.375), Attack.damage (-0.324), and Base.armor (-0.304) are the strongest negative drivers.
This captures the fundamental “Range vs. Durability” trade-off. PC1 separates the “glass cannon” marksmen and mages from the sturdy tanks and brawlers who possess higher base damage and defensive scaling.
PC2: The “Resource Dependency and Scaling” Axis The second principal component (15.5% of variance) captures the nuances of champion resources and growth over time:
Mana Scaling This axis is heavily influenced by mana-related stats, specifically Mana.per.lvl (-0.466), resource_bin (-0.444), and Mana.regeneration.per.lvl (-0.433).
Sustain and Durability HP.regeneration (-0.237) and HP.per.lvl (-0.204) also pull significantly in the negative direction.
PC2 distinguishes champions based on their ability to sustain combat. Champions on the negative end of this axis are typically mana-dependent “scalers” who rely on large resource pools, whereas the positive end identifies champions with unique or more static resource profiles.
ggplot(lol, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
labs(
title = "PCA (colored by final K-means clusters)",
subtitle = paste0("Final K = ", k_final),
x = paste0("PC1 (", round(var_explained[1], 1), "%)"),
y = paste0("PC2 (", round(var_explained[2], 1), "%)")
)
This spatial separation is consistent with the convex hull visualization
presented earlier in Section 5, where the same four groupings appeared
as non-overlapping regions - with the exception of Cluster 4 (purple)
partially overlapping Cluster 2 (green), confirming that these champions
sit on the boundary of the melee archetype rather than forming a fully
independent group.
ggplot(lol, aes(x = PC1, y = PC2, color = fuzzy_cluster_label)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_manual(values = c("Melee/Tanky-like" = "steelblue",
"Ranged/Mana-like" = "coral")) +
theme_minimal() +
labs(
title = "PCA (colored by fuzzy c-means labels)",
subtitle = paste0("Fuzzy K = ", K_fuzzy, " (main axis)"),
x = paste0("PC1 (", round(var_explained[1], 1), "%)"),
y = paste0("PC2 (", round(var_explained[2], 1), "%)"),
color = "Fuzzy cluster"
)
Switching to the fuzzy K=2 view reveals the fundamental binary structure
underlying all four K-Means archetypes. The Melee/Tanky-like (blue) and
Ranged/Mana-like (orange) groups are cleanly separated along PC1 with
almost no overlap, confirming that the ranged vs. melee axis is the
single strongest organising principle in the dataset. Compared to the
K=4 plot, this view collapses the green/pink distinction and the small
purple cluster into one melee pole, which shows that while those
sub-groups are statistically distinct, they all share the same
fundamental identity. The handful of points near PC1 ≈ 0 are the genuine
borderline cases, champions whose stat profiles sit between both worlds
and whose in-game design likely reflects a hybrid archetype such as a
bruiser or battlemage.
ggplot(lol, aes(x = PC1, y = PC2, color = uncertainty)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_viridis_c(option = "plasma", direction = -1) +
theme_minimal() +
labs(
title = "Champion Versatility (Fuzzy): Uncertainty (PCA view)",
subtitle = "Higher = more mixed memberships (more hybrid / potential misfit)",
x = paste0("PC1 (", round(var_explained[1], 1), "%)"),
y = paste0("PC2 (", round(var_explained[2], 1), "%)")
)
The uncertainty plot adds the most analytically rich layer to the PCA
analysis. Champions deep in either the melee (far left) or ranged (far
right) zones are overwhelmingly yellow, meaning the algorithm assigns
them with high confidence to their archetype. The highest-uncertainty
points (dark purple) are not randomly scattered, they concentrate
predictably near the boundary zone around PC1 ≈ 0 and within the upper
portion of the melee cluster (high PC2), confirming that the fuzzy
membership scores are geometrically meaningful rather than statistical
noise. Notably, the isolated outlier at the very top (PC2 ≈ 5.5) shows
near-maximum uncertainty despite sitting firmly in melee territory,
suggesting this champion has an unusual stat profile that genuinely
straddles both archetypes, making it a strong candidate for further
investigation in the hybrid analysis section.
The three PCA views together tell a coherent story about champion design in League of Legends. The K=4 plot confirms that the four archetypes occupy statistically distinct regions of the feature space, while the fuzzy K=2 view reveals that all four ultimately reduce to a single fundamental axis the melee/ranged divide encoded in PC1. The uncertainty plot then adds precision to this picture by showing that most champions are unambiguous members of one pole, and that hybridity is a genuine but rare statistical property concentrated at the boundary between the two worlds. Taken together, these three views validate the entire clustering pipeline: the hard clusters capture meaningful sub-structure, the fuzzy model captures the underlying spectrum, and the uncertainty scores identify the most design-complex champions in the dataset.
To understand how champion attributes relate to each other, we visualize: 1) a feature correlation heatmap (pairwise correlations) 2) a PCA variable correlation circle, which shows how features align with PC1/PC2.
corr_mat <- cor(lol_clean, use = "pairwise.complete.obs")
op <- par(mar = c(1, 1, 2, 4))
corrplot(
corr_mat,
method = "color",
type = "upper",
order = "hclust",
diag = FALSE,
# readability
tl.col = "black",
tl.cex = 0.75,
tl.srt = 45,
tl.offset = 0.8,
insig = "blank",
addCoef.col = NULL,
col = colorRampPalette(c("#B2182B", "#F7F7F7", "#2166AC"))(200),
cl.cex = 0.9
)
title("Correlation heatmap (numeric features)", cex.main = 1.1)
par(op)
fviz_pca_var(
pca_result,
col.var = "contrib", # color by contribution to PC1/PC2
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
) +
labs(
title = "PCA variable correlation circle",
subtitle = "Vectors show how champion statistics relate to PC1/PC2; color = contribution"
) +
theme_minimal()
The correlation heatmap reveals two distinct clusters
of inter-related features that directly explain why PCA found such clean
separation. The most striking pattern is the strong negative correlation
between Attack.range and the melee-side stats - Base.HP, Attack.damage,
Base.armor, Magic.resistance.per.lvl, and Movement.speed all correlate
negatively with range, confirming that high reach and physical
durability are genuinely opposing design principles in League of
Legends. On the other side, the mana-related features (Mana.per.lvl,
Base.mana, Mana.regeneration.per.lvl, and resource_bin) form a tight
positive cluster among themselves, meaning champions that use mana tend
to have consistently high values across all mana stats rather than just
one. This internal correlation structure is precisely what makes PCA
effective here - because the features are not independent but grouped
into meaningful blocks, the algorithm can compress them into a small
number of components without losing much information.
The correlation circle provides a visual summary of everything the loadings table already told us, but in a more intuitive form. Attack.range points strongly to the right (positive PC1) and is the longest, most orange vector on that axis - confirming it as the single highest-contributing feature in the entire analysis. Directly opposing it on the left are Magic.resistance.per.lvl, Movement.speed, Base.armor, and Attack.damage, all pointing in the same direction and therefore correlated with each other, collectively defining the melee/tanky pole. The mana cluster (Mana.per.lvl, resource_bin, Mana.regeneration.per.lvl, Base.mana) points downward and slightly right, nearly perpendicular to the melee vectors - which confirms that mana dependency is an independent dimension of champion design, not simply a consequence of being ranged or melee. Features like Attack.speed and AS.ratio point almost straight upward with short vectors, meaning they contribute little to either PC1 or PC2 and are not particularly useful for distinguishing champion archetypes in this two-dimensional view.
Together, the correlation heatmap and the variable correlation circle confirm that the feature space is not random - it is structured around two orthogonal design principles that Riot Games appears to have built into champion statistics: the ranged vs. melee divide on PC1, and the resource dependency spectrum on PC2. This structure is what makes unsupervised clustering effective on this dataset.
This section complements PCA with: - Classical MDS (distance-preserving, metric) - UMAP (non-linear embedding) - t-SNE (non-linear embedding, local neighborhoods) - SOM (grid-based 2D mapping)
Each is visualized with embedding colored by: 1) final K-means clusters (K = 4) 2) fuzzy uncertainty (hybrid / misfit signal)
D <- dist(lol_scaled, method = "euclidean")
mds <- cmdscale(D, k = 2, eig = TRUE)
lol$MDS1 <- mds$points[, 1]
lol$MDS2 <- mds$points[, 2]
mds_var <- mds$eig[mds$eig > 0]
prop_2d <- sum(mds_var[1:2]) / sum(mds_var)
cat("MDS: approx. proportion captured by 2D =", round(prop_2d, 3), "\n")
## MDS: approx. proportion captured by 2D = 0.435
ggplot(lol, aes(x = MDS1, y = MDS2, color = cluster)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
labs(
title = "Classical MDS (2D) colored by K-means clusters",
subtitle = paste0("K-means K = ", k_final, " | 2D capture ≈ ", round(prop_2d * 100, 1), "%"),
x = "MDS1", y = "MDS2"
)
The Classical MDS embedding captures approximately 43.5% of the total
variance in 2D - identical to PCA, which is expected since both methods
are linear and distance-preserving on Euclidean data. The four clusters
are clearly separated, with Cluster 3 (teal) occupying the left side,
Clusters 1 (pink) and 2 (green) on the right but separated along MDS2,
and Cluster 4 (purple) again appearing as a small isolated group near
the green cluster. The fact that MDS produces virtually the same spatial
structure as PCA is itself an important validation - it confirms that
the cluster separation is not an artifact of how PCA rotates the axes,
but reflects genuine distances between champions in the original
20-dimensional feature space.
ggplot(lol, aes(x = MDS1, y = MDS2, color = uncertainty)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_viridis_c(option = "plasma", direction = -1) +
theme_minimal() +
labs(
title = "Classical MDS (2D) colored by fuzzy uncertainty",
subtitle = "Higher uncertainty = more hybrid / potential misfit",
x = "MDS1", y = "MDS2"
)
The uncertainty overlay on the MDS plot reinforces the same pattern seen
in PCA, high-uncertainty champions (dark purple) concentrate in the
transitional zone between clusters rather than at the extremes. The teal
cluster on the left is predominantly yellow, confirming that ranged
champions form a tight, unambiguous group with low hybridity. The most
uncertain points appear scattered along the upper portion of the plot
and at the boundary between the pink and green clusters on the right
side, suggesting that the melee/tanky archetypes produce more hybrid
edge cases than the ranged group. The isolated dark purple point at the
very top (MDS2 ≈ 5) appears consistently as an outlier across both PCA
and MDS, further confirming this is a genuinely unusual champion worth
identifying by name in the hybrid analysis section.
if (!requireNamespace("uwot", quietly = TRUE)) install.packages("uwot")
library(uwot)
set.seed(123)
umap_xy <- uwot::umap(
lol_scaled,
n_neighbors = 15,
min_dist = 0.10,
metric = "euclidean"
)
lol$UMAP1 <- umap_xy[, 1]
lol$UMAP2 <- umap_xy[, 2]
ggplot(lol, aes(x = UMAP1, y = UMAP2, color = cluster)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
labs(
title = "UMAP (2D) colored by K-means clusters",
subtitle = paste0("K-means K = ", k_final, " | non-linear embedding"),
x = "UMAP1", y = "UMAP2"
)
UMAP, as a non-linear embedding, reveals something that PCA and MDS
could not show as clearly, the four clusters are not just statistically
distinct, they are genuinely separated in local neighborhood structure.
All four groups appear as completely isolated islands with no overlap
whatsoever, which is a stronger statement than the linear methods could
make. Particularly striking is that Cluster 4 (purple) now separates
cleanly from Cluster 2 (green), appearing as its own distinct island
rather than an embedded sub-group, confirming that these champions are
not just statistical outliers within the melee archetype but a genuinely
different local neighborhood in feature space. The single teal point
sitting near the pink cluster is the only cross-boundary case visible,
consistent with the borderline champion identified in previous
plots.
ggplot(lol, aes(x = UMAP1, y = UMAP2, color = uncertainty)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_viridis_c(option = "plasma", direction = -1) +
theme_minimal() +
labs(
title = "UMAP (2D) colored by fuzzy uncertainty",
subtitle = "Higher uncertainty = champions between prototypes",
x = "UMAP1", y = "UMAP2"
)
The uncertainty overlay on the UMAP reveals a particularly interesting
pattern, the two isolated lower clusters (green and purple in the
previous plot) are almost entirely dark purple, meaning these champions
have the highest fuzzy uncertainty in the entire dataset despite forming
their own tight local neighborhoods. This is an important finding: UMAP
says they are internally similar to each other, but the fuzzy model says
they don’t belong cleanly to either the melee or ranged prototype. This
combination suggests these are champions with a genuinely unique stat
profile that sits between both worlds, not random noise, but a coherent
hybrid archetype that neither the melee nor ranged pole fully captures.
The pink and teal clusters on the other hand show a gradient from orange
to yellow toward their cores, confirming that champions at the heart of
each archetype are the most statistically unambiguous.
if (!requireNamespace("Rtsne", quietly = TRUE)) install.packages("Rtsne")
library(Rtsne)
set.seed(123)
# Perplexity must be < (n-1)/3
perp <- 20
tsne_out <- Rtsne(
lol_scaled,
dims = 2,
perplexity = perp,
pca = TRUE,
check_duplicates = FALSE,
verbose = FALSE
)
lol$TSNE1 <- tsne_out$Y[, 1]
lol$TSNE2 <- tsne_out$Y[, 2]
ggplot(lol, aes(x = TSNE1, y = TSNE2, color = cluster)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
labs(
title = "t-SNE (2D) colored by K-means clusters",
subtitle = paste0("perplexity = ", perp, " | K-means K = ", k_final),
x = "t-SNE 1", y = "t-SNE 2"
)
The t-SNE embedding (perplexity = 20) confirms the four-cluster
structure for the third time using a completely different algorithmic
approach. All four groups appear as spatially distinct regions with
minimal overlap, consistent with UMAP and PCA. Cluster 3 (teal) is the
most spread out, suggesting it contains the most internal diversity
among ranged champions, while Cluster 2 (green) forms the tightest and
most compact island at the top, indicating that the melee/tanky
archetype it represents is the most statistically homogeneous group in
the dataset. Cluster 4 (purple) again appears as a small isolated group
sitting between the green and pink clusters, reinforcing the finding
from UMAP that these champions occupy a genuinely distinct position in
feature space. As with all t-SNE results, the absolute distances between
clusters should not be over-interpreted, but the internal structure and
separation are meaningful and fully consistent with the other
methods.
ggplot(lol, aes(x = TSNE1, y = TSNE2, color = uncertainty)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_viridis_c(option = "plasma", direction = -1) +
theme_minimal() +
labs(
title = "t-SNE (2D) colored by fuzzy uncertainty",
subtitle = paste0("perplexity = ", perp, " | higher uncertainty = more hybrid / potential misfit"),
x = "t-SNE 1", y = "t-SNE 2"
)
The uncertainty overlay on t-SNE confirms and sharpens the findings from
UMAP. The green cluster (top center) is almost entirely purple, meaning
these champions, despite forming a tight, internally coherent group, are
statistically the most hybrid in the dataset, sitting between both fuzzy
prototypes. The same applies to the small purple Cluster 4 points
visible at t-SNE coordinates around (0, 3), which also show high
uncertainty. In contrast, the teal cluster (bottom left) is
predominantly yellow at its core, confirming that ranged champions are
the most archetypal and unambiguous group. The pink cluster (right)
shows a clear gradient, yellow at the center, shifting to orange and
pink toward the edges, which suggests that while most melee/tanky
champions fit their archetype well, those at the periphery of the group
are the most likely candidates for hybrid or bruiser classifications.
This gradient pattern is consistent across all three non-linear
embeddings and strongly supports the validity of the fuzzy uncertainty
scores.
topN <- 10
misfits <- lol |> dplyr::arrange(dplyr::desc(uncertainty)) |> head(topN)
ggplot(lol, aes(x = TSNE1, y = TSNE2, color = cluster)) +
geom_point(size = 2.5, alpha = 0.6) +
geom_point(data = misfits, aes(x = TSNE1, y = TSNE2), color = "black", size = 3.2) +
ggrepel::geom_text_repel(
data = misfits, aes(label = Name),
size = 3, max.overlaps = Inf
) +
theme_minimal() +
labs(
title = "t-SNE with top hybrid/misfit champions highlighted",
subtitle = paste0("perplexity = ", perp, " | black points = highest uncertainty"),
x = "t-SNE 1", y = "t-SNE 2"
)
The labeled plot identifies the ten most statistically hybrid champions by name, and the selection is immediately interpretable from a game knowledge perspective. Bel’Veth, Vladimir, Gnar, Briar, and Kled all appear within the green cluster but with high uncertainty, these are champions whose kits blend melee combat with unconventional resource mechanics or transforming playstyles, making them genuine design hybrids that the algorithm correctly flags as difficult to classify. Kennen sits isolated in the Cluster 4 (purple) zone, which makes sense as he is a ranged champion with melee-like durability stats. Nilah and Ryze appear at the edge of the pink cluster, both being statistically unusual within their archetype, Nilah as a melee ADC and Ryze as a mana-scaling champion with atypically high resource stats. Thresh sits deep in the teal cluster but is flagged as uncertain, reflecting his unique resource design as a champion with no mana. Jhin appears as an isolated outlier at the far bottom of the teal group, consistent with his exceptionally unusual attack speed mechanics which make his stat profile unlike any other ranged champion in the dataset.
if (!requireNamespace("kohonen", quietly = TRUE)) install.packages("kohonen")
library(kohonen)
set.seed(123)
X <- as.matrix(lol_scaled)
som_grid <- somgrid(xdim = 10, ydim = 10, topo = "hexagonal")
som_model <- kohonen::som(
X = X,
grid = som_grid,
rlen = 200,
alpha = c(0.05, 0.01),
keep.data = TRUE
)
lol$SOM_unit <- som_model$unit.classif
total_nodes <- som_model$grid$xdim * som_model$grid$ydim
node_counts <- as.integer(table(factor(lol$SOM_unit, levels = 1:total_nodes)))
empty_nodes <- sum(node_counts == 0)
cat("SOM nodes:", total_nodes, "\n")
## SOM nodes: 100
cat("Empty nodes:", empty_nodes, "(", round(empty_nodes / total_nodes * 100, 1), "% )\n\n")
## Empty nodes: 22 ( 22 % )
codes <- som_model$codes[[1]]
bmu_codes <- codes[lol$SOM_unit, , drop = FALSE]
qe <- sqrt(rowSums((as.matrix(lol_scaled) - bmu_codes)^2))
cat("Mean quantization error:", round(mean(qe), 3), "\n")
## Mean quantization error: 1.82
cat("Median quantization error:", round(median(qe), 3), "\n")
## Median quantization error: 1.642
cat("Max quantization error:", round(max(qe), 3), "\n")
## Max quantization error: 5.497
plot(som_model, type = "dist.neighbours",
main = "SOM U-Matrix (neighbor distances)")
The U-Matrix shows neighbor distances across the 10×10 SOM grid, where
lighter colors indicate greater distance between adjacent nodes,
essentially revealing the “walls” between clusters. The two bright
white/yellow regions visible in the upper-center and left-center areas
of the grid mark the boundaries between champion archetypes, confirming
that the clusters are not gradually merging but separated by genuine
gaps in feature space. The predominantly red background indicates that
most nodes within each region are densely packed and internally similar,
which is consistent with the tight cluster structure seen in all
previous methods.
plot(
som_model, type = "mapping",
main = "SOM mapping (points colored by K-means cluster)",
pchs = 19,
col = as.integer(lol$cluster)
)
add.cluster.boundaries(som_model, as.integer(lol$cluster))
The mapping plot shows that the four K-means clusters occupy largely
separate regions of the SOM grid with minimal mixing. Green (Cluster 2)
dominates the left and upper-left portion, black (Cluster 3) covers the
lower-right, pink (Cluster 1) sits in the upper-right, and the small
blue (Cluster 4) group appears as an isolated pair of nodes in the
center, again confirming their status as a distinct archetype. The thick
black boundaries drawn by the cluster detection algorithm align well
with where champions of different colors actually separate, validating
the K=4 solution on this entirely different representational
framework.
node_unc <- tapply(lol$uncertainty, lol$SOM_unit, mean)
prop <- rep(NA_real_, total_nodes)
prop[as.integer(names(node_unc))] <- as.numeric(node_unc)
if (!requireNamespace("viridisLite", quietly = TRUE)) install.packages("viridisLite")
plot(
som_model, type = "property",
property = prop,
palette.name = viridisLite::viridis,
main = "SOM nodes colored by mean fuzzy uncertainty"
)
The uncertainty map reveals a striking spatial pattern, the highest
uncertainty nodes (yellow) are not uniformly distributed but
concentrated in specific transition zones between the cluster regions,
particularly in the center and lower-left of the grid. This means the
SOM has physically placed the most hybrid champions at the geographic
boundaries between archetypes, which is exactly what a well-trained map
should do. The deep purple nodes (low uncertainty) dominate the corners
and edges, confirming that the most archetypal champions cluster at the
periphery of the map away from the transition zones. The 22% empty nodes
and mean quantization error of 1.82 are acceptable for a dataset of this
size, indicating the map has learned a reasonable representation of the
feature space.
Across all five dimensionality reduction methods PCA, MDS, UMAP, t-SNE, and SOM the results are remarkably consistent and mutually reinforcing. Every method independently recovers the same four-cluster structure, with Cluster 3 (ranged/mana-dependent) and Clusters 1 and 2 (melee/tanky sub-archetypes) forming the primary division, and the small Cluster 4 appearing as a coherent isolated group rather than noise. The linear methods (PCA and MDS) confirm that this structure is preserved in Euclidean distances, while the non-linear methods (UMAP and t-SNE) additionally reveal that the clusters have tight local neighborhood structure with virtually no overlap. The SOM independently reproduces the same topology through a completely different learning mechanism, with the U-Matrix boundaries aligning precisely with the K-means partition. The fuzzy uncertainty signal is equally consistent across all views, hybrid champions always appear at the geometric boundaries between clusters regardless of which method is used, confirming that the uncertainty scores are measuring a real property of the data rather than an artifact of any single algorithm. Taken together, these results provide strong multi-method validation that the champion archetypes identified in this analysis are stable, interpretable, and genuinely present in the League of Legends stat design.
specialists <- lol %>%
arrange(uncertainty) %>%
select(Name, uncertainty, max_membership, fuzzy_cluster_label) %>%
head(10)
knitr::kable(
specialists, digits = 3,
col.names = c("Champion", "Uncertainty", "Max Membership", "Fuzzy label"),
caption = "Top 10 Specialist Champions (Lowest Uncertainty)"
)
| Champion | Uncertainty | Max Membership | Fuzzy label |
|---|---|---|---|
| Twisted Fate | 0.018 | 0.982 | Ranged/Mana-like |
| Syndra | 0.020 | 0.980 | Ranged/Mana-like |
| Ahri | 0.025 | 0.975 | Ranged/Mana-like |
| Zilean | 0.027 | 0.973 | Ranged/Mana-like |
| Jarvan IV | 0.028 | 0.972 | Melee/Tanky-like |
| Varus | 0.029 | 0.971 | Ranged/Mana-like |
| Xin Zhao | 0.030 | 0.970 | Melee/Tanky-like |
| Malzahar | 0.032 | 0.968 | Ranged/Mana-like |
| Poppy | 0.032 | 0.968 | Melee/Tanky-like |
| Ziggs | 0.032 | 0.968 | Ranged/Mana-like |
Specialist Champions The ten most archetypal champions are dominated by the Ranged/Mana-like group, with Twisted Fate (uncertainty = 0.018, max membership = 0.982) and Syndra (0.020) being the most statistically “pure” champions in the entire dataset. Their stat profiles, high attack range, large mana pools, strong mana scaling, align so closely with the ranged prototype that the fuzzy model assigns them with near-certainty. The three melee representatives in this list (Jarvan IV, Xin Zhao, Poppy) are equally unambiguous within their archetype, characterized by high base armor, magic resistance scaling, and low attack range with no deviation toward ranged stats whatsoever.
hybrids <- lol %>%
arrange(desc(uncertainty)) %>%
select(Name, uncertainty, max_membership, fuzzy_cluster_label, cluster, hclust_cluster) %>%
head(15)
knitr::kable(
hybrids, digits = 3,
col.names = c("Champion", "Uncertainty", "Max Membership", "Fuzzy label", "K-means", "Hierarchical"),
caption = "Top Hybrid / Potential Misfit Champions (Highest Uncertainty)"
)
| Champion | Uncertainty | Max Membership | Fuzzy label | K-means | Hierarchical |
|---|---|---|---|---|---|
| Briar | 0.497 | 0.503 | Melee/Tanky-like | 2 | 1 |
| Gnar | 0.472 | 0.528 | Melee/Tanky-like | 2 | 1 |
| Thresh | 0.443 | 0.557 | Ranged/Mana-like | 3 | 2 |
| Vladimir | 0.441 | 0.559 | Ranged/Mana-like | 2 | 1 |
| Kled | 0.436 | 0.564 | Melee/Tanky-like | 2 | 1 |
| Jhin | 0.426 | 0.574 | Ranged/Mana-like | 3 | 2 |
| Nilah | 0.418 | 0.582 | Melee/Tanky-like | 1 | 4 |
| Bel’Veth | 0.410 | 0.590 | Melee/Tanky-like | 2 | 1 |
| Kennen | 0.404 | 0.596 | Ranged/Mana-like | 4 | 3 |
| Ryze | 0.399 | 0.601 | Ranged/Mana-like | 3 | 4 |
| Wukong | 0.398 | 0.602 | Melee/Tanky-like | 1 | 4 |
| Rakan | 0.392 | 0.608 | Melee/Tanky-like | 1 | 4 |
| Graves | 0.389 | 0.611 | Melee/Tanky-like | 1 | 4 |
| Taric | 0.386 | 0.614 | Melee/Tanky-like | 1 | 4 |
| Kassadin | 0.371 | 0.629 | Ranged/Mana-like | 3 | 2 |
Hybrid Champions The hybrid table tells a more nuanced story. Briar (0.497) and Gnar (0.472) have max memberships barely above 0.5, meaning the algorithm is almost unable to decide which archetype they belong to, they are as close to a true statistical midpoint as possible. Notably, several champions in this list also show disagreement between K-Means and Hierarchical clustering (e.g. Vladimir assigned to Cluster 2 by K-Means but Cluster 1 by Hierarchical, Nilah to Cluster 1 vs Cluster 4), which independently confirms their borderline status. Every champion on this list has a clear in-game reason for their hybridity, as discussed in the t-SNE section, reinforcing that these are genuine design patterns rather than data artefacts.
The interactive table provides a complete per-champion summary of all clustering results, combining K-Means assignment, Hierarchical clustering label, fuzzy archetype, uncertainty score, max membership, and all dimensionality reduction coordinates in one searchable view. It is sorted by uncertainty by default, meaning the most statistically ambiguous champions appear first. This table serves as a reference tool, readers can search for any specific champion to inspect how consistently it was classified across all methods, or filter by fuzzy label to compare uncertainty distributions within each archetype. Champions where K-Means and Hierarchical labels disagree (e.g. Nilah: KM1 vs H4, Kennen: KM4 vs H3) are particularly worth examining, as cross-method disagreement is an independent signal of borderline archetype membership that complements the fuzzy uncertainty scores.
champion_summary <- lol %>%
select(
Name, cluster, hclust_cluster, fuzzy_cluster_label,
uncertainty, max_membership, entropy_norm,
PC1, PC2, MDS1, MDS2, UMAP1, UMAP2, TSNE1, TSNE2, SOM_unit
) %>%
mutate(
cluster = paste0("KM", as.integer(cluster)),
hclust_cluster = paste0("H", as.integer(hclust_cluster))
) %>%
arrange(desc(uncertainty))
datatable(
champion_summary,
options = list(pageLength = 15, autoWidth = TRUE),
caption = "Interactive Champion Results (sorted by uncertainty)",
filter = "top",
rownames = FALSE
) %>%
formatRound(
columns = c(
"uncertainty", "max_membership", "entropy_norm",
"PC1", "PC2", "MDS1", "MDS2",
"UMAP1", "UMAP2", "TSNE1", "TSNE2"
),
digits = 3
)
The most fundamental discovery of this analysis is that League of Legends champion design is organized around two orthogonal statistical principles that together explain 43.5% of all variance.
The first and strongest is the ranged vs. melee divide, Attack.range is the single most important feature in the entire dataset, and it negatively correlates with almost every durability stat (armor, magic resistance, base HP, movement speed), confirming that Riot Games has built a systematic trade-off between reach and survivability into champion design. The second principle is resource dependency - mana-related stats form a tightly correlated cluster that is largely independent of the ranged/melee axis, meaning a champion’s resource identity is a separate design dimension from their combat positioning.
The hard clustering analysis reveals that these two principles together produce four stable archetypes: a large ranged/mana-dependent group (marksmen and mages), two distinct melee sub-archetypes separated by resource scaling and sustain, and a small but genuine fourth group of champions with extreme stat profiles that no coarser model would have detected.
The fuzzy analysis adds an important nuance, while most champions (median uncertainty 0.118) fit cleanly into one archetype, a meaningful minority are genuine statistical hybrids. The labeled t-SNE plot identifies these by name: Jhin, Nilah, Vladimir, Kennen, Thresh, and Gnar are among the champions whose stat profiles most straddle the boundary between archetypes, and in every case the statistical hybridity reflects a real design decision - a melee ADC, a manaless support, a transforming champion, or an unconventional resource mechanic.
This is an important methodological question. Several potential data issues could in principle produce apparent hybrids artificially. Champions with missing or imputed stats, those added late in the game’s lifecycle with non-standard base values, or champions whose mechanics are not fully captured by static base stats (such as transformation champions like Gnar or resource-agnostic champions like Thresh) could appear as statistical outliers for reasons unrelated to their actual gameplay identity.
However, several factors suggest the hybrid signal in this analysis is genuine rather than artifactual. First, the same champions appear as high-uncertainty cases consistently across all five dimensionality reduction methods, if the hybridity were noise or a data error, it would be unlikely to reproduce so reliably. Second, in every identified case the statistical ambiguity has a clear in-game explanation: Nilah is a melee champion designed to fill the ADC role, Jhin has a deliberately unique attack speed mechanic, Thresh uses souls instead of mana, and Kennen is a ranged champion with unusually high base durability.
Third, the uncertainty scores are continuous and graded rather than binary, which is more consistent with genuine design variation than with data errors which would tend to produce sharp outliers.
The main genuine limitation is that base stats alone do not capture the full complexity of a champion’s kit, abilities, scaling, and itemization are not included in this dataset. Some champions may therefore appear statistically hybrid simply because their power is concentrated in their abilities rather than their base stats, which is a dataset limitation rather than a reflection of true archetype ambiguity.
This analysis set out to answer three research questions using unsupervised learning on League of Legends champion statistics, and all three can now be answered with confidence.
How many distinct champion archetypes exist? The optimal hard partition is K=4, supported by convergent evidence from the Elbow method, NbClust consensus, silhouette analysis, and near-perfect agreement (ARI = 0.978) between K-Means and Hierarchical clustering. These four archetypes are stable, reproducible across algorithms, and visible in every dimensionality reduction method applied.
Which attributes best discriminate between champion types? Attack.range is the single most discriminative feature in the dataset, driving the primary axis of separation (PC1, 28% of variance) and negatively correlating with almost every durability stat. The secondary axis (PC2, 15.5%) is defined by mana dependency and resource scaling, revealing that champion identity is structured around two orthogonal design dimensions that appear to be deliberately built into Riot Games’ champion design philosophy.
Can hybridity be quantified using fuzzy memberships? Yes. The fuzzy c-means model (K=2, m=1.5) produces uncertainty scores that are geometrically meaningful, consistent across all five dimension reduction methods, and interpretable by name, champions like Briar, Gnar, Thresh, and Jhin score highest not by accident but because their in-game design deliberately blends statistical properties from both archetypes.
Taken together, the results suggest that League of Legends champion design is not arbitrary but follows an underlying statistical grammar with two primary axes and four stable role families. The unsupervised approach used here recovers this structure without any prior knowledge of champion roles or tags, which validates both the methodology and the insight. Future work could extend this analysis by incorporating ability data, patch history, or win-rate statistics to move from descriptive archetype discovery toward predictive modeling of champion performance and balance.
Beyond academic interest, this analysis has a direct practical application for players. A player who enjoys a particular champion can use the clustering and uncertainty scores to find statistically similar alternatives, champions that share the same fundamental stat profile and therefore likely feel similar to play. For example, a player who enjoys Twisted Fate (the most archetypal Ranged/Mana-like champion) can look for low-uncertainty neighbours in the same cluster, while a player drawn to hybrid champions like Gnar or Vladimir might find other high-uncertainty champions equally satisfying due to their similarly flexible stat designs. The interactive table provided in this report makes this kind of personalised exploration directly accessible.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kohonen_3.0.13 Rtsne_0.17 uwot_0.2.4 Matrix_1.7-4
## [5] corrplot_0.95 mclust_6.1.2 DT_0.34.0 ggrepel_0.9.6
## [9] gridExtra_2.3 e1071_1.7-17 NbClust_3.0.1 factoextra_1.0.7
## [13] cluster_2.1.8.1 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
## [17] dplyr_1.1.4 purrr_1.2.1 readr_2.1.6 tidyr_1.3.2
## [21] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.55 bslib_0.9.0 htmlwidgets_1.6.4
## [5] rstatix_0.7.3 lattice_0.22-7 tzdb_0.5.0 crosstalk_1.2.2
## [9] vctrs_0.6.5 tools_4.5.2 generics_0.1.4 proxy_0.4-27
## [13] pkgconfig_2.0.3 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.4
## [17] FNN_1.1.4.1 compiler_4.5.2 farver_2.1.2 carData_3.0-5
## [21] htmltools_0.5.9 class_7.3-23 sass_0.4.10 yaml_2.3.12
## [25] Formula_1.2-5 pillar_1.11.1 car_3.1-3 ggpubr_0.6.2
## [29] jquerylib_0.1.4 cachem_1.1.0 viridis_0.6.5 abind_1.4-8
## [33] RSpectra_0.16-2 gtools_3.9.5 tidyselect_1.2.1 digest_0.6.39
## [37] stringi_1.8.7 reshape2_1.4.5 labeling_0.4.3 fastmap_1.2.0
## [41] grid_4.5.2 cli_3.6.5 magrittr_2.0.4 broom_1.0.11
## [45] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [49] rmarkdown_2.30 otel_0.2.0 ggsignif_0.6.4 hms_1.1.4
## [53] evaluate_1.0.5 knitr_1.51 viridisLite_0.4.2 rlang_1.1.6
## [57] dendextend_1.19.1 Rcpp_1.1.0 glue_1.8.0 rstudioapi_0.17.1
## [61] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9