Clustering is an unsupervised learning task where we aim to identify latent group structure in data when class labels are not used for model training. In practical analytics, clustering can be used to: - segment customers into behavioral groups, - identify product archetypes, - detect anomalous observations, - build compressed representations for downstream modeling.
Can we discover meaningful clusters from penguin physical measurements, and which clustering approach produces the most coherent grouping under internal validation criteria?
This report compares multiple clustering methods and evaluates them using internal validation metrics. The dataset includes a known species label; it is not used for clustering, but it is used at the end for interpretation and an optional external sanity check.
We use the palmerpenguins dataset, which contains
measurements of penguins from three species. We focus on numeric
morphology variables:
required_pkgs <- c(
"palmerpenguins", "dplyr", "tidyr", "ggplot2",
"cluster", "dbscan", "mclust", "GGally"
)
missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
stop(
"Missing R packages: ", paste(missing, collapse = ", "),
"
Please install them first, e.g.:
",
"install.packages(c(", paste0('"', missing, '"', collapse = ", "), "), repos='https://cloud.r-project.org')
"
)
}
invisible(lapply(required_pkgs, library, character.only = TRUE))
data("penguins", package = "palmerpenguins")
penguins_raw <- penguins
dplyr::glimpse(penguins_raw)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The dataset contains missing values in some numeric fields. For
clustering, we need a complete numeric matrix.
Here we use a simple, transparent approach: complete-case
filtering on the selected numeric variables.
num_vars <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")
penguins_cln <- penguins_raw %>%
dplyr::select(species, island, sex, dplyr::all_of(num_vars)) %>%
tidyr::drop_na(dplyr::all_of(num_vars))
n_removed <- nrow(penguins_raw) - nrow(penguins_cln)
cat("Rows removed due to missingness:", n_removed, "\n")
## Rows removed due to missingness: 2
dplyr::glimpse(penguins_cln)
## Rows: 342
## Columns: 7
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ sex <fct> male, female, female, female, male, female, male, NA…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
## $ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…
penguins_cln %>%
dplyr::summarise(
n = dplyr::n(),
across(all_of(num_vars), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.)))
)
Clusters often become visible in pairwise scatterplots.
GGally::ggpairs(
penguins_cln,
columns = num_vars,
aes(alpha = 0.5)
)
Distance-based clustering is sensitive to scale. We standardize the numeric variables to mean 0 and sd 1.
X <- penguins_cln %>%
dplyr::select(all_of(num_vars)) %>%
as.data.frame()
X_scaled <- scale(X)
X_scaled <- as.matrix(X_scaled)
dist_mat <- dist(X_scaled, method = "euclidean")
set.seed(123)
We compare candidate k values using: - Elbow method (within-cluster sum of squares) - Average silhouette (higher is better)
k_grid <- 2:6
kmeans_stats <- lapply(k_grid, function(k) {
km <- kmeans(X_scaled, centers = k, nstart = 50)
sil <- cluster::silhouette(km$cluster, dist_mat)
data.frame(
k = k,
tot_withinss = km$tot.withinss,
avg_silhouette = mean(sil[, 3])
)
}) %>% dplyr::bind_rows()
kmeans_stats
ggplot(kmeans_stats, aes(x = k, y = tot_withinss)) +
geom_line() + geom_point() +
labs(title = "Elbow method", x = "Number of clusters (k)", y = "Total within-cluster SS")
ggplot(kmeans_stats, aes(x = k, y = avg_silhouette)) +
geom_line() + geom_point() +
labs(title = "Average silhouette by k", x = "Number of clusters (k)", y = "Average silhouette")
We select k = 3 as a strong candidate because it typically balances interpretability and internal validity for this dataset (you may adjust k if your metrics point strongly elsewhere).
k_chosen <- 3
km3 <- kmeans(X_scaled, centers = k_chosen, nstart = 100)
table(km3$cluster)
##
## 1 2 3
## 132 123 87
K-means centers are on the scaled feature space. For interpretability, we also show centers in original units.
centers_scaled <- km3$centers
centers_scaled
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 -1.0465260 0.4858415 -0.8899121 -0.7694891
## 2 0.6562677 -1.0983711 1.1571696 1.0901639
## 3 0.6600059 0.8157307 -0.2857869 -0.3737654
centers_original <- sweep(centers_scaled, 2, attr(X_scaled, "scaled:scale"), `*`)
centers_original <- sweep(centers_original, 2, attr(X_scaled, "scaled:center"), `+`)
centers_original <- as.data.frame(centers_original)
centers_original$cluster <- 1:nrow(centers_original)
centers_original
We use PCA only for visualization.
pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$cluster_kmeans <- factor(km3$cluster)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_kmeans)) +
geom_point(alpha = 0.8) +
labs(title = "K-means clusters visualized in PCA space", color = "Cluster")
Hierarchical clustering provides a dendrogram and can be a robust alternative to k-means.
hc <- hclust(dist_mat, method = "ward.D2")
plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram (Ward.D2)")
hc_cut <- cutree(hc, k = k_chosen)
table(hc_cut)
## hc_cut
## 1 2 3
## 162 123 57
pca_df$cluster_hc <- factor(hc_cut)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_hc)) +
geom_point(alpha = 0.8) +
labs(title = "Hierarchical clusters (Ward) in PCA space", color = "Cluster")
Model-based clustering assumes the data are generated from a mixture of Gaussian distributions and selects the number of components by BIC.
mc <- mclust::Mclust(X_scaled)
summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 4 components:
##
## log-likelihood n df BIC ICL
## -1154.901 342 32 -2496.516 -2535.667
##
## Clustering table:
## 1 2 3 4
## 154 66 57 65
mc_cluster <- mc$classification
table(mc_cluster)
## mc_cluster
## 1 2 3 4
## 154 66 57 65
pca_df$cluster_mclust <- factor(mc_cluster)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_mclust)) +
geom_point(alpha = 0.8) +
labs(title = "Model-based clusters (Mclust) in PCA space", color = "Cluster")
DBSCAN can discover non-spherical clusters and mark outliers as noise
(cluster 0).
We choose eps using a simple data-driven heuristic based on
kNN distances.
k_nn <- 5
knn_dist <- dbscan::kNNdist(X_scaled, k = k_nn)
# Heuristic: pick eps around a high quantile of kNN distances.
eps_candidate <- as.numeric(stats::quantile(knn_dist, 0.90))
db <- dbscan::dbscan(X_scaled, eps = eps_candidate, minPts = k_nn + 1)
table(db$cluster)
##
## 0 1 2
## 9 213 120
pca_df$cluster_dbscan <- factor(db$cluster)
ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_dbscan)) +
geom_point(alpha = 0.8) +
labs(title = "DBSCAN clustering in PCA space (0 = noise)", color = "Cluster")
We compare clusterings using: - Average silhouette (higher is better) - Calinski–Harabasz (CH) index (higher is better) - Davies–Bouldin (DB) index (lower is better)
The functions below implement CH and DB in a transparent way.
avg_silhouette <- function(cl, dist_obj) {
cl <- as.integer(cl)
if (length(unique(cl)) < 2) return(NA_real_)
sil <- cluster::silhouette(cl, dist_obj)
mean(sil[, 3])
}
calinski_harabasz <- function(X, cl) {
cl <- as.integer(cl)
if (length(unique(cl)) < 2) return(NA_real_)
n <- nrow(X)
k <- length(unique(cl))
overall_mean <- colMeans(X)
# Between-cluster and within-cluster sum of squares
B <- 0
W <- 0
for (g in unique(cl)) {
Xg <- X[cl == g, , drop = FALSE]
ng <- nrow(Xg)
cg <- colMeans(Xg)
B <- B + ng * sum((cg - overall_mean)^2)
W <- W + sum(rowSums((Xg - matrix(cg, nrow = ng, ncol = ncol(Xg), byrow = TRUE))^2))
}
(B / (k - 1)) / (W / (n - k))
}
davies_bouldin <- function(X, cl) {
cl <- as.integer(cl)
groups <- sort(unique(cl))
if (length(groups) < 2) return(NA_real_)
centroids <- lapply(groups, function(g) colMeans(X[cl == g, , drop = FALSE]))
centroids <- do.call(rbind, centroids)
# Scatter: mean distance to centroid within each cluster
S <- sapply(seq_along(groups), function(i) {
g <- groups[i]
Xg <- X[cl == g, , drop = FALSE]
c_i <- centroids[i, ]
mean(sqrt(rowSums((Xg - matrix(c_i, nrow = nrow(Xg), ncol = ncol(Xg), byrow = TRUE))^2)))
})
# Pairwise centroid distances
M <- as.matrix(dist(centroids))
R <- matrix(NA_real_, nrow = length(groups), ncol = length(groups))
for (i in seq_along(groups)) {
for (j in seq_along(groups)) {
if (i != j) R[i, j] <- (S[i] + S[j]) / M[i, j]
}
}
D <- apply(R, 1, max, na.rm = TRUE)
mean(D)
}
# For DBSCAN, exclude noise (cluster 0) when computing metrics
db_nonnoise_idx <- which(db$cluster != 0)
metrics <- dplyr::tibble(
method = c("K-means (k=3)", "Hierarchical Ward (k=3)", paste0("Mclust (G=", mc$G, ")"), "DBSCAN (non-noise)"),
n_used = c(nrow(X_scaled), nrow(X_scaled), nrow(X_scaled), length(db_nonnoise_idx)),
avg_silhouette = c(
avg_silhouette(km3$cluster, dist_mat),
avg_silhouette(hc_cut, dist_mat),
avg_silhouette(mc_cluster, dist_mat),
avg_silhouette(db$cluster[db_nonnoise_idx], dist(X_scaled[db_nonnoise_idx, , drop = FALSE]))
),
calinski_harabasz = c(
calinski_harabasz(X_scaled, km3$cluster),
calinski_harabasz(X_scaled, hc_cut),
calinski_harabasz(X_scaled, mc_cluster),
calinski_harabasz(X_scaled[db_nonnoise_idx, , drop = FALSE], db$cluster[db_nonnoise_idx])
),
davies_bouldin = c(
davies_bouldin(X_scaled, km3$cluster),
davies_bouldin(X_scaled, hc_cut),
davies_bouldin(X_scaled, mc_cluster),
davies_bouldin(X_scaled[db_nonnoise_idx, , drop = FALSE], db$cluster[db_nonnoise_idx])
)
)
metrics
The dataset includes a species label. We did not use
it in clustering, but we can use it to interpret results.
We compute the Adjusted Rand Index (ARI), where 1 is perfect agreement
and 0 is random agreement.
species <- penguins_cln$species
ari_kmeans <- mclust::adjustedRandIndex(species, km3$cluster)
ari_hc <- mclust::adjustedRandIndex(species, hc_cut)
ari_mclust <- mclust::adjustedRandIndex(species, mc_cluster)
# For DBSCAN, treat noise as its own label (0)
ari_dbscan <- mclust::adjustedRandIndex(species, db$cluster)
dplyr::tibble(
method = c("K-means", "Hierarchical Ward", "Mclust", "DBSCAN"),
ARI_vs_species = c(ari_kmeans, ari_hc, ari_mclust, ari_dbscan)
)
# Simple cross-tabs for interpretation
table(Species = species, KMeans = km3$cluster)
## KMeans
## Species 1 2 3
## Adelie 127 0 24
## Chinstrap 5 0 63
## Gentoo 0 123 0
table(Species = species, Ward = hc_cut)
## Ward
## Species 1 2 3
## Adelie 151 0 0
## Chinstrap 11 0 57
## Gentoo 0 123 0
table(Species = species, Mclust = mc_cluster)
## Mclust
## Species 1 2 3 4
## Adelie 150 0 0 1
## Chinstrap 4 0 0 64
## Gentoo 0 66 57 0
table(Species = species, DBSCAN = db$cluster)
## DBSCAN
## Species 0 1 2
## Adelie 1 150 0
## Chinstrap 5 63 0
## Gentoo 3 0 120
eps choice.eps, minPts) can
require careful tuning.sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GGally_2.4.0 mclust_6.1.2 dbscan_1.2.4
## [4] cluster_2.1.8.1 ggplot2_4.0.0 tidyr_1.3.1
## [7] dplyr_1.1.4 palmerpenguins_0.1.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.1 tidyselect_1.2.1
## [5] Rcpp_1.1.0 ggstats_0.12.0 jquerylib_0.1.4 scales_1.4.0
## [9] yaml_2.3.10 fastmap_1.2.0 R6_2.6.1 labeling_0.4.3
## [13] generics_0.1.4 knitr_1.50 tibble_3.3.0 bslib_0.9.0
## [17] pillar_1.11.1 RColorBrewer_1.1-3 rlang_1.1.6 cachem_1.1.0
## [21] xfun_0.54 sass_0.4.10 S7_0.2.0 cli_3.6.5
## [25] withr_3.0.2 magrittr_2.0.4 digest_0.6.37 grid_4.5.1
## [29] rstudioapi_0.17.1 lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.5
## [33] glue_1.8.0 farver_2.1.2 codetools_0.2-20 rmarkdown_2.30
## [37] purrr_1.1.0 tools_4.5.1 pkgconfig_2.0.3 htmltools_0.5.8.1