1 1. Introduction

1.1 1.1 Motivation

Clustering is an unsupervised learning task where we aim to identify latent group structure in data when class labels are not used for model training. In practical analytics, clustering can be used to: - segment customers into behavioral groups, - identify product archetypes, - detect anomalous observations, - build compressed representations for downstream modeling.

1.2 1.2 Research question

Can we discover meaningful clusters from penguin physical measurements, and which clustering approach produces the most coherent grouping under internal validation criteria?

This report compares multiple clustering methods and evaluates them using internal validation metrics. The dataset includes a known species label; it is not used for clustering, but it is used at the end for interpretation and an optional external sanity check.

2 2. Data

2.1 2.1 Dataset: Palmer Penguins

We use the palmerpenguins dataset, which contains measurements of penguins from three species. We focus on numeric morphology variables:

bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g

required_pkgs <- c(
  "palmerpenguins", "dplyr", "tidyr", "ggplot2",
  "cluster", "dbscan", "mclust", "GGally"
)

missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
  stop(
    "Missing R packages: ", paste(missing, collapse = ", "),
    "

Please install them first, e.g.:
",
    "install.packages(c(", paste0('"', missing, '"', collapse = ", "), "), repos='https://cloud.r-project.org')
"
  )
}

invisible(lapply(required_pkgs, library, character.only = TRUE))

data("penguins", package = "palmerpenguins")

penguins_raw <- penguins
dplyr::glimpse(penguins_raw)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

2.2 2.2 Cleaning

The dataset contains missing values in some numeric fields. For clustering, we need a complete numeric matrix.
Here we use a simple, transparent approach: complete-case filtering on the selected numeric variables.

num_vars <- c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")

penguins_cln <- penguins_raw %>%
  dplyr::select(species, island, sex, dplyr::all_of(num_vars)) %>%
  tidyr::drop_na(dplyr::all_of(num_vars))

n_removed <- nrow(penguins_raw) - nrow(penguins_cln)
cat("Rows removed due to missingness:", n_removed, "\n")

## Rows removed due to missingness: 2

dplyr::glimpse(penguins_cln)

## Rows: 342
## Columns: 7
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ sex               <fct> male, female, female, female, male, female, male, NA…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…

3 3. Exploratory analysis

3.1 3.1 Basic summaries

penguins_cln %>%
  dplyr::summarise(
    n = dplyr::n(),
    across(all_of(num_vars), list(mean = ~mean(.), sd = ~sd(.), min = ~min(.), max = ~max(.)))
  )

3.2 3.2 Pairwise relationships

Clusters often become visible in pairwise scatterplots.

GGally::ggpairs(
  penguins_cln,
  columns = num_vars,
  aes(alpha = 0.5)
)

4 4. Preprocessing for clustering

Distance-based clustering is sensitive to scale. We standardize the numeric variables to mean 0 and sd 1.

X <- penguins_cln %>%
  dplyr::select(all_of(num_vars)) %>%
  as.data.frame()

X_scaled <- scale(X)
X_scaled <- as.matrix(X_scaled)

dist_mat <- dist(X_scaled, method = "euclidean")

set.seed(123)

5 5. Method 1 — K-means clustering

5.1 5.1 Choosing the number of clusters (k)

We compare candidate k values using: - Elbow method (within-cluster sum of squares) - Average silhouette (higher is better)

k_grid <- 2:6

kmeans_stats <- lapply(k_grid, function(k) {
  km <- kmeans(X_scaled, centers = k, nstart = 50)
  sil <- cluster::silhouette(km$cluster, dist_mat)
  data.frame(
    k = k,
    tot_withinss = km$tot.withinss,
    avg_silhouette = mean(sil[, 3])
  )
}) %>% dplyr::bind_rows()

kmeans_stats

ggplot(kmeans_stats, aes(x = k, y = tot_withinss)) +
  geom_line() + geom_point() +
  labs(title = "Elbow method", x = "Number of clusters (k)", y = "Total within-cluster SS")

ggplot(kmeans_stats, aes(x = k, y = avg_silhouette)) +
  geom_line() + geom_point() +
  labs(title = "Average silhouette by k", x = "Number of clusters (k)", y = "Average silhouette")

5.1.1 Decision rule

We select k = 3 as a strong candidate because it typically balances interpretability and internal validity for this dataset (you may adjust k if your metrics point strongly elsewhere).

k_chosen <- 3
km3 <- kmeans(X_scaled, centers = k_chosen, nstart = 100)

table(km3$cluster)

## 
##   1   2   3 
## 132 123  87

5.2 5.2 Cluster profiles (centers)

K-means centers are on the scaled feature space. For interpretability, we also show centers in original units.

centers_scaled <- km3$centers
centers_scaled

##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1     -1.0465260     0.4858415        -0.8899121  -0.7694891
## 2      0.6562677    -1.0983711         1.1571696   1.0901639
## 3      0.6600059     0.8157307        -0.2857869  -0.3737654

centers_original <- sweep(centers_scaled, 2, attr(X_scaled, "scaled:scale"), `*`)
centers_original <- sweep(centers_original, 2, attr(X_scaled, "scaled:center"), `+`)
centers_original <- as.data.frame(centers_original)
centers_original$cluster <- 1:nrow(centers_original)
centers_original

5.3 5.3 Visualization in PCA space

We use PCA only for visualization.

pca <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$cluster_kmeans <- factor(km3$cluster)

ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_kmeans)) +
  geom_point(alpha = 0.8) +
  labs(title = "K-means clusters visualized in PCA space", color = "Cluster")

6 6. Method 2 — Hierarchical clustering (Ward)

Hierarchical clustering provides a dendrogram and can be a robust alternative to k-means.

hc <- hclust(dist_mat, method = "ward.D2")
plot(hc, labels = FALSE, main = "Hierarchical clustering dendrogram (Ward.D2)")

hc_cut <- cutree(hc, k = k_chosen)
table(hc_cut)

## hc_cut
##   1   2   3 
## 162 123  57

pca_df$cluster_hc <- factor(hc_cut)

ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_hc)) +
  geom_point(alpha = 0.8) +
  labs(title = "Hierarchical clusters (Ward) in PCA space", color = "Cluster")

7 7. Method 3 — Model-based clustering (Gaussian mixtures)

Model-based clustering assumes the data are generated from a mixture of Gaussian distributions and selects the number of components by BIC.

mc <- mclust::Mclust(X_scaled)
summary(mc)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 4 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -1154.901 342 32 -2496.516 -2535.667
## 
## Clustering table:
##   1   2   3   4 
## 154  66  57  65

mc_cluster <- mc$classification
table(mc_cluster)

## mc_cluster
##   1   2   3   4 
## 154  66  57  65

pca_df$cluster_mclust <- factor(mc_cluster)

ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_mclust)) +
  geom_point(alpha = 0.8) +
  labs(title = "Model-based clusters (Mclust) in PCA space", color = "Cluster")

8 8. Method 4 — DBSCAN (density-based)

DBSCAN can discover non-spherical clusters and mark outliers as noise (cluster 0).
We choose eps using a simple data-driven heuristic based on kNN distances.

k_nn <- 5
knn_dist <- dbscan::kNNdist(X_scaled, k = k_nn)

# Heuristic: pick eps around a high quantile of kNN distances.
eps_candidate <- as.numeric(stats::quantile(knn_dist, 0.90))

db <- dbscan::dbscan(X_scaled, eps = eps_candidate, minPts = k_nn + 1)
table(db$cluster)

## 
##   0   1   2 
##   9 213 120

pca_df$cluster_dbscan <- factor(db$cluster)

ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster_dbscan)) +
  geom_point(alpha = 0.8) +
  labs(title = "DBSCAN clustering in PCA space (0 = noise)", color = "Cluster")

9 9. Evaluation and comparison

9.1 9.1 Internal validation metrics

We compare clusterings using: - Average silhouette (higher is better) - Calinski–Harabasz (CH) index (higher is better) - Davies–Bouldin (DB) index (lower is better)

The functions below implement CH and DB in a transparent way.

avg_silhouette <- function(cl, dist_obj) {
  cl <- as.integer(cl)
  if (length(unique(cl)) < 2) return(NA_real_)
  sil <- cluster::silhouette(cl, dist_obj)
  mean(sil[, 3])
}

calinski_harabasz <- function(X, cl) {
  cl <- as.integer(cl)
  if (length(unique(cl)) < 2) return(NA_real_)
  n <- nrow(X)
  k <- length(unique(cl))
  overall_mean <- colMeans(X)

  # Between-cluster and within-cluster sum of squares
  B <- 0
  W <- 0
  for (g in unique(cl)) {
    Xg <- X[cl == g, , drop = FALSE]
    ng <- nrow(Xg)
    cg <- colMeans(Xg)
    B <- B + ng * sum((cg - overall_mean)^2)
    W <- W + sum(rowSums((Xg - matrix(cg, nrow = ng, ncol = ncol(Xg), byrow = TRUE))^2))
  }
  (B / (k - 1)) / (W / (n - k))
}

davies_bouldin <- function(X, cl) {
  cl <- as.integer(cl)
  groups <- sort(unique(cl))
  if (length(groups) < 2) return(NA_real_)

  centroids <- lapply(groups, function(g) colMeans(X[cl == g, , drop = FALSE]))
  centroids <- do.call(rbind, centroids)

  # Scatter: mean distance to centroid within each cluster
  S <- sapply(seq_along(groups), function(i) {
    g <- groups[i]
    Xg <- X[cl == g, , drop = FALSE]
    c_i <- centroids[i, ]
    mean(sqrt(rowSums((Xg - matrix(c_i, nrow = nrow(Xg), ncol = ncol(Xg), byrow = TRUE))^2)))
  })

  # Pairwise centroid distances
  M <- as.matrix(dist(centroids))
  R <- matrix(NA_real_, nrow = length(groups), ncol = length(groups))
  for (i in seq_along(groups)) {
    for (j in seq_along(groups)) {
      if (i != j) R[i, j] <- (S[i] + S[j]) / M[i, j]
    }
  }
  D <- apply(R, 1, max, na.rm = TRUE)
  mean(D)
}

# For DBSCAN, exclude noise (cluster 0) when computing metrics
db_nonnoise_idx <- which(db$cluster != 0)

metrics <- dplyr::tibble(
  method = c("K-means (k=3)", "Hierarchical Ward (k=3)", paste0("Mclust (G=", mc$G, ")"), "DBSCAN (non-noise)"),
  n_used = c(nrow(X_scaled), nrow(X_scaled), nrow(X_scaled), length(db_nonnoise_idx)),
  avg_silhouette = c(
    avg_silhouette(km3$cluster, dist_mat),
    avg_silhouette(hc_cut, dist_mat),
    avg_silhouette(mc_cluster, dist_mat),
    avg_silhouette(db$cluster[db_nonnoise_idx], dist(X_scaled[db_nonnoise_idx, , drop = FALSE]))
  ),
  calinski_harabasz = c(
    calinski_harabasz(X_scaled, km3$cluster),
    calinski_harabasz(X_scaled, hc_cut),
    calinski_harabasz(X_scaled, mc_cluster),
    calinski_harabasz(X_scaled[db_nonnoise_idx, , drop = FALSE], db$cluster[db_nonnoise_idx])
  ),
  davies_bouldin = c(
    davies_bouldin(X_scaled, km3$cluster),
    davies_bouldin(X_scaled, hc_cut),
    davies_bouldin(X_scaled, mc_cluster),
    davies_bouldin(X_scaled[db_nonnoise_idx, , drop = FALSE], db$cluster[db_nonnoise_idx])
  )
)

metrics

9.2 9.2 Optional external check (not used for clustering)

The dataset includes a species label. We did not use it in clustering, but we can use it to interpret results.
We compute the Adjusted Rand Index (ARI), where 1 is perfect agreement and 0 is random agreement.

species <- penguins_cln$species

ari_kmeans <- mclust::adjustedRandIndex(species, km3$cluster)
ari_hc     <- mclust::adjustedRandIndex(species, hc_cut)
ari_mclust <- mclust::adjustedRandIndex(species, mc_cluster)

# For DBSCAN, treat noise as its own label (0)
ari_dbscan <- mclust::adjustedRandIndex(species, db$cluster)

dplyr::tibble(
  method = c("K-means", "Hierarchical Ward", "Mclust", "DBSCAN"),
  ARI_vs_species = c(ari_kmeans, ari_hc, ari_mclust, ari_dbscan)
)

# Simple cross-tabs for interpretation
table(Species = species, KMeans = km3$cluster)

##            KMeans
## Species       1   2   3
##   Adelie    127   0  24
##   Chinstrap   5   0  63
##   Gentoo      0 123   0

table(Species = species, Ward = hc_cut)

##            Ward
## Species       1   2   3
##   Adelie    151   0   0
##   Chinstrap  11   0  57
##   Gentoo      0 123   0

table(Species = species, Mclust = mc_cluster)

##            Mclust
## Species       1   2   3   4
##   Adelie    150   0   0   1
##   Chinstrap   4   0   0  64
##   Gentoo      0  66  57   0

table(Species = species, DBSCAN = db$cluster)

##            DBSCAN
## Species       0   1   2
##   Adelie      1 150   0
##   Chinstrap   5  63   0
##   Gentoo      3   0 120

10 10. Discussion

10.1 10.1 Key findings

K-means is a strong baseline and tends to form compact clusters in standardized Euclidean space.
Hierarchical Ward often produces similar results to k-means, with the added benefit of a dendrogram for exploratory inspection.
Model-based clustering (Mclust) can adapt cluster shape/volume assumptions and selects the number of components using BIC.
DBSCAN is useful when clusters are not spherical and when outliers/noise are meaningful; it can be sensitive to the eps choice.

10.2 10.2 Limitations

Complete-case filtering removes some observations; alternative approaches include imputation (e.g., median by group) or model-based missing data handling.
Distance-based methods assume Euclidean geometry; alternative distances or feature engineering may change the structure.
DBSCAN hyperparameters (eps, minPts) can require careful tuning.

10.3 10.3 What I would do next

Try robust scaling and outlier-resistant distances.
Evaluate clustering stability via bootstrap resampling.
Add additional variables (e.g., categorical encoding of island/sex) using mixed-type methods (e.g., Gower distance + hierarchical clustering).

11 11. Reproducibility

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                               
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GGally_2.4.0         mclust_6.1.2         dbscan_1.2.4        
## [4] cluster_2.1.8.1      ggplot2_4.0.0        tidyr_1.3.1         
## [7] dplyr_1.1.4          palmerpenguins_0.1.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1  
##  [5] Rcpp_1.1.0         ggstats_0.12.0     jquerylib_0.1.4    scales_1.4.0      
##  [9] yaml_2.3.10        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3    
## [13] generics_0.1.4     knitr_1.50         tibble_3.3.0       bslib_0.9.0       
## [17] pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.6        cachem_1.1.0      
## [21] xfun_0.54          sass_0.4.10        S7_0.2.0           cli_3.6.5         
## [25] withr_3.0.2        magrittr_2.0.4     digest_0.6.37      grid_4.5.1        
## [29] rstudioapi_0.17.1  lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.5    
## [33] glue_1.8.0         farver_2.1.2       codetools_0.2-20   rmarkdown_2.30    
## [37] purrr_1.1.0        tools_4.5.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Unsupervised Learning — Project 1 (Clustering)

Discovering natural groups in penguin morphology (Palmer Penguins dataset)

Rongfeng Qiu (Student ID: 488004)

2025-12-28