I’m doing K-means clustering with the penguins dataset.
K-means groups data points into clusters based on similarity.
Goal: See if we can identify the 3 penguin species just from their physical measurements, without being told which is which.
2-8-2026
I’m doing K-means clustering with the penguins dataset.
K-means groups data points into clusters based on similarity.
Goal: See if we can identify the 3 penguin species just from their physical measurements, without being told which is which.
The basic idea is that we want to group data points so that points in the same group are close to each other.
The math behind it:
\[\text{Minimize: } \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2\]
This says: for each cluster, add up all the distances from points to their cluster center, and try to make that total as small as possible.
Where \(\mu_i\) is the center (average) of cluster \(i\), and \(\|x - \mu_i\|^2\) is the distance squared.
How it works:
Stops when nothing changes or changes are very small.
| species | island | bill length mm | bill depth mm | flipper length mm | body mass g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
Data from penguins on islands in Antarctica. 3 species: Adelie, Chinstrap, and Gentoo.
Measurements: bill length, bill depth, flipper length, body mass.
You can see natural groupings, species cluster together.
Gentoo penguins are larger across most measurements.
Here is the code I used to run the algorithm. I had to scale the data first so the large body mass numbers (in grams) didn’t overpower the small beak measurements (in mm).
# select features
df <- penguins_clean %>%
select(bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g)
# scale the data
df_scaled <- scale(df)
# k-means with k=3
set.seed(123)
k_fit <- kmeans(df_scaled, centers = 3, nstart = 25)
# add cluster labels
penguins_clustered <- penguins_clean %>%
mutate(cluster = as.factor(k_fit$cluster))
The clusters are clearly separated even in 2D space.
Clusters vs actual species:
## ## 1 2 3 ## Adelie 22 124 0 ## Chinstrap 63 5 0 ## Gentoo 0 0 119
Results are good. Each species mostly in its own cluster.
Cluster 1 = Adelie, Cluster 2 = Chinstrap, Cluster 3 = Gentoo
Problem: need to pick k beforehand. Elbow method helps:
The “elbow” is at k=3.
Main takeaways:
Limitations:
palmerpenguins R package
R documentation for kmeans()
ggplot2 and plotly documentation