Clustering Penguins with K-means

2-8-2026

What I’m Presenting Today

I’m doing K-means clustering with the penguins dataset.

K-means groups data points into clusters based on similarity.

Goal: See if we can identify the 3 penguin species just from their physical measurements, without being told which is which.

How Does K-means Work?

The basic idea is that we want to group data points so that points in the same group are close to each other.

The math behind it:

\[\text{Minimize: } \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2\]

This says: for each cluster, add up all the distances from points to their cluster center, and try to make that total as small as possible.

Where \(\mu_i\) is the center (average) of cluster \(i\), and \(\|x - \mu_i\|^2\) is the distance squared.

The Steps of the Algorithm

How it works:

Pick k random points as initial centers
Assign each point to nearest center
Recalculate centers as average of points in each cluster \[\mu_i^{(new)} = \frac{1}{n_i} \sum_{x \in \text{cluster } i} x\]
Repeat steps 2-3 until centers stop moving

Stops when nothing changes or changes are very small.

About the Penguins Dataset

species	island	bill length mm	bill depth mm	flipper length mm	body mass g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007

Data from penguins on islands in Antarctica. 3 species: Adelie, Chinstrap, and Gentoo.

Measurements: bill length, bill depth, flipper length, body mass.

Looking at the Data First

You can see natural groupings, species cluster together.

Comparing All the Measurements

Gentoo penguins are larger across most measurements.

Running K-means on the Penguin Data

Here is the code I used to run the algorithm. I had to scale the data first so the large body mass numbers (in grams) didn’t overpower the small beak measurements (in mm).

# select features
df <- penguins_clean %>%
  select(bill_length_mm, bill_depth_mm, 
         flipper_length_mm, body_mass_g)

# scale the data
df_scaled <- scale(df)

# k-means with k=3
set.seed(123)
k_fit <- kmeans(df_scaled, centers = 3, nstart = 25)

# add cluster labels
penguins_clustered <- penguins_clean %>%
  mutate(cluster = as.factor(k_fit$cluster))

Cluster Visualization

The clusters are clearly separated even in 2D space.

How Well Did It Work?

Clusters vs actual species:

##            
##               1   2   3
##   Adelie     22 124   0
##   Chinstrap  63   5   0
##   Gentoo      0   0 119

Results are good. Each species mostly in its own cluster.

Cluster 1 = Adelie, Cluster 2 = Chinstrap, Cluster 3 = Gentoo

Finding the Right Number of Clusters

Problem: need to pick k beforehand. Elbow method helps:

The “elbow” is at k=3.

Summary

Main takeaways:

K-means groups data into k clusters based on similarity
Need to standardize data first
Worked well on penguin data - found the 3 species
Elbow method helps pick k

Limitations:

Need to pick k beforehand
Results vary (use set.seed for reproducibility)
Assumes spherical clusters
Sensitive to outliers

References

palmerpenguins R package

R documentation for kmeans()

ggplot2 and plotly documentation