2-8-2026

What I’m Presenting Today

I’m doing K-means clustering with the penguins dataset.

K-means groups data points into clusters based on similarity.

Goal: See if we can identify the 3 penguin species just from their physical measurements, without being told which is which.

How Does K-means Work?

The basic idea is that we want to group data points so that points in the same group are close to each other.

The math behind it:

\[\text{Minimize: } \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2\]

This says: for each cluster, add up all the distances from points to their cluster center, and try to make that total as small as possible.

Where \(\mu_i\) is the center (average) of cluster \(i\), and \(\|x - \mu_i\|^2\) is the distance squared.

The Steps of the Algorithm

How it works:

  1. Pick k random points as initial centers
  2. Assign each point to nearest center
  3. Recalculate centers as average of points in each cluster \[\mu_i^{(new)} = \frac{1}{n_i} \sum_{x \in \text{cluster } i} x\]
  4. Repeat steps 2-3 until centers stop moving

Stops when nothing changes or changes are very small.

About the Penguins Dataset

species island bill length mm bill depth mm flipper length mm body mass g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007

Data from penguins on islands in Antarctica. 3 species: Adelie, Chinstrap, and Gentoo.

Measurements: bill length, bill depth, flipper length, body mass.

Looking at the Data First

You can see natural groupings, species cluster together.

Comparing All the Measurements

Gentoo penguins are larger across most measurements.

Running K-means on the Penguin Data

Here is the code I used to run the algorithm. I had to scale the data first so the large body mass numbers (in grams) didn’t overpower the small beak measurements (in mm).

# select features
df <- penguins_clean %>%
  select(bill_length_mm, bill_depth_mm, 
         flipper_length_mm, body_mass_g)

# scale the data
df_scaled <- scale(df)

# k-means with k=3
set.seed(123)
k_fit <- kmeans(df_scaled, centers = 3, nstart = 25)

# add cluster labels
penguins_clustered <- penguins_clean %>%
  mutate(cluster = as.factor(k_fit$cluster))

Cluster Visualization

The clusters are clearly separated even in 2D space.

How Well Did It Work?

Clusters vs actual species:

##            
##               1   2   3
##   Adelie     22 124   0
##   Chinstrap  63   5   0
##   Gentoo      0   0 119

Results are good. Each species mostly in its own cluster.

Cluster 1 = Adelie, Cluster 2 = Chinstrap, Cluster 3 = Gentoo

Finding the Right Number of Clusters

Problem: need to pick k beforehand. Elbow method helps:

The “elbow” is at k=3.

Summary

Main takeaways:

  • K-means groups data into k clusters based on similarity
  • Need to standardize data first
  • Worked well on penguin data - found the 3 species
  • Elbow method helps pick k

Limitations:

  • Need to pick k beforehand
  • Results vary (use set.seed for reproducibility)
  • Assumes spherical clusters
  • Sensitive to outliers

References

palmerpenguins R package

R documentation for kmeans()

ggplot2 and plotly documentation