Introduction

This is a tongue in cheek example of understanding and implementing a machine learning technique known as k-means clustering. We’ll be creating a fake dataset and try to use k-means to see if we can identify basic girls based on the total number of bathroom selfies on their social media profiles and their average weekly visits to Starbucks. Given that the dataset is fabricated, of course we’ll be able to see very distinct clusters (which is usually not the case in real life), but the point of this exercise is to understand how the k-means algorithm works and how you can use it for knowledge discovery.

Clustering

Clustering is an unsupervised machine learning task that automatically divides data into clusters. Unlike supervised learning where we know what we’re trying to predict, clustering is instead used for knowledge discovery. That is, it allows us to discover new insights or latent variables within our data.

Clustering is based on the principle that the points inside a cluster should be very similar to each other, but very different from those outside. Clustering can provide insights into patterns and relationships we may not have even thought of looking for.

K-Means Clustering

The k-means algorithm is perhaps the most commonly used clustering method in machine learning. It involves assigning \(n\) examples to one of the \(k\) clusters, where \(k\) is a number that has been been chosen ahead of time. The k-means algorithm is actually pretty simple.

  1. Randomly pick \(k\) centroids (the k-means).
  2. Assign each data point to the centroid it’s nearest to. This is usually done by calculating the Euclidean distance between all of the features. If there are \(n\) features, the Euclidean distance between example \(x\) and example \(y\) can be calculated as \[ \begin{aligned} dist(x,y)=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2} \end{aligned} \]
  3. Recompute each centroid based on the average position of each centroid’s points.
  4. Iterate until data points stop changing assignment to centroids.

The goal is to minimize the differences within each cluster and maximize the differences between clusters. Also, note that choosing the level of \(k\) doesn’t have to be random. Ideally, you would like to have some sort of a-priori knowledge about the true groupings (in this case, we’re trying to identify basic girls and not basic girls, so we can choose a level of \(k=2\)). However, without prior knowlege, there are other methods for choosing the level of \(k\). One rule of thumb is setting \(k\) equal to the square root of \(n/2\) where \(n\) is the number of observations. Another method is known as the elbow method, which I won’t get into here, but you can click on the link to understand how it works.

Creating Fake Data

So, I’m terribly sorry I didn’t take the time to take an actual survey of girls and ask them how many bathroom selfies they have and how many times they visit Starbucks on a weekly average basis. So, to make up for that, let’s create some fake data. First let’s write a function that will allow us to input how many observations we would like and create some fake data out of it.

createData <- function(n) {
  set.seed(10)
  bathroom.selfies <- vector()
  starbucks.visits <- vector()
  for (i in 1:n/2) {
    bathroom.selfies <- append(bathroom.selfies, round(rnorm(n = 1, mean = 215, sd = 15)))
    starbucks.visits <- append(starbucks.visits, round(rnorm(n = 1, mean = 15, sd = 3)))
  }
  for (j in 1:n/2) {
    bathroom.selfies <- append(bathroom.selfies, round(rnorm(n = 1, mean = 100, sd = 25)))
    starbucks.visits <- append(starbucks.visits, round(rnorm(n = 1, mean = 8, sd = 2.5)))
  }
  df <- data.frame(bathroom.selfies, starbucks.visits)
  return(df)
}

Notice that the above function ensures that there will be two very distinct groupings, since the mean of half of each variable is completely different from the mean of the other half. Let’s see how this fake data looks. Also, let’s pretend we didn’t create this fake data. Play along here. Let’s imagine that we legit were able to get this data for realskis. Let’s examine our data first.

library(ggplot2)
girls <- createData(1000)
head(girls, n = 10)
##    bathroom.selfies starbucks.visits
## 1               215               14
## 2               194               13
## 3               219               16
## 4               197               14
## 5               191               14
## 6               232               17
## 7               211               18
## 8               226               15
## 9               201               14
## 10              229               16
summary(girls)
##  bathroom.selfies starbucks.visits
##  Min.   : 12      Min.   : 0.00   
##  1st Qu.: 99      1st Qu.: 8.00   
##  Median :171      Median :11.00   
##  Mean   :157      Mean   :11.55   
##  3rd Qu.:215      3rd Qu.:15.00   
##  Max.   :268      Max.   :26.00

Let’s visualize.

library(ggplot2)
ggplot(girls, aes(y = bathroom.selfies, x = starbucks.visits)) +  geom_point(col='red')

Based on the scatter plot, there’s obviously two distinct clusters. And, if you know anything about basic girls, you can probably guess which cluster is the “basic girl” cluster. However, we haven’t really discovered any new knowledge based on this plot alone (actually, based on the function we created, we actually do, but remember, we’re pretending this is real data). Let’s implement k-means and set \(k=2\).

girls.kmeans <- kmeans(girls, centers = 2)

The one thing to note about clustering is that it doesn’t assign a meaningful category to each cluster. It only assigns a numeric value indicating which cluster it belongs to. You have to be the one to make sense of the data based on the results of the k-means clustering algorithm. Let’s assign the clusters to our original dataset, and see if we can gain any new knowledge based on each cluster.

girls$cluster <- girls.kmeans$cluster

The mean of bathroom.selfies per cluster:

##   cluster bathroom.selfies
## 1       1        214.35172
## 2       2         97.88934

The mean of starbucks.visits per cluster:

##   cluster starbucks.visits
## 1       1        14.951724
## 2       2         8.035533

Looking at our data, we notice that cluster 1 has a bathroom.selfies mean of about 214 and cluster 2 has a mean of about 98. For starbucks.visits, cluster 1 has a mean of about 15 and cluster 2 has a mean of about 8. This data is meaningful (well, sorta. We already knew what the means would be based on the function we created). Girls belonging to cluster 1 take twice as more bathroom selfies and take twice as many weekly visits to Starbucks than girls in cluster 2. I think we spotted our basic girls. Let’s rename our clusters accordingly and visualize our new data.

girls$cluster <- factor(girls$cluster, levels = c(1, 2), labels = c('Basic', 'Not Basic'))
g <- ggplot(girls, aes(y = bathroom.selfies, x = starbucks.visits, col = cluster, shape = cluster))
g + geom_point() + stat_ellipse(aes(y = bathroom.selfies, x = starbucks.visits, fill = cluster), 
                                geom = 'polygon', alpha = 0.2, level = 0.99)

Conclusion

Using k-means clustering and some fake data, we were able to spot the basic girls. Note that you can use as many features you want for the k-means algorithm. Some other useful features for spotting basic girls might be: how many pairs of Uggs they own, how many hashtags they use per Instagram post, how often do they watch Sex and the City, etc. The point is, k-means isn’t limited to just using two features. K-means clustering is very useful for knowledge discovery. By assigning each observation to a cluster, you can learn some of the shared characteristics within those clusters and perhaps gain new insights or discover latent variables.