Greg, over at  \(\hat{y}hat\), did a really nice job introducing basic K-Means clustering with Principal Component Analysis to look at commonalities between sets of customers.

I thought it would be interesting to have an R-verison of it available in the event someone wanted to see a comparison of the solution between the two languages (and, it’s a really compact example that I might be able to use in the classes I’m teaching at some point).

You’ll need to keep their post handy since I’m only riffing off of them, not pilfering their content or idea (and they deserve your eyeballs as they contribute quite a bit to the data science community on a regular basis).

The Data

We’ll start by reading in the same data set, but first we’ll load some pacakges we’ll need to help us with the analyses & visualizations:

library(readxl)    # free data from excel hades
library(dplyr)     # sane data manipulation
library(tidyr)     # sane data munging
library(viridis)   # sane colors
library(ggplot2)   # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects

NOTE: To use ggfortify you’ll need to devtools::install_github("sinhrks/ggfortify) since it’s not in CRAN.

Now, we’ll read in the file, taking care to only download it once from the  \(\hat{y}hat\) servers:

url <- "http://blog.yhathq.com/static/misc/data/WineKMC.xlsx"
fil <- basename(url)
if (!file.exists(fil)) download.file(url, fil)

Reading in and cleaning up the sheets looks really similar to the Python example:

offers <- read_excel(fil, sheet = 1)
colnames(offers) <- c("offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak")
head(offers)
Table continues below
offer_id campaign varietal min_qty discount
1 January Malbec 72 56
2 January Pinot Noir 72 17
3 February Espumante 144 32
4 February Champagne 72 48
5 February Cabernet Sauvignon 144 44
6 March Prosecco 144 86
origin past_peak
France FALSE
France FALSE
Oregon TRUE
France TRUE
New Zealand TRUE
Chile FALSE
transactions <- read_excel(fil, sheet = 2)
colnames(transactions) <- c("customer_name", "offer_id")
transactions$n <- 1
head(transactions)
customer_name offer_id n
Smith 2 1
Smith 24 1
Johnson 17 1
Johnson 24 1
Johnson 26 1
Williams 18 1

The dplyr package offers a very SQL-esque way of manipulating data and—combined with the tidyr package—makes quick work of getting the data into the the binary wide-form we need:

# join the offers and transactions table
left_join(offers, transactions, by="offer_id") %>% 
# get the number of times each customer responded to a given offer
  count(customer_name, offer_id, wt=n) %>%
# change it from long to wide
  spread(offer_id, n) %>%
# and fill in the NAs that get generated as a result
  mutate_each(funs(ifelse(is.na(.), 0, .))) -> dat

Clustering our customers

With the data in shape we can perform the same K-Means clustering:

fit <- kmeans(dat[,-1], 5, iter.max=1000)
table(fit$cluster)
1 2 3 4 5
44 7 24 17 8
barplot(table(fit$cluster), col="maroon")

Visualizing the clusters

The same Principal Component Analysis to get the scatterplot:

pca <- prcomp(dat[,-1])
pca_dat <- mutate(fortify(pca), col=fit$cluster)
ggplot(pca_dat) +
  geom_point(aes(x=PC1, y=PC2, fill=factor(col)), size=3, col="#7f7f7f", shape=21) +
  scale_fill_viridis(name="Cluster", discrete=TRUE) + theme_bw(base_family="Helvetica")

We can use the handy autoplot feature of ggfortify to do all that for us, tho:

autoplot(fit, data=dat[,-1], frame=TRUE, frame.type='norm')

Digging deeper into the clusters

The cluster-based customer introspection is also equally as easy:

transactions %>% 
  left_join(data_frame(customer_name=dat$customer_name, 
                       cluster=fit$cluster)) %>% 
  left_join(offers) -> customer_clusters

customer_clusters %>% 
  mutate(is_4=(cluster==4)) %>% 
  count(is_4, varietal) -> varietal_4
varietal_4

And, we too can look at how the mean of the min_qty field breaks out between 4 vs. non-4. It seems like members of cluster 4 like to by in bulk!

customer_clusters %>% 
  mutate(is_4=(cluster==4)) %>% 
  group_by(is_4) %>% 
  summarise_each(funs(mean), min_qty, discount) -> mean_4
mean_4
is_4 min_qty discount
FALSE 53.66 58.69
TRUE 74.16 62.12

Fin

Remember to check out all the things the  \(\hat{y}hat\) folks had to say about the analyses and the links they provided. The code for this Rmd is on github. I’ve also published an Jupyter Notebook of the  \(\hat{y}hat\) example in that gist (minus the plots as the Python ggplot module was giving me issues).

(NOTE that any numerical discrepancies are due to the fact that the random seed used for the K-Means cluster computation was not provided in the blog post. Run this code or the Python example many times to see the different results you’ll get with the clusters.)