Greg, over at \(\hat{y}hat\), did a really nice job introducing basic K-Means clustering with Principal Component Analysis to look at commonalities between sets of customers.
I thought it would be interesting to have an R-verison of it available in the event someone wanted to see a comparison of the solution between the two languages (and, it’s a really compact example that I might be able to use in the classes I’m teaching at some point).
You’ll need to keep their post handy since I’m only riffing off of them, not pilfering their content or idea (and they deserve your eyeballs as they contribute quite a bit to the data science community on a regular basis).
We’ll start by reading in the same data set, but first we’ll load some pacakges we’ll need to help us with the analyses & visualizations:
library(readxl) # free data from excel hades
library(dplyr) # sane data manipulation
library(tidyr) # sane data munging
library(viridis) # sane colors
library(ggplot2) # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects
NOTE: To use
ggfortifyyou’ll need todevtools::install_github("sinhrks/ggfortify)since it’s not in CRAN.
Now, we’ll read in the file, taking care to only download it once from the \(\hat{y}hat\) servers:
url <- "http://blog.yhathq.com/static/misc/data/WineKMC.xlsx"
fil <- basename(url)
if (!file.exists(fil)) download.file(url, fil)
Reading in and cleaning up the sheets looks really similar to the Python example:
offers <- read_excel(fil, sheet = 1)
colnames(offers) <- c("offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak")
head(offers)
| offer_id | campaign | varietal | min_qty | discount |
|---|---|---|---|---|
| 1 | January | Malbec | 72 | 56 |
| 2 | January | Pinot Noir | 72 | 17 |
| 3 | February | Espumante | 144 | 32 |
| 4 | February | Champagne | 72 | 48 |
| 5 | February | Cabernet Sauvignon | 144 | 44 |
| 6 | March | Prosecco | 144 | 86 |
| origin | past_peak |
|---|---|
| France | FALSE |
| France | FALSE |
| Oregon | TRUE |
| France | TRUE |
| New Zealand | TRUE |
| Chile | FALSE |
transactions <- read_excel(fil, sheet = 2)
colnames(transactions) <- c("customer_name", "offer_id")
transactions$n <- 1
head(transactions)
| customer_name | offer_id | n |
|---|---|---|
| Smith | 2 | 1 |
| Smith | 24 | 1 |
| Johnson | 17 | 1 |
| Johnson | 24 | 1 |
| Johnson | 26 | 1 |
| Williams | 18 | 1 |
The dplyr package offers a very SQL-esque way of manipulating data and—combined with the tidyr package—makes quick work of getting the data into the the binary wide-form we need:
# join the offers and transactions table
left_join(offers, transactions, by="offer_id") %>%
# get the number of times each customer responded to a given offer
count(customer_name, offer_id, wt=n) %>%
# change it from long to wide
spread(offer_id, n) %>%
# and fill in the NAs that get generated as a result
mutate_each(funs(ifelse(is.na(.), 0, .))) -> dat
With the data in shape we can perform the same K-Means clustering:
fit <- kmeans(dat[,-1], 5, iter.max=1000)
table(fit$cluster)
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 44 | 7 | 24 | 17 | 8 |
barplot(table(fit$cluster), col="maroon")
The same Principal Component Analysis to get the scatterplot:
pca <- prcomp(dat[,-1])
pca_dat <- mutate(fortify(pca), col=fit$cluster)
ggplot(pca_dat) +
geom_point(aes(x=PC1, y=PC2, fill=factor(col)), size=3, col="#7f7f7f", shape=21) +
scale_fill_viridis(name="Cluster", discrete=TRUE) + theme_bw(base_family="Helvetica")
We can use the handy autoplot feature of ggfortify to do all that for us, tho:
autoplot(fit, data=dat[,-1], frame=TRUE, frame.type='norm')
The cluster-based customer introspection is also equally as easy:
transactions %>%
left_join(data_frame(customer_name=dat$customer_name,
cluster=fit$cluster)) %>%
left_join(offers) -> customer_clusters
customer_clusters %>%
mutate(is_4=(cluster==4)) %>%
count(is_4, varietal) -> varietal_4
varietal_4
And, we too can look at how the mean of the min_qty field breaks out between 4 vs. non-4. It seems like members of cluster 4 like to by in bulk!
customer_clusters %>%
mutate(is_4=(cluster==4)) %>%
group_by(is_4) %>%
summarise_each(funs(mean), min_qty, discount) -> mean_4
mean_4
| is_4 | min_qty | discount |
|---|---|---|
| FALSE | 53.66 | 58.69 |
| TRUE | 74.16 | 62.12 |
Remember to check out all the things the \(\hat{y}hat\) folks had to say about the analyses and the links they provided. The code for this Rmd is on github. I’ve also published an Jupyter Notebook of the \(\hat{y}hat\) example in that gist (minus the plots as the Python ggplot module was giving me issues).
(NOTE that any numerical discrepancies are due to the fact that the random seed used for the K-Means cluster computation was not provided in the blog post. Run this code or the Python example many times to see the different results you’ll get with the clusters.)