Customer Segmentation in R (Riffing off of @YhatHQ’s Python Post) #rstats

Greg, over at \(\hat{y}hat\), did a really nice job introducing basic K-Means clustering with Principal Component Analysis to look at commonalities between sets of customers.

I thought it would be interesting to have an R-verison of it available in the event someone wanted to see a comparison of the solution between the two languages (and, it’s a really compact example that I might be able to use in the classes I’m teaching at some point).

You’ll need to keep their post handy since I’m only riffing off of them, not pilfering their content or idea (and they deserve your eyeballs as they contribute quite a bit to the data science community on a regular basis).

The Data

We’ll start by reading in the same data set, but first we’ll load some pacakges we’ll need to help us with the analyses & visualizations:

library(readxl)    # free data from excel hades
library(dplyr)     # sane data manipulation
library(tidyr)     # sane data munging
library(viridis)   # sane colors
library(ggplot2)   # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects

NOTE: To use ggfortify you’ll need to devtools::install_github("sinhrks/ggfortify) since it’s not in CRAN.

Now, we’ll read in the file, taking care to only download it once from the \(\hat{y}hat\) servers:

url <- "http://blog.yhathq.com/static/misc/data/WineKMC.xlsx"
fil <- basename(url)
if (!file.exists(fil)) download.file(url, fil)

Reading in and cleaning up the sheets looks really similar to the Python example:

offers <- read_excel(fil, sheet = 1)
colnames(offers) <- c("offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak")

head(offers)

Table continues below
offer_id	campaign	varietal	min_qty	discount
1	January	Malbec	72	56
2	January	Pinot Noir	72	17
3	February	Espumante	144	32
4	February	Champagne	72	48
5	February	Cabernet Sauvignon	144	44
6	March	Prosecco	144	86

origin	past_peak
France	FALSE
France	FALSE
Oregon	TRUE
France	TRUE
New Zealand	TRUE
Chile	FALSE

transactions <- read_excel(fil, sheet = 2)
colnames(transactions) <- c("customer_name", "offer_id")
transactions$n <- 1

head(transactions)

customer_name	offer_id	n
Smith	2	1
Smith	24	1
Johnson	17	1
Johnson	24	1
Johnson	26	1
Williams	18	1

The dplyr package offers a very SQL-esque way of manipulating data and—combined with the tidyr package—makes quick work of getting the data into the the binary wide-form we need:

# join the offers and transactions table
left_join(offers, transactions, by="offer_id") %>% 
# get the number of times each customer responded to a given offer
  count(customer_name, offer_id, wt=n) %>%
# change it from long to wide
  spread(offer_id, n) %>%
# and fill in the NAs that get generated as a result
  mutate_each(funs(ifelse(is.na(.), 0, .))) -> dat

Clustering our customers

With the data in shape we can perform the same K-Means clustering:

fit <- kmeans(dat[,-1], 5, iter.max=1000)

table(fit$cluster)

1	2	3	4	5
44	7	24	17	8

barplot(table(fit$cluster), col="maroon")

Visualizing the clusters

The same Principal Component Analysis to get the scatterplot:

pca <- prcomp(dat[,-1])
pca_dat <- mutate(fortify(pca), col=fit$cluster)
ggplot(pca_dat) +
  geom_point(aes(x=PC1, y=PC2, fill=factor(col)), size=3, col="#7f7f7f", shape=21) +
  scale_fill_viridis(name="Cluster", discrete=TRUE) + theme_bw(base_family="Helvetica")

We can use the handy autoplot feature of ggfortify to do all that for us, tho:

autoplot(fit, data=dat[,-1], frame=TRUE, frame.type='norm')

Digging deeper into the clusters

The cluster-based customer introspection is also equally as easy:

transactions %>% 
  left_join(data_frame(customer_name=dat$customer_name, 
                       cluster=fit$cluster)) %>% 
  left_join(offers) -> customer_clusters

customer_clusters %>% 
  mutate(is_4=(cluster==4)) %>% 
  count(is_4, varietal) -> varietal_4

varietal_4

And, we too can look at how the mean of the min_qty field breaks out between 4 vs. non-4. It seems like members of cluster 4 like to by in bulk!

customer_clusters %>% 
  mutate(is_4=(cluster==4)) %>% 
  group_by(is_4) %>% 
  summarise_each(funs(mean), min_qty, discount) -> mean_4

mean_4

is_4	min_qty	discount
FALSE	53.66	58.69
TRUE	74.16	62.12

Fin

Remember to check out all the things the \(\hat{y}hat\) folks had to say about the analyses and the links they provided. The code for this Rmd is on github. I’ve also published an Jupyter Notebook of the \(\hat{y}hat\) example in that gist (minus the plots as the Python ggplot module was giving me issues).

(NOTE that any numerical discrepancies are due to the fact that the random seed used for the K-Means cluster computation was not provided in the blog post. Run this code or the Python example many times to see the different results you’ll get with the clusters.)

Customer Segmentation in R (Riffing off of @YhatHQ’s Python Post) #rstats

Bob Rudis (@hrbrmstr)

The Data

Clustering our customers

Visualizing the clusters

Digging deeper into the clusters

Fin