library(data.table)
library(dplyr)
library(ggplot2)

Read the first 1M rows from the hmm_source.csv

full<-fread("hmm_source.csv", select=c(1,2,4,6), data.table = FALSE, nrows=1000000)

Find the Actives users, i.e. users with at least one purchase within 3 years

actives<-unique(subset(full, total_purchases>0, select=customer_id))

The 17% of the customers have bought at least once in 3 years!

Create the reduced source by keeping only the Active Users

reduced<-merge(full,actives)

reduced[is.na(reduced)]<-0

###In case we need to save it
### write.csv(reduced, "reduced.csv")

Take the total purchases and the distinct number of months that they received a catalog

by_customerid<-reduced%>%select(-reference_date)%>%group_by(customer_id)%>%summarise(Total_Purchases=sum(total_purchases), Catalogs=sum(ispromo))

Define 4 clusters

cls<-kmeans(by_customerid[,c(2,3)],4)

The centers of the clusters and the sizes

cls$centers
##   Total_Purchases Catalogs
## 1       31.218935 28.50296
## 2        2.128508 10.23289
## 3     1602.000000  0.00000
## 4        5.506344 24.11572
cls$size
## [1]  169 2031    1 2601

We can plot the data to see the clusters

cls$cluster<-as.factor(cls$cluster)

ggplot(by_customerid, aes(Catalogs, Total_Purchases, color=cls$cluster) )+geom_point()+scale_y_continuous(limits = c(0, 150))+ggtitle("Clusters of Customers")
## Warning: Removed 2 rows containing missing values (geom_point).

Summary

I took a sample of 1M rows and there were around 28k users.From those users I kept only 4802 and I run a cluster analysis using 4 clusters. THere were one outlier in cluster 2 which has only one user with 1602 purchases within 3 years. For our analysis I ignore that users. Then we have that 3.5% belogs to cluster 1 which has 31 purchases within 36 months and they have received 28 catalogs in different months. Those users should definitely receive a catalog. Now there is 54% of the active customers who have bought 5.5 times within 36 months and they have received 24 catalogs. Those users can receive a catalog because they have 15% to buy on the next months. Finally another 42% has made two purchases in 3 years. We can avoid send them a catalog since they have around 6% probability to buy on the next months