library(data.table)
library(dplyr)
library(ggplot2)
full<-fread("hmm_source.csv", select=c(1,2,4,6), data.table = FALSE, nrows=1000000)
actives<-unique(subset(full, total_purchases>0, select=customer_id))
reduced<-merge(full,actives)
reduced[is.na(reduced)]<-0
###In case we need to save it
### write.csv(reduced, "reduced.csv")
by_customerid<-reduced%>%select(-reference_date)%>%group_by(customer_id)%>%summarise(Total_Purchases=sum(total_purchases), Catalogs=sum(ispromo))
cls<-kmeans(by_customerid[,c(2,3)],4)
cls$centers
## Total_Purchases Catalogs
## 1 31.218935 28.50296
## 2 2.128508 10.23289
## 3 1602.000000 0.00000
## 4 5.506344 24.11572
cls$size
## [1] 169 2031 1 2601
cls$cluster<-as.factor(cls$cluster)
ggplot(by_customerid, aes(Catalogs, Total_Purchases, color=cls$cluster) )+geom_point()+scale_y_continuous(limits = c(0, 150))+ggtitle("Clusters of Customers")
## Warning: Removed 2 rows containing missing values (geom_point).
I took a sample of 1M rows and there were around 28k users.From those users I kept only 4802 and I run a cluster analysis using 4 clusters. THere were one outlier in cluster 2 which has only one user with 1602 purchases within 3 years. For our analysis I ignore that users. Then we have that 3.5% belogs to cluster 1 which has 31 purchases within 36 months and they have received 28 catalogs in different months. Those users should definitely receive a catalog. Now there is 54% of the active customers who have bought 5.5 times within 36 months and they have received 24 catalogs. Those users can receive a catalog because they have 15% to buy on the next months. Finally another 42% has made two purchases in 3 years. We can avoid send them a catalog since they have around 6% probability to buy on the next months