CRISA is an Asian research agency that specializes in tracking consumer purchase behavior in consumer goods. CRISA has both transaction data and household data. CRISA has two categories of clients: (1) advertising agencies and (2) consumer goods manufacturers.
Business Problem: Segment the market on two key sets of variables more directly related to the purchase process and to brand loyalty: 1. Purchase behavior and 2. Basis of purchase
Analytics Problem: Use clustering methods to segment the data.
Use k-mean clustering to identify clusters of households based on: The variables that describe purchase behavior (including brand loyalty).
Based off of the visualization of the clustering it looks like either 2 or 3 clusters appear to separate the data without much overlap. When k=3, the 3rd cluster seems to have a lot of the data inside the other two clusters. Thus, I would say that k=2 would be best visually. We will do three different types of cluster optimization later.
#normalize
Soap.norm = scale(Soap)
purch.behavior = Soap[,c(12:31)]
basis.purch = Soap[,c(32:46)]
#Purchase Behavior (including brand loyalty)
purch.behavior.norm = scale(purch.behavior)
#Basis of Purchase
basis.purch.norm = scale(basis.purch)
set.seed(1)
k2.0 <- kmeans(purch.behavior.norm, centers = 2)
set.seed(1)
k3.0 <- kmeans(purch.behavior.norm, centers = 3)
set.seed(1)
k4.0 <- kmeans(purch.behavior.norm, centers = 4)
set.seed(1)
k5.0 <- kmeans(purch.behavior.norm, centers = 5)
p1.0 <- fviz_cluster(k2.0, geom = "point", data = purch.behavior.norm) + ggtitle("k = 2")
p2.0 <- fviz_cluster(k3.0, geom = "point", data = purch.behavior.norm) + ggtitle("k = 3")
p3.0 <- fviz_cluster(k4.0, geom = "point", data = purch.behavior.norm) + ggtitle("k = 4")
p4.0 <- fviz_cluster(k5.0, geom = "point", data = purch.behavior.norm) + ggtitle("k = 5")
gridExtra::grid.arrange(p1.0, p2.0, p3.0, p4.0, nrow = 2)
Use k-mean clustering to identify clusters of households based on: The variables that describe the basis for purchase.
Based off of the visualization of the clustering it looks like 3 clusters appear to separate the data without much overlap. We will perform optimization later using three different models.
set.seed(1)
k2.1 <- kmeans(basis.purch.norm, centers = 2)
set.seed(1)
k3.1 <- kmeans(basis.purch.norm, centers = 3)
set.seed(1)
k4.1 <- kmeans(basis.purch.norm, centers = 4)
set.seed(1)
k5.1 <- kmeans(basis.purch.norm, centers = 5)
p1.1 <- fviz_cluster(k2.1, geom = "point", data = basis.purch.norm) + ggtitle("k = 2")
p2.1 <- fviz_cluster(k3.1, geom = "point", data = basis.purch.norm) + ggtitle("k = 3")
p3.1 <- fviz_cluster(k4.1, geom = "point", data = basis.purch.norm) + ggtitle("k = 4")
p4.1 <- fviz_cluster(k5.1, geom = "point", data = basis.purch.norm) + ggtitle("k = 5")
gridExtra::grid.arrange(p1.1, p2.1, p3.1, p4.1, nrow = 2)
Use k-mean clustering to identify clusters of households based on: The variables that describe purchase behavior and the basis for purchase.
For both groups it appears that it is mix between 2 or 3 clusters. Visually, it appears that k=3 probably would be best.
set.seed(1)
k2.2 <- kmeans(Soap.norm, centers = 2)
set.seed(1)
k3.2 <- kmeans(Soap.norm, centers = 3)
set.seed(1)
k4.2 <- kmeans(Soap.norm, centers = 4)
set.seed(1)
k5.2 <- kmeans(Soap.norm, centers = 5)
p1.2 <- fviz_cluster(k2.2, geom = "point", data = Soap.norm) + ggtitle("k = 2")
p2.2 <- fviz_cluster(k3.2, geom = "point", data = Soap.norm) + ggtitle("k = 3")
p3.2 <- fviz_cluster(k4.2, geom = "point", data = Soap.norm) + ggtitle("k = 4")
p4.2 <- fviz_cluster(k5.2, geom = "point", data = Soap.norm) + ggtitle("k = 5")
gridExtra::grid.arrange(p1.2, p2.2, p3.2, p4.2, nrow = 2)
Using the elbow method, silhouette method, and gap statistic to view optimal clusters.
fviz_nbclust(purch.behavior.norm, kmeans, method = "wss")
fviz_nbclust(purch.behavior.norm, kmeans, method = "silhouette")
fviz_nbclust(purch.behavior.norm, kmeans, method = "gap_stat")
Doing the three different methods, the silhouette method looks like k=2 would be best here and that gap method shows that k=3 is best. The elbow is hard to conclude anything from. Thus, using either k=2 or k=3 should work for purchase behavior.
Using the elbow method, silhouette method, and gap statistic to view optimal clusters.
fviz_nbclust(basis.purch.norm, kmeans, method = "wss")
fviz_nbclust(basis.purch.norm, kmeans, method = "silhouette")
fviz_nbclust(basis.purch.norm, kmeans, method = "gap_stat")
I would say the below method doesn’t really have any good conclusions again, the other two appear that k=2 or k=3 could be good for the number of clusters.
Using the elbow method, silhouette method, and gap statistic to view optimal clusters.
fviz_nbclust(Soap.norm, kmeans, method = "wss")
fviz_nbclust(Soap.norm, kmeans, method = "silhouette")
fviz_nbclust(Soap.norm, kmeans, method = "gap_stat")
When looking at all of data for purchase behavior and basis of purchase we see that the elbow model shows k=4, the silhouette says k=2, and the gap method shows k=4. Looking back we can see that if we use 4 clusters, the entire 4th cluster is inside of cluster 2. I would probably lean more towards k=4 because it has two different methods showing k=4 is better.
Below are the means of each variable for the first cluster.
### Exploring Clusters
purch.behavior = Soap[,c(2:31)]
purch.behavior$Cluster = k2.0$cluster
#Partition into clusters
C1.0 = subset(purch.behavior, Cluster == "1")
C2.0 = subset(purch.behavior, Cluster == "2")
format(colMeans(C1.0), scientific = F)
## SEC FEH MT
## " 2.71753247" " 1.90584416" " 7.54220779"
## SEX AGE EDU
## " 1.59415584" " 3.16558442" " 3.39935065"
## HS CHILD CS
## " 3.90259740" " 3.40584416" " 0.86363636"
## Affluence Index No. of Brands Brand Runs
## " 13.01623377" " 2.75974026" " 9.02597403"
## Total Volume No. of Trans Value
## "10862.01298701" " 21.89935065" " 1059.32581169"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 3.28863636" " 492.24159091" " 10.63211039"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.95474808" " 0.01438278" " 0.03086914"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.23491356" " 0.20934939" " 0.01133486"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.03840560" " 0.01041140" " 0.01208363"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.04081926" " 0.01284040" " 0.42984191"
## Cluster
## " 1.00000000"
Below are the means of each variable for the second cluster.
format(colMeans(C2.0), scientific = F)
## SEC FEH MT
## " 2.27054795" " 2.19863014" " 8.84931507"
## SEX AGE EDU
## " 1.89041096" " 3.26369863" " 4.72260274"
## HS CHILD CS
## " 4.49657534" " 3.05136986" " 1.00342466"
## Affluence Index No. of Brands Brand Runs
## " 21.24315068" " 4.56164384" " 22.84589041"
## Total Volume No. of Trans Value
## "13025.21232877" " 40.91438356" " 1630.68325342"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 1.91017123" " 333.63123288" " 13.10321918"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.86898145" " 0.09476070" " 0.03625785"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.12993093" " 0.04496154" " 0.05617157"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.02924639" " 0.02871958" " 0.04052024"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.02728314" " 0.02382510" " 0.61919148"
## Cluster
## " 2.00000000"
Notice that there is a large difference between the length of Brand Runs (~9 vs ~23). The sum of value seems to be quite a bit greater for cluster 2 (~1631 vs. ~1059) with a high average price of purchase as well. Note that cluster 2 has an overall higher promotional purchase rate of about 13.1% vs. cluster 1’s 5.5%. The overall transactions for cluster two is almost double that of cluster 1.
Interesting observations: cluster 1 seems to use fewer brands overall, but has shorter brand runs. i.e. cluster 2 has people who will use a lot more brands, but consistently buys from them. I would believe that this means that these people are loyal, but two multiple brands. A question that comes to minds is: are there more people in the household?
Below is the sizes of the clusters, where the centers are, and the total within-cluster sum of squares.
k2.0$size
## [1] 308 292
k2.0$centers #note these are normalized
## No. of Brands Brand Runs Total Volume No. of Trans Value
## 1 -0.5551189 -0.6469201 -0.1354834 -0.5310062 -0.314849
## 2 0.5855363 0.6823678 0.1429072 0.5601025 0.332101
## Trans / Brand Runs Vol/Tran Avg. Price Pur Vol No Promo - %
## 1 0.2575565 0.3102991 -0.3213184 0.3493170
## 2 -0.2716692 -0.3273017 0.3389249 -0.3684577
## Pur Vol Promo 6 % Pur Vol Other Promo % Br. Cd. 57, 144 Br. Cd. 55
## 1 -0.4206724 -0.03643105 0.216099 0.3080240
## 2 0.4437230 0.03842727 -0.227940 -0.3249021
## Br. Cd. 272 Br. Cd. 286 Br. Cd. 24 Br. Cd. 481 Br. Cd. 352 Br. Cd. 5
## 1 -0.2398908 0.03948456 -0.1116980 -0.1548665 0.05419250 -0.07872644
## 2 0.2530355 -0.04164810 0.1178185 0.1633524 -0.05716195 0.08304022
## Others 999
## 1 -0.3099314
## 2 0.3269140
k2.0$tot.withinss
## [1] 10692.33
We can see the size of the clusters are pretty even. Also, the clusters are separated quite nicely across a lot of the variables. Note if we increases the the number of clusters the total within-cluster sum does get better, but it doesn’t improve drastically.
Below are the means of each variable for the first cluster.
basis.purch = Soap[,c(2:11,32:46)]
basis.purch$Cluster = k2.1$cluster
#Partition into clusters
C1.1 = subset(basis.purch, Cluster == "1")
C2.1 = subset(basis.purch, Cluster == "2")
format(colMeans(C1.1), scientific = F)
## SEC FEH MT SEX AGE
## " 1.89302326" " 1.84651163" " 7.64186047" " 1.67906977" " 3.25116279"
## EDU HS CHILD CS Affluence Index
## " 4.60000000" " 3.67906977" " 3.37209302" " 0.87906977" "20.77674419"
## Pr Cat 1 Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5
## " 0.55901305" " 0.40408272" " 0.01208310" " 0.02482113" " 0.31030259"
## PropCat 6 PropCat 7 PropCat 8 PropCat 9 PropCat 10
## " 0.12026980" " 0.19410519" " 0.15067337" " 0.03992765" " 0.04482603"
## PropCat 11 PropCat 12 PropCat 13 PropCat 14 PropCat 15
## " 0.01994279" " 0.01280167" " 0.06227482" " 0.01144191" " 0.03343419"
## Cluster
## " 1.00000000"
Below are the means of each variable for the second cluster.
format(colMeans(C2.1), scientific = F)
## SEC FEH MT SEX AGE
## " 2.838961039" " 2.161038961" " 8.477922078" " 1.771428571" " 3.192207792"
## EDU HS CHILD CS Affluence Index
## " 3.732467532" " 4.477922078" " 3.155844156" " 0.961038961" "14.922077922"
## Pr Cat 1 Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5
## " 0.122686222" " 0.542875839" " 0.210193816" " 0.124244123" " 0.539177357"
## PropCat 6 PropCat 7 PropCat 8 PropCat 9 PropCat 10
## " 0.076717247" " 0.042630920" " 0.040764257" " 0.025711585" " 0.006523370"
## PropCat 11 PropCat 12 PropCat 13 PropCat 14 PropCat 15
## " 0.034630295" " 0.002540059" " 0.004086944" " 0.206307523" " 0.020910442"
## Cluster
## " 2.000000000"
Cluster 1 seems to have most of the volume purchased under premium soaps and popular soaps (~56% and ~40% or together ~96%) and cluster 2 is mostly what is popular (~54%). Thus, it seems like cluster 1 is less focused on price, but more focused on popularity or that premium label.
Looking at the selling proposition wise purchase, we can see that majority of purchased were made in “Beauty” (PropCat 5) for both clusters, but cluster 2 favors that more with about a 22% difference and it also favors Carbolic (PropCat 14). Cluster 1 has a higher preference towards PropCat 6, PropCat 7, and ProbCat 8, which are: Health, Herbal, and Freshness.
k2.1$size
## [1] 215 385
k2.1$centers #note these are normalized
## Pr Cat 1 Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5 PropCat 6 PropCat 7
## 1 0.9967576 -0.2858243 -0.4742713 -0.3328171 -0.4642392 0.16801648 0.4965058
## 2 -0.5566309 0.1596162 0.2648528 0.1858589 0.2592504 -0.09382739 -0.2772695
## PropCat 8 PropCat 9 PropCat 10 PropCat 11 PropCat 12 PropCat 13
## 1 0.4622663 0.14521800 0.3206181 -0.09571174 0.2502437 0.3911774
## 2 -0.2581487 -0.08109576 -0.1790465 0.05344941 -0.1397465 -0.2184497
## PropCat 14 PropCat 15
## 1 -0.4699908 0.09174436
## 2 0.2624624 -0.05123386
k2.1$tot.withinss
## [1] 8082.713
The cluster size is a little more skewed, the first cluster is a bit smaller. We can see the centers are fairly close on the some of the variables (PropCat 9,11, and 15), which the above means shows.
SoapCluster$Cluster = k4.2$cluster
#Partition into clusters
C1.2 = subset(SoapCluster, Cluster == "1")
We will choose k=4 off of the above methods and visualizations. Below are listed, in order, the 4 clusters mean values for each variable, then the cluster sizes, their centers, and the total within-cluster sum.
C2.2 = subset(SoapCluster, Cluster == "2")
C3.2 = subset(SoapCluster, Cluster == "3")
C4.2 = subset(SoapCluster, Cluster == "4")
format(colMeans(C1.2), scientific = F)
## SEC FEH MT
## " 3.4705882353" " 2.3382352941" " 8.6911764706"
## SEX AGE EDU
## " 1.7500000000" " 3.0000000000" " 2.5147058824"
## HS CHILD CS
## " 4.7794117647" " 3.2941176471" " 1.0147058824"
## Affluence Index No. of Brands Brand Runs
## " 9.2647058824" " 2.9264705882" " 8.9117647059"
## Total Volume No. of Trans Value
## "14468.6029411765" " 27.3529411765" " 992.1014705882"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 5.5048529412" " 555.6422058824" " 6.8422058824"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.9307966573" " 0.0168661572" " 0.0523371855"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.0288208121" " 0.7621998982" " 0.0019585298"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.0089379349" " 0.0029901375" " 0.0041173348"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.0022367635" " 0.0077038105" " 0.1810347787"
## Pr Cat 1 Pr Cat 2 Pr Cat 3
## " 0.0536513749" " 0.1114074551" " 0.7966113492"
## Pr Cat 4 PropCat 5 PropCat 6
## " 0.0383298208" " 0.1035686825" " 0.0522436098"
## PropCat 7 PropCat 8 PropCat 9
## " 0.0065686732" " 0.0040699554" " 0.0184133419"
## PropCat 10 PropCat 11 PropCat 12
## " 0.0006637574" " 0.0042723511" " 0.0020614373"
## PropCat 13 PropCat 14 PropCat 15
## " 0.0029901375" " 0.7894704542" " 0.0156775996"
## Cluster
## " 1.0000000000"
format(colMeans(C2.2), scientific = F)
## SEC FEH MT
## " 2.420581655" " 2.277404922" " 9.167785235"
## SEX AGE EDU
## " 1.961968680" " 3.288590604" " 4.794183445"
## HS CHILD CS
## " 4.646532438" " 2.984340045" " 1.044742729"
## Affluence Index No. of Brands Brand Runs
## " 20.422818792" " 3.874720358" " 18.015659955"
## Total Volume No. of Trans Value
## "12629.087248322" " 34.704697987" " 1507.357740492"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 2.340782998" " 397.112416107" " 12.436554810"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.907125363" " 0.063117400" " 0.029757238"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.207717969" " 0.042142711" " 0.036946364"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.015357284" " 0.017473600" " 0.031841449"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.042768050" " 0.021710743" " 0.583943825"
## Pr Cat 1 Pr Cat 2 Pr Cat 3
## " 0.303942410" " 0.543546123" " 0.049017534"
## Pr Cat 4 PropCat 5 PropCat 6
## " 0.103493933" " 0.494976879" " 0.101319698"
## PropCat 7 PropCat 8 PropCat 9
## " 0.114148333" " 0.090570142" " 0.033899704"
## PropCat 10 PropCat 11 PropCat 12
## " 0.022138138" " 0.036199984" " 0.006099238"
## PropCat 13 PropCat 14 PropCat 15
## " 0.023414192" " 0.046484561" " 0.030749131"
## Cluster
## " 2.000000000"
format(colMeans(C3.2), scientific = F)
## SEC FEH MT
## " 2.016393443" " 0.000000000" " 0.000000000"
## SEX AGE EDU
## " 0.000000000" " 2.770491803" " 0.000000000"
## HS CHILD CS
## " 0.000000000" " 5.000000000" " 0.000000000"
## Affluence Index No. of Brands Brand Runs
## " 0.000000000" " 2.524590164" " 7.131147541"
## Total Volume No. of Trans Value
## "3484.180327869" " 10.180327869" " 450.881147541"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 1.305901639" " 363.824426230" " 13.284590164"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.915312335" " 0.039789310" " 0.044898355"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.215260757" " 0.103247243" " 0.049418420"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.018265489" " 0.056675020" " 0.011835254"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.013209642" " 0.006242285" " 0.525845890"
## Pr Cat 1 Pr Cat 2 Pr Cat 3
## " 0.410894855" " 0.421080104" " 0.109410812"
## Pr Cat 4 PropCat 5 PropCat 6
## " 0.058614229" " 0.453368570" " 0.078422521"
## PropCat 7 PropCat 8 PropCat 9
## " 0.088343093" " 0.107452007" " 0.027454539"
## PropCat 10 PropCat 11 PropCat 12
## " 0.035732279" " 0.011835254" " 0.014021988"
## PropCat 13 PropCat 14 PropCat 15
## " 0.068384622" " 0.109410812" " 0.005574316"
## Cluster
## " 3.000000000"
format(colMeans(C4.2), scientific = F)
## SEC FEH MT
## " 2.458333333" " 2.166666667" " 9.083333333"
## SEX AGE EDU
## " 1.958333333" " 3.541666667" " 4.666666667"
## HS CHILD CS
## " 4.708333333" " 3.208333333" " 0.958333333"
## Affluence Index No. of Brands Brand Runs
## " 18.875000000" " 4.041666667" " 14.875000000"
## Total Volume No. of Trans Value
## "12802.500000000" " 29.083333333" " 1403.179166667"
## Trans / Brand Runs Vol/Tran Avg. Price
## " 2.931250000" " 481.021666667" " 11.085833333"
## Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo %
## " 0.966322075" " 0.013020710" " 0.020657215"
## Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272
## " 0.098023038" " 0.026788142" " 0.009604473"
## Br. Cd. 286 Br. Cd. 24 Br. Cd. 481
## " 0.490924591" " 0.005067665" " 0.013275419"
## Br. Cd. 352 Br. Cd. 5 Others 999
## " 0.019325102" " 0.012601325" " 0.324390246"
## Pr Cat 1 Pr Cat 2 Pr Cat 3
## " 0.118619306" " 0.819094709" " 0.032000710"
## Pr Cat 4 PropCat 5 PropCat 6
## " 0.030285276" " 0.764396707" " 0.073662641"
## PropCat 7 PropCat 8 PropCat 9
## " 0.053567392" " 0.032201235" " 0.016808472"
## PropCat 10 PropCat 11 PropCat 12
## " 0.001189221" " 0.017770884" " 0.000350140"
## PropCat 13 PropCat 14 PropCat 15
## " 0.005067665" " 0.031323203" " 0.003662441"
## Cluster
## " 4.000000000"
k4.2$size
## [1] 68 447 61 24
k4.2$centers #note these are normalized
## Member id SEC FEH MT SEX AGE
## 1 -0.99660572 0.86739677 0.2554282 0.1194147 0.01798996 -0.2464889
## 2 0.09245715 -0.07097471 0.2018315 0.2303923 0.34484503 0.0869535
## 3 0.33338888 -0.43219025 -1.8047556 -1.9043115 -2.68050476 -0.5116665
## 4 0.25433839 -0.03723673 0.1042617 0.2107278 0.33923934 0.3793618
## EDU HS CHILD CS Affluence Index No. of Brands
## 1 -0.6980023 0.2555313 0.04994149 0.16366538 -0.6796880 -0.4495739
## 2 0.3428534 0.1977600 -0.20457740 0.22286622 0.2982292 0.1506946
## 3 -1.8462679 -1.8223924 1.45152536 -1.83625981 -1.4916636 -0.7039754
## 4 0.2846266 0.2246289 -0.02054045 0.05255842 0.1625756 0.2563763
## Brand Runs Total Volume No. of Trans Value Trans / Brand Runs
## 1 -0.65790552 0.32866279 -0.2180717 -0.39096736 1.1084151
## 2 0.21776536 0.09192829 0.2037822 0.19245910 -0.1063471
## 3 -0.82917667 -1.08496568 -1.2034598 -1.00379302 -0.5036628
## 4 -0.08432341 0.11424546 -0.1187794 0.07449734 0.1203474
## Vol/Tran Avg. Price Pur Vol No Promo - % Pur Vol Promo 6 %
## 1 0.56516439 -1.3339241 0.14886931 -0.3939659
## 2 -0.07211252 0.1608021 -0.04923405 0.1034265
## 3 -0.20592755 0.3873845 0.01928214 -0.1474471
## 4 0.26519586 -0.2000904 0.44617897 -0.4353204
## Pur Vol Other Promo % Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272 Br. Cd. 286
## 1 0.26179643 -0.6555996 2.4366090 -0.34297239 -0.2215413
## 2 -0.05187722 0.1010714 -0.3357551 0.04167722 -0.1646784
## 3 0.15845850 0.1329747 -0.1004905 0.17879263 -0.1389174
## 4 -0.17829193 -0.3628990 -0.3948733 -0.25891434 4.0479181
## Br. Cd. 24 Br. Cd. 481 Br. Cd. 352 Br. Cd. 5 Others 999 Pr Cat 1
## 1 -0.20473308 -0.24401301 -0.26320534 -0.15437056 -1.1467522 -0.80240522
## 2 -0.02316431 0.06623267 0.07022421 0.05190302 0.2083644 0.08866816
## 3 0.46827575 -0.15764593 -0.17293725 -0.17589376 0.0129618 0.46943479
## 4 -0.17868860 -0.14152980 -0.12262859 -0.08224715 -0.6646003 -0.57110971
## Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5 PropCat 6 PropCat 7
## 1 -1.2251317 2.4526968 -0.26234414 -1.11773989 -0.24096622 -0.46148609
## 2 0.1617674 -0.3364740 0.07760803 0.11952955 0.05408589 0.08806258
## 3 -0.2312732 -0.1111549 -0.15652318 -0.01199731 -0.08357505 -0.04375821
## 4 1.0461087 -0.3999614 -0.30431142 0.97118499 -0.11219210 -0.22140276
## PropCat 8 PropCat 9 PropCat 10 PropCat 11 PropCat 12 PropCat 13
## 1 -0.4986668 -0.19728072 -0.25548610 -0.25485532 -0.157936401 -0.22994082
## 2 0.0683111 0.04925563 0.02465078 0.06939063 -0.004480698 -0.01596054
## 3 0.1789657 -0.05334868 0.20198863 -0.17804911 0.296621646 0.45518906
## 4 -0.3142763 -0.22282957 -0.24863134 -0.11776891 -0.222973870 -0.20817482
## PropCat 14 PropCat 15
## 1 2.4544320 -0.11097497
## 2 -0.3382738 0.06109023
## 3 -0.1017491 -0.22631983
## 4 -0.3952617 -0.24814688
k4.2$tot.withinss
## [1] 22123.18
We can see the clusters differ in size, below we will use tables to compare a few of the key variables against each other to show some relationships.
Looking the reported sex of each cluster. SEX Sex of homemaker
0 Not Specified 1 Male
2 Female
table(C1.2$SEX)
##
## 0 1 2
## 7 3 58
table(C2.2$SEX)
##
## 1 2
## 17 430
table(C3.2$SEX)
##
## 0
## 61
table(C4.2$SEX)
##
## 1 2
## 1 23
An interesting observation is that the 3rd cluster is only ‘unspecified’ people.
Below we are comparing the each cluster and the reported age.
AGE Age of homemaker
1 Up to 24
2 25-34
3 35-44
4 45+
table(C1.2$AGE)
##
## 1 2 3 4
## 4 20 16 28
table(C2.2$AGE)
##
## 1 2 3 4
## 9 82 127 229
table(C3.2$AGE)
##
## 1 2 3 4
## 2 24 21 14
table(C4.2$AGE)
##
## 2 3 4
## 3 5 16
There appears to be no trends for each cluster here.
Below we are comparing the each cluster and the reported number of
children. CHILD Presence of children in household
1 Children up to age 6 present (only)
2 Children 7-14 present (only)
3 Both
4 None
5 Not specified
table(C1.2$CHILD)
##
## 1 2 3 4 5
## 6 16 5 34 7
table(C2.2$CHILD)
##
## 1 2 3 4
## 52 122 54 219
table(C3.2$CHILD)
##
## 5
## 61
table(C4.2$CHILD)
##
## 1 2 3 4
## 1 7 2 14
Below we are comparing the each cluster and the reported educational
status. EDU Education of homemaker
1 Illiterate
2 Literate, but no formal schooling
3 Up to 4 years of school
4 5-9 years of school
5 10-12 years of school
6 Some college
7 College graduate
8 Some graduate school
9 Graduate or professional school degree
0 Not specified
table(C1.2$EDU)
##
## 0 1 2 3 4 5 7
## 10 21 1 8 18 9 1
table(C2.2$EDU)
##
## 0 1 2 3 4 5 6 7 8 9
## 2 27 7 22 111 174 21 69 13 1
table(C3.2$EDU)
##
## 0
## 61
table(C4.2$EDU)
##
## 1 2 3 4 5 6 7 9
## 1 1 3 7 6 2 3 1
Below we are comparing the each cluster and the reported
socioeconomic status. SEC Socioeconomic class (1 = high, 4 = low)
1 A
2 B
3 C
4 D/E
table(C1.2$SEC)
##
## 1 2 3 4
## 1 11 11 45
table(C2.2$SEC)
##
## 1 2 3 4
## 115 121 119 92
table(C3.2$SEC)
##
## 1 2 3 4
## 29 12 10 10
table(C4.2$SEC)
##
## 1 2 3 4
## 5 6 10 3
Below we are comparing the each cluster and the reported affluence index.
table(C1.2$'Affluence Index')
##
## 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 23 25
## 8 1 3 6 4 2 3 3 1 7 7 3 4 4 3 3 2 1 2 1
table(C2.2$'Affluence Index')
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 5 3 1 4 2 3 5 8 12 19 21 17 27 17 28 22 16 19 24 10 13 12 9 8 10 14
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 53
## 10 11 8 11 11 6 5 5 8 6 4 8 4 3 2 1 1 3 3 1 1 2 1 1 1 1
table(C3.2$'Affluence Index')
##
## 0
## 61
table(C4.2$'Affluence Index')
##
## 6 11 12 13 14 15 16 17 20 25 32 37 38
## 1 1 4 1 3 3 2 2 1 1 2 1 2
Below we are comparing the each cluster and the reported household
size. HS Household size
Number of people in the household
table(C1.2$HS)
##
## 0 2 3 4 5 6 7 8 9 10 15
## 7 4 6 8 23 8 3 5 2 1 1
table(C2.2$HS)
##
## 1 2 3 4 5 6 7 8 9 10 12 15
## 2 35 64 130 114 55 18 12 11 3 1 2
table(C3.2$HS)
##
## 0
## 61
table(C4.2$HS)
##
## 2 3 4 5 6 7 8 12
## 2 3 9 5 2 1 1 1
Comparing all of the above tables, we can see that the size of the clusters are very different. An interesting observation is that cluster 3 takes almost all of the unspecified results, thus not really know a lot about that specific cluster. Below there are a few plots that show cluster 3 and where they are clumped. They also have varying degree of value, but the majority of those are of less value.
We can also see that the total volume for each cluster varies, Cluster 1 has the largest volume of 14468, but the value is 992, with about 7% of people using promotions to influence their purchases and seem to more loyal to a specific brand. Cluster 2 has a total volume of 12629 and the high value of 1507. These customers use more brands, but have a lot high brand run, so appear to be loyal, but to multiple different brands (note that these customers have in general 3-6 members in the household). The 2nd cluster is more motivated by promotions with about 10% of the purchases using promo codes and of that 10%, about 6% is using the promo 6 code. i.e. they are buying more volume to get that promo code. Cluster 3 is the unspecified cluster and has the total volume and value. Cluster 4 appears to have people focusing in on about 4 brands on average. The volume is high and the value is high with low promotion usage. These seem to be the customers who will buy the product no matter what, interestingly the affluence index is lower than that of cluster 2.
Shown below are some interesting plots that compare the value to brand runs and to the affluence index.
ggplot(SoapCluster, aes(x = `Brand Runs` ,y = Value, color = factor(Cluster)))+geom_point()
ggplot(SoapCluster, aes(x = `Affluence Index` ,y = `Value`, color = factor(Cluster)))+geom_point()
ggplot(SoapCluster, aes(x = HS ,y = `Value`, color = factor(Cluster)))+geom_point()
From the above clusters using both purchase behavior and basis of purchase, I’d consider that cluster 2 would be considered “success” due to the fact that is has a higher response to promotional purchases and we’d want to send promotions to this group of individuals. We can also see that they provide the most value and volume. Thus we should apply predictive techniques based off of the demographics given in order to classify our consumers into the differing clusters.
Using logistic regression to determine the cluster or market segmentation based off of the given demographics. Below are both the lift chart and confusion matrix for the predictive model.
#Partition into 60-40
set.seed(1)
Cluster2.2 = Soap
Cluster2.2$Cluster = k4.2$cluster
#Cluster2 = subset(Cluster2.2, Cluster == "2")
Cluster2Value <- Cluster2.2[,c(2:11,47)] #demographics w/ Cluster
n = nrow(Cluster2Value)
ind = createDataPartition(1:n,p = 0.6)[[1]]
train.data = Cluster2Value[ind, ]
valid.data = Cluster2Value[-ind, ]
#Create logistic models
model1 = glm(Cluster ~ ., data = train.data)
predicted.prob = predict(model1, newdata = valid.data, type = "response")
predicted.label = round(predicted.prob,0)
observed.label = valid.data$Cluster
pp4 = predicted.prob
actual = observed.label
plot.lift(observed.label, predicted.prob)
CM1 = confusionMatrix(factor(predicted.label), factor(valid.data$Cluster))
CM1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2 1 0 0
## 2 24 174 0 8
## 3 4 2 25 0
## 4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.8375
## 95% CI : (0.7846, 0.8818)
## No Information Rate : 0.7375
## P-Value [Acc > NIR] : 0.0001547
##
## Kappa : 0.5383
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.066667 0.9831 1.0000 0.00000
## Specificity 0.995238 0.4921 0.9721 1.00000
## Pos Pred Value 0.666667 0.8447 0.8065 NaN
## Neg Pred Value 0.881857 0.9118 1.0000 0.96667
## Prevalence 0.125000 0.7375 0.1042 0.03333
## Detection Rate 0.008333 0.7250 0.1042 0.00000
## Detection Prevalence 0.012500 0.8583 0.1292 0.00000
## Balanced Accuracy 0.530952 0.7376 0.9860 0.50000
Using a random forest model to determine the cluster or market segmentation based off of the given demographics. Below are both the lift chart and confusion matrix for the predictive model.
colnames(train.data)[10]="AffluenceIndex" #rename column, randomForest was having an error with the space
colnames(valid.data)[10]="AffluenceIndex"
set.seed(1)
rf = randomForest(Cluster ~ .,
data = train.data,
ntree = 5000, # Number of trees to grow.
mtry = 5, # Number of variables randomly sampled as candidates at each split.
nodesize = 25, # Minimum size of terminal nodes.
maxnodes = NULL,
importance = TRUE,
proximity = TRUE
)
rf.pred = predict(rf, valid.data)
rf.pred = round(rf.pred,0)
plot.lift(observed.label, rf.pred)
CM2 = confusionMatrix(factor(rf.pred), factor(valid.data$Cluster))
CM2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4
## 1 2 2 0 0
## 2 24 174 0 8
## 3 4 1 25 0
## 4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.8375
## 95% CI : (0.7846, 0.8818)
## No Information Rate : 0.7375
## P-Value [Acc > NIR] : 0.0001547
##
## Kappa : 0.5382
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.066667 0.9831 1.0000 0.00000
## Specificity 0.990476 0.4921 0.9767 1.00000
## Pos Pred Value 0.500000 0.8447 0.8333 NaN
## Neg Pred Value 0.881356 0.9118 1.0000 0.96667
## Prevalence 0.125000 0.7375 0.1042 0.03333
## Detection Rate 0.008333 0.7250 0.1042 0.00000
## Detection Prevalence 0.016667 0.8583 0.1250 0.00000
## Balanced Accuracy 0.528571 0.7376 0.9884 0.50000
Comparing both models we can see there is a very mild lift for both models. It appears that the random forest model has a 1.3x lift to the 12th percentile and the logistic model has a 1.4x lift to about the 10th percentile. Thus the random forest is probably better because it goes farther into the data.