Background

CRISA is an Asian research agency that specializes in tracking consumer purchase behavior in consumer goods. CRISA has both transaction data and household data. CRISA has two categories of clients: (1) advertising agencies and (2) consumer goods manufacturers.

Business Problem: Segment the market on two key sets of variables more directly related to the purchase process and to brand loyalty: 1. Purchase behavior and 2. Basis of purchase

Analytics Problem: Use clustering methods to segment the data.

Clustering - Purchase Behavior

Use k-mean clustering to identify clusters of households based on: The variables that describe purchase behavior (including brand loyalty).

Based off of the visualization of the clustering it looks like either 2 or 3 clusters appear to separate the data without much overlap. When k=3, the 3rd cluster seems to have a lot of the data inside the other two clusters. Thus, I would say that k=2 would be best visually. We will do three different types of cluster optimization later.

#normalize

Soap.norm = scale(Soap)
purch.behavior = Soap[,c(12:31)]
basis.purch = Soap[,c(32:46)]

#Purchase Behavior (including brand loyalty)
purch.behavior.norm = scale(purch.behavior)
#Basis of Purchase
basis.purch.norm = scale(basis.purch)

set.seed(1)
k2.0 <- kmeans(purch.behavior.norm, centers = 2)
set.seed(1)
k3.0 <- kmeans(purch.behavior.norm, centers = 3)
set.seed(1)
k4.0 <- kmeans(purch.behavior.norm, centers = 4)
set.seed(1)
k5.0 <- kmeans(purch.behavior.norm, centers = 5)

p1.0 <- fviz_cluster(k2.0, geom = "point", data = purch.behavior.norm) + ggtitle("k = 2")
p2.0 <- fviz_cluster(k3.0, geom = "point",  data = purch.behavior.norm) + ggtitle("k = 3")
p3.0 <- fviz_cluster(k4.0, geom = "point",  data = purch.behavior.norm) + ggtitle("k = 4")
p4.0 <- fviz_cluster(k5.0, geom = "point",  data = purch.behavior.norm) + ggtitle("k = 5")

gridExtra::grid.arrange(p1.0, p2.0, p3.0, p4.0, nrow = 2)

Clustering - Basis for Purchase

Use k-mean clustering to identify clusters of households based on: The variables that describe the basis for purchase.

Based off of the visualization of the clustering it looks like 3 clusters appear to separate the data without much overlap. We will perform optimization later using three different models.

set.seed(1)
k2.1 <- kmeans(basis.purch.norm, centers = 2)
set.seed(1)
k3.1 <- kmeans(basis.purch.norm, centers = 3)
set.seed(1)
k4.1 <- kmeans(basis.purch.norm, centers = 4)
set.seed(1)
k5.1 <- kmeans(basis.purch.norm, centers = 5)

p1.1 <- fviz_cluster(k2.1, geom = "point", data = basis.purch.norm) + ggtitle("k = 2")
p2.1 <- fviz_cluster(k3.1, geom = "point",  data = basis.purch.norm) + ggtitle("k = 3")
p3.1 <- fviz_cluster(k4.1, geom = "point",  data = basis.purch.norm) + ggtitle("k = 4")
p4.1 <- fviz_cluster(k5.1, geom = "point",  data = basis.purch.norm) + ggtitle("k = 5")

gridExtra::grid.arrange(p1.1, p2.1, p3.1, p4.1, nrow = 2)

Clustering - Purchase Behaviour and Basis for Purchase

Use k-mean clustering to identify clusters of households based on: The variables that describe purchase behavior and the basis for purchase.

For both groups it appears that it is mix between 2 or 3 clusters. Visually, it appears that k=3 probably would be best.

set.seed(1)
k2.2 <- kmeans(Soap.norm, centers = 2)
set.seed(1)
k3.2 <- kmeans(Soap.norm, centers = 3)
set.seed(1)
k4.2 <- kmeans(Soap.norm, centers = 4)
set.seed(1)
k5.2 <- kmeans(Soap.norm, centers = 5)

p1.2 <- fviz_cluster(k2.2, geom = "point", data = Soap.norm) + ggtitle("k = 2")
p2.2 <- fviz_cluster(k3.2, geom = "point",  data = Soap.norm) + ggtitle("k = 3")
p3.2 <- fviz_cluster(k4.2, geom = "point",  data = Soap.norm) + ggtitle("k = 4")
p4.2 <- fviz_cluster(k5.2, geom = "point",  data = Soap.norm) + ggtitle("k = 5")

gridExtra::grid.arrange(p1.2, p2.2, p3.2, p4.2, nrow = 2)

Which Clusters?

Optimal Clustering For Purchase Behavior

Using the elbow method, silhouette method, and gap statistic to view optimal clusters.

fviz_nbclust(purch.behavior.norm, kmeans, method = "wss")

fviz_nbclust(purch.behavior.norm, kmeans, method = "silhouette")

fviz_nbclust(purch.behavior.norm, kmeans, method = "gap_stat")

Doing the three different methods, the silhouette method looks like k=2 would be best here and that gap method shows that k=3 is best. The elbow is hard to conclude anything from. Thus, using either k=2 or k=3 should work for purchase behavior.

Optimal Clustering For Basis of Purchase

Using the elbow method, silhouette method, and gap statistic to view optimal clusters.

fviz_nbclust(basis.purch.norm, kmeans, method = "wss")

fviz_nbclust(basis.purch.norm, kmeans, method = "silhouette")

fviz_nbclust(basis.purch.norm, kmeans, method = "gap_stat")

I would say the below method doesn’t really have any good conclusions again, the other two appear that k=2 or k=3 could be good for the number of clusters.

Optimal Clustering For Purchase Behavior and Basis of Purchase

Using the elbow method, silhouette method, and gap statistic to view optimal clusters.

fviz_nbclust(Soap.norm, kmeans, method = "wss")

fviz_nbclust(Soap.norm, kmeans, method = "silhouette")

fviz_nbclust(Soap.norm, kmeans, method = "gap_stat")

When looking at all of data for purchase behavior and basis of purchase we see that the elbow model shows k=4, the silhouette says k=2, and the gap method shows k=4. Looking back we can see that if we use 4 clusters, the entire 4th cluster is inside of cluster 2. I would probably lean more towards k=4 because it has two different methods showing k=4 is better.

Clusters for Purchase Behavior

Below are the means of each variable for the first cluster.

### Exploring Clusters 

purch.behavior = Soap[,c(2:31)]
purch.behavior$Cluster = k2.0$cluster

#Partition into clusters

C1.0 = subset(purch.behavior, Cluster == "1")
C2.0 = subset(purch.behavior, Cluster == "2")

format(colMeans(C1.0), scientific = F)
##                   SEC                   FEH                    MT 
##      "    2.71753247"      "    1.90584416"      "    7.54220779" 
##                   SEX                   AGE                   EDU 
##      "    1.59415584"      "    3.16558442"      "    3.39935065" 
##                    HS                 CHILD                    CS 
##      "    3.90259740"      "    3.40584416"      "    0.86363636" 
##       Affluence Index         No. of Brands            Brand Runs 
##      "   13.01623377"      "    2.75974026"      "    9.02597403" 
##          Total Volume         No. of  Trans                 Value 
##      "10862.01298701"      "   21.89935065"      " 1059.32581169" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##      "    3.28863636"      "  492.24159091"      "   10.63211039" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##      "    0.95474808"      "    0.01438278"      "    0.03086914" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##      "    0.23491356"      "    0.20934939"      "    0.01133486" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##      "    0.03840560"      "    0.01041140"      "    0.01208363" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##      "    0.04081926"      "    0.01284040"      "    0.42984191" 
##               Cluster 
##      "    1.00000000"

Below are the means of each variable for the second cluster.

format(colMeans(C2.0), scientific = F)
##                   SEC                   FEH                    MT 
##      "    2.27054795"      "    2.19863014"      "    8.84931507" 
##                   SEX                   AGE                   EDU 
##      "    1.89041096"      "    3.26369863"      "    4.72260274" 
##                    HS                 CHILD                    CS 
##      "    4.49657534"      "    3.05136986"      "    1.00342466" 
##       Affluence Index         No. of Brands            Brand Runs 
##      "   21.24315068"      "    4.56164384"      "   22.84589041" 
##          Total Volume         No. of  Trans                 Value 
##      "13025.21232877"      "   40.91438356"      " 1630.68325342" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##      "    1.91017123"      "  333.63123288"      "   13.10321918" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##      "    0.86898145"      "    0.09476070"      "    0.03625785" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##      "    0.12993093"      "    0.04496154"      "    0.05617157" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##      "    0.02924639"      "    0.02871958"      "    0.04052024" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##      "    0.02728314"      "    0.02382510"      "    0.61919148" 
##               Cluster 
##      "    2.00000000"

Notice that there is a large difference between the length of Brand Runs (~9 vs ~23). The sum of value seems to be quite a bit greater for cluster 2 (~1631 vs. ~1059) with a high average price of purchase as well. Note that cluster 2 has an overall higher promotional purchase rate of about 13.1% vs. cluster 1’s 5.5%. The overall transactions for cluster two is almost double that of cluster 1.

Interesting observations: cluster 1 seems to use fewer brands overall, but has shorter brand runs. i.e. cluster 2 has people who will use a lot more brands, but consistently buys from them. I would believe that this means that these people are loyal, but two multiple brands. A question that comes to minds is: are there more people in the household?

Below is the sizes of the clusters, where the centers are, and the total within-cluster sum of squares.

k2.0$size
## [1] 308 292
k2.0$centers  #note these are normalized
##   No. of Brands Brand Runs Total Volume No. of  Trans     Value
## 1    -0.5551189 -0.6469201   -0.1354834    -0.5310062 -0.314849
## 2     0.5855363  0.6823678    0.1429072     0.5601025  0.332101
##   Trans / Brand Runs   Vol/Tran Avg. Price Pur Vol No Promo - %
## 1          0.2575565  0.3102991 -0.3213184            0.3493170
## 2         -0.2716692 -0.3273017  0.3389249           -0.3684577
##   Pur Vol Promo 6 % Pur Vol Other Promo % Br. Cd. 57, 144 Br. Cd. 55
## 1        -0.4206724           -0.03643105        0.216099  0.3080240
## 2         0.4437230            0.03842727       -0.227940 -0.3249021
##   Br. Cd. 272 Br. Cd. 286 Br. Cd. 24 Br. Cd. 481 Br. Cd. 352   Br. Cd. 5
## 1  -0.2398908  0.03948456 -0.1116980  -0.1548665  0.05419250 -0.07872644
## 2   0.2530355 -0.04164810  0.1178185   0.1633524 -0.05716195  0.08304022
##   Others 999
## 1 -0.3099314
## 2  0.3269140
k2.0$tot.withinss
## [1] 10692.33

We can see the size of the clusters are pretty even. Also, the clusters are separated quite nicely across a lot of the variables. Note if we increases the the number of clusters the total within-cluster sum does get better, but it doesn’t improve drastically.

Clusters for Basis of Purchase

Below are the means of each variable for the first cluster.

basis.purch = Soap[,c(2:11,32:46)]
basis.purch$Cluster = k2.1$cluster

#Partition into clusters

C1.1 = subset(basis.purch, Cluster == "1")
C2.1 = subset(basis.purch, Cluster == "2")


format(colMeans(C1.1), scientific = F)
##             SEC             FEH              MT             SEX             AGE 
##   " 1.89302326"   " 1.84651163"   " 7.64186047"   " 1.67906977"   " 3.25116279" 
##             EDU              HS           CHILD              CS Affluence Index 
##   " 4.60000000"   " 3.67906977"   " 3.37209302"   " 0.87906977"   "20.77674419" 
##        Pr Cat 1        Pr Cat 2        Pr Cat 3        Pr Cat 4       PropCat 5 
##   " 0.55901305"   " 0.40408272"   " 0.01208310"   " 0.02482113"   " 0.31030259" 
##       PropCat 6       PropCat 7       PropCat 8       PropCat 9      PropCat 10 
##   " 0.12026980"   " 0.19410519"   " 0.15067337"   " 0.03992765"   " 0.04482603" 
##      PropCat 11      PropCat 12      PropCat 13      PropCat 14      PropCat 15 
##   " 0.01994279"   " 0.01280167"   " 0.06227482"   " 0.01144191"   " 0.03343419" 
##         Cluster 
##   " 1.00000000"

Below are the means of each variable for the second cluster.

format(colMeans(C2.1), scientific = F)
##             SEC             FEH              MT             SEX             AGE 
##  " 2.838961039"  " 2.161038961"  " 8.477922078"  " 1.771428571"  " 3.192207792" 
##             EDU              HS           CHILD              CS Affluence Index 
##  " 3.732467532"  " 4.477922078"  " 3.155844156"  " 0.961038961"  "14.922077922" 
##        Pr Cat 1        Pr Cat 2        Pr Cat 3        Pr Cat 4       PropCat 5 
##  " 0.122686222"  " 0.542875839"  " 0.210193816"  " 0.124244123"  " 0.539177357" 
##       PropCat 6       PropCat 7       PropCat 8       PropCat 9      PropCat 10 
##  " 0.076717247"  " 0.042630920"  " 0.040764257"  " 0.025711585"  " 0.006523370" 
##      PropCat 11      PropCat 12      PropCat 13      PropCat 14      PropCat 15 
##  " 0.034630295"  " 0.002540059"  " 0.004086944"  " 0.206307523"  " 0.020910442" 
##         Cluster 
##  " 2.000000000"

Cluster 1 seems to have most of the volume purchased under premium soaps and popular soaps (~56% and ~40% or together ~96%) and cluster 2 is mostly what is popular (~54%). Thus, it seems like cluster 1 is less focused on price, but more focused on popularity or that premium label.

Looking at the selling proposition wise purchase, we can see that majority of purchased were made in “Beauty” (PropCat 5) for both clusters, but cluster 2 favors that more with about a 22% difference and it also favors Carbolic (PropCat 14). Cluster 1 has a higher preference towards PropCat 6, PropCat 7, and ProbCat 8, which are: Health, Herbal, and Freshness.

k2.1$size
## [1] 215 385
k2.1$centers  #note these are normalized
##     Pr Cat 1   Pr Cat 2   Pr Cat 3   Pr Cat 4  PropCat 5   PropCat 6  PropCat 7
## 1  0.9967576 -0.2858243 -0.4742713 -0.3328171 -0.4642392  0.16801648  0.4965058
## 2 -0.5566309  0.1596162  0.2648528  0.1858589  0.2592504 -0.09382739 -0.2772695
##    PropCat 8   PropCat 9 PropCat 10  PropCat 11 PropCat 12 PropCat 13
## 1  0.4622663  0.14521800  0.3206181 -0.09571174  0.2502437  0.3911774
## 2 -0.2581487 -0.08109576 -0.1790465  0.05344941 -0.1397465 -0.2184497
##   PropCat 14  PropCat 15
## 1 -0.4699908  0.09174436
## 2  0.2624624 -0.05123386
k2.1$tot.withinss
## [1] 8082.713

The cluster size is a little more skewed, the first cluster is a bit smaller. We can see the centers are fairly close on the some of the variables (PropCat 9,11, and 15), which the above means shows.

SoapCluster$Cluster = k4.2$cluster

#Partition into clusters

C1.2 = subset(SoapCluster, Cluster == "1")

Combining Purchase Behavior and Basis of Purchase

We will choose k=4 off of the above methods and visualizations. Below are listed, in order, the 4 clusters mean values for each variable, then the cluster sizes, their centers, and the total within-cluster sum.

C2.2 = subset(SoapCluster, Cluster == "2")
C3.2 = subset(SoapCluster, Cluster == "3")
C4.2 = subset(SoapCluster, Cluster == "4")

format(colMeans(C1.2), scientific = F)
##                   SEC                   FEH                    MT 
##    "    3.4705882353"    "    2.3382352941"    "    8.6911764706" 
##                   SEX                   AGE                   EDU 
##    "    1.7500000000"    "    3.0000000000"    "    2.5147058824" 
##                    HS                 CHILD                    CS 
##    "    4.7794117647"    "    3.2941176471"    "    1.0147058824" 
##       Affluence Index         No. of Brands            Brand Runs 
##    "    9.2647058824"    "    2.9264705882"    "    8.9117647059" 
##          Total Volume         No. of  Trans                 Value 
##    "14468.6029411765"    "   27.3529411765"    "  992.1014705882" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##    "    5.5048529412"    "  555.6422058824"    "    6.8422058824" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##    "    0.9307966573"    "    0.0168661572"    "    0.0523371855" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##    "    0.0288208121"    "    0.7621998982"    "    0.0019585298" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##    "    0.0089379349"    "    0.0029901375"    "    0.0041173348" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##    "    0.0022367635"    "    0.0077038105"    "    0.1810347787" 
##              Pr Cat 1              Pr Cat 2              Pr Cat 3 
##    "    0.0536513749"    "    0.1114074551"    "    0.7966113492" 
##              Pr Cat 4             PropCat 5             PropCat 6 
##    "    0.0383298208"    "    0.1035686825"    "    0.0522436098" 
##             PropCat 7             PropCat 8             PropCat 9 
##    "    0.0065686732"    "    0.0040699554"    "    0.0184133419" 
##            PropCat 10            PropCat 11            PropCat 12 
##    "    0.0006637574"    "    0.0042723511"    "    0.0020614373" 
##            PropCat 13            PropCat 14            PropCat 15 
##    "    0.0029901375"    "    0.7894704542"    "    0.0156775996" 
##               Cluster 
##    "    1.0000000000"
format(colMeans(C2.2), scientific = F)
##                   SEC                   FEH                    MT 
##     "    2.420581655"     "    2.277404922"     "    9.167785235" 
##                   SEX                   AGE                   EDU 
##     "    1.961968680"     "    3.288590604"     "    4.794183445" 
##                    HS                 CHILD                    CS 
##     "    4.646532438"     "    2.984340045"     "    1.044742729" 
##       Affluence Index         No. of Brands            Brand Runs 
##     "   20.422818792"     "    3.874720358"     "   18.015659955" 
##          Total Volume         No. of  Trans                 Value 
##     "12629.087248322"     "   34.704697987"     " 1507.357740492" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##     "    2.340782998"     "  397.112416107"     "   12.436554810" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##     "    0.907125363"     "    0.063117400"     "    0.029757238" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##     "    0.207717969"     "    0.042142711"     "    0.036946364" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##     "    0.015357284"     "    0.017473600"     "    0.031841449" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##     "    0.042768050"     "    0.021710743"     "    0.583943825" 
##              Pr Cat 1              Pr Cat 2              Pr Cat 3 
##     "    0.303942410"     "    0.543546123"     "    0.049017534" 
##              Pr Cat 4             PropCat 5             PropCat 6 
##     "    0.103493933"     "    0.494976879"     "    0.101319698" 
##             PropCat 7             PropCat 8             PropCat 9 
##     "    0.114148333"     "    0.090570142"     "    0.033899704" 
##            PropCat 10            PropCat 11            PropCat 12 
##     "    0.022138138"     "    0.036199984"     "    0.006099238" 
##            PropCat 13            PropCat 14            PropCat 15 
##     "    0.023414192"     "    0.046484561"     "    0.030749131" 
##               Cluster 
##     "    2.000000000"
format(colMeans(C3.2), scientific = F)
##                   SEC                   FEH                    MT 
##      "   2.016393443"      "   0.000000000"      "   0.000000000" 
##                   SEX                   AGE                   EDU 
##      "   0.000000000"      "   2.770491803"      "   0.000000000" 
##                    HS                 CHILD                    CS 
##      "   0.000000000"      "   5.000000000"      "   0.000000000" 
##       Affluence Index         No. of Brands            Brand Runs 
##      "   0.000000000"      "   2.524590164"      "   7.131147541" 
##          Total Volume         No. of  Trans                 Value 
##      "3484.180327869"      "  10.180327869"      " 450.881147541" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##      "   1.305901639"      " 363.824426230"      "  13.284590164" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##      "   0.915312335"      "   0.039789310"      "   0.044898355" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##      "   0.215260757"      "   0.103247243"      "   0.049418420" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##      "   0.018265489"      "   0.056675020"      "   0.011835254" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##      "   0.013209642"      "   0.006242285"      "   0.525845890" 
##              Pr Cat 1              Pr Cat 2              Pr Cat 3 
##      "   0.410894855"      "   0.421080104"      "   0.109410812" 
##              Pr Cat 4             PropCat 5             PropCat 6 
##      "   0.058614229"      "   0.453368570"      "   0.078422521" 
##             PropCat 7             PropCat 8             PropCat 9 
##      "   0.088343093"      "   0.107452007"      "   0.027454539" 
##            PropCat 10            PropCat 11            PropCat 12 
##      "   0.035732279"      "   0.011835254"      "   0.014021988" 
##            PropCat 13            PropCat 14            PropCat 15 
##      "   0.068384622"      "   0.109410812"      "   0.005574316" 
##               Cluster 
##      "   3.000000000"
format(colMeans(C4.2), scientific = F)
##                   SEC                   FEH                    MT 
##     "    2.458333333"     "    2.166666667"     "    9.083333333" 
##                   SEX                   AGE                   EDU 
##     "    1.958333333"     "    3.541666667"     "    4.666666667" 
##                    HS                 CHILD                    CS 
##     "    4.708333333"     "    3.208333333"     "    0.958333333" 
##       Affluence Index         No. of Brands            Brand Runs 
##     "   18.875000000"     "    4.041666667"     "   14.875000000" 
##          Total Volume         No. of  Trans                 Value 
##     "12802.500000000"     "   29.083333333"     " 1403.179166667" 
##    Trans / Brand Runs              Vol/Tran            Avg. Price 
##     "    2.931250000"     "  481.021666667"     "   11.085833333" 
##  Pur Vol No Promo - %     Pur Vol Promo 6 % Pur Vol Other Promo % 
##     "    0.966322075"     "    0.013020710"     "    0.020657215" 
##       Br. Cd. 57, 144            Br. Cd. 55           Br. Cd. 272 
##     "    0.098023038"     "    0.026788142"     "    0.009604473" 
##           Br. Cd. 286            Br. Cd. 24           Br. Cd. 481 
##     "    0.490924591"     "    0.005067665"     "    0.013275419" 
##           Br. Cd. 352             Br. Cd. 5            Others 999 
##     "    0.019325102"     "    0.012601325"     "    0.324390246" 
##              Pr Cat 1              Pr Cat 2              Pr Cat 3 
##     "    0.118619306"     "    0.819094709"     "    0.032000710" 
##              Pr Cat 4             PropCat 5             PropCat 6 
##     "    0.030285276"     "    0.764396707"     "    0.073662641" 
##             PropCat 7             PropCat 8             PropCat 9 
##     "    0.053567392"     "    0.032201235"     "    0.016808472" 
##            PropCat 10            PropCat 11            PropCat 12 
##     "    0.001189221"     "    0.017770884"     "    0.000350140" 
##            PropCat 13            PropCat 14            PropCat 15 
##     "    0.005067665"     "    0.031323203"     "    0.003662441" 
##               Cluster 
##     "    4.000000000"
k4.2$size
## [1]  68 447  61  24
k4.2$centers  #note these are normalized
##     Member id         SEC        FEH         MT         SEX        AGE
## 1 -0.99660572  0.86739677  0.2554282  0.1194147  0.01798996 -0.2464889
## 2  0.09245715 -0.07097471  0.2018315  0.2303923  0.34484503  0.0869535
## 3  0.33338888 -0.43219025 -1.8047556 -1.9043115 -2.68050476 -0.5116665
## 4  0.25433839 -0.03723673  0.1042617  0.2107278  0.33923934  0.3793618
##          EDU         HS       CHILD          CS Affluence Index No. of Brands
## 1 -0.6980023  0.2555313  0.04994149  0.16366538      -0.6796880    -0.4495739
## 2  0.3428534  0.1977600 -0.20457740  0.22286622       0.2982292     0.1506946
## 3 -1.8462679 -1.8223924  1.45152536 -1.83625981      -1.4916636    -0.7039754
## 4  0.2846266  0.2246289 -0.02054045  0.05255842       0.1625756     0.2563763
##    Brand Runs Total Volume No. of  Trans       Value Trans / Brand Runs
## 1 -0.65790552   0.32866279    -0.2180717 -0.39096736          1.1084151
## 2  0.21776536   0.09192829     0.2037822  0.19245910         -0.1063471
## 3 -0.82917667  -1.08496568    -1.2034598 -1.00379302         -0.5036628
## 4 -0.08432341   0.11424546    -0.1187794  0.07449734          0.1203474
##      Vol/Tran Avg. Price Pur Vol No Promo - % Pur Vol Promo 6 %
## 1  0.56516439 -1.3339241           0.14886931        -0.3939659
## 2 -0.07211252  0.1608021          -0.04923405         0.1034265
## 3 -0.20592755  0.3873845           0.01928214        -0.1474471
## 4  0.26519586 -0.2000904           0.44617897        -0.4353204
##   Pur Vol Other Promo % Br. Cd. 57, 144 Br. Cd. 55 Br. Cd. 272 Br. Cd. 286
## 1            0.26179643      -0.6555996  2.4366090 -0.34297239  -0.2215413
## 2           -0.05187722       0.1010714 -0.3357551  0.04167722  -0.1646784
## 3            0.15845850       0.1329747 -0.1004905  0.17879263  -0.1389174
## 4           -0.17829193      -0.3628990 -0.3948733 -0.25891434   4.0479181
##    Br. Cd. 24 Br. Cd. 481 Br. Cd. 352   Br. Cd. 5 Others 999    Pr Cat 1
## 1 -0.20473308 -0.24401301 -0.26320534 -0.15437056 -1.1467522 -0.80240522
## 2 -0.02316431  0.06623267  0.07022421  0.05190302  0.2083644  0.08866816
## 3  0.46827575 -0.15764593 -0.17293725 -0.17589376  0.0129618  0.46943479
## 4 -0.17868860 -0.14152980 -0.12262859 -0.08224715 -0.6646003 -0.57110971
##     Pr Cat 2   Pr Cat 3    Pr Cat 4   PropCat 5   PropCat 6   PropCat 7
## 1 -1.2251317  2.4526968 -0.26234414 -1.11773989 -0.24096622 -0.46148609
## 2  0.1617674 -0.3364740  0.07760803  0.11952955  0.05408589  0.08806258
## 3 -0.2312732 -0.1111549 -0.15652318 -0.01199731 -0.08357505 -0.04375821
## 4  1.0461087 -0.3999614 -0.30431142  0.97118499 -0.11219210 -0.22140276
##    PropCat 8   PropCat 9  PropCat 10  PropCat 11   PropCat 12  PropCat 13
## 1 -0.4986668 -0.19728072 -0.25548610 -0.25485532 -0.157936401 -0.22994082
## 2  0.0683111  0.04925563  0.02465078  0.06939063 -0.004480698 -0.01596054
## 3  0.1789657 -0.05334868  0.20198863 -0.17804911  0.296621646  0.45518906
## 4 -0.3142763 -0.22282957 -0.24863134 -0.11776891 -0.222973870 -0.20817482
##   PropCat 14  PropCat 15
## 1  2.4544320 -0.11097497
## 2 -0.3382738  0.06109023
## 3 -0.1017491 -0.22631983
## 4 -0.3952617 -0.24814688
k4.2$tot.withinss
## [1] 22123.18

We can see the clusters differ in size, below we will use tables to compare a few of the key variables against each other to show some relationships.

Looking the reported sex of each cluster. SEX Sex of homemaker
0 Not Specified 1 Male
2 Female

table(C1.2$SEX)
## 
##  0  1  2 
##  7  3 58
table(C2.2$SEX)
## 
##   1   2 
##  17 430
table(C3.2$SEX)
## 
##  0 
## 61
table(C4.2$SEX)
## 
##  1  2 
##  1 23

An interesting observation is that the 3rd cluster is only ‘unspecified’ people.

Below we are comparing the each cluster and the reported age.
AGE Age of homemaker
1 Up to 24
2 25-34
3 35-44
4 45+

table(C1.2$AGE)
## 
##  1  2  3  4 
##  4 20 16 28
table(C2.2$AGE)
## 
##   1   2   3   4 
##   9  82 127 229
table(C3.2$AGE)
## 
##  1  2  3  4 
##  2 24 21 14
table(C4.2$AGE)
## 
##  2  3  4 
##  3  5 16

There appears to be no trends for each cluster here.

Below we are comparing the each cluster and the reported number of children. CHILD Presence of children in household
1 Children up to age 6 present (only)
2 Children 7-14 present (only)
3 Both
4 None
5 Not specified

table(C1.2$CHILD)
## 
##  1  2  3  4  5 
##  6 16  5 34  7
table(C2.2$CHILD)
## 
##   1   2   3   4 
##  52 122  54 219
table(C3.2$CHILD)
## 
##  5 
## 61
table(C4.2$CHILD)
## 
##  1  2  3  4 
##  1  7  2 14

Below we are comparing the each cluster and the reported educational status. EDU Education of homemaker
1 Illiterate
2 Literate, but no formal schooling
3 Up to 4 years of school
4 5-9 years of school
5 10-12 years of school
6 Some college
7 College graduate
8 Some graduate school
9 Graduate or professional school degree
0 Not specified

table(C1.2$EDU)
## 
##  0  1  2  3  4  5  7 
## 10 21  1  8 18  9  1
table(C2.2$EDU)
## 
##   0   1   2   3   4   5   6   7   8   9 
##   2  27   7  22 111 174  21  69  13   1
table(C3.2$EDU)
## 
##  0 
## 61
table(C4.2$EDU)
## 
## 1 2 3 4 5 6 7 9 
## 1 1 3 7 6 2 3 1

Below we are comparing the each cluster and the reported socioeconomic status. SEC Socioeconomic class (1 = high, 4 = low)
1 A
2 B
3 C
4 D/E

table(C1.2$SEC)
## 
##  1  2  3  4 
##  1 11 11 45
table(C2.2$SEC)
## 
##   1   2   3   4 
## 115 121 119  92
table(C3.2$SEC)
## 
##  1  2  3  4 
## 29 12 10 10
table(C4.2$SEC)
## 
##  1  2  3  4 
##  5  6 10  3

Below we are comparing the each cluster and the reported affluence index.

table(C1.2$'Affluence Index')
## 
##  0  1  2  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 23 25 
##  8  1  3  6  4  2  3  3  1  7  7  3  4  4  3  3  2  1  2  1
table(C2.2$'Affluence Index')
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  5  3  1  4  2  3  5  8 12 19 21 17 27 17 28 22 16 19 24 10 13 12  9  8 10 14 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 53 
## 10 11  8 11 11  6  5  5  8  6  4  8  4  3  2  1  1  3  3  1  1  2  1  1  1  1
table(C3.2$'Affluence Index')
## 
##  0 
## 61
table(C4.2$'Affluence Index')
## 
##  6 11 12 13 14 15 16 17 20 25 32 37 38 
##  1  1  4  1  3  3  2  2  1  1  2  1  2

Below we are comparing the each cluster and the reported household size. HS Household size
Number of people in the household

table(C1.2$HS)
## 
##  0  2  3  4  5  6  7  8  9 10 15 
##  7  4  6  8 23  8  3  5  2  1  1
table(C2.2$HS)
## 
##   1   2   3   4   5   6   7   8   9  10  12  15 
##   2  35  64 130 114  55  18  12  11   3   1   2
table(C3.2$HS)
## 
##  0 
## 61
table(C4.2$HS)
## 
##  2  3  4  5  6  7  8 12 
##  2  3  9  5  2  1  1  1

Comparing all of the above tables, we can see that the size of the clusters are very different. An interesting observation is that cluster 3 takes almost all of the unspecified results, thus not really know a lot about that specific cluster. Below there are a few plots that show cluster 3 and where they are clumped. They also have varying degree of value, but the majority of those are of less value.

We can also see that the total volume for each cluster varies, Cluster 1 has the largest volume of 14468, but the value is 992, with about 7% of people using promotions to influence their purchases and seem to more loyal to a specific brand. Cluster 2 has a total volume of 12629 and the high value of 1507. These customers use more brands, but have a lot high brand run, so appear to be loyal, but to multiple different brands (note that these customers have in general 3-6 members in the household). The 2nd cluster is more motivated by promotions with about 10% of the purchases using promo codes and of that 10%, about 6% is using the promo 6 code. i.e. they are buying more volume to get that promo code. Cluster 3 is the unspecified cluster and has the total volume and value. Cluster 4 appears to have people focusing in on about 4 brands on average. The volume is high and the value is high with low promotion usage. These seem to be the customers who will buy the product no matter what, interestingly the affluence index is lower than that of cluster 2.

Shown below are some interesting plots that compare the value to brand runs and to the affluence index.

ggplot(SoapCluster, aes(x = `Brand Runs` ,y = Value, color = factor(Cluster)))+geom_point()

ggplot(SoapCluster, aes(x = `Affluence Index` ,y = `Value`, color = factor(Cluster)))+geom_point()

ggplot(SoapCluster, aes(x = HS ,y = `Value`, color = factor(Cluster)))+geom_point()

From the above clusters using both purchase behavior and basis of purchase, I’d consider that cluster 2 would be considered “success” due to the fact that is has a higher response to promotional purchases and we’d want to send promotions to this group of individuals. We can also see that they provide the most value and volume. Thus we should apply predictive techniques based off of the demographics given in order to classify our consumers into the differing clusters.

Predictive Modeling

Using logistic regression to determine the cluster or market segmentation based off of the given demographics. Below are both the lift chart and confusion matrix for the predictive model.

#Partition into 60-40 
set.seed(1)

Cluster2.2 = Soap
Cluster2.2$Cluster =  k4.2$cluster
#Cluster2 = subset(Cluster2.2, Cluster == "2")
Cluster2Value <- Cluster2.2[,c(2:11,47)] #demographics w/ Cluster
n = nrow(Cluster2Value)

ind = createDataPartition(1:n,p = 0.6)[[1]]

train.data = Cluster2Value[ind, ]

valid.data = Cluster2Value[-ind, ]

#Create logistic models
model1 = glm(Cluster ~ ., data = train.data)

predicted.prob = predict(model1, newdata = valid.data, type = "response")

predicted.label = round(predicted.prob,0)

observed.label = valid.data$Cluster

pp4 = predicted.prob 
actual = observed.label 

plot.lift(observed.label, predicted.prob) 

CM1 = confusionMatrix(factor(predicted.label), factor(valid.data$Cluster))
CM1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   2   1   0   0
##          2  24 174   0   8
##          3   4   2  25   0
##          4   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8375          
##                  95% CI : (0.7846, 0.8818)
##     No Information Rate : 0.7375          
##     P-Value [Acc > NIR] : 0.0001547       
##                                           
##                   Kappa : 0.5383          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity          0.066667   0.9831   1.0000  0.00000
## Specificity          0.995238   0.4921   0.9721  1.00000
## Pos Pred Value       0.666667   0.8447   0.8065      NaN
## Neg Pred Value       0.881857   0.9118   1.0000  0.96667
## Prevalence           0.125000   0.7375   0.1042  0.03333
## Detection Rate       0.008333   0.7250   0.1042  0.00000
## Detection Prevalence 0.012500   0.8583   0.1292  0.00000
## Balanced Accuracy    0.530952   0.7376   0.9860  0.50000

Using a random forest model to determine the cluster or market segmentation based off of the given demographics. Below are both the lift chart and confusion matrix for the predictive model.

colnames(train.data)[10]="AffluenceIndex"  #rename column, randomForest was having an error with the space
colnames(valid.data)[10]="AffluenceIndex"
set.seed(1)
rf = randomForest(Cluster ~ .,    
                  data = train.data,
                  ntree = 5000, # Number of trees to grow. 
                  mtry = 5, # Number of variables randomly sampled as candidates at each split. 
                  nodesize = 25, # Minimum size of terminal nodes.
                  maxnodes = NULL, 
                  importance = TRUE, 
                  proximity = TRUE 
                  
)

rf.pred = predict(rf, valid.data)
rf.pred = round(rf.pred,0)

plot.lift(observed.label, rf.pred) 

CM2 = confusionMatrix(factor(rf.pred), factor(valid.data$Cluster))
CM2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   2   2   0   0
##          2  24 174   0   8
##          3   4   1  25   0
##          4   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8375          
##                  95% CI : (0.7846, 0.8818)
##     No Information Rate : 0.7375          
##     P-Value [Acc > NIR] : 0.0001547       
##                                           
##                   Kappa : 0.5382          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity          0.066667   0.9831   1.0000  0.00000
## Specificity          0.990476   0.4921   0.9767  1.00000
## Pos Pred Value       0.500000   0.8447   0.8333      NaN
## Neg Pred Value       0.881356   0.9118   1.0000  0.96667
## Prevalence           0.125000   0.7375   0.1042  0.03333
## Detection Rate       0.008333   0.7250   0.1042  0.00000
## Detection Prevalence 0.016667   0.8583   0.1250  0.00000
## Balanced Accuracy    0.528571   0.7376   0.9884  0.50000

Summary

Comparing both models we can see there is a very mild lift for both models. It appears that the random forest model has a 1.3x lift to the 12th percentile and the logistic model has a 1.4x lift to about the 10th percentile. Thus the random forest is probably better because it goes farther into the data.