Using Hierarchical Clustering for Market Segmentation

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Notice:

No package is needed for this analysis.

Keep in mind that no programmer can avoid errors. I strongly agree with this quote from “CodeAcademy” that “Errors in your code mean you’re trying to do something cool.”

https://news.codecademy.com/errors-in-code-think-differently/

Segmentation

Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses.

Market segmentation

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

Create a product which evokes the needs & wants in target market

For example: iPhone and iPad

Once the product is created, the ball shifts to the marketing team’s court. As mentioned above, they make use of market segmentation techniques. This ensures the product is positioned to the right segment of customers with high propensity to buy.

Examples of Objectives

1.Identify the type of customers who would respond to a particular offer

2.Identify high spenders among customers who will use the e-commerce channel for festive shopping

3.Identify customers who will default on their credit obligation for a loan or credit card

Example

The dataset (segmetation.csv) contains information on consumers’ perceptions toward a brand in the apparel industry. The purpose of the case analysis is to gain a better understanding of the consumer segments for the brand, in hopes that such understanding would allow the brand to develop effective segment- or product-specific advertising campaigns.

Technical introduction - Why using Hierarchical clustering instead of K-means?

In this tutorial, our goal is to present an intuitive segmentation example using R and its hclust function. We will be dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses. The dataset (segmentation.csv) we will be using is available in the course folder. The hclust function available in R usually does not require any pre-installed package. Agglomerative hierarchical cluster analysis is usually more commonly used due to the stability of its solution. Briefly speaking, the agglomerative hierarchical cluster analysis algorithm works as follows: • Assign each data point to its own cluster. • Identify the closest two clusters and combine them into one cluster. • Repeat the above process until all the data points are in a single cluster. The difference between K-means cluster analysis and Hierarchical Cluster analysis is that K-means might offer a different solution every time you change the ordering of your data. Additionally, if you try a 3-cluster solution first and a 4-cluster solution next, all of the structure that the 3 cluster solution revealed is probably gone.

Questions for you

1.How many clusters do we have? 2.How many observations do you have in each cluster, respectively? 3.List the cluster member IDs for each cluster. 4.What are common characteristics of the customers in Group 1?

setwd("C:/Users/zxu3/Documents/R/segmentation") # you need to set your own working directory #we had several in-class tutorials about this. You may also do Google search if you are not sure how to do it yet.

mydata = read.csv("Segmentation.csv") # read csv file #This allows you to use your data in the steps below.

#Open the data. Note that some students will see an Excel option in "Import Dataset";
#those that do not will need to save the original data as a TXT (tab-delimited file) and
#import that as a text file.

# In the following step, you will standardize your data(i.e., data with 
# a mean of 0 and a standard deviation of 1)
# you can use the scale function from the R environment
# which is a generic function whose default 
# method centers and/or scales the columns of a numeric matrix.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)

dist = dist(use)  

d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)                # apply hirarchical clustering 
plot(seg.hclust)

groups.3 = cutree(seg.hclust,3)
table(groups.3)  #A good first step is to use the table function to see how # many observations are in each cluster

## groups.3
##   1   2   3 
##   9 170  42

groups.3   # the distribution among the clusters

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   2   1   3   2   2   2   2   3   2   3   3   3   2   2   1   3   2   3   2 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   2   2   2   2   2   2   2   2   2   2   3   2   2   3   2   2   2   2   2   2 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   2   2   2   2   2   2   3   2   3   2   2   3   2   2   2   3   3   1   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   2   2   3   3   2   2   2   2   2   2   2   2   3   2   3   2   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   3   2   2   2   2   3   3   2   2   2   2   2   2   2   3   2   2   1   2   3 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   3   2   3   3   2   2   2   3   2   3   3   2   2   2   3   2   3   2   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   1   2   2   2   2   2   3   2   2   3   2   2   3   2   2   2   2   2 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   2   2   2   2   2   2   1   2   3   1   2   3   2   2   2   2   3   2   2   2 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   2   2   2   2   2   2   2   2   2   2   3   2   2   3   2   2   2   2   2   3 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   2   2   2   2   2   3   2   2   2   2   2   2   2   2   3   2   2   2   2   2 
## 221 
##   2

#In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]

## [1]   1   3  16  59  98 101 143 167 170

mydata$ID[groups.3 == 2]

##   [1]   2   5   6   7   8  10  14  15  18  20  21  22  23  24  25  26  27  28
##  [19]  29  30  32  33  35  36  37  38  39  40  41  42  43  44  45  46  47  49
##  [37]  51  52  54  55  56  60  61  62  63  64  67  68  69  70  71  72  73  74
##  [55]  76  78  79  80  82  83  84  85  88  89  90  91  92  93  94  96  97  99
##  [73] 103 106 107 108 110 113 114 115 117 119 120 121 122 123 124 125 126 127
##  [91] 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 144 145 146
## [109] 147 148 150 151 153 154 156 157 158 159 160 161 162 163 164 165 166 168
## [127] 171 173 174 175 176 178 179 180 181 182 183 184 185 186 187 188 189 190
## [145] 192 193 195 196 197 198 199 201 202 203 204 205 207 208 209 210 211 212
## [163] 213 214 216 217 218 219 220 221

mydata$ID[groups.3 == 3]

##  [1]   4   9  11  12  13  17  19  31  34  48  50  53  57  58  65  66  75  77  81
## [20]  86  87  95 100 102 104 105 109 111 112 116 118 149 152 155 169 172 177 191
## [39] 194 200 206 215

# We can look at the medians (or means) for the variables in each cluster - Q for you - why? 
# how to get it done? You can get it done using Excel. However, it is too tedious
# The aggregate function is well suited for this task
# to see how the aggregate function works, please add ? to aggregate and see what happens
#do you think if mean or median should be used? Why or why not?
?aggregate

## starting httpd help server ... done

aggregate(mydata,list(groups.3),median)

##   Group.1    ID Fashn Price Convnience ShpTime Fitness Perceptn ChNoise
## 1       1  98.0     1     5          5       5       3        3       6
## 2       2 122.5     3     5          3       4       6        5       4
## 3       3  91.0     2     5          4       5       6        4       5
##   RetailEx KnowdgStaf Brand4Slf Brand4Els Populr StrDisp SaleStaf Fabric Cut
## 1      6.0          6         1         1    1.0       1        2    1.0   4
## 2      6.0          5         4         4    4.0       4        5    6.0   6
## 3      5.5          5         2         2    1.5       2        4    5.5   6
##   Seam ShpOHngr ShpOBody Colrs Match
## 1  1.0        1      4.0     4     1
## 2  5.5        4      7.0     6     4
## 3  5.0        4      6.5     5     2

aggregate(mydata,list(groups.3),mean)

##   Group.1        ID    Fashn    Price Convnience  ShpTime  Fitness Perceptn
## 1       1  84.22222 2.111111 5.555556   5.000000 5.000000 4.000000 3.222222
## 2       2 116.28235 3.352941 4.623529   3.582353 3.905882 5.917647 4.411765
## 3       3  95.35714 2.833333 4.666667   4.238095 4.595238 5.309524 3.928571
##    ChNoise RetailEx KnowdgStaf Brand4Slf Brand4Els   Populr  StrDisp SaleStaf
## 1 4.666667 5.444444   5.111111  1.666667  1.444444 1.777778 2.000000 3.444444
## 2 4.070588 5.300000   4.676471  4.111765  4.423529 4.064706 3.676471 4.535294
## 3 4.452381 5.119048   4.523810  2.761905  2.619048 2.404762 3.023810 4.119048
##     Fabric      Cut     Seam ShpOHngr ShpOBody    Colrs    Match
## 1 2.888889 3.333333 1.444444 2.333333 3.777778 3.222222 1.333333
## 2 5.658824 6.017647 5.341176 4.364706 6.458824 5.482353 4.323529
## 3 4.857143 5.785714 4.761905 3.619048 6.190476 4.595238 3.190476

Do more with less using the scale function

We could standardize our data using one line of code: use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE).

That said, the following original code snippets were not used this time since it is too redundant.

use = mydata[,-c(1)] # we exclude the 1st column since ID should not be included in the analysis.

medians = apply(use,2,median) mads = apply(use,2,mad) #we now pass scale a matrix or data frame to be standardized use = scale(use,center=medians,scale=mads)

You can use Excel to standardize your data as well.