Using Hierarchical Clustering & PCA for Market Segmentation and Targeting

Segmentation

Objective - Dividing the target market or customers on the basis of some significant features which could help a company sell more products in less marketing expenses. A potentially interesting question might be are some products (or customers) more alike than the others.

Market segmentation

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

Create a product which evokes the needs & wants in target market

Imagine that you are the Director of Customer Relationships at Apple, and you might be interested in understanding consumers’ attitude towards iPhone 12 and Google’s Pixel 5. Once the product is created, the ball shifts to the marketing team s� court. As mentioned above, to understand which groups of customers will be interested in which kind of features, marketers will make use of market segmentation strategy. The cluster analysis algorithm is designed to address this problem. Doing this ensures the product is positioned to the right segment of customers with a high propensity to buy.

# Building distance function and plotting the trees (dendrograms)
# Hierarchical clustering (using the function hclust) is an informative way to visualize the data.
# We will see if we could discover subgroups among the variables or among the observations.

library(readr)
library(ggplot2)

mydata <- read_csv("customer_segmentation.csv") # load your dataset

## Rows: 22 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): ID, CS_helpful, Recommend, Come_again, All_Products, Profesionalis...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

use = scale(mydata[,-c(1)], center = TRUE, scale = TRUE)
dist = dist(use)  
d <- dist(as.matrix(dist))   # find distance matrix 
seg.hclust <- hclust(d)      # apply hierarchical clustering 
plot(seg.hclust)

# Identifying clustering memberships for each cluster
# Imagine if your goal is to find some profitable customers to target. 
# Now you will be able to see the number of customers using this algorithm.

groups.3 = cutree(seg.hclust, 3)
table(groups.3)

## groups.3
##  1  2  3 
## 17  2  3

# In the following step, we will find the members in each cluster or group.
mydata$ID[groups.3 == 1]

##  [1]  1  2  3  6  7  8  9 10 11 12 13 14 15 16 17 18 21

mydata$ID[groups.3 == 2]

## [1]  4 22

mydata$ID[groups.3 == 3]

## [1]  5 19 20

# Identifying common features of each cluster using the aggregate function
aggregate(mydata, list(groups.3), median)

##   Group.1 ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1 11          1       1.0        1.0            2            1.0
## 2       2 13          3       2.5        1.5            3            1.5
## 3       3 19          2       1.0        3.0            3            2.0
##   Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1          1              2        2     3.0          1         2.0      1 2.0
## 2          2              3        3     2.5          2         1.5      1 2.5
## 3          1              2        3     1.0          2         3.0      2 2.0
##   Education
## 1         2
## 2         5
## 3         2

aggregate(mydata, list(groups.3), mean)

##   Group.1       ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1 10.76471   1.294118  1.117647   1.235294     1.823529       1.235294
## 2       2 13.00000   3.000000  2.500000   1.500000     3.000000       1.500000
## 3       3 14.66667   2.333333  1.666667   2.666667     3.000000       2.333333
##   Limitation Online_grocery delivery  Pick_up Find_items other_shops   Gender
## 1   1.352941       2.235294 2.235294 2.705882   1.294118    2.647059 1.176471
## 2   2.000000       3.000000 3.000000 2.500000   2.000000    1.500000 1.000000
## 3   2.000000       2.000000 3.000000 1.000000   2.000000    3.000000 2.000000
##        Age Education
## 1 2.411765  3.117647
## 2 2.500000  5.000000
## 3 2.666667  2.333333

aggregate(mydata[,-1], list(groups.3), median)

##   Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1          1       1.0        1.0            2            1.0
## 2       2          3       2.5        1.5            3            1.5
## 3       3          2       1.0        3.0            3            2.0
##   Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1          1              2        2     3.0          1         2.0      1 2.0
## 2          2              3        3     2.5          2         1.5      1 2.5
## 3          1              2        3     1.0          2         3.0      2 2.0
##   Education
## 1         2
## 2         5
## 3         2

aggregate(mydata[,-1], list(groups.3), mean)

##   Group.1 CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1   1.294118  1.117647   1.235294     1.823529       1.235294
## 2       2   3.000000  2.500000   1.500000     3.000000       1.500000
## 3       3   2.333333  1.666667   2.666667     3.000000       2.333333
##   Limitation Online_grocery delivery  Pick_up Find_items other_shops   Gender
## 1   1.352941       2.235294 2.235294 2.705882   1.294118    2.647059 1.176471
## 2   2.000000       3.000000 3.000000 2.500000   2.000000    1.500000 1.000000
## 3   2.000000       2.000000 3.000000 1.000000   2.000000    3.000000 2.000000
##        Age Education
## 1 2.411765  3.117647
## 2 2.500000  5.000000
## 3 2.666667  2.333333

cluster_means <- aggregate(mydata[,-1], list(groups.3), mean)

# Exporting cluster analysis results into excel from R Studio Cloud
write.csv(groups.3, "clusterID.csv")
write.csv(cluster_means, "cluster_means.csv")

Downloading your solutions mannually

First, select the files (“clusterID.csv” & “cluster_means.csv”) and put a checkmark before each file.

Second, click the gear icon on the right side of your pane and export the data.

Finding means or medians of each variable (factor) for each cluster

Imagine if your goal is to find some profitable customers to target. Now using the mean function or the median function, you will be able to see the characteristics of each sub-group. Now it is time to use your domain expertise.

Discussion Questions for you (about 20 words per question)

How many observations do we have in each cluster? Answer: 17 in cluster 1, 2 in cluster 2, and 3 in cluster 3.
We can look at the medians (or means) for the variables in each cluster. Why is this important? Answer: Looking at the medians or means helps us understand the typical customer profile in each cluster. This allows us to interpret the differences between groups and identify which clusters might contain different types of customers.
Do you think if mean or median should be used when it comes to analyzing the differences among different clusters? Why?

Answer: The median is often preferred if the data contains outliers or is skewed because it gives a better sense of the “typical” value and is less effected by the outliers.

Now we need to understand the common characteristics of each cluster. Our goal is to build targeting strategy using the profiles of each cluster. What summary measures of each cluster are appropriate in a descriptive sense.

Answer: I believe cluster means and medians are appropriate. In addition, looking at the range of the data might help understand variability within each cluster. These summaries help define the target customer for each cluster.

Any major differences between K-means clustering (https://rpubs.com/utjimmyx/kmeans) and Hierarchical clustering? Which one do you like better? Why? You may refer to the assigned readings. Hierarchal clustering provides a visual tree, while k-means clustering provides a bar graph. Personally, I like the k-means clustering because it is much easier to visualize. When comparing the two methods, the hierarchal clustering looks quite cluttered.
Do a keyword search using “cluster analysis.” How many relevant job titles are there?

I saw 3 relevent job titles pop up, all making near or at 6 figures. They are all full time positions as well. The titles include Cluster FP&A Analyst, Cluster Keywords and URLs into Topically Relevant Groups, and Data Science Manager.

Intro

Principal Component Analysis (PCA) involves the process of understanding different features in a dataset and can be used in conjunction with cluster analysis.

PCA is also a popular machine learning algorithm used for feature selection. Imagine if you have more than 100 features or factors. It is useful to select the most important features for further analysis.

The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).

#install.packages('dplyr')
library(dplyr) # sane data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr) # sane data munging
library(ggplot2) # needs no introduction
library(ggfortify) # super-helpful for plotting non-"standard" stats objects

#identifying your working directory
getwd() #confirm your working directory is accurate

## [1] "/cloud/project"

library(readr)

##  mydata <-read_csv('Segmentation.csv')

mydata <-read_csv('customer_segmentation.csv')

## Rows: 22 Columns: 15

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): ID, CS_helpful, Recommend, Come_again, All_Products, Profesionalis...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# read csv file #This allows you to read the data from my Github site.

#Open the data. Note that some students will see an Excel option in "Import Dataset";
#those that do not will need to save the original data as a csv and import that as a text file.
#rm(list = ls()) #used to clean your working environment
fit <- kmeans(mydata[,-1], 3, iter.max=1000)
#exclude the first column since it is "id" instead of a factor #or variable.
#3 means you want to have 3 clusters
table(fit$cluster)

## 
##  1  2  3 
## 10  3  9

barplot(table(fit$cluster), col="#336699") #plot

pca <- prcomp(mydata[,-1], scale=TRUE) #principle component analysis
pca_data <- mutate(fortify(pca), col=fit$cluster)
#We want to examine the cluster memberships for each #observation - see last column

ggplot(pca_data) + geom_point(aes(x=PC1, y=PC2, fill=factor(col)),
size=3, col="#7f7f7f", shape=21) + theme_bw(base_family="Helvetica")

autoplot(fit, data=mydata[,-1], frame=TRUE, frame.type='norm')

## Too few points to calculate an ellipse

names(pca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

pca$center

##     CS_helpful      Recommend     Come_again   All_Products Profesionalism 
##       1.590909       1.318182       1.454545       2.090909       1.409091 
##     Limitation Online_grocery       delivery        Pick_up     Find_items 
##       1.500000       2.272727       2.409091       2.454545       1.454545 
##    other_shops         Gender            Age      Education 
##       2.590909       1.272727       2.454545       3.181818

pca$scale

##     CS_helpful      Recommend     Come_again   All_Products Profesionalism 
##      0.7341397      0.6463350      0.7385489      1.0649879      0.5903261 
##     Limitation Online_grocery       delivery        Pick_up     Find_items 
##      0.8017837      0.7672969      0.7341397      1.0568269      0.6709817 
##    other_shops         Gender            Age      Education 
##      1.4026876      0.4558423      0.7385489      1.6223547

pca$rotation

##                         PC1         PC2         PC3          PC4          PC5
## CS_helpful     -0.488254060  0.18353687  0.09973845  0.045221127  0.092443591
## Recommend      -0.330197677  0.13991354 -0.19892372  0.358613745 -0.208505096
## Come_again     -0.326085356 -0.34041476 -0.18584895  0.116146481 -0.342514053
## All_Products   -0.237688878 -0.33206544  0.30137894  0.022875225 -0.066485862
## Profesionalism -0.369807437  0.03477990 -0.41101054 -0.149688188  0.001503016
## Limitation     -0.276227449  0.18864661  0.36353878 -0.334396804 -0.017461769
## Online_grocery -0.043475182  0.32978681 -0.14782950  0.422865900  0.019831184
## delivery       -0.351938301  0.28759967  0.12110867  0.150376344  0.006723563
## Pick_up         0.208402706  0.44334883  0.09799661 -0.011935578 -0.138495611
## Find_items     -0.240648470 -0.08690804  0.51908591 -0.153694840  0.085804597
## other_shops     0.087708302 -0.24033344  0.09192695  0.002751194 -0.738531498
## Gender         -0.196617487 -0.28135924 -0.35122683 -0.257036171  0.306921574
## Age             0.056826085 -0.36201176  0.08767070  0.349708269  0.387112312
## Education       0.004030129 -0.14223843  0.26258524  0.554568267  0.097308148
##                        PC6          PC7         PC8         PC9        PC10
## CS_helpful     -0.11077913  0.035353541  0.13007878 -0.43856718  0.09590230
## Recommend      -0.09553144  0.200038529 -0.01130160  0.43984794  0.62683843
## Come_again     -0.06572910  0.024522862 -0.23986864 -0.10307364 -0.19352387
## All_Products    0.46023149  0.245244527  0.28514611 -0.25163505  0.07413083
## Profesionalism  0.09677131  0.297360901  0.20638892 -0.09904767 -0.23742562
## Limitation     -0.29652333 -0.331945940 -0.14649416 -0.25432284  0.32279594
## Online_grocery  0.35598881 -0.554513343  0.34468239 -0.11197454 -0.07743250
## delivery        0.15452242 -0.085950762 -0.58313191  0.17757789 -0.44900412
## Pick_up         0.41357158  0.220929987 -0.11529403 -0.09148473  0.18348083
## Find_items      0.22151682 -0.015221196  0.20963596  0.57238758 -0.10243200
## other_shops     0.11847361 -0.333249591 -0.04002334 -0.04516252  0.05022230
## Gender          0.15664439 -0.471694070  0.01241550  0.19824069  0.17283668
## Age             0.26951115  0.008307255 -0.45046829 -0.20951026  0.27670798
## Education      -0.42807889 -0.042929384  0.24348136  0.02132896 -0.18341535
##                       PC11        PC12        PC13        PC14
## CS_helpful      0.08499678  0.12853926  0.13765569 -0.65780467
## Recommend       0.10152978  0.06719730 -0.01896875  0.09433582
## Come_again     -0.05106820 -0.69346597 -0.10901925 -0.08073348
## All_Products    0.26555413  0.12536909 -0.39652455  0.26816734
## Profesionalism -0.48073471  0.20344701  0.29530718  0.32314938
## Limitation     -0.17311939 -0.13086687 -0.01435426  0.45614659
## Online_grocery  0.10539622 -0.22720433  0.15130596  0.17638419
## delivery        0.12003990  0.30862260 -0.18974545  0.07741658
## Pick_up        -0.52442325 -0.19195723 -0.32143825 -0.20177844
## Find_items     -0.16039580 -0.22254458  0.32134565 -0.15561551
## other_shops    -0.18306875  0.39928130  0.19565336 -0.13485229
## Gender         -0.21563958  0.12285325 -0.42084814 -0.20852942
## Age            -0.19550324  0.02689677  0.38447466  0.05500715
## Education      -0.45140171  0.12388542 -0.30897450  0.02713011

dim(pca$x)

## [1] 22 14

biplot(pca, scale=0)

pca$rotation=-pca$rotation
pca$x=-pca$x
biplot(pca, scale=0)

pca$sdev

##  [1] 1.7762774 1.5392773 1.3417626 1.2574520 1.1217199 1.0080006 0.7824326
##  [8] 0.7619842 0.6731043 0.6257669 0.5998688 0.4509321 0.4352687 0.1717997

pca.var=pca$sdev^2
pca.var

##  [1] 3.15516153 2.36937455 1.80032691 1.58118560 1.25825561 1.01606526
##  [7] 0.61220073 0.58061988 0.45306947 0.39158416 0.35984260 0.20333971
## [13] 0.18945885 0.02951515

pve=pca.var/sum(pca.var)
pve

##  [1] 0.225368681 0.169241039 0.128594779 0.112941828 0.089875401 0.072576090
##  [7] 0.043728623 0.041472849 0.032362105 0.027970297 0.025703043 0.014524265
## [13] 0.013532775 0.002108225

plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

write.csv(pca_data, "pca_data.csv")

Discussion Questions for you (50 words per questions)

Think about at least one question you could answer using this result. Answer:

What are the main characteristics that differentiate customer groups?

2.Interpret the PCA graphs according to the required reading(p.385-p.399) https://www.statlearning.com/ (page number required).

According to page 508, PCA graphs are a tool that we can use to do exploratory analysis. From this graphs we can interpret determine that different clusters tend to favor different attributes. In addition, the axis lets us know how much variation is determined according to each attribute, allowing us to analyze the data more efficiently.

References

Cluster analysis - reading (p.385-p.399) https://www.statlearning.com/ Hint:you can download the free version of this book from this website. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L) https://www.scielo.br/scielo.php? script=sci_arttext&pid=S1415-47572004000100014&lng=en&nrm=iso Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r- practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/ Principal component analysis - reading (p.404-p.405) https://www.statlearning.com/ Hint:you can download the free version from this website. Principal Component Methods in R: Practical Guide http://www.sthda.com/english/articles/31-principal-component-methods-in-r- practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/ https://online.stat.psu.edu/stat505/lesson/11/11.4