Using Hierarchical Clustering and PCA for Market Segmentation and Targeting

Learning objectives

By the end of this lab session, you should be able to:

Understand how cloud computing works.
Understand how to import your own data to the cloud environment.
Create descriptive statistics to understand the frequency distributions of your data.
Understand how hierarchical cluster analysis works.
Perform a basic cluster analysis using RStudio Cloud.
Understand how to interpret your cluster analysis results.
Understand how to export your final results from the cloud environment to your own computer.
Understand how to use some basic packages and custom functions to process your data.

Reading

For hierarchical clustering and exploratory data analysis, read Chapter 12, “Cluster Analysis,” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

Reference: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R.

Segmentation

Market segmentation divides a broad target market into smaller, more similar groups. Clustering is a common technique for market segmentation because it automatically finds similar groups in a data set.

Dataset

The file used in this analysis is customer_segmentation (1).csv.

Importing data

mydata <- read_csv("customer_segmentation (1).csv")
str(mydata)

## spc_tbl_ [22 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID            : num [1:22] 1 2 3 4 5 6 7 8 9 10 ...
##  $ CS_helpful    : num [1:22] 2 1 2 3 2 1 2 1 1 1 ...
##  $ Recommend     : num [1:22] 2 2 1 3 1 1 1 1 1 1 ...
##  $ Come_again    : num [1:22] 2 1 1 2 3 3 1 1 1 1 ...
##  $ All_Products  : num [1:22] 2 1 1 4 5 2 2 2 2 1 ...
##  $ Profesionalism: num [1:22] 2 1 1 1 2 1 2 1 2 1 ...
##  $ Limitation    : num [1:22] 2 1 2 2 1 1 1 2 1 1 ...
##  $ Online_grocery: num [1:22] 2 2 3 3 2 1 2 1 2 3 ...
##  $ delivery      : num [1:22] 3 3 3 3 3 2 2 1 1 2 ...
##  $ Pick_up       : num [1:22] 4 3 2 2 1 1 2 2 3 2 ...
##  $ Find_items    : num [1:22] 1 1 1 2 2 1 1 2 1 1 ...
##  $ other_shops   : num [1:22] 2 2 3 2 3 4 1 4 1 1 ...
##  $ Gender        : num [1:22] 1 1 1 1 2 1 1 1 2 2 ...
##  $ Age           : num [1:22] 2 2 2 3 4 2 2 2 2 2 ...
##  $ Education     : num [1:22] 2 2 2 5 2 5 3 2 1 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_double(),
##   ..   CS_helpful = col_double(),
##   ..   Recommend = col_double(),
##   ..   Come_again = col_double(),
##   ..   All_Products = col_double(),
##   ..   Profesionalism = col_double(),
##   ..   Limitation = col_double(),
##   ..   Online_grocery = col_double(),
##   ..   delivery = col_double(),
##   ..   Pick_up = col_double(),
##   ..   Find_items = col_double(),
##   ..   other_shops = col_double(),
##   ..   Gender = col_double(),
##   ..   Age = col_double(),
##   ..   Education = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(mydata)

##        ID          CS_helpful      Recommend       Come_again   
##  Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 6.25   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :11.50   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :11.50   Mean   :1.591   Mean   :1.318   Mean   :1.455  
##  3rd Qu.:16.75   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :22.00   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##   All_Products   Profesionalism    Limitation  Online_grocery     delivery    
##  Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.250   1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :1.000   Median :1.0   Median :2.000   Median :3.000  
##  Mean   :2.091   Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000  
##     Pick_up        Find_items     other_shops        Gender     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :1.000  
##  Mean   :2.455   Mean   :1.455   Mean   :2.591   Mean   :1.273  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750  
##  Max.   :5.000   Max.   :3.000   Max.   :5.000   Max.   :2.000  
##       Age          Education    
##  Min.   :2.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.500  
##  Mean   :2.455   Mean   :3.182  
##  3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :4.000   Max.   :5.000

head(mydata)

## # A tibble: 6 × 15
##      ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
##   <dbl>      <dbl>     <dbl>      <dbl>        <dbl>          <dbl>      <dbl>
## 1     1          2         2          2            2              2          2
## 2     2          1         2          1            1              1          1
## 3     3          2         1          1            1              1          2
## 4     4          3         3          2            4              1          2
## 5     5          2         1          3            5              2          1
## 6     6          1         1          3            2              1          1
## # ℹ 8 more variables: Online_grocery <dbl>, delivery <dbl>, Pick_up <dbl>,
## #   Find_items <dbl>, other_shops <dbl>, Gender <dbl>, Age <dbl>,
## #   Education <dbl>

Standardizing the data

use <- scale(mydata[, -1], center = TRUE, scale = TRUE)
head(use)

##      CS_helpful  Recommend Come_again All_Products Profesionalism Limitation
## [1,]  0.5572385  1.0548991  0.7385489  -0.08536162      1.0009877  0.6236096
## [2,] -0.8049001  1.0548991 -0.6154575  -1.02433946     -0.6929915 -0.6236096
## [3,]  0.5572385 -0.4922862 -0.6154575  -1.02433946     -0.6929915  0.6236096
## [4,]  1.9193772  2.6020844  0.7385489   1.79259406     -0.6929915  0.6236096
## [5,]  0.5572385 -0.4922862  2.0925553   2.73157191      1.0009877 -0.6236096
## [6,] -0.8049001 -0.4922862  2.0925553  -0.08536162     -0.6929915 -0.6236096
##      Online_grocery   delivery    Pick_up Find_items other_shops    Gender
## [1,]     -0.3554390  0.8049001  1.4623535 -0.6774335  -0.4212692 -0.598293
## [2,]     -0.3554390  0.8049001  0.5161248 -0.6774335  -0.4212692 -0.598293
## [3,]      0.9478374  0.8049001 -0.4301040 -0.6774335   0.2916479 -0.598293
## [4,]      0.9478374  0.8049001 -0.4301040  0.8129201  -0.4212692 -0.598293
## [5,]     -0.3554390  0.8049001 -1.3763327  0.8129201   0.2916479  1.595448
## [6,]     -1.6587154 -0.5572385 -1.3763327 -0.6774335   1.0045650 -0.598293
##             Age  Education
## [1,] -0.6154575 -0.7284586
## [2,] -0.6154575 -0.7284586
## [3,] -0.6154575 -0.7284586
## [4,]  0.7385489  1.1207055
## [5,]  2.0925553 -0.7284586
## [6,] -0.6154575  1.1207055

Building the distance matrix and plotting the dendrogram

dist_matrix <- dist(use)
seg.hclust <- hclust(dist_matrix, method = "complete")
plot(seg.hclust, main = "Hierarchical Clustering Dendrogram")

Identifying cluster memberships

groups.3 <- cutree(seg.hclust, k = 3)
table(groups.3)

## groups.3
##  1  2  3 
## 17  3  2

mydata$ID[groups.3 == 1]

##  [1]  1  2  3  6  7  8  9 10 11 12 13 14 15 16 17 18 21

mydata$ID[groups.3 == 2]

## [1]  4 20 22

mydata$ID[groups.3 == 3]

## [1]  5 19

Identifying common features of each cluster

aggregate(mydata[, -1], list(Cluster = groups.3), median)

##   Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1        1.0         1        1.0            2              1
## 2       2        3.0         3        2.0            2              2
## 3       3        2.5         1        2.5            4              2
##   Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1        1.0            2.0        2       3        1.0         2.0      1   2
## 2        2.0            3.0        3       2        2.0         2.0      1   2
## 3        2.5            1.5        3       1        2.5         2.5      2   3
##   Education
## 1         2
## 2         5
## 3         2

aggregate(mydata[, -1], list(Cluster = groups.3), mean)

##   Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1   1.294118  1.117647   1.235294     1.823529       1.235294
## 2       2   2.666667  2.666667   2.000000     2.333333       2.000000
## 3       3   2.500000  1.000000   2.500000     4.000000       2.000000
##   Limitation Online_grocery delivery  Pick_up Find_items other_shops   Gender
## 1   1.352941       2.235294 2.235294 2.705882   1.294118    2.647059 1.176471
## 2   1.666667       3.000000 3.000000 2.000000   1.666667    2.333333 1.333333
## 3   2.500000       1.500000 3.000000 1.000000   2.500000    2.500000 2.000000
##        Age Education
## 1 2.411765  3.117647
## 2 2.333333  4.333333
## 3 3.000000  2.000000

cluster_means <- aggregate(mydata[, -1], list(Cluster = groups.3), mean)
cluster_means

##   Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1   1.294118  1.117647   1.235294     1.823529       1.235294
## 2       2   2.666667  2.666667   2.000000     2.333333       2.000000
## 3       3   2.500000  1.000000   2.500000     4.000000       2.000000
##   Limitation Online_grocery delivery  Pick_up Find_items other_shops   Gender
## 1   1.352941       2.235294 2.235294 2.705882   1.294118    2.647059 1.176471
## 2   1.666667       3.000000 3.000000 2.000000   1.666667    2.333333 1.333333
## 3   2.500000       1.500000 3.000000 1.000000   2.500000    2.500000 2.000000
##        Age Education
## 1 2.411765  3.117647
## 2 2.333333  4.333333
## 3 3.000000  2.000000

Exporting cluster analysis results

write.csv(data.frame(ID = mydata$ID, Cluster = groups.3), "clusterID.csv", row.names = FALSE)
write.csv(cluster_means, "cluster_means.csv", row.names = FALSE)

Discussion Questions

How many observations do we have in each cluster?
Answer: Your answer here.
Why is it important to look at the medians or means for the variables in each cluster?
Answer: Your answer here.
Should mean or median be used when analyzing the differences among clusters? Why?
Answer: Your answer here.
What summary measures of each cluster are appropriate for building a targeting strategy?
Answer: Your answer here.
What are the major differences between K-means clustering and hierarchical clustering? Which one do you prefer, and why?
Answer: Your answer here.
Do a keyword search using “cluster analysis.” How many relevant job titles are there?
Answer: Your answer here.

Advanced Question

Should we use mydata or mydata[, -1] with the aggregate() function? Why?
Answer: Your answer here.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) helps identify the most important features in a data set and can be used alongside cluster analysis.

fit <- kmeans(mydata[, -1], centers = 3, iter.max = 1000)
table(fit$cluster)

## 
##  1  2  3 
## 10  9  3

barplot(table(fit$cluster), main = "Cluster Sizes")

pca <- prcomp(mydata[, -1], scale. = TRUE)
pca_data <- mutate(fortify(pca), Cluster = fit$cluster)
head(pca_data)

##   CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1          2         2          2            2              2          2
## 2          1         2          1            1              1          1
## 3          2         1          1            1              1          2
## 4          3         3          2            4              1          2
## 5          2         1          3            5              2          1
## 6          1         1          3            2              1          1
##   Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1              2        3       4          1           2      1   2         2
## 2              2        3       3          1           2      1   2         2
## 3              3        3       2          1           3      1   2         2
## 4              3        3       2          2           2      1   3         5
## 5              2        3       1          2           3      2   4         2
## 6              1        2       1          1           4      1   2         5
##          PC1         PC2         PC3          PC4         PC5        PC6
## 1 -1.1406348  1.59673228 -0.63410651 -0.280229354  0.86904958  0.1165197
## 2  0.9629067  1.40574597 -0.65120805  0.161359334  0.31849449 -0.2611656
## 3  0.2728698  1.51351625  0.02598032 -0.184584915  0.26137384 -0.4770065
## 4 -2.8618142  0.08875892  1.45597311  2.476371371  0.07321515  0.1029394
## 5 -2.5614717 -3.40830515 -0.15453453 -0.006377382 -0.54039721  2.1556752
## 6  0.6417214 -2.31064647 -0.10485209  0.238576376  1.63185131 -1.9389726
##          PC7        PC8          PC9       PC10       PC11        PC12
## 1 -1.2286092  0.8188759  0.417757947 -0.3716437 -0.4388740  0.40759767
## 2 -0.6182053  0.9968371 -1.127019947 -0.2594187  0.7916425 -0.23877274
## 3  1.2264580  0.4550988  0.559657368  0.4159135  1.0374887 -0.31685456
## 4 -0.3114923 -0.3468342  0.003376173 -1.4482133  0.9383764  0.23674647
## 5 -0.1574697  0.7216815  0.681397975  0.5632122  0.4154912  0.09390993
## 6 -0.4725143  0.4047234  0.016685289  1.0671387  0.3414491  0.58800229
##         PC13        PC14 Cluster
## 1 -0.4397615  0.19628938       1
## 2 -0.2855143  0.02442650       1
## 3  0.5542694 -0.12396750       1
## 4 -0.2928463 -0.09311831       2
## 5 -0.0083104  0.06374110       3
## 6 -0.5465203 -0.18393394       2

PCA plot

ggplot(pca_data, aes(x = PC1, y = PC2, fill = factor(Cluster))) +
  geom_point(size = 3, color = "gray40", shape = 21) +
  theme_bw() +
  labs(title = "PCA Plot with Cluster Membership", fill = "Cluster")

K-means plot

autoplot(fit, data = mydata[, -1], frame = TRUE, frame.type = "norm")

PCA outputs

names(pca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

pca$center

##     CS_helpful      Recommend     Come_again   All_Products Profesionalism 
##       1.590909       1.318182       1.454545       2.090909       1.409091 
##     Limitation Online_grocery       delivery        Pick_up     Find_items 
##       1.500000       2.272727       2.409091       2.454545       1.454545 
##    other_shops         Gender            Age      Education 
##       2.590909       1.272727       2.454545       3.181818

pca$scale

##     CS_helpful      Recommend     Come_again   All_Products Profesionalism 
##      0.7341397      0.6463350      0.7385489      1.0649879      0.5903261 
##     Limitation Online_grocery       delivery        Pick_up     Find_items 
##      0.8017837      0.7672969      0.7341397      1.0568269      0.6709817 
##    other_shops         Gender            Age      Education 
##      1.4026876      0.4558423      0.7385489      1.6223547

pca$rotation

##                         PC1         PC2         PC3          PC4          PC5
## CS_helpful     -0.488254060  0.18353687  0.09973845  0.045221127 -0.092443591
## Recommend      -0.330197677  0.13991354 -0.19892372  0.358613745  0.208505096
## Come_again     -0.326085356 -0.34041476 -0.18584895  0.116146481  0.342514053
## All_Products   -0.237688878 -0.33206544  0.30137894  0.022875225  0.066485862
## Profesionalism -0.369807437  0.03477990 -0.41101054 -0.149688188 -0.001503016
## Limitation     -0.276227449  0.18864661  0.36353878 -0.334396804  0.017461769
## Online_grocery -0.043475182  0.32978681 -0.14782950  0.422865900 -0.019831184
## delivery       -0.351938301  0.28759967  0.12110867  0.150376344 -0.006723563
## Pick_up         0.208402706  0.44334883  0.09799661 -0.011935578  0.138495611
## Find_items     -0.240648470 -0.08690804  0.51908591 -0.153694840 -0.085804597
## other_shops     0.087708302 -0.24033344  0.09192695  0.002751194  0.738531498
## Gender         -0.196617487 -0.28135924 -0.35122683 -0.257036171 -0.306921574
## Age             0.056826085 -0.36201176  0.08767070  0.349708269 -0.387112312
## Education       0.004030129 -0.14223843  0.26258524  0.554568267 -0.097308148
##                        PC6          PC7         PC8         PC9        PC10
## CS_helpful     -0.11077913 -0.035353541 -0.13007878  0.43856718 -0.09590230
## Recommend      -0.09553144 -0.200038529  0.01130160 -0.43984794 -0.62683843
## Come_again     -0.06572910 -0.024522862  0.23986864  0.10307364  0.19352387
## All_Products    0.46023149 -0.245244527 -0.28514611  0.25163505 -0.07413083
## Profesionalism  0.09677131 -0.297360901 -0.20638892  0.09904767  0.23742562
## Limitation     -0.29652333  0.331945940  0.14649416  0.25432284 -0.32279594
## Online_grocery  0.35598881  0.554513343 -0.34468239  0.11197454  0.07743250
## delivery        0.15452242  0.085950762  0.58313191 -0.17757789  0.44900412
## Pick_up         0.41357158 -0.220929987  0.11529403  0.09148473 -0.18348083
## Find_items      0.22151682  0.015221196 -0.20963596 -0.57238758  0.10243200
## other_shops     0.11847361  0.333249591  0.04002334  0.04516252 -0.05022230
## Gender          0.15664439  0.471694070 -0.01241550 -0.19824069 -0.17283668
## Age             0.26951115 -0.008307255  0.45046829  0.20951026 -0.27670798
## Education      -0.42807889  0.042929384 -0.24348136 -0.02132896  0.18341535
##                       PC11        PC12        PC13        PC14
## CS_helpful      0.08499678 -0.12853926  0.13765569 -0.65780467
## Recommend       0.10152978 -0.06719730 -0.01896875  0.09433582
## Come_again     -0.05106820  0.69346597 -0.10901925 -0.08073348
## All_Products    0.26555413 -0.12536909 -0.39652455  0.26816734
## Profesionalism -0.48073471 -0.20344701  0.29530718  0.32314938
## Limitation     -0.17311939  0.13086687 -0.01435426  0.45614659
## Online_grocery  0.10539622  0.22720433  0.15130596  0.17638419
## delivery        0.12003990 -0.30862260 -0.18974545  0.07741658
## Pick_up        -0.52442325  0.19195723 -0.32143825 -0.20177844
## Find_items     -0.16039580  0.22254458  0.32134565 -0.15561551
## other_shops    -0.18306875 -0.39928130  0.19565336 -0.13485229
## Gender         -0.21563958 -0.12285325 -0.42084814 -0.20852942
## Age            -0.19550324 -0.02689677  0.38447466  0.05500715
## Education      -0.45140171 -0.12388542 -0.30897450  0.02713011

dim(pca$x)

## [1] 22 14

biplot(pca, scale = 0)

Reverse PCA signs for alternative view

pca$rotation <- -pca$rotation
pca$x <- -pca$x
biplot(pca, scale = 0)

Variance explained

pca.var <- pca$sdev^2
pca.var

##  [1] 3.15516153 2.36937455 1.80032691 1.58118560 1.25825561 1.01606526
##  [7] 0.61220073 0.58061988 0.45306947 0.39158416 0.35984260 0.20333971
## [13] 0.18945885 0.02951515

pve <- pca.var / sum(pca.var)
pve

##  [1] 0.225368681 0.169241039 0.128594779 0.112941828 0.089875401 0.072576090
##  [7] 0.043728623 0.041472849 0.032362105 0.027970297 0.025703043 0.014524265
## [13] 0.013532775 0.002108225

plot(pve,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

plot(cumsum(pve),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

Export PCA results

write.csv(pca_data, "pca_data.csv", row.names = FALSE)

PCA Discussion Questions

Think about at least one question you could answer using this result. Please cite the original source.
Answer: Your answer here.
Interpret the PCA graphs according to the required reading.
Answer: Your answer here.

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. An Introduction to Statistical Learning with Applications in R.
Principal Component Methods in R: Practical Guide.
Penn State STAT 505 PCA notes.