By the end of this lab session, you should be able to:
For hierarchical clustering and exploratory data analysis, read Chapter 12, “Cluster Analysis,” from An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
Reference: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R.
Market segmentation divides a broad target market into smaller, more similar groups. Clustering is a common technique for market segmentation because it automatically finds similar groups in a data set.
The file used in this analysis is
customer_segmentation (1).csv.
mydata <- read_csv("customer_segmentation (1).csv")
str(mydata)
## spc_tbl_ [22 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:22] 1 2 3 4 5 6 7 8 9 10 ...
## $ CS_helpful : num [1:22] 2 1 2 3 2 1 2 1 1 1 ...
## $ Recommend : num [1:22] 2 2 1 3 1 1 1 1 1 1 ...
## $ Come_again : num [1:22] 2 1 1 2 3 3 1 1 1 1 ...
## $ All_Products : num [1:22] 2 1 1 4 5 2 2 2 2 1 ...
## $ Profesionalism: num [1:22] 2 1 1 1 2 1 2 1 2 1 ...
## $ Limitation : num [1:22] 2 1 2 2 1 1 1 2 1 1 ...
## $ Online_grocery: num [1:22] 2 2 3 3 2 1 2 1 2 3 ...
## $ delivery : num [1:22] 3 3 3 3 3 2 2 1 1 2 ...
## $ Pick_up : num [1:22] 4 3 2 2 1 1 2 2 3 2 ...
## $ Find_items : num [1:22] 1 1 1 2 2 1 1 2 1 1 ...
## $ other_shops : num [1:22] 2 2 3 2 3 4 1 4 1 1 ...
## $ Gender : num [1:22] 1 1 1 1 2 1 1 1 2 2 ...
## $ Age : num [1:22] 2 2 2 3 4 2 2 2 2 2 ...
## $ Education : num [1:22] 2 2 2 5 2 5 3 2 1 2 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. CS_helpful = col_double(),
## .. Recommend = col_double(),
## .. Come_again = col_double(),
## .. All_Products = col_double(),
## .. Profesionalism = col_double(),
## .. Limitation = col_double(),
## .. Online_grocery = col_double(),
## .. delivery = col_double(),
## .. Pick_up = col_double(),
## .. Find_items = col_double(),
## .. other_shops = col_double(),
## .. Gender = col_double(),
## .. Age = col_double(),
## .. Education = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(mydata)
## ID CS_helpful Recommend Come_again
## Min. : 1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 6.25 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :11.50 Median :1.000 Median :1.000 Median :1.000
## Mean :11.50 Mean :1.591 Mean :1.318 Mean :1.455
## 3rd Qu.:16.75 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :22.00 Max. :3.000 Max. :3.000 Max. :3.000
## All_Products Profesionalism Limitation Online_grocery delivery
## Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000
## 1st Qu.:1.250 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :1.000 Median :1.0 Median :2.000 Median :3.000
## Mean :2.091 Mean :1.409 Mean :1.5 Mean :2.273 Mean :2.409
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :3.000 Max. :4.0 Max. :3.000 Max. :3.000
## Pick_up Find_items other_shops Gender
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.250 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000 Median :1.000
## Mean :2.455 Mean :1.455 Mean :2.591 Mean :1.273
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.750 3rd Qu.:1.750
## Max. :5.000 Max. :3.000 Max. :5.000 Max. :2.000
## Age Education
## Min. :2.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.500
## Mean :2.455 Mean :3.182
## 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :4.000 Max. :5.000
head(mydata)
## # A tibble: 6 × 15
## ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 2 2 2 2 2
## 2 2 1 2 1 1 1 1
## 3 3 2 1 1 1 1 2
## 4 4 3 3 2 4 1 2
## 5 5 2 1 3 5 2 1
## 6 6 1 1 3 2 1 1
## # ℹ 8 more variables: Online_grocery <dbl>, delivery <dbl>, Pick_up <dbl>,
## # Find_items <dbl>, other_shops <dbl>, Gender <dbl>, Age <dbl>,
## # Education <dbl>
use <- scale(mydata[, -1], center = TRUE, scale = TRUE)
head(use)
## CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## [1,] 0.5572385 1.0548991 0.7385489 -0.08536162 1.0009877 0.6236096
## [2,] -0.8049001 1.0548991 -0.6154575 -1.02433946 -0.6929915 -0.6236096
## [3,] 0.5572385 -0.4922862 -0.6154575 -1.02433946 -0.6929915 0.6236096
## [4,] 1.9193772 2.6020844 0.7385489 1.79259406 -0.6929915 0.6236096
## [5,] 0.5572385 -0.4922862 2.0925553 2.73157191 1.0009877 -0.6236096
## [6,] -0.8049001 -0.4922862 2.0925553 -0.08536162 -0.6929915 -0.6236096
## Online_grocery delivery Pick_up Find_items other_shops Gender
## [1,] -0.3554390 0.8049001 1.4623535 -0.6774335 -0.4212692 -0.598293
## [2,] -0.3554390 0.8049001 0.5161248 -0.6774335 -0.4212692 -0.598293
## [3,] 0.9478374 0.8049001 -0.4301040 -0.6774335 0.2916479 -0.598293
## [4,] 0.9478374 0.8049001 -0.4301040 0.8129201 -0.4212692 -0.598293
## [5,] -0.3554390 0.8049001 -1.3763327 0.8129201 0.2916479 1.595448
## [6,] -1.6587154 -0.5572385 -1.3763327 -0.6774335 1.0045650 -0.598293
## Age Education
## [1,] -0.6154575 -0.7284586
## [2,] -0.6154575 -0.7284586
## [3,] -0.6154575 -0.7284586
## [4,] 0.7385489 1.1207055
## [5,] 2.0925553 -0.7284586
## [6,] -0.6154575 1.1207055
dist_matrix <- dist(use)
seg.hclust <- hclust(dist_matrix, method = "complete")
plot(seg.hclust, main = "Hierarchical Clustering Dendrogram")
groups.3 <- cutree(seg.hclust, k = 3)
table(groups.3)
## groups.3
## 1 2 3
## 17 3 2
mydata$ID[groups.3 == 1]
## [1] 1 2 3 6 7 8 9 10 11 12 13 14 15 16 17 18 21
mydata$ID[groups.3 == 2]
## [1] 4 20 22
mydata$ID[groups.3 == 3]
## [1] 5 19
aggregate(mydata[, -1], list(Cluster = groups.3), median)
## Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.0 1 1.0 2 1
## 2 2 3.0 3 2.0 2 2
## 3 3 2.5 1 2.5 4 2
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1 1.0 2.0 2 3 1.0 2.0 1 2
## 2 2.0 3.0 3 2 2.0 2.0 1 2
## 3 2.5 1.5 3 1 2.5 2.5 2 3
## Education
## 1 2
## 2 5
## 3 2
aggregate(mydata[, -1], list(Cluster = groups.3), mean)
## Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 2.666667 2.666667 2.000000 2.333333 2.000000
## 3 3 2.500000 1.000000 2.500000 4.000000 2.000000
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 1.666667 3.000000 3.000000 2.000000 1.666667 2.333333 1.333333
## 3 2.500000 1.500000 3.000000 1.000000 2.500000 2.500000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.333333 4.333333
## 3 3.000000 2.000000
cluster_means <- aggregate(mydata[, -1], list(Cluster = groups.3), mean)
cluster_means
## Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1 1 1.294118 1.117647 1.235294 1.823529 1.235294
## 2 2 2.666667 2.666667 2.000000 2.333333 2.000000
## 3 3 2.500000 1.000000 2.500000 4.000000 2.000000
## Limitation Online_grocery delivery Pick_up Find_items other_shops Gender
## 1 1.352941 2.235294 2.235294 2.705882 1.294118 2.647059 1.176471
## 2 1.666667 3.000000 3.000000 2.000000 1.666667 2.333333 1.333333
## 3 2.500000 1.500000 3.000000 1.000000 2.500000 2.500000 2.000000
## Age Education
## 1 2.411765 3.117647
## 2 2.333333 4.333333
## 3 3.000000 2.000000
write.csv(data.frame(ID = mydata$ID, Cluster = groups.3), "clusterID.csv", row.names = FALSE)
write.csv(cluster_means, "cluster_means.csv", row.names = FALSE)
How many observations do we have in each cluster?
Answer: Your answer here.
Why is it important to look at the medians or means for the
variables in each cluster?
Answer: Your answer here.
Should mean or median be used when analyzing the differences
among clusters? Why?
Answer: Your answer here.
What summary measures of each cluster are appropriate for
building a targeting strategy?
Answer: Your answer here.
What are the major differences between K-means clustering and
hierarchical clustering? Which one do you prefer, and why?
Answer: Your answer here.
Do a keyword search using “cluster analysis.” How many relevant
job titles are there?
Answer: Your answer here.
Should we use mydata or mydata[, -1] with
the aggregate() function? Why?
Answer: Your answer here.
Principal Component Analysis (PCA) helps identify the most important features in a data set and can be used alongside cluster analysis.
fit <- kmeans(mydata[, -1], centers = 3, iter.max = 1000)
table(fit$cluster)
##
## 1 2 3
## 10 9 3
barplot(table(fit$cluster), main = "Cluster Sizes")
pca <- prcomp(mydata[, -1], scale. = TRUE)
pca_data <- mutate(fortify(pca), Cluster = fit$cluster)
head(pca_data)
## CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1 2 2 2 2 2 2
## 2 1 2 1 1 1 1
## 3 2 1 1 1 1 2
## 4 3 3 2 4 1 2
## 5 2 1 3 5 2 1
## 6 1 1 3 2 1 1
## Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1 2 3 4 1 2 1 2 2
## 2 2 3 3 1 2 1 2 2
## 3 3 3 2 1 3 1 2 2
## 4 3 3 2 2 2 1 3 5
## 5 2 3 1 2 3 2 4 2
## 6 1 2 1 1 4 1 2 5
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -1.1406348 1.59673228 -0.63410651 -0.280229354 0.86904958 0.1165197
## 2 0.9629067 1.40574597 -0.65120805 0.161359334 0.31849449 -0.2611656
## 3 0.2728698 1.51351625 0.02598032 -0.184584915 0.26137384 -0.4770065
## 4 -2.8618142 0.08875892 1.45597311 2.476371371 0.07321515 0.1029394
## 5 -2.5614717 -3.40830515 -0.15453453 -0.006377382 -0.54039721 2.1556752
## 6 0.6417214 -2.31064647 -0.10485209 0.238576376 1.63185131 -1.9389726
## PC7 PC8 PC9 PC10 PC11 PC12
## 1 -1.2286092 0.8188759 0.417757947 -0.3716437 -0.4388740 0.40759767
## 2 -0.6182053 0.9968371 -1.127019947 -0.2594187 0.7916425 -0.23877274
## 3 1.2264580 0.4550988 0.559657368 0.4159135 1.0374887 -0.31685456
## 4 -0.3114923 -0.3468342 0.003376173 -1.4482133 0.9383764 0.23674647
## 5 -0.1574697 0.7216815 0.681397975 0.5632122 0.4154912 0.09390993
## 6 -0.4725143 0.4047234 0.016685289 1.0671387 0.3414491 0.58800229
## PC13 PC14 Cluster
## 1 -0.4397615 0.19628938 1
## 2 -0.2855143 0.02442650 1
## 3 0.5542694 -0.12396750 1
## 4 -0.2928463 -0.09311831 2
## 5 -0.0083104 0.06374110 3
## 6 -0.5465203 -0.18393394 2
ggplot(pca_data, aes(x = PC1, y = PC2, fill = factor(Cluster))) +
geom_point(size = 3, color = "gray40", shape = 21) +
theme_bw() +
labs(title = "PCA Plot with Cluster Membership", fill = "Cluster")
autoplot(fit, data = mydata[, -1], frame = TRUE, frame.type = "norm")
names(pca)
## [1] "sdev" "rotation" "center" "scale" "x"
pca$center
## CS_helpful Recommend Come_again All_Products Profesionalism
## 1.590909 1.318182 1.454545 2.090909 1.409091
## Limitation Online_grocery delivery Pick_up Find_items
## 1.500000 2.272727 2.409091 2.454545 1.454545
## other_shops Gender Age Education
## 2.590909 1.272727 2.454545 3.181818
pca$scale
## CS_helpful Recommend Come_again All_Products Profesionalism
## 0.7341397 0.6463350 0.7385489 1.0649879 0.5903261
## Limitation Online_grocery delivery Pick_up Find_items
## 0.8017837 0.7672969 0.7341397 1.0568269 0.6709817
## other_shops Gender Age Education
## 1.4026876 0.4558423 0.7385489 1.6223547
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## CS_helpful -0.488254060 0.18353687 0.09973845 0.045221127 -0.092443591
## Recommend -0.330197677 0.13991354 -0.19892372 0.358613745 0.208505096
## Come_again -0.326085356 -0.34041476 -0.18584895 0.116146481 0.342514053
## All_Products -0.237688878 -0.33206544 0.30137894 0.022875225 0.066485862
## Profesionalism -0.369807437 0.03477990 -0.41101054 -0.149688188 -0.001503016
## Limitation -0.276227449 0.18864661 0.36353878 -0.334396804 0.017461769
## Online_grocery -0.043475182 0.32978681 -0.14782950 0.422865900 -0.019831184
## delivery -0.351938301 0.28759967 0.12110867 0.150376344 -0.006723563
## Pick_up 0.208402706 0.44334883 0.09799661 -0.011935578 0.138495611
## Find_items -0.240648470 -0.08690804 0.51908591 -0.153694840 -0.085804597
## other_shops 0.087708302 -0.24033344 0.09192695 0.002751194 0.738531498
## Gender -0.196617487 -0.28135924 -0.35122683 -0.257036171 -0.306921574
## Age 0.056826085 -0.36201176 0.08767070 0.349708269 -0.387112312
## Education 0.004030129 -0.14223843 0.26258524 0.554568267 -0.097308148
## PC6 PC7 PC8 PC9 PC10
## CS_helpful -0.11077913 -0.035353541 -0.13007878 0.43856718 -0.09590230
## Recommend -0.09553144 -0.200038529 0.01130160 -0.43984794 -0.62683843
## Come_again -0.06572910 -0.024522862 0.23986864 0.10307364 0.19352387
## All_Products 0.46023149 -0.245244527 -0.28514611 0.25163505 -0.07413083
## Profesionalism 0.09677131 -0.297360901 -0.20638892 0.09904767 0.23742562
## Limitation -0.29652333 0.331945940 0.14649416 0.25432284 -0.32279594
## Online_grocery 0.35598881 0.554513343 -0.34468239 0.11197454 0.07743250
## delivery 0.15452242 0.085950762 0.58313191 -0.17757789 0.44900412
## Pick_up 0.41357158 -0.220929987 0.11529403 0.09148473 -0.18348083
## Find_items 0.22151682 0.015221196 -0.20963596 -0.57238758 0.10243200
## other_shops 0.11847361 0.333249591 0.04002334 0.04516252 -0.05022230
## Gender 0.15664439 0.471694070 -0.01241550 -0.19824069 -0.17283668
## Age 0.26951115 -0.008307255 0.45046829 0.20951026 -0.27670798
## Education -0.42807889 0.042929384 -0.24348136 -0.02132896 0.18341535
## PC11 PC12 PC13 PC14
## CS_helpful 0.08499678 -0.12853926 0.13765569 -0.65780467
## Recommend 0.10152978 -0.06719730 -0.01896875 0.09433582
## Come_again -0.05106820 0.69346597 -0.10901925 -0.08073348
## All_Products 0.26555413 -0.12536909 -0.39652455 0.26816734
## Profesionalism -0.48073471 -0.20344701 0.29530718 0.32314938
## Limitation -0.17311939 0.13086687 -0.01435426 0.45614659
## Online_grocery 0.10539622 0.22720433 0.15130596 0.17638419
## delivery 0.12003990 -0.30862260 -0.18974545 0.07741658
## Pick_up -0.52442325 0.19195723 -0.32143825 -0.20177844
## Find_items -0.16039580 0.22254458 0.32134565 -0.15561551
## other_shops -0.18306875 -0.39928130 0.19565336 -0.13485229
## Gender -0.21563958 -0.12285325 -0.42084814 -0.20852942
## Age -0.19550324 -0.02689677 0.38447466 0.05500715
## Education -0.45140171 -0.12388542 -0.30897450 0.02713011
dim(pca$x)
## [1] 22 14
biplot(pca, scale = 0)
pca$rotation <- -pca$rotation
pca$x <- -pca$x
biplot(pca, scale = 0)
pca.var <- pca$sdev^2
pca.var
## [1] 3.15516153 2.36937455 1.80032691 1.58118560 1.25825561 1.01606526
## [7] 0.61220073 0.58061988 0.45306947 0.39158416 0.35984260 0.20333971
## [13] 0.18945885 0.02951515
pve <- pca.var / sum(pca.var)
pve
## [1] 0.225368681 0.169241039 0.128594779 0.112941828 0.089875401 0.072576090
## [7] 0.043728623 0.041472849 0.032362105 0.027970297 0.025703043 0.014524265
## [13] 0.013532775 0.002108225
plot(pve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1),
type = "b")
plot(cumsum(pve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1),
type = "b")
write.csv(pca_data, "pca_data.csv", row.names = FALSE)
Think about at least one question you could answer using this
result. Please cite the original source.
Answer: Your answer here.
Interpret the PCA graphs according to the required reading.
Answer: Your answer here.