A survey was done to study the consumers preference towards coffee. The consumers need to be clustered based on their preference towards coffee. The respondents were asked to express their degree of agreement with the following statements on a 7 point scale (1: strongly disagree, 7: strongly agree).
x1: Coffee shall have attractive aroma and should be hot
x2: Coffee is not good for health
x3: I take coffee with breakfast or snacks
x4: I try to get the best offers when buying coffee
x5: I don’t care about coffee
x6: The cost of coffee matters to me
We load the data and the package
library(dplyr)
coffee <- tbl_df(coffee)
coffee
## Source: local data frame [20 x 7]
##
## Case.No. x1 x2 x3 x4 x5 x6
## 1 1 6 4 7 3 2 3
## 2 2 2 3 1 4 5 4
## 3 3 7 2 6 4 1 3
## 4 4 4 6 4 5 3 6
## 5 5 1 3 2 2 6 4
## 6 6 6 4 6 3 3 4
## 7 7 5 3 6 3 3 4
## 8 8 7 3 7 4 1 4
## 9 9 2 4 3 3 6 3
## 10 10 3 5 3 6 4 6
## 11 11 1 3 2 3 5 3
## 12 12 5 4 5 4 2 4
## 13 13 2 2 1 5 4 4
## 14 14 4 6 4 6 4 7
## 15 15 6 5 4 2 1 4
## 16 16 3 5 4 6 4 7
## 17 17 4 4 7 2 2 5
## 18 18 3 7 2 6 4 3
## 19 19 4 6 3 7 2 7
## 20 20 2 3 2 4 7 2
glimpse(coffee)
## Variables:
## $ Case.No. (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ x1 (int) 6, 2, 7, 4, 1, 6, 5, 7, 2, 3, 1, 5, 2, 4, 6, 3, 4, 3,...
## $ x2 (int) 4, 3, 2, 6, 3, 4, 3, 3, 4, 5, 3, 4, 2, 6, 5, 5, 4, 7,...
## $ x3 (int) 7, 1, 6, 4, 2, 6, 6, 7, 3, 3, 2, 5, 1, 4, 4, 4, 7, 2,...
## $ x4 (int) 3, 4, 4, 5, 2, 3, 3, 4, 3, 6, 3, 4, 5, 6, 2, 6, 2, 6,...
## $ x5 (int) 2, 5, 1, 3, 6, 3, 3, 1, 6, 4, 5, 2, 4, 4, 1, 4, 2, 4,...
## $ x6 (int) 3, 4, 3, 6, 4, 4, 4, 4, 3, 6, 3, 4, 4, 7, 4, 7, 5, 3,...
We remove the variable ‘Case.No.’ so that we can process the data for cluster analysis.
coffee_1 <- coffee %>%
select(x1:x6)
coffee_1
## Source: local data frame [20 x 6]
##
## x1 x2 x3 x4 x5 x6
## 1 6 4 7 3 2 3
## 2 2 3 1 4 5 4
## 3 7 2 6 4 1 3
## 4 4 6 4 5 3 6
## 5 1 3 2 2 6 4
## 6 6 4 6 3 3 4
## 7 5 3 6 3 3 4
## 8 7 3 7 4 1 4
## 9 2 4 3 3 6 3
## 10 3 5 3 6 4 6
## 11 1 3 2 3 5 3
## 12 5 4 5 4 2 4
## 13 2 2 1 5 4 4
## 14 4 6 4 6 4 7
## 15 6 5 4 2 1 4
## 16 3 5 4 6 4 7
## 17 4 4 7 2 2 5
## 18 3 7 2 6 4 3
## 19 4 6 3 7 2 7
## 20 2 3 2 4 7 2
We do Ward heirarchical cluster
d <- dist(coffee_1, method = "euclidean") # distance matrix
fit <- hclust(d, method = "ward")
plot(fit)
We group the cluster to decide how many clusters we want
groups <- cutree(fit, k = 3) # cut tree into 3 clusters
groups
## [1] 1 2 1 3 2 1 1 1 2 3 2 1 2 3 1 3 1 3 3 2
We go head and add the new variable ‘groups’ to the original dataset
coffee_clust <- data.frame(coffee, groups)
coffee_clust
## Case.No. x1 x2 x3 x4 x5 x6 groups
## 1 1 6 4 7 3 2 3 1
## 2 2 2 3 1 4 5 4 2
## 3 3 7 2 6 4 1 3 1
## 4 4 4 6 4 5 3 6 3
## 5 5 1 3 2 2 6 4 2
## 6 6 6 4 6 3 3 4 1
## 7 7 5 3 6 3 3 4 1
## 8 8 7 3 7 4 1 4 1
## 9 9 2 4 3 3 6 3 2
## 10 10 3 5 3 6 4 6 3
## 11 11 1 3 2 3 5 3 2
## 12 12 5 4 5 4 2 4 1
## 13 13 2 2 1 5 4 4 2
## 14 14 4 6 4 6 4 7 3
## 15 15 6 5 4 2 1 4 1
## 16 16 3 5 4 6 4 7 3
## 17 17 4 4 7 2 2 5 1
## 18 18 3 7 2 6 4 3 3
## 19 19 4 6 3 7 2 7 3
## 20 20 2 3 2 4 7 2 2
We take the mean value of each cluster to segregate the data
coffee_clust %>%
group_by(groups) %>%
summarise_each(funs(mean),x1, x2, x3, x4, x5, x6)
## Source: local data frame [3 x 7]
##
## groups x1 x2 x3 x4 x5 x6
## 1 1 5.750 3.625 6.000 3.125 1.875 3.875
## 2 2 1.667 3.000 1.833 3.500 5.500 3.333
## 3 3 3.500 5.833 3.333 6.000 3.500 6.000
We see which group has the highest mean value in each statement which are x1, x2, x3, x4, x5 and x6. Based on this we segment the customers and target the group to expand the coffee market.