Cluster Analysis - Heirarchical

A survey was done to study the consumers preference towards coffee. The consumers need to be clustered based on their preference towards coffee. The respondents were asked to express their degree of agreement with the following statements on a 7 point scale (1: strongly disagree, 7: strongly agree).
x1: Coffee shall have attractive aroma and should be hot
x2: Coffee is not good for health
x3: I take coffee with breakfast or snacks
x4: I try to get the best offers when buying coffee
x5: I don’t care about coffee
x6: The cost of coffee matters to me

We load the data and the package

library(dplyr)

coffee <- tbl_df(coffee)
coffee

## Source: local data frame [20 x 7]
## 
##    Case.No. x1 x2 x3 x4 x5 x6
## 1         1  6  4  7  3  2  3
## 2         2  2  3  1  4  5  4
## 3         3  7  2  6  4  1  3
## 4         4  4  6  4  5  3  6
## 5         5  1  3  2  2  6  4
## 6         6  6  4  6  3  3  4
## 7         7  5  3  6  3  3  4
## 8         8  7  3  7  4  1  4
## 9         9  2  4  3  3  6  3
## 10       10  3  5  3  6  4  6
## 11       11  1  3  2  3  5  3
## 12       12  5  4  5  4  2  4
## 13       13  2  2  1  5  4  4
## 14       14  4  6  4  6  4  7
## 15       15  6  5  4  2  1  4
## 16       16  3  5  4  6  4  7
## 17       17  4  4  7  2  2  5
## 18       18  3  7  2  6  4  3
## 19       19  4  6  3  7  2  7
## 20       20  2  3  2  4  7  2

glimpse(coffee)

## Variables:
## $ Case.No. (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ x1       (int) 6, 2, 7, 4, 1, 6, 5, 7, 2, 3, 1, 5, 2, 4, 6, 3, 4, 3,...
## $ x2       (int) 4, 3, 2, 6, 3, 4, 3, 3, 4, 5, 3, 4, 2, 6, 5, 5, 4, 7,...
## $ x3       (int) 7, 1, 6, 4, 2, 6, 6, 7, 3, 3, 2, 5, 1, 4, 4, 4, 7, 2,...
## $ x4       (int) 3, 4, 4, 5, 2, 3, 3, 4, 3, 6, 3, 4, 5, 6, 2, 6, 2, 6,...
## $ x5       (int) 2, 5, 1, 3, 6, 3, 3, 1, 6, 4, 5, 2, 4, 4, 1, 4, 2, 4,...
## $ x6       (int) 3, 4, 3, 6, 4, 4, 4, 4, 3, 6, 3, 4, 4, 7, 4, 7, 5, 3,...

We remove the variable ‘Case.No.’ so that we can process the data for cluster analysis.

coffee_1 <- coffee %>%
  select(x1:x6)

coffee_1

## Source: local data frame [20 x 6]
## 
##    x1 x2 x3 x4 x5 x6
## 1   6  4  7  3  2  3
## 2   2  3  1  4  5  4
## 3   7  2  6  4  1  3
## 4   4  6  4  5  3  6
## 5   1  3  2  2  6  4
## 6   6  4  6  3  3  4
## 7   5  3  6  3  3  4
## 8   7  3  7  4  1  4
## 9   2  4  3  3  6  3
## 10  3  5  3  6  4  6
## 11  1  3  2  3  5  3
## 12  5  4  5  4  2  4
## 13  2  2  1  5  4  4
## 14  4  6  4  6  4  7
## 15  6  5  4  2  1  4
## 16  3  5  4  6  4  7
## 17  4  4  7  2  2  5
## 18  3  7  2  6  4  3
## 19  4  6  3  7  2  7
## 20  2  3  2  4  7  2

We do Ward heirarchical cluster

d <- dist(coffee_1, method = "euclidean") # distance matrix
fit <- hclust(d, method = "ward")
plot(fit)

plot of chunk unnamed-chunk-5

We group the cluster to decide how many clusters we want

groups <- cutree(fit, k = 3) # cut tree into 3 clusters
groups

##  [1] 1 2 1 3 2 1 1 1 2 3 2 1 2 3 1 3 1 3 3 2

We go head and add the new variable ‘groups’ to the original dataset

coffee_clust <- data.frame(coffee, groups)
coffee_clust

##    Case.No. x1 x2 x3 x4 x5 x6 groups
## 1         1  6  4  7  3  2  3      1
## 2         2  2  3  1  4  5  4      2
## 3         3  7  2  6  4  1  3      1
## 4         4  4  6  4  5  3  6      3
## 5         5  1  3  2  2  6  4      2
## 6         6  6  4  6  3  3  4      1
## 7         7  5  3  6  3  3  4      1
## 8         8  7  3  7  4  1  4      1
## 9         9  2  4  3  3  6  3      2
## 10       10  3  5  3  6  4  6      3
## 11       11  1  3  2  3  5  3      2
## 12       12  5  4  5  4  2  4      1
## 13       13  2  2  1  5  4  4      2
## 14       14  4  6  4  6  4  7      3
## 15       15  6  5  4  2  1  4      1
## 16       16  3  5  4  6  4  7      3
## 17       17  4  4  7  2  2  5      1
## 18       18  3  7  2  6  4  3      3
## 19       19  4  6  3  7  2  7      3
## 20       20  2  3  2  4  7  2      2

We take the mean value of each cluster to segregate the data

coffee_clust %>%
  group_by(groups) %>%
  summarise_each(funs(mean),x1, x2, x3, x4, x5, x6)

## Source: local data frame [3 x 7]
## 
##   groups    x1    x2    x3    x4    x5    x6
## 1      1 5.750 3.625 6.000 3.125 1.875 3.875
## 2      2 1.667 3.000 1.833 3.500 5.500 3.333
## 3      3 3.500 5.833 3.333 6.000 3.500 6.000

We see which group has the highest mean value in each statement which are x1, x2, x3, x4, x5 and x6. Based on this we segment the customers and target the group to expand the coffee market.

Cluster Analysis - Heirarchical

Loy

Saturday, December 13, 2014