In this tutorial, we will use an exemplary data osn_cluster.RData to learn about cluster analysis. You can find the following objects in the MSR package.

osn: a data frame that you will use for clustering.
cluster_mean: a function to calculate the means of variables across clusters.
elbow_plot: a function that produces an elbow plot based on hierarchical clustering.

The data osn describe several characteristics of users of an online social network (OSN) similar to Facebook. To preserve privacy, the characteristics are transformed (standardized or normalized). All variables are continuous.

intrinsic: the intrinsic preferences of users towards the OSN;
habit: to what extent users form a persistent habit of using the OSN;
si_follower: how susceptible are users to the influence of their followers?
si_friends: how susceptible are users to the influence of their friends?
n_followers: (logged) no. of followers of users.
n_friends: (logged) no. of friends of users.

# load the library 
library(MSR)

# load the data 
data("osn")

An overview of the variables in the data:

##    intrinsic      habit si_followers si_friends n_followers n_friends
## 1 -0.4004632  0.1706246    1.0169258  0.8815938    3.135494  2.079442
## 2  1.4292221  1.8998624    0.8256795 -1.2979507    4.094345  3.761200
## 3 -0.4469281 -0.6304754    1.0156293  0.6537481    4.962845  4.653960
## 4  0.7341009  2.0418772    0.5320340 -0.1849733    1.791759  3.610918
## 5  3.6584396  1.8108407    0.2677690 -0.3595886    2.833213  2.708050
## 6 -0.8601568  2.7794144    3.0153064 -0.4454919    2.484907  2.484907

Step 1 - Selecting A Distance Measure

As discussed in class, the most important criterion for selecting a distance measure is the measurement level. In our data, the variables are all continuous. In theory, you may choose different measures that are suitable for continuous variables. In our course, we will stick with Euclidean distance. To calculate the measure, we use a function called dist or the distance measure function. The function takes in two key inputs:

a data frame of variables that you are using to calculate the distance measure.
the distance measure to be used (method =).
- As we are using Euclidean measure, we do this: method = "euclidean".

The function outputs an object within R called dist or a distance matrix. This will be used as an input of the cluster analysis next.

osn_dist <- dist(osn, method = "euclidean")

Let’s check out distance matrix between first 5 users in the data.

##          1        2        3        4        5
## 1 0.000000 3.856515 3.265454 3.211316 4.664043
## 2 3.856515 0.000000 3.913958 2.674470 2.978036
## 3 3.265454 3.913958 0.000000 4.540631 5.720324
## 4 3.211316 2.674470 4.540631 0.000000 3.256571
## 5 4.664043 2.978036 5.720324 3.256571 0.000000

A distance matrix has the same number of row and columns, with each row (and column) representing one user in the OSN data. A distance matrix is also symmetric, as the distance between, for example, User 1 and User 2 (3.856515) is the same as the distance between User 2 and User 1 (3.856515). Lastly, the distance between a user to him/herself is 0, as seen above.

Step 2 - Selecting A Clustering Procedure

For our course, we will focus on the hierarchical clustering and follow the agglomeration procedure. In addition, we use Ward’s method to decide how to combine clusters in the agglomeration procedure. Throughout the course, we will stick to this combination.

After selecting the procedure, we will first run a hierarchical clustering procedure. Within R, you can use a function hclust for the hierarchical clustering. The hclust takes in two main inputs:

a distance matrix object that is usually created from the dist function.
the method of the clustering, including, for example, single (single linkage), complete (complete linkage), ward.D2 (Ward’s method).

The function hclust outputs an object of hclust which records the hierarchical clustering process. For more details of hclust, please run ?hclust or help(hclust) in R.

# we use "Ward.D2" for the Ward's method
results <- hclust(osn_dist, method =  "ward.D2")
results

## 
## Call:
## hclust(d = osn_dist, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 120

Step 3 - Determining the No. of Clusters

To determine the no. of clusters, we need to check the performances of clustering at different stages of the hierarchical clustering. As discussed in class, a fundamental criterion for good clustering is homogeneity within clusters.

From the clustering results, we can obtain a measure of within-cluster variation called height in results. Given the homogeneity within clusters criterion, a clustering is good if the within-cluster variation is small (or height is small).

Within R, height is a vector of within-cluster variation at different stages of the agglomeration procedure. Since we use the agglomeration procedure (as shown in the gif above), we start with the initial stage that each and every consumer is in its own cluster. In the osn data, this means we start with the assumption the 120 users are in 120 clusters (each user in one cluster). We gradually combine users and clusters to the end stage that all consumers are in 1 cluster.

The values in height are arranged in the same order as the agglomeration procedure. Therefore, we have the follow two observations:

For the last value of height, we have 1 cluster; the second-to-last, we have 2 clusters; the third-to-last, 3 clusters…
The within-cluster variation continues increasing from the initial stage to the end stage. This is because at the initial stage, every one is in its own cluster and we have the lowest within-cluster variation (0 or no within-cluster variation). In contrast, in the end stage, all consumers are in one cluster and we have the highest within-cluster variation.

Let’s check what the height looks like. You will see that it increases constantly.

within_cluster_variation <- results$height
within_cluster_variation

##   [1]  0.4780642  0.5293206  0.5612478  0.5750117  0.6032959  0.6612906
##   [7]  0.7162642  0.7533355  0.7746848  0.7884269  0.7889602  0.8050062
##  [13]  0.8052550  0.8409981  0.8599442  0.8631694  0.8637981  0.8882851
##  [19]  0.9070446  0.9140462  0.9244842  0.9256939  0.9731676  1.0579897
##  [25]  1.0591372  1.0634935  1.0652339  1.0684773  1.0711527  1.0785229
##  [31]  1.0978222  1.1231807  1.1973810  1.2106981  1.2179649  1.2433307
##  [37]  1.2516266  1.2531895  1.2551041  1.2636073  1.2695350  1.2721478
##  [43]  1.2780557  1.2914777  1.2968819  1.3018216  1.3354961  1.3628748
##  [49]  1.3701394  1.3747623  1.3779429  1.3981233  1.4179611  1.4281549
##  [55]  1.4425011  1.4608043  1.5504601  1.5833322  1.6001816  1.6021762
##  [61]  1.6079447  1.6513533  1.7203470  1.7383765  1.7557230  1.7806412
##  [67]  1.8293025  1.8479023  1.8732198  1.9335654  1.9346242  2.0849719
##  [73]  2.1028503  2.1446351  2.1812540  2.2086823  2.2134725  2.2731686
##  [79]  2.2821196  2.3556332  2.3865849  2.4174510  2.4254254  2.4258883
##  [85]  2.4355663  2.5026708  2.5856381  2.6873940  2.7382265  2.9284161
##  [91]  2.9475014  3.1660506  3.2016780  3.2874132  3.3595585  3.4215631
##  [97]  3.5111259  3.5501055  3.6923737  3.7185953  3.8237255  4.2440969
## [103]  4.3289215  4.7527831  4.9086087  4.9938644  5.2502399  5.4908632
## [109]  5.7717751  6.2197486  6.6424875  7.5016324  7.8763714  8.0262806
## [115]  8.5655845  8.8949807  9.8969601 13.9601297 16.5567531

To see things more clearly, check the picture below: Within Cluster Variation

To create an elbow plot with within_cluster_variation, you can use the function elbow_plot shared in the MSR package. The function takes in height from the hclust() results as the input and outputs an elbow plot with the default no. of clusters set from 1 to 10. The reason of checking 1 to 10 clusters is that, in practice, we only want a few clusters. It is not useful to have too many clusters. For example, having 100 clusters would not be useful for marketers. As a rule-of-thumb, the no. of clusters should not be larger than 10.

# we already create a variable called within_cluster_variation
# within_cluster_variation = results$height, so we use this variable
elbow_plot(within_cluster_variation)

Now we apply the Elbow criterion to check the Elbow point. For a point to be an elbow point, it must be that:

before the elbow point, we have a big decrease in within-cluster variation;
after the elbow point, we have a small decrease in within-cluster variation.

A closer look at the plot shows that no_of_clusters at 3 is the elbow point, as shown in the plot below.

Note: in practice, the choice of the no. of clusters is a rather complex decision. Of course, it makes sense to use the Elbow plot as the statistical basis. However, in practices, the Elbow point may not be so obvious. In this case, managers must rely on their own knowledge and understanding of consumers to make a decision. We always allow for subjectivity in the decision.

Choosing The Elbow Point

Step 4 - Validating the Clustering

After determining the no. of clusters, we first obtain the clustering when the no. of clusters is 3 (the elbow point). For this purpose, we use a function named cutree (please use ?cutree to see the details). This function takes two key inputs:

A hierarchical clustering obtained from hclust;
The no. of clusters to use (e.g., 3 in our case).

The function outputs a vector telling you which cluster each and every consumer belongs to. In our case, we name the output as clus_3. clus_3 is a vector of 120 values, with each value telling you which cluster a user belongs to. For example, the first value of clus_3 is 1, implying the first user in our data belongs to cluster 1.

# obtain the solution when no. of clusters = 3
clust_3 <- cutree(results,3)
clust_3

##   [1] 1 2 3 1 2 1 3 3 1 3 3 1 1 3 1 1 1 2 1 3 1 3 2 1 3 1 3 3 2 3 3 1 3 3 2 3 1
##  [38] 2 1 2 1 1 1 2 2 3 3 3 1 1 2 2 2 1 1 2 3 1 3 2 2 3 3 2 3 1 1 1 1 3 3 1 1 2
##  [75] 1 1 1 2 1 3 1 2 1 3 2 1 1 1 3 1 3 1 1 3 1 1 3 2 2 1 1 1 2 2 3 1 3 1 1 3 1
## [112] 3 1 1 1 1 2 3 2 1

Since clust_3 is a discrete variable with three levels (i.e., 3 clusters), we should let R know it’s a factor by using the as.factor function. For more details of as.factor, please check the help file with ?as.factor in your R command line.

clust_3 <- as.factor(clust_3)

##  Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 3 3 1 3 ...

To validate this solution, clust_3, we check whether all variables in osn data indeed differ significantly across 3 clusters in clust_3. This is an application of the clustering criterion heterogeneity between clusters. That is, we want our clusters to be as different as possible.

If a variable does not differ across clusters, we will remove it from the interpretation in the next step. Given variables in osn are all continuous, we run an ANOVA to compare means of variables across the 3 clusters (groups) in clust_3. For example, we compare the average no. of friends across the 3 clusters.

You may run the ANOVA for each variable one by one. For example, to compare the average no. of friends across the 3 clusters, we can use aov(n_friends ~ clust3, osn). An easier approach is to run the ANOVA for all variables in osn with cbind() like the following.

Note: cbind is a useful function which you can use within R formula. It combines different variables, so you can run ANOVA with multiple dependent variables in one go.

# run anova on all the variables
anova_osn <- aov(cbind(intrinsic,
                       habit,
                       si_followers,
                       si_friends,
                       n_followers,
                       n_friends) ~ clust_3, data = osn)
# get a summary of anova results
summary(anova_osn)

##  Response intrinsic :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2 31.946 15.9728  20.866 1.778e-08 ***
## Residuals   117 89.561  0.7655                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response habit :
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2  24.849 12.4244  11.676 2.381e-05 ***
## Residuals   117 124.502  1.0641                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response si_followers :
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2  40.348 20.1742  20.617 2.137e-08 ***
## Residuals   117 114.488  0.9785                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response si_friends :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2 47.482 23.7408  48.476 4.628e-16 ***
## Residuals   117 57.300  0.4897                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response n_followers :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2 57.938 28.9690   52.33 < 2.2e-16 ***
## Residuals   117 64.770  0.5536                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response n_friends :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## clust_3       2 31.943 15.9717  22.666 4.788e-09 ***
## Residuals   117 82.444  0.7046                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results, all p-values of the F-stats are smaller than 0.05. Therefore, all variables differ significantly across the 3 clusters in clust_3. Therefore, we will include all variables in the interpretation of clusters.

Step 5 - Interpreting Clusters

To interpret the clusters, we first obtain the means of variables across the 3 clusters in clust_3. To do this, you are given a function cluster_mean in the MSR package. You can use this function to obtain cluster means of all the variables in the osn data.

The function takes in an ANOVA object and output the cluster means. In our analysis, the ANOVA results are stored in anova_osn in Step 4. We will use it as input of the function.

cluster_mean(anova_osn)

##               cluster_1  cluster_2  cluster_3
## intrinsic    0.08165014  1.3019292  0.9125530
## habit        1.42790041  1.5471999  0.4864840
## si_followers 0.34646713  1.1677341 -0.4335153
## si_friends   0.26362564 -0.9936173  0.7091600
## n_followers  2.71598773  3.9012587  4.2257967
## n_friends    2.83797460  3.2689184  4.0345711

Note: remember to remove insignificant variables from this table before doing the actual interpretation.

To name the clusters, we first compare the clusters along different variables, one variable by one variable. We can mark the highest and lowest cluster means of these variables. For example, for intrinsic or the intrinsic preferences for the OSN, cluster_1 is the lowest and cluster_2 the highest. If we mark the highest means as red and the lowest as blue, we will have a table as below (you can do it in excel or other software).

From the table, we can interpret the clusters. For example, for Cluster 1, it has the lowest no. of followers and friends, as well as the lowest intrinsic preferences. This means, Cluster 1 users are not strongly attracted to the online social network. They are not well-connected on the social network. So, we name Cluster 1 as Lurkers (you may come up with other names as you want). You can follow the similar procedure for Cluster 2 and 3.

Overall, make sure you cluster names are:

Accurate: reflecting the features of clusters.
Catchy: easy to remember and communicate.

R Tutorial: Cluster Analysis

Step 1 - Selecting A Distance Measure

Step 2 - Selecting A Clustering Procedure

Step 3 - Determining the No. of Clusters

Step 4 - Validating the Clustering

Step 5 - Interpreting Clusters