In this tutorial, we will use an exemplary data
osn_cluster.RData to learn about cluster analysis. You can
find the following objects in the MSR package.
osn: a data frame that you will use for clustering.cluster_mean: a function to calculate the means of variables across clusters.elbow_plot: a function that produces an elbow plot based on hierarchical clustering.
The data osn describe several characteristics of users
of an online social network (OSN) similar to Facebook. To preserve
privacy, the characteristics are transformed (standardized or
normalized). All variables are continuous.
- intrinsic: the intrinsic preferences of users towards the OSN;
- habit: to what extent users form a persistent habit of using the OSN;
- si_follower: how susceptible are users to the influence of their followers?
- si_friends: how susceptible are users to the influence of their friends?
- n_followers: (logged) no. of followers of users.
- n_friends: (logged) no. of friends of users.
An overview of the variables in the data:
## intrinsic habit si_followers si_friends n_followers n_friends
## 1 -0.4004632 0.1706246 1.0169258 0.8815938 3.135494 2.079442
## 2 1.4292221 1.8998624 0.8256795 -1.2979507 4.094345 3.761200
## 3 -0.4469281 -0.6304754 1.0156293 0.6537481 4.962845 4.653960
## 4 0.7341009 2.0418772 0.5320340 -0.1849733 1.791759 3.610918
## 5 3.6584396 1.8108407 0.2677690 -0.3595886 2.833213 2.708050
## 6 -0.8601568 2.7794144 3.0153064 -0.4454919 2.484907 2.484907
Step 1 - Selecting A Distance Measure
As discussed in class, the most important criterion for selecting a
distance measure is the measurement level. In our data, the variables
are all continuous. In theory, you may choose different measures that
are suitable for continuous variables. In our course, we will stick with
Euclidean distance. To calculate the measure, we use a function called
dist or the distance measure function. The function takes
in two key inputs:
- a data frame of variables that you are using to calculate the distance measure.
- the distance measure to be used (
method =).- As we are using Euclidean measure, we do this:
method = "euclidean".
- As we are using Euclidean measure, we do this:
The function outputs an object within R called dist or a
distance matrix. This will be used as an input of the cluster analysis
next.
Let’s check out distance matrix between first 5 users in the data.
## 1 2 3 4 5
## 1 0.000000 3.856515 3.265454 3.211316 4.664043
## 2 3.856515 0.000000 3.913958 2.674470 2.978036
## 3 3.265454 3.913958 0.000000 4.540631 5.720324
## 4 3.211316 2.674470 4.540631 0.000000 3.256571
## 5 4.664043 2.978036 5.720324 3.256571 0.000000
A distance matrix has the same number of row and columns, with each
row (and column) representing one user in the OSN data. A distance
matrix is also symmetric, as the distance between, for example, User 1
and User 2 (3.856515) is the same as the distance between
User 2 and User 1 (3.856515). Lastly, the distance between
a user to him/herself is 0, as seen above.
Step 2 - Selecting A Clustering Procedure
For our course, we will focus on the hierarchical clustering and follow the agglomeration procedure. In addition, we use Ward’s method to decide how to combine clusters in the agglomeration procedure. Throughout the course, we will stick to this combination.
After selecting the procedure, we will first run a hierarchical
clustering procedure. Within R, you can use a function
hclust for the hierarchical clustering. The
hclust takes in two main inputs:
- a distance matrix object that is usually created from the
distfunction. - the method of the clustering, including, for example,
single(single linkage),complete(complete linkage),ward.D2(Ward’s method).
The function hclust outputs an object of
hclust which records the hierarchical clustering process.
For more details of hclust, please run ?hclust
or help(hclust) in R.
##
## Call:
## hclust(d = osn_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 120
Step 3 - Determining the No. of Clusters
To determine the no. of clusters, we need to check the performances of clustering at different stages of the hierarchical clustering. As discussed in class, a fundamental criterion for good clustering is homogeneity within clusters.
From the clustering results, we can obtain a measure of
within-cluster variation called height in
results. Given the homogeneity within
clusters criterion, a clustering is good if the within-cluster
variation is small (or height is small).
Within R, height is a vector of within-cluster variation
at different stages of the agglomeration procedure. Since we use the
agglomeration procedure (as shown in the gif above), we start with the
initial stage that each and every consumer is in its own
cluster. In the osn data, this means we start with the
assumption the 120 users are in 120 clusters (each user in one cluster).
We gradually combine users and clusters to the end stage that
all consumers are in 1 cluster.
The values in height are arranged in the same order as
the agglomeration procedure. Therefore, we have the follow two
observations:
- For the last value of
height, we have 1 cluster; the second-to-last, we have 2 clusters; the third-to-last, 3 clusters… - The within-cluster variation continues increasing from the initial stage to the end stage. This is because at the initial stage, every one is in its own cluster and we have the lowest within-cluster variation (0 or no within-cluster variation). In contrast, in the end stage, all consumers are in one cluster and we have the highest within-cluster variation.
Let’s check what the height looks like. You will see
that it increases constantly.
## [1] 0.4780642 0.5293206 0.5612478 0.5750117 0.6032959 0.6612906
## [7] 0.7162642 0.7533355 0.7746848 0.7884269 0.7889602 0.8050062
## [13] 0.8052550 0.8409981 0.8599442 0.8631694 0.8637981 0.8882851
## [19] 0.9070446 0.9140462 0.9244842 0.9256939 0.9731676 1.0579897
## [25] 1.0591372 1.0634935 1.0652339 1.0684773 1.0711527 1.0785229
## [31] 1.0978222 1.1231807 1.1973810 1.2106981 1.2179649 1.2433307
## [37] 1.2516266 1.2531895 1.2551041 1.2636073 1.2695350 1.2721478
## [43] 1.2780557 1.2914777 1.2968819 1.3018216 1.3354961 1.3628748
## [49] 1.3701394 1.3747623 1.3779429 1.3981233 1.4179611 1.4281549
## [55] 1.4425011 1.4608043 1.5504601 1.5833322 1.6001816 1.6021762
## [61] 1.6079447 1.6513533 1.7203470 1.7383765 1.7557230 1.7806412
## [67] 1.8293025 1.8479023 1.8732198 1.9335654 1.9346242 2.0849719
## [73] 2.1028503 2.1446351 2.1812540 2.2086823 2.2134725 2.2731686
## [79] 2.2821196 2.3556332 2.3865849 2.4174510 2.4254254 2.4258883
## [85] 2.4355663 2.5026708 2.5856381 2.6873940 2.7382265 2.9284161
## [91] 2.9475014 3.1660506 3.2016780 3.2874132 3.3595585 3.4215631
## [97] 3.5111259 3.5501055 3.6923737 3.7185953 3.8237255 4.2440969
## [103] 4.3289215 4.7527831 4.9086087 4.9938644 5.2502399 5.4908632
## [109] 5.7717751 6.2197486 6.6424875 7.5016324 7.8763714 8.0262806
## [115] 8.5655845 8.8949807 9.8969601 13.9601297 16.5567531
To see things more clearly, check the picture below:
To create an elbow plot with within_cluster_variation,
you can use the function elbow_plot shared in the
MSR package. The function takes in height from
the hclust() results as the input and outputs an elbow plot
with the default no. of clusters set from 1 to 10. The reason
of checking 1 to 10 clusters is that, in practice, we only want a few
clusters. It is not useful to have too many clusters. For example,
having 100 clusters would not be useful for marketers. As a
rule-of-thumb, the no. of clusters should not be larger than 10.
# we already create a variable called within_cluster_variation
# within_cluster_variation = results$height, so we use this variable
elbow_plot(within_cluster_variation)Now we apply the Elbow criterion to check the Elbow point. For a point to be an elbow point, it must be that:
- before the elbow point, we have a big decrease in within-cluster variation;
- after the elbow point, we have a small decrease in within-cluster variation.
A closer look at the plot shows that no_of_clusters at 3
is the elbow point, as shown in the plot below.
Note: in practice, the choice of the no. of clusters is a rather complex decision. Of course, it makes sense to use the Elbow plot as the statistical basis. However, in practices, the Elbow point may not be so obvious. In this case, managers must rely on their own knowledge and understanding of consumers to make a decision. We always allow for subjectivity in the decision.
Step 4 - Validating the Clustering
After determining the no. of clusters, we first obtain the clustering
when the no. of clusters is 3 (the elbow point). For this purpose, we
use a function named cutree (please use
?cutree to see the details). This function takes two key
inputs:
- A hierarchical clustering obtained from
hclust; - The no. of clusters to use (e.g., 3 in our case).
The function outputs a vector telling you which cluster each and
every consumer belongs to. In our case, we name the output as
clus_3. clus_3 is a vector of 120 values, with
each value telling you which cluster a user belongs to. For example, the
first value of clus_3 is 1, implying the first
user in our data belongs to cluster 1.
## [1] 1 2 3 1 2 1 3 3 1 3 3 1 1 3 1 1 1 2 1 3 1 3 2 1 3 1 3 3 2 3 3 1 3 3 2 3 1
## [38] 2 1 2 1 1 1 2 2 3 3 3 1 1 2 2 2 1 1 2 3 1 3 2 2 3 3 2 3 1 1 1 1 3 3 1 1 2
## [75] 1 1 1 2 1 3 1 2 1 3 2 1 1 1 3 1 3 1 1 3 1 1 3 2 2 1 1 1 2 2 3 1 3 1 1 3 1
## [112] 3 1 1 1 1 2 3 2 1
Since clust_3 is a discrete variable with three levels
(i.e., 3 clusters), we should let R know it’s a factor by using the
as.factor function. For more details of
as.factor, please check the help file with
?as.factor in your R command line.
## Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 3 3 1 3 ...
To validate this solution, clust_3, we check whether all
variables in osn data indeed differ significantly across 3
clusters in clust_3. This is an application of the
clustering criterion heterogeneity between clusters.
That is, we want our clusters to be as different as possible.
If a variable does not differ across clusters, we will remove it from
the interpretation in the next step. Given variables in osn
are all continuous, we run an ANOVA to compare means of variables across
the 3 clusters (groups) in clust_3. For example, we compare
the average no. of friends across the 3 clusters.
You may run the ANOVA for each variable one by one. For example, to
compare the average no. of friends across the 3 clusters, we can use
aov(n_friends ~ clust3, osn). An easier approach is to run
the ANOVA for all variables in osn with
cbind() like the following.
Note: cbind is a useful function which you can use
within R formula. It combines different variables, so you can run ANOVA
with multiple dependent variables in one go.
# run anova on all the variables
anova_osn <- aov(cbind(intrinsic,
habit,
si_followers,
si_friends,
n_followers,
n_friends) ~ clust_3, data = osn)
# get a summary of anova results
summary(anova_osn)## Response intrinsic :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 31.946 15.9728 20.866 1.778e-08 ***
## Residuals 117 89.561 0.7655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response habit :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 24.849 12.4244 11.676 2.381e-05 ***
## Residuals 117 124.502 1.0641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response si_followers :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 40.348 20.1742 20.617 2.137e-08 ***
## Residuals 117 114.488 0.9785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response si_friends :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 47.482 23.7408 48.476 4.628e-16 ***
## Residuals 117 57.300 0.4897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response n_followers :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 57.938 28.9690 52.33 < 2.2e-16 ***
## Residuals 117 64.770 0.5536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response n_friends :
## Df Sum Sq Mean Sq F value Pr(>F)
## clust_3 2 31.943 15.9717 22.666 4.788e-09 ***
## Residuals 117 82.444 0.7046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results, all p-values of the
F-stats are smaller than 0.05. Therefore, all variables
differ significantly across the 3 clusters in clust_3.
Therefore, we will include all variables in the interpretation of
clusters.
Step 5 - Interpreting Clusters
To interpret the clusters, we first obtain the means of variables
across the 3 clusters in clust_3. To do this, you are given
a function cluster_mean in the MSR package.
You can use this function to obtain cluster means of all the variables
in the osn data.
The function takes in an ANOVA object and output the cluster means.
In our analysis, the ANOVA results are stored in anova_osn
in Step 4. We will use it as input of the function.
## cluster_1 cluster_2 cluster_3
## intrinsic 0.08165014 1.3019292 0.9125530
## habit 1.42790041 1.5471999 0.4864840
## si_followers 0.34646713 1.1677341 -0.4335153
## si_friends 0.26362564 -0.9936173 0.7091600
## n_followers 2.71598773 3.9012587 4.2257967
## n_friends 2.83797460 3.2689184 4.0345711
Note: remember to remove insignificant variables from this table before doing the actual interpretation.
To name the clusters, we first compare the clusters along different
variables, one variable by one variable. We can mark the highest and
lowest cluster means of these variables. For example, for
intrinsic or the intrinsic preferences for the OSN,
cluster_1 is the lowest and cluster_2 the
highest. If we mark the highest means as red and the lowest as blue, we
will have a table as below (you can do it in excel or other
software).
From the table, we can interpret the clusters. For example, for
Cluster 1, it has the lowest no. of followers and friends, as well as
the lowest intrinsic preferences. This means, Cluster 1 users are not
strongly attracted to the online social network. They are not
well-connected on the social network. So, we name Cluster 1 as
Lurkers (you may come up with other names as you want). You
can follow the similar procedure for Cluster 2 and 3.
Overall, make sure you cluster names are:
- Accurate: reflecting the features of clusters.
- Catchy: easy to remember and communicate.