Please download the data TiVo.RData from Canvas. The
description of the data is in the slides of the case discussion.
# load the MSR package
library(MSR)
# load the data from the hard disc
load("TiVo.RData")
The data include both continuous variables and binary variables. As a workaround, we will use “Euclidean distance” with “Ward’s method” in the hierarchical clustering.
In addition, because some variables have rather large scale (large mean and standard deviation), we need to rescale there variables to avoid putting too much weight on these large-scaled variables in our distance measure.
To do this, we apply a function called scale() to the
data frame cluster_data to standardize all variables, so
they have the same scale. For more information of scale,
please run ?scale in your command line.
# the distance matrix with Euclidean measure
# don't forget the scale the cluster data
tivo_dist <- dist(scale(cluster_data), method = "euclidean")
# running the cluster with Ward's method
tivo_cluster <- hclust(tivo_dist, method = "ward.D2")
tivo_cluster
##
## Call:
## hclust(d = tivo_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 1000
To decide the no. of clusters, we create an elbow plot with the
elbow_plot function. This the same function we use for the
practical session. It takes the height vector from
tivo_cluster and output a elbow plot with the default no.
of clusters set to \(1,2,3,...,10\).
elbow_plot(
# gettting the height vector from tivo_cluster
tivo_cluster$height
)
Here, we observe an elbow point at the no. of clusters equal to 4. From 3 clusters to 4 clusters, there is a big decrease of the within-cluster variation. In contrast, from 4 clusters to 3 clusters, there is a small decrease of the within-cluster variation. By applying the Elbow criterion, we have the elbow point at 4 clusters.
First, given the choice of no. of clusters, we obtain the clustering
results at 4 clusters with cutree function.
clust_4 <- cutree(tivo_cluster,4)
# change clust_4 as a factor
clust_4 <- as.factor(clust_4)
str(clust_4)
## Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 3 3 3 2 3 ...
## - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
Next, we would have validated with clustering by checking whether different characteristics differ significantly across the 4 clusters. For continuous variables, we can use ANOVA analysis. Note, here we also have binary variables. You need to choose a test that matches the measurement levels of the variables. For binary variables, we will choose chi-square test. This is beyond our discussion. So, the validation step is omitted here.
For simplicity, we will just use all variables in the interpretation.
To do so, we need to obtain the cluster means of all the variables. A
convenient approach is to use some data packages such as
dplyr or data.table. I will omit the codes
here as it’s beyond our course.
## Variables Cluster_1
## 1: Gender: Females 0.53
## 2: Gender: Males 0.47
## 3: Education: none 0.25
## 4: Education: BA 0.25
## 5: Education: MA 0.25
## 6: Education: PhD 0.25
## 7: Annual Income (x1000 $) 48.09
## 8: Age 53.51
## 9: Purchasing Decision-maker: single 0.10
## 10: Purchasing Decision-maker: family 0.90
## 11: Purchasing Location: discount 0.00
## 12: Purchasing Location: web (ebay) 0.00
## 13: Purchasing Location: retail 0.00
## 14: Purchasing Location: mass-consumer electronics 1.00
## 15: Purchasing Location: specialty stores 0.00
## 16: Monthly Electronics Spend 41.41
## 17: Purchasing Frequency (every x months) 29.94
## 18: TV Viewing (hours/day) 6.32
## 19: Favorite Feature: cool gadget 0.00
## 20: Favorite Feature: programming/interactive features 0.00
## 21: Favorite Feature: saving favorite shows to watch as a family 1.00
## 22: Favorite Feature: schedule control 0.00
## 23: Favorite Feature: time shifting 0.00
## Variables Cluster_1
## Cluster_2 Cluster_3 Cluster_4
## 1: 0.30 0.48 0.54
## 2: 0.70 0.52 0.46
## 3: 0.14 0.84 0.21
## 4: 0.57 0.16 0.23
## 5: 0.14 0.00 0.33
## 6: 0.15 0.00 0.23
## 7: 60.32 29.86 29.97
## 8: 50.32 52.12 26.40
## 9: 0.25 0.50 0.96
## 10: 0.75 0.50 0.04
## 11: 0.03 0.50 0.36
## 12: 0.00 0.00 0.30
## 13: 0.03 0.50 0.34
## 14: 0.00 0.00 0.00
## 15: 0.94 0.00 0.00
## 16: 56.07 16.84 31.65
## 17: 9.51 24.80 24.17
## 18: 1.12 1.01 2.91
## 19: 0.35 0.33 0.04
## 20: 0.00 0.00 0.90
## 21: 0.00 0.00 0.00
## 22: 0.32 0.33 0.03
## 23: 0.33 0.33 0.03
## Cluster_2 Cluster_3 Cluster_4