The k-means algorithm looks for cluster of data with similar attributes to group them together, which helps ensure the internal homogeneity among market segments. This algorithm is used because in this assignment, I am provided with the number of segments to aim for (4 and 7), which makes determining the value of k (number of segments) effortless.
tivo <- read.csv("C:/Users/trang/OneDrive/Minerva/2nd year/B110/tivo cluster.csv", header = TRUE)
alltivo <- read.csv("C:/Users/trang/OneDrive/Minerva/2nd year/B110/alltivo.csv")
str(tivo)
## 'data.frame': 1002 obs. of 14 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Education : int 0 0 1 3 0 1 0 0 0 0 ...
## $ Annual.Income..x1000... : Factor w/ 47 levels "","#REF!","21",..: 30 27 39 32 27 13 15 11 38 12 ...
## $ Age : Factor w/ 65 levels "","#REF!","18",..: 15 21 51 63 37 57 47 15 45 44 ...
## $ Purchasing.Location : int 3 3 4 3 3 2 0 2 4 0 ...
## $ Monthly.Electronics.Spend : int 35 35 64 33 45 14 18 23 74 16 ...
## $ Annual.Spending.on.Electronics : int 420 420 768 396 540 168 216 276 888 192 ...
## $ Electronics.Spending.as...of.income : Factor w/ 143 levels "","0.10%","0.11%",..: 51 56 95 43 82 21 30 60 119 29 ...
## $ Monthly.Household.Spend : int 150 163 103 154 161 21 40 75 358 78 ...
## $ Annual.Household.spend : int 1800 1956 1236 1848 1932 252 480 900 4296 936 ...
## $ Disposable.Income..after.electronics..hh.: int 46780 43624 55996 48756 43528 30580 32304 27824 51816 28872 ...
## $ Purchasing.Frequency..every.x.months. : int 13 26 13 22 47 32 41 9 1 25 ...
## $ Technology.Adoption : int 0 0 1 0 0 1 1 1 1 1 ...
## $ TV.Viewing..hours.day. : int 2 10 0 5 2 1 0 1 0 0 ...
For the first segment scheme, the attribute chosen to divide the sample into different segments is TV Viewing time (hrs/day). This is a behaviorial attribute, which can be a better reflection of customrs’ need than demographics attribute.
#Cluster based on TV time
set.seed(123)
first_cluster <- kmeans(na.omit(tivo$TV.Viewing..hours.day.),centers = 7) #disregard NAs when doing clustering
first_cluster #give insightful details on size, centers,...
## K-means clustering with 7 clusters of sizes 41, 264, 537, 89, 18, 44, 7
##
## Cluster means:
## [,1]
## 1 6.7073171
## 2 2.1212121
## 3 0.6741155
## 4 4.4943820
## 5 13.2222222
## 6 10.2500000
## 7 12.0000000
##
## Clustering vector:
## [1] 2 6 3 4 2 3 3 3 3 3 1 4 3 2 3 3 3 2 4 3 1 3 4 1 1 2 2 3 2 3 3 3 3 3
## [35] 4 6 3 3 4 3 2 3 3 3 3 3 2 3 2 3 3 3 3 4 2 3 3 3 4 2 2 2 4 2 3 3 2 3
## [69] 3 3 2 2 2 3 1 1 2 3 3 2 3 3 2 3 3 2 5 3 3 3 5 3 3 2 3 1 7 3 3 3 3 5
## [103] 2 3 3 3 3 3 2 2 3 2 3 3 3 6 1 3 2 3 3 3 4 5 3 2 4 3 2 3 2 3 3 3 3 3
## [137] 3 2 2 1 6 2 3 6 2 4 3 2 2 2 3 3 3 4 3 4 1 7 2 3 3 2 2 3 3 3 4 6 3 4
## [171] 6 3 2 2 3 3 5 3 5 4 3 2 6 2 3 3 3 3 3 3 4 5 2 2 2 3 3 2 6 3 2 3 3 3
## [205] 3 2 3 3 3 3 2 2 3 2 3 2 2 2 3 4 3 3 3 3 2 3 2 2 5 7 4 3 6 1 3 4 4 3
## [239] 3 6 3 3 3 3 3 3 3 3 4 3 2 3 3 3 2 4 2 3 6 3 3 3 2 3 3 2 7 3 2 3 3 3
## [273] 3 2 1 4 3 3 3 2 3 3 2 3 2 6 3 2 1 2 2 3 3 3 3 4 2 2 3 2 2 6 3 2 3 2
## [307] 3 3 3 1 3 2 3 3 3 2 2 3 4 3 3 3 3 3 3 4 2 2 3 2 2 3 3 3 3 2 3 2 3 6
## [341] 3 3 3 2 2 2 3 3 3 2 3 3 2 2 2 3 3 3 3 2 2 2 4 3 3 3 2 2 6 3 3 2 3 3
## [375] 3 4 1 3 3 3 2 3 1 3 3 2 3 6 3 3 2 2 2 3 4 3 6 3 2 3 3 2 3 2 6 3 4 3
## [409] 6 2 2 3 3 2 3 4 3 3 3 3 2 3 3 3 3 2 3 3 1 3 3 2 4 3 3 3 1 3 3 5 3 3
## [443] 3 4 2 2 3 2 2 2 1 2 3 3 3 3 4 1 3 3 2 2 4 6 3 3 3 3 2 2 3 3 2 1 3 3
## [477] 3 2 2 2 2 3 4 6 4 3 3 3 2 2 3 1 3 3 6 3 3 2 2 3 3 2 2 3 3 2 3 3 6 2
## [511] 4 3 2 3 3 2 3 3 3 3 3 6 2 3 3 3 3 3 4 2 2 4 3 3 3 3 1 3 3 3 3 3 4 3
## [545] 2 6 3 3 2 3 3 4 3 3 3 4 3 6 4 3 3 3 3 2 2 2 2 3 1 3 3 2 3 2 3 4 4 6
## [579] 3 4 2 3 3 4 3 3 6 3 3 7 4 4 3 3 2 3 3 2 2 3 3 3 2 3 3 3 3 3 2 3 2 3
## [613] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 2 2 2 3 3 4 3 3 6 2 3 2 2 3 2 2 2 2 3 4
## [647] 6 2 5 3 6 3 1 3 3 2 3 4 3 3 3 3 3 3 1 6 3 3 3 5 3 3 2 3 2 5 3 2 2 5
## [681] 3 3 2 3 2 6 3 3 2 2 3 2 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3 1 3 3 3 3 2 3
## [715] 3 3 3 2 2 3 3 6 3 2 2 3 3 3 2 3 3 3 3 3 4 3 2 3 1 3 4 3 3 4 4 2 2 2
## [749] 2 3 3 1 2 3 4 1 3 2 3 3 4 3 1 4 3 2 3 2 5 3 3 3 3 3 3 3 3 2 3 2 3 4
## [783] 2 2 4 3 3 2 3 4 2 2 1 3 2 2 3 2 3 1 3 4 2 3 3 4 3 3 3 3 3 4 7 4 3 2
## [817] 3 2 2 3 2 3 2 2 2 3 3 6 6 2 3 3 2 4 3 3 3 2 3 2 3 3 2 2 5 3 1 3 3 3
## [851] 2 5 3 2 3 3 2 3 2 3 3 6 6 3 3 2 2 3 3 3 7 1 5 3 3 3 4 3 2 3 6 4 2 3
## [885] 2 2 3 4 3 4 2 2 1 2 6 3 3 3 3 3 2 2 2 3 3 4 3 3 2 3 3 1 6 4 3 4 3 4
## [919] 3 3 2 3 2 4 3 3 1 2 2 3 2 4 3 3 3 4 3 3 3 3 3 3 1 2 3 4 3 3 3 1 2 2
## [953] 2 2 4 3 3 2 3 3 4 3 3 2 3 3 2 4 2 3 4 2 3 3 3 6 1 2 4 3 2 2 3 3 3 3
## [987] 4 4 5 2 2 2 2 3 3 2 3 3 6 3
##
## Within cluster sum of squares by cluster:
## [1] 28.487805 28.121212 117.970205 22.247191 3.111111 26.250000
## [7] 0.000000
## (between_SS / total_SS = 97.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
## Create a dataframe of participant IDs and Cluster number they belong to (1:7)
o = order(first_cluster$cluster)
df_firstcluster <- data.frame(tivo$ID[o], first_cluster$cluster[o])
Nex, I will create a visualization to see the distribution of the 7 segments. The graph below shows that the external heterogeneity across segments is strong as IDs of the same color stick together into a block shape.
plot(tivo$TV.Viewing..hours.day., type="n", xlim=c(3,1000))
text(x=tivo$TV.Viewing..hours.day., labels=tivo$ID, col=first_cluster$cluster+1)
In order to investigate the other attribute characteristics of 7 formed clusters, I will store the partcipant IDs corresponding to each segment (1:7) into 7 vectors. Afterward, these 7 vectors of IDs will serve as filters to achieve the desired subset.
list_id1 <- with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==1])
length(list_id1) #check if it's done correctly, should return 41 here because the first cluster has 41 participants
## [1] 41
list_id2 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==2]))
list_id3 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==3]))
list_id4 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==4]))
list_id5 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==5]))
list_id6 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==6]))
list_id7 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==7]))
Next, I’ll examine annual income, Monthly electronics spend, Purchasig locations and Favorite features of each segment
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Segment 1
tivo_clus1 <- tivo %>% filter(tivo$ID %in% list_id1)
tivo_clus1$ID
## [1] 11 21 24 25 75 76 96 117 140 157 234 275 289 310 377 383 429
## [18] 437 451 458 474 492 537 569 653 665 708 739 752 756 763 793 800 847
## [35] 872 893 912 927 943 950 977
mean(tivo_clus1$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus1$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus1$Monthly.Electronics.Spend)
## [1] 40.31707
#location
table(tivo_clus1$Purchasing.Location)
##
## 0 1 2 3
## 4 4 4 29
#appealing feature
alltivo1 <- alltivo %>% filter(alltivo$ID %in% list_id1)
table(alltivo1$Purchasing.Location)
##
## discount
## 0 4
## mass-consumer electronics retail
## 29 4
## specialty stores web (ebay)
## 0 4
table(alltivo1$Favorite.feature)
##
##
## 0
## cool gadget
## 0
## programming/interactive features
## 12
## saving favorite shows to watch as a family
## 29
## schedule control
## 0
## time shifting
## 0
#Segment 2
tivo_clus2 <- tivo %>% filter(tivo$ID %in% list_id2) #filter the ids with the Ids in list1
mean(tivo_clus2$TV.Viewing..hours.day.)
## [1] 2.121212
mean(tivo_clus2$Technology.Adoption)
## [1] 0.8674242
mean(tivo_clus2$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus2$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus2$Monthly.Electronics.Spend)
## [1] 30.37121
table(tivo_clus2$Purchasing.Location)
##
## 0 1 2 3 4
## 96 14 69 35 50
#appealing feature
alltivo2 <- alltivo %>% filter(alltivo$ID %in% list_id2)
table(alltivo2$Purchasing.Location)
##
## discount
## 0 96
## mass-consumer electronics retail
## 35 69
## specialty stores web (ebay)
## 50 14
table(alltivo2$Favorite.feature)
##
##
## 0
## cool gadget
## 60
## programming/interactive features
## 45
## saving favorite shows to watch as a family
## 35
## schedule control
## 66
## time shifting
## 58
#Segment 3
tivo_clus3 <- tivo %>% filter(tivo$ID %in% list_id3) #filter the ids with the Ids in list1
mean(tivo_clus3$TV.Viewing..hours.day.)
## [1] 0.6741155
mean(tivo_clus3$Technology.Adoption)
## [1] 0.9534451
mean(tivo_clus3$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus3$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus3$Monthly.Electronics.Spend)
## [1] 28.18436
#location
table(tivo_clus3$Purchasing.Location)
##
## 0 1 2 3 4
## 182 8 202 25 120
#appealing feature
alltivo3 <- alltivo %>% filter(alltivo$ID %in% list_id3)
table(alltivo3$Favorite.feature)
##
##
## 0
## cool gadget
## 168
## programming/interactive features
## 26
## saving favorite shows to watch as a family
## 25
## schedule control
## 155
## time shifting
## 163
#Segment 4
tivo_clus4 <- tivo %>% filter(tivo$ID %in% list_id4) #filter the ids with the Ids in list1
tivo_clus4$ID
## [1] 4 12 19 23 35 39 54 59 63 123 127 146 154 156 167 170 180
## [18] 191 220 231 236 237 249 256 276 296 319 326 363 376 395 407 416 433
## [35] 444 457 463 483 485 511 529 532 543 552 556 559 576 577 580 584 591
## [52] 592 632 646 658 735 741 744 745 755 761 764 782 785 790 802 806 812
## [69] 814 834 877 882 888 890 906 914 916 918 924 932 936 946 955 961 968
## [86] 971 979 987 988
mean(tivo_clus4$TV.Viewing..hours.day.)
## [1] 4.494382
mean(tivo_clus4$Technology.Adoption)
## [1] 0.5280899
mean(tivo_clus4$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus4$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus4$Monthly.Electronics.Spend)
## [1] 37.11236
#location
table(tivo_clus4$Purchasing.Location)
##
## 0 1 2 3
## 11 17 19 42
#appealing feature
alltivo4 <- alltivo %>% filter(alltivo$ID %in% list_id4)
table(alltivo4$Favorite.feature)
##
##
## 0
## cool gadget
## 0
## programming/interactive features
## 47
## saving favorite shows to watch as a family
## 42
## schedule control
## 0
## time shifting
## 0
#Segment 5
tivo_clus5 <- tivo %>% filter(tivo$ID %in% list_id5) #filter the ids with the Ids in list1
mean(tivo_clus5$TV.Viewing..hours.day.)
## [1] 13.22222
mean(tivo_clus5$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus5$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus5$Monthly.Electronics.Spend)
## [1] 42.5
#location
table(tivo_clus5$Purchasing.Location)
##
## 3
## 18
#appealing feature
alltivo5 <- alltivo %>% filter(alltivo$ID %in% list_id5)
table(alltivo5$Favorite.feature)
##
##
## 0
## cool gadget
## 0
## programming/interactive features
## 0
## saving favorite shows to watch as a family
## 18
## schedule control
## 0
## time shifting
## 0
#Segment 6
tivo_clus6 <- tivo %>% filter(tivo$ID %in% list_id6) #filter the ids with the Ids in list1
mean(tivo_clus6$TV.Viewing..hours.day.)
## [1] 10.25
mean(tivo_clus6$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus6$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus6$Monthly.Electronics.Spend)
## [1] 40.97727
#location
table(tivo_clus6$Purchasing.Location)
##
## 3
## 44
#appealing feature
alltivo6 <- alltivo %>% filter(alltivo$ID %in% list_id6)
table(alltivo6$Favorite.feature)
##
##
## 0
## cool gadget
## 0
## programming/interactive features
## 0
## saving favorite shows to watch as a family
## 44
## schedule control
## 0
## time shifting
## 0
#Segment 7
tivo_clus7 <- tivo %>% filter(tivo$ID %in% list_id7) #filter the ids with the Ids in list1
mean(tivo_clus7$Annual.Income..x1000...)
## Warning in mean.default(tivo_clus7$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA
## [1] NA
mean(tivo_clus7$Monthly.Electronics.Spend)
## [1] 40.71429
#location
table(tivo_clus7$Purchasing.Location)
##
## 3
## 7
#appealing feature
alltivo7 <- alltivo %>% filter(alltivo$ID %in% list_id7)
table(alltivo7$Favorite.feature)
##
##
## 0
## cool gadget
## 0
## programming/interactive features
## 0
## saving favorite shows to watch as a family
## 7
## schedule control
## 0
## time shifting
## 0
Comment: The first segmentation scheme consists of 7 segments based on TV viewing (hours/day). It is clear from the description of the segment that the heterogeneity across these segments is not strong. The first three segments can be grouped together as low income, discount buying group while the last three segments can also be grouped together as high income, mass-consumer electronics buying location groups. Even though the condition of external heterogeneity is not met, having these segments still provide some meaningful insights. It seems that the higher-income group buys electronics as a means of family relaxation and connection while lower-income group purchases electronics for personal use and preferences. Also, as the higher-income group only shops in mass-consumer electronics stores, the company can channel a good amount of budget for marketing in these stores if this is their target segment.