Market segmentation using k-means clustering algorithm

The k-means algorithm looks for cluster of data with similar attributes to group them together, which helps ensure the internal homogeneity among market segments. This algorithm is used because in this assignment, I am provided with the number of segments to aim for (4 and 7), which makes determining the value of k (number of segments) effortless.

7 segment scheme

tivo <- read.csv("C:/Users/trang/OneDrive/Minerva/2nd year/B110/tivo cluster.csv", header = TRUE)
alltivo <- read.csv("C:/Users/trang/OneDrive/Minerva/2nd year/B110/alltivo.csv")
str(tivo)

## 'data.frame':    1002 obs. of  14 variables:
##  $ ID                                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Education                                : int  0 0 1 3 0 1 0 0 0 0 ...
##  $ Annual.Income..x1000...                  : Factor w/ 47 levels "","#REF!","21",..: 30 27 39 32 27 13 15 11 38 12 ...
##  $ Age                                      : Factor w/ 65 levels "","#REF!","18",..: 15 21 51 63 37 57 47 15 45 44 ...
##  $ Purchasing.Location                      : int  3 3 4 3 3 2 0 2 4 0 ...
##  $ Monthly.Electronics.Spend                : int  35 35 64 33 45 14 18 23 74 16 ...
##  $ Annual.Spending.on.Electronics           : int  420 420 768 396 540 168 216 276 888 192 ...
##  $ Electronics.Spending.as...of.income      : Factor w/ 143 levels "","0.10%","0.11%",..: 51 56 95 43 82 21 30 60 119 29 ...
##  $ Monthly.Household.Spend                  : int  150 163 103 154 161 21 40 75 358 78 ...
##  $ Annual.Household.spend                   : int  1800 1956 1236 1848 1932 252 480 900 4296 936 ...
##  $ Disposable.Income..after.electronics..hh.: int  46780 43624 55996 48756 43528 30580 32304 27824 51816 28872 ...
##  $ Purchasing.Frequency..every.x.months.    : int  13 26 13 22 47 32 41 9 1 25 ...
##  $ Technology.Adoption                      : int  0 0 1 0 0 1 1 1 1 1 ...
##  $ TV.Viewing..hours.day.                   : int  2 10 0 5 2 1 0 1 0 0 ...

For the first segment scheme, the attribute chosen to divide the sample into different segments is TV Viewing time (hrs/day). This is a behaviorial attribute, which can be a better reflection of customrs’ need than demographics attribute.

#Cluster based on TV time
set.seed(123)
first_cluster <- kmeans(na.omit(tivo$TV.Viewing..hours.day.),centers = 7) #disregard NAs when doing clustering
first_cluster #give insightful details on size, centers,...

## K-means clustering with 7 clusters of sizes 41, 264, 537, 89, 18, 44, 7
## 
## Cluster means:
##         [,1]
## 1  6.7073171
## 2  2.1212121
## 3  0.6741155
## 4  4.4943820
## 5 13.2222222
## 6 10.2500000
## 7 12.0000000
## 
## Clustering vector:
##    [1] 2 6 3 4 2 3 3 3 3 3 1 4 3 2 3 3 3 2 4 3 1 3 4 1 1 2 2 3 2 3 3 3 3 3
##   [35] 4 6 3 3 4 3 2 3 3 3 3 3 2 3 2 3 3 3 3 4 2 3 3 3 4 2 2 2 4 2 3 3 2 3
##   [69] 3 3 2 2 2 3 1 1 2 3 3 2 3 3 2 3 3 2 5 3 3 3 5 3 3 2 3 1 7 3 3 3 3 5
##  [103] 2 3 3 3 3 3 2 2 3 2 3 3 3 6 1 3 2 3 3 3 4 5 3 2 4 3 2 3 2 3 3 3 3 3
##  [137] 3 2 2 1 6 2 3 6 2 4 3 2 2 2 3 3 3 4 3 4 1 7 2 3 3 2 2 3 3 3 4 6 3 4
##  [171] 6 3 2 2 3 3 5 3 5 4 3 2 6 2 3 3 3 3 3 3 4 5 2 2 2 3 3 2 6 3 2 3 3 3
##  [205] 3 2 3 3 3 3 2 2 3 2 3 2 2 2 3 4 3 3 3 3 2 3 2 2 5 7 4 3 6 1 3 4 4 3
##  [239] 3 6 3 3 3 3 3 3 3 3 4 3 2 3 3 3 2 4 2 3 6 3 3 3 2 3 3 2 7 3 2 3 3 3
##  [273] 3 2 1 4 3 3 3 2 3 3 2 3 2 6 3 2 1 2 2 3 3 3 3 4 2 2 3 2 2 6 3 2 3 2
##  [307] 3 3 3 1 3 2 3 3 3 2 2 3 4 3 3 3 3 3 3 4 2 2 3 2 2 3 3 3 3 2 3 2 3 6
##  [341] 3 3 3 2 2 2 3 3 3 2 3 3 2 2 2 3 3 3 3 2 2 2 4 3 3 3 2 2 6 3 3 2 3 3
##  [375] 3 4 1 3 3 3 2 3 1 3 3 2 3 6 3 3 2 2 2 3 4 3 6 3 2 3 3 2 3 2 6 3 4 3
##  [409] 6 2 2 3 3 2 3 4 3 3 3 3 2 3 3 3 3 2 3 3 1 3 3 2 4 3 3 3 1 3 3 5 3 3
##  [443] 3 4 2 2 3 2 2 2 1 2 3 3 3 3 4 1 3 3 2 2 4 6 3 3 3 3 2 2 3 3 2 1 3 3
##  [477] 3 2 2 2 2 3 4 6 4 3 3 3 2 2 3 1 3 3 6 3 3 2 2 3 3 2 2 3 3 2 3 3 6 2
##  [511] 4 3 2 3 3 2 3 3 3 3 3 6 2 3 3 3 3 3 4 2 2 4 3 3 3 3 1 3 3 3 3 3 4 3
##  [545] 2 6 3 3 2 3 3 4 3 3 3 4 3 6 4 3 3 3 3 2 2 2 2 3 1 3 3 2 3 2 3 4 4 6
##  [579] 3 4 2 3 3 4 3 3 6 3 3 7 4 4 3 3 2 3 3 2 2 3 3 3 2 3 3 3 3 3 2 3 2 3
##  [613] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 2 2 2 3 3 4 3 3 6 2 3 2 2 3 2 2 2 2 3 4
##  [647] 6 2 5 3 6 3 1 3 3 2 3 4 3 3 3 3 3 3 1 6 3 3 3 5 3 3 2 3 2 5 3 2 2 5
##  [681] 3 3 2 3 2 6 3 3 2 2 3 2 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3 1 3 3 3 3 2 3
##  [715] 3 3 3 2 2 3 3 6 3 2 2 3 3 3 2 3 3 3 3 3 4 3 2 3 1 3 4 3 3 4 4 2 2 2
##  [749] 2 3 3 1 2 3 4 1 3 2 3 3 4 3 1 4 3 2 3 2 5 3 3 3 3 3 3 3 3 2 3 2 3 4
##  [783] 2 2 4 3 3 2 3 4 2 2 1 3 2 2 3 2 3 1 3 4 2 3 3 4 3 3 3 3 3 4 7 4 3 2
##  [817] 3 2 2 3 2 3 2 2 2 3 3 6 6 2 3 3 2 4 3 3 3 2 3 2 3 3 2 2 5 3 1 3 3 3
##  [851] 2 5 3 2 3 3 2 3 2 3 3 6 6 3 3 2 2 3 3 3 7 1 5 3 3 3 4 3 2 3 6 4 2 3
##  [885] 2 2 3 4 3 4 2 2 1 2 6 3 3 3 3 3 2 2 2 3 3 4 3 3 2 3 3 1 6 4 3 4 3 4
##  [919] 3 3 2 3 2 4 3 3 1 2 2 3 2 4 3 3 3 4 3 3 3 3 3 3 1 2 3 4 3 3 3 1 2 2
##  [953] 2 2 4 3 3 2 3 3 4 3 3 2 3 3 2 4 2 3 4 2 3 3 3 6 1 2 4 3 2 2 3 3 3 3
##  [987] 4 4 5 2 2 2 2 3 3 2 3 3 6 3
## 
## Within cluster sum of squares by cluster:
## [1]  28.487805  28.121212 117.970205  22.247191   3.111111  26.250000
## [7]   0.000000
##  (between_SS / total_SS =  97.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

## Create a dataframe of participant IDs and Cluster number they belong to (1:7)
o = order(first_cluster$cluster)
df_firstcluster <- data.frame(tivo$ID[o], first_cluster$cluster[o])

Nex, I will create a visualization to see the distribution of the 7 segments. The graph below shows that the external heterogeneity across segments is strong as IDs of the same color stick together into a block shape.

plot(tivo$TV.Viewing..hours.day., type="n", xlim=c(3,1000))
text(x=tivo$TV.Viewing..hours.day., labels=tivo$ID, col=first_cluster$cluster+1)

In order to investigate the other attribute characteristics of 7 formed clusters, I will store the partcipant IDs corresponding to each segment (1:7) into 7 vectors. Afterward, these 7 vectors of IDs will serve as filters to achieve the desired subset.

list_id1 <- with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==1])
length(list_id1) #check if it's done correctly, should return 41 here because the first cluster has 41 participants

## [1] 41

list_id2 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==2]))
list_id3 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==3]))
list_id4 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==4]))
list_id5 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==5]))
list_id6 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==6]))
list_id7 <- (with(df_firstcluster,df_firstcluster$tivo.ID.o.[df_firstcluster$first_cluster.cluster.o.==7]))

Next, I’ll examine annual income, Monthly electronics spend, Purchasig locations and Favorite features of each segment

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Segment 1
tivo_clus1 <- tivo %>% filter(tivo$ID %in% list_id1) 


tivo_clus1$ID

##  [1]  11  21  24  25  75  76  96 117 140 157 234 275 289 310 377 383 429
## [18] 437 451 458 474 492 537 569 653 665 708 739 752 756 763 793 800 847
## [35] 872 893 912 927 943 950 977

mean(tivo_clus1$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus1$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus1$Monthly.Electronics.Spend)

## [1] 40.31707

#location
table(tivo_clus1$Purchasing.Location)

## 
##  0  1  2  3 
##  4  4  4 29

#appealing feature
alltivo1 <- alltivo %>% filter(alltivo$ID %in% list_id1)
table(alltivo1$Purchasing.Location)

## 
##                                            discount 
##                         0                         4 
## mass-consumer electronics                    retail 
##                        29                         4 
##          specialty stores                web (ebay) 
##                         0                         4

table(alltivo1$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                          0 
##           programming/interactive features 
##                                         12 
## saving favorite shows to watch as a family 
##                                         29 
##                           schedule control 
##                                          0 
##                              time shifting 
##                                          0

#Segment 2

tivo_clus2 <- tivo %>% filter(tivo$ID %in% list_id2) #filter the ids with the Ids in list1
mean(tivo_clus2$TV.Viewing..hours.day.)

## [1] 2.121212

mean(tivo_clus2$Technology.Adoption)

## [1] 0.8674242

mean(tivo_clus2$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus2$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus2$Monthly.Electronics.Spend)

## [1] 30.37121

table(tivo_clus2$Purchasing.Location)

## 
##  0  1  2  3  4 
## 96 14 69 35 50

#appealing feature
alltivo2 <- alltivo %>% filter(alltivo$ID %in% list_id2)
table(alltivo2$Purchasing.Location)

## 
##                                            discount 
##                         0                        96 
## mass-consumer electronics                    retail 
##                        35                        69 
##          specialty stores                web (ebay) 
##                        50                        14

table(alltivo2$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                         60 
##           programming/interactive features 
##                                         45 
## saving favorite shows to watch as a family 
##                                         35 
##                           schedule control 
##                                         66 
##                              time shifting 
##                                         58

#Segment 3

tivo_clus3 <- tivo %>% filter(tivo$ID %in% list_id3) #filter the ids with the Ids in list1
mean(tivo_clus3$TV.Viewing..hours.day.)

## [1] 0.6741155

mean(tivo_clus3$Technology.Adoption)

## [1] 0.9534451

mean(tivo_clus3$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus3$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus3$Monthly.Electronics.Spend)

## [1] 28.18436

#location
table(tivo_clus3$Purchasing.Location)

## 
##   0   1   2   3   4 
## 182   8 202  25 120

#appealing feature
alltivo3 <- alltivo %>% filter(alltivo$ID %in% list_id3)
table(alltivo3$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                        168 
##           programming/interactive features 
##                                         26 
## saving favorite shows to watch as a family 
##                                         25 
##                           schedule control 
##                                        155 
##                              time shifting 
##                                        163

#Segment 4

tivo_clus4 <- tivo %>% filter(tivo$ID %in% list_id4) #filter the ids with the Ids in list1
tivo_clus4$ID

##  [1]   4  12  19  23  35  39  54  59  63 123 127 146 154 156 167 170 180
## [18] 191 220 231 236 237 249 256 276 296 319 326 363 376 395 407 416 433
## [35] 444 457 463 483 485 511 529 532 543 552 556 559 576 577 580 584 591
## [52] 592 632 646 658 735 741 744 745 755 761 764 782 785 790 802 806 812
## [69] 814 834 877 882 888 890 906 914 916 918 924 932 936 946 955 961 968
## [86] 971 979 987 988

mean(tivo_clus4$TV.Viewing..hours.day.)

## [1] 4.494382

mean(tivo_clus4$Technology.Adoption)

## [1] 0.5280899

mean(tivo_clus4$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus4$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus4$Monthly.Electronics.Spend)

## [1] 37.11236

#location
table(tivo_clus4$Purchasing.Location)

## 
##  0  1  2  3 
## 11 17 19 42

#appealing feature
alltivo4 <- alltivo %>% filter(alltivo$ID %in% list_id4)
table(alltivo4$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                          0 
##           programming/interactive features 
##                                         47 
## saving favorite shows to watch as a family 
##                                         42 
##                           schedule control 
##                                          0 
##                              time shifting 
##                                          0

#Segment 5
tivo_clus5 <- tivo %>% filter(tivo$ID %in% list_id5) #filter the ids with the Ids in list1
mean(tivo_clus5$TV.Viewing..hours.day.)

## [1] 13.22222

mean(tivo_clus5$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus5$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus5$Monthly.Electronics.Spend)

## [1] 42.5

#location
table(tivo_clus5$Purchasing.Location)

## 
##  3 
## 18

#appealing feature
alltivo5 <- alltivo %>% filter(alltivo$ID %in% list_id5)
table(alltivo5$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                          0 
##           programming/interactive features 
##                                          0 
## saving favorite shows to watch as a family 
##                                         18 
##                           schedule control 
##                                          0 
##                              time shifting 
##                                          0

#Segment 6
tivo_clus6 <- tivo %>% filter(tivo$ID %in% list_id6) #filter the ids with the Ids in list1
mean(tivo_clus6$TV.Viewing..hours.day.)

## [1] 10.25

mean(tivo_clus6$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus6$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus6$Monthly.Electronics.Spend)

## [1] 40.97727

#location
table(tivo_clus6$Purchasing.Location)

## 
##  3 
## 44

#appealing feature
alltivo6 <- alltivo %>% filter(alltivo$ID %in% list_id6)
table(alltivo6$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                          0 
##           programming/interactive features 
##                                          0 
## saving favorite shows to watch as a family 
##                                         44 
##                           schedule control 
##                                          0 
##                              time shifting 
##                                          0

#Segment 7
tivo_clus7 <- tivo %>% filter(tivo$ID %in% list_id7) #filter the ids with the Ids in list1
mean(tivo_clus7$Annual.Income..x1000...)

## Warning in mean.default(tivo_clus7$Annual.Income..x1000...): argument is
## not numeric or logical: returning NA

## [1] NA

mean(tivo_clus7$Monthly.Electronics.Spend)

## [1] 40.71429

#location
table(tivo_clus7$Purchasing.Location)

## 
## 3 
## 7

#appealing feature
alltivo7 <- alltivo %>% filter(alltivo$ID %in% list_id7)
table(alltivo7$Favorite.feature)

## 
##                                            
##                                          0 
##                                cool gadget 
##                                          0 
##           programming/interactive features 
##                                          0 
## saving favorite shows to watch as a family 
##                                          7 
##                           schedule control 
##                                          0 
##                              time shifting 
##                                          0

Comment: The first segmentation scheme consists of 7 segments based on TV viewing (hours/day). It is clear from the description of the segment that the heterogeneity across these segments is not strong. The first three segments can be grouped together as low income, discount buying group while the last three segments can also be grouped together as high income, mass-consumer electronics buying location groups. Even though the condition of external heterogeneity is not met, having these segments still provide some meaningful insights. It seems that the higher-income group buys electronics as a means of family relaxation and connection while lower-income group purchases electronics for personal use and preferences. Also, as the higher-income group only shops in mass-consumer electronics stores, the company can channel a good amount of budget for marketing in these stores if this is their target segment.