Market segmentation involves taking a set of individuals and partitioning them into smaller groups or segments. Cluster analysis is one of the simplest algorithms to partition n observations into k clusters or groups.
In this assignment, we will analyze FordData from the Ford survey using Factor Analysis to understand the relationship among potential demand drivers and then use a clustering algorithm to form segments. You will i) apply factor analysis on the Ford FordDataset to extract meaning from a set of potential purchase motivations; ii) create an index score that summarizes factor information; iii) run a cluster analysis to form market segments; iv) name each segment; and v) assign each respondent to a segment for further analysis.
Compile your answers into a .html file and upload it to CarmenCanvas.
The instructor will provide as to how to compile to a .html file. Insert
a R code chunk when necessary to answer questions.
a. Read the FordData (Ford_Data.csv) in to your R console using the following command.
FordData = read.csv("Ford_Data.csv",sep=",",header = TRUE)
varname = colnames(FordData)
b. Create a subset Motivations by selecting the
following variables of Motivations using grepl(). The Excel
file Ford_Motivation Variable coding.xlsx (Under Modules -> Ford
FordData) provides illustration of each variable. You may find the
following code helpful.
# The command below will extract responses for Q2 and Q3 (Motivational variable and frustration variables)
Motivations.temp = FordData[ , grepl("Q2_", names(FordData) )| grepl("Q3_", names( FordData ) ) ]
#Motivations.temp[,19]
Motivations = Motivations.temp[,-c(19,33)]; rm(Motivations.temp)
colnames(Motivations) = c("trans","roadtrip","rackuser","haul","command","reliable","handlecond","concern","adjslow","expressme","lookright","dontcarecar","tech_driveconsist","tech_defensive","hearothers","comfortable","panoramicview", "other","bad_GM", "highbill", "difficulttodrive", "notcomfortable", "hassletoshare","difficulttoload", "dontlikecoldcar", "likelytogetlost", "notenoughroom", "notenoughtowcap", "notenoughstable", "notenoughhauling", "other_frus" )
# This will change the name of the variables to word that are meaningful to us
c. Create a vector freq that indicates the
frequency of each motivation being check off. (Hint: use ?apply for
help)
freq = apply(Motivations, 2, sum) ##execute summation column-wise (2)
#apply(Motivations, 1, sum) ##execute summation row-wise (1)
d. Let \(m\) be the motivation. Use the following equation to calculate the choice probability of each motivations in R. \[\text{The choice probability}= \frac{ Freq}{nhh}\]
freq/dim(FordData)[1]
## trans roadtrip rackuser haul
## 0.191956124 0.345521024 0.129798903 0.107861060
## command reliable handlecond concern
## 0.363802559 0.466179159 0.568555759 0.360146252
## adjslow expressme lookright dontcarecar
## 0.067641682 0.292504570 0.471663620 0.054844607
## tech_driveconsist tech_defensive hearothers comfortable
## 0.493601463 0.497257770 0.113345521 0.758683729
## panoramicview other bad_GM highbill
## 0.442413163 0.009140768 0.360146252 0.206581353
## difficulttodrive notcomfortable hassletoshare difficulttoload
## 0.049360146 0.058500914 0.137111517 0.102376600
## dontlikecoldcar likelytogetlost notenoughroom notenoughtowcap
## 0.170018282 0.063985375 0.047531993 0.043875686
## notenoughstable notenoughhauling other_frus
## 0.065813528 0.074954296 0.367458867
e. What are the top 3 motivations for consumer to purchase an SUV? (Hint: use ?order for help)
sort(freq, decreasing = TRUE)
## comfortable handlecond tech_defensive tech_driveconsist
## 415 311 272 270
## lookright reliable panoramicview other_frus
## 258 255 242 201
## command concern bad_GM roadtrip
## 199 197 197 189
## expressme highbill trans dontlikecoldcar
## 160 113 105 93
## hassletoshare rackuser hearothers haul
## 75 71 62 59
## difficulttoload notenoughhauling adjslow notenoughstable
## 56 41 37 36
## likelytogetlost notcomfortable dontcarecar difficulttodrive
## 35 32 30 27
## notenoughroom notenoughtowcap other
## 26 24 5
a. Use the provided R command to create the hierarchy
clustering tree. How many potential clusters are there in the Motivation
subset? (Hint: use hcluster() with Distance Matrix of the
Motivations dist(Motivations) to create a hierarchy
clustering tree).
# we will use some graphics so that we can choose the number of groups to form.
Distance = dist(Motivations)
# There are 547 choose 2 (547C2) pairs of distance
# 547 * 546 / 2
out = hclust(Distance, method="complete")
plot(out,main="Complete Linkage", xlab="", sub="", cex =.9)
b. We will run a Factor Analysis to reduce the dimension of
Motivations. b-i. Create a variable
nfactor that determine the number of factors based on your
answer from a.
nfactor = 4 ## Change based on your answer from a
# This mean that we are going to reduce 31 variables to 4 representations.
b-ii. Use the following command to run a factor analysis.
(Hint: use factanal() and the number of cluster in Q1 to
find out the factors).
FA = factanal(Motivations, nfactor, rotation="varimax")
b-iii. What are the important features of each potential factor according to the loadings? (In other words, how to access important factors? Hint: FA$loading)
FA$loadings
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## trans 0.302
## roadtrip 0.420 0.213
## rackuser 0.307 0.322
## haul 0.160 0.338
## command 0.481 0.107
## reliable 0.367 0.135
## handlecond 0.483 0.126 -0.149
## concern 0.289 -0.143 0.262
## adjslow 0.105 0.153 0.388
## expressme 0.401 0.126
## lookright 0.426
## dontcarecar -0.146 0.268
## tech_driveconsist 0.471
## tech_defensive 0.492
## hearothers 0.115 0.108 0.317
## comfortable 0.420
## panoramicview 0.453 0.127
## other -0.187
## bad_GM 0.125 0.584 0.138
## highbill 0.321 0.120 0.232
## difficulttodrive 0.360 0.282
## notcomfortable 0.244 0.287
## hassletoshare 0.114 0.224 0.329
## difficulttoload 0.134 0.404 0.177
## dontlikecoldcar 0.135 0.268 0.299
## likelytogetlost 0.117 0.259
## notenoughroom 0.390 0.156
## notenoughtowcap 0.435
## notenoughstable 0.319 0.364
## notenoughhauling 0.119 0.503
## other_frus -0.935 -0.205 -0.279
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.403 1.569 1.532 1.182
## Proportion Var 0.078 0.051 0.049 0.038
## Cumulative Var 0.078 0.128 0.178 0.216
c. Identify the high loading for each factor by determine their cut point p1, p2, p3, and p4, respectively. Use the following command to create the linear combinations as a summary.
p1= 0.45;
F1 = FA$loadings[,c(1)]*(abs(FA$loadings[,1])>=p1)
F1[F1!=0] # this command print variables with factor loadings of >= p1
p2= 0.3;
F2 = FA$loadings[,c(2)]*(abs(FA$loadings[,2])>=p2)
p3= 0.4;
F3 = FA$loadings[,c(3)]*(abs(FA$loadings[,3])>=p3)
p4 =0.3; ## Change according to your judgement.
F4 = FA$loadings[,c(4)]*(abs(FA$loadings[,4])>=p4)
AdjFloading = cbind(F1, F2, F3, F4)
d. Name the factors according to the loadings of each motivations and state your reasons. Factor1: Driving experience This factor contains the following statements: 1, SUVs are technical defective. 2. SUVs are weakly on processing command. 3. SUVs’ handle cond is hard to adjest. 4. The company don’t care cars. 5. tech_defensive.
Factor2: Financially careful This Factor contains three statements about 1. SUVs have bad gas mileage, 2. SUVs have high repair bills, 3. Some other frustration (with negative factor loading)
Factor 3: Hauler This factor contains the following statements: 1.not enough hauling 2. SUVS havehandlecond, concern, other frustration.
Factor 4: Safety Concern This factors contains three statement: 1. SUVs has slow adjustment 2. SUVs is not very stable when driving 3. sharing cause the trouble.
a. Create a matrix score that weights each
motivation by factor loading with cut point in question a.
(AdjFloading).
score = as.matrix(Motivations) %*% AdjFloading
(as.matrix(Motivations) %*% AdjFloading)[1:5,] # Print the first five rows, the factor scores for the first five respondents
## F1 F2 F3 F4
## [1,] 0.0000000 -0.9352282 0.0000000 0.000000
## [2,] 1.8978694 0.5840602 0.4044405 1.080627
## [3,] 0.9623882 0.5840602 0.0000000 0.000000
## [4,] 0.4916807 0.0000000 0.4044405 0.000000
## [5,] 2.3792941 0.9054136 0.4044405 0.000000
# score
b. Create a variable stdscore that standardize
the score.
nhh = dim(Motivations)[1]
smean = matrix(apply(score, 2, mean), ncol = nfactor, nrow = nhh, byrow = TRUE)
ssd = matrix(apply(score, 2, sd), ncol = nfactor, nrow = nhh, byrow = TRUE)
stdscore = (score - smean) / ssd ## Standardized factor scores for all 547 ppl
a. The following R command conducts k-means clustering algorithm (See description below) based on the standardized factor scores. We will explore the results.
# Temporarily commenting out scripts
set.seed(1234)
cluster = kmeans(stdscore, centers=nfactor)
## kmeans function uses 'stdscore' as a measure and create a number of groups, which depends on the user's choice
## in this case, we will form 4 groups.
cluster$cluster
## [1] 2 4 1 3 3 1 4 3 4 3 2 2 3 4 2 1 4 3 3 1 2 2 4 3 1 1 4 3 1 2 4 2 3 4 4 2 1
## [38] 2 2 4 1 2 2 4 2 2 1 1 2 1 1 2 3 3 2 3 2 4 1 4 2 3 3 4 1 1 1 4 1 4 3 1 4 2
## [75] 4 3 1 1 2 3 1 2 3 1 1 1 4 3 3 1 2 1 2 4 4 2 3 3 2 2 2 3 2 2 4 2 4 3 4 3 2
## [112] 1 4 1 3 4 2 1 1 4 3 1 3 1 2 3 3 2 1 3 2 2 3 4 1 3 2 4 4 1 2 2 4 1 2 2 2 1
## [149] 2 3 2 4 4 2 4 4 4 2 2 3 4 1 3 1 1 4 1 1 1 2 1 1 2 3 2 1 2 2 1 4 2 2 1 3 2
## [186] 3 1 2 2 2 4 1 4 4 2 2 2 4 2 4 4 2 4 3 2 2 1 4 4 4 2 2 4 4 2 2 2 2 1 3 2 2
## [223] 3 2 4 2 2 1 3 2 3 4 2 2 1 3 4 1 4 3 2 3 3 2 2 2 4 4 2 2 1 4 4 2 2 2 2 1 2
## [260] 1 2 1 1 1 3 1 2 4 2 3 2 4 2 2 2 2 1 4 4 4 4 1 1 1 4 3 4 3 1 4 3 1 3 1 2 2
## [297] 2 4 2 2 1 4 4 1 3 1 2 4 2 2 1 2 2 2 2 1 2 4 2 4 4 3 1 1 2 2 4 2 1 2 1 3 1
## [334] 4 2 1 1 2 3 4 1 2 4 2 1 2 4 3 1 4 2 1 2 2 2 1 2 1 4 1 2 3 1 4 3 4 1 2 1 2
## [371] 3 4 2 4 2 1 4 4 4 2 1 4 1 2 2 4 4 2 1 2 3 1 1 4 2 4 1 2 4 3 3 4 2 4 3 2 2
## [408] 2 4 2 1 4 1 1 1 4 4 1 1 2 2 4 2 4 3 4 4 1 2 2 4 2 2 4 4 3 3 3 2 1 2 4 4 2
## [445] 1 1 1 2 4 2 3 2 4 4 3 1 4 2 4 4 4 1 3 4 1 4 2 3 2 2 2 2 1 2 2 4 4 2 1 4 3
## [482] 2 1 1 2 2 3 4 2 1 4 4 2 3 4 2 4 3 4 1 1 1 1 4 3 2 2 2 2 2 1 2 4 2 2 2 2 1
## [519] 1 2 2 2 1 2 1 4 2 2 4 2 4 1 2 4 1 4 2 2 1 2 2 2 4 1 3 3 3
# in my case, the first aresp is assigned to 2, the second resp is assigned to 3
table(cluster$cluster)
##
## 1 2 3 4
## 131 196 83 137
b. How many respondents are there in each cluster?
c. Interpret the cluster results and match each cluster with each factor. Please state your reasons.
##cluser$cluster## a vector containing group assignment for each resp.
# in my case, the first resp is assigned to 2, the second resp is assigned to 3.
table(cluster$cluster)
##
## 1 2 3 4
## 131 196 83 137
cluster$centers ## these are the coordinates of the centers of the clusters
## F1 F2 F3 F4
## 1 -0.58571163 0.7224308 -0.4050114 -0.4062587
## 2 -0.11805942 -1.2174161 -0.4050114 -0.4948663
## 3 -0.05792136 0.6690821 2.0351384 0.1930893
## 4 0.76405360 0.6455570 -0.2662610 0.9794692
# cluster 1 is a group centered at (-0.6, 0.7, -0.4, -0.3)
# cluster 2 is a group centered at (-0.1, -1.2, -0.4, -0.5)...
# Cluster 3 has the largest factor score of 2.0 for factor 3,
# This means that those who are assigned to cluster 3 are explained as "haulers"(with complaints)
# Cluster 2 has the "largest" factor score of 0.7 for factor 2,
# so we match Cluster label 2 with Factor 2
# based on my results, there are 196 respondents who dosen't like SUVs coming with high repair bill ad so on...
# if we keep repeating this, it's unfortunate that the last cluster will not be necessarily matched with the factor with the largest score. But we will keep as it in