Market segmentation involves taking a set of individuals and partitioning them into smaller groups or segments. Cluster analysis is one of the simplest algorithms to partition n observations into k clusters or groups.

In this assignment, we will analyze FordData from the Ford survey using Factor Analysis to understand the relationship among potential demand drivers and then use a clustering algorithm to form segments. You will i) apply factor analysis on the Ford FordDataset to extract meaning from a set of potential purchase motivations; ii) create an index score that summarizes factor information; iii) run a cluster analysis to form market segments; iv) name each segment; and v) assign each respondent to a segment for further analysis.

Compile your answers into a .html file and upload it to CarmenCanvas. The instructor will provide as to how to compile to a .html file. Insert a R code chunk when necessary to answer questions.

1. This question explores the Ford FordDataset.

a. Read the FordData (Ford_Data.csv) in to your R console using the following command.

FordData = read.csv("Ford_Data.csv",sep=",",header = TRUE) 
varname = colnames(FordData)

b. Create a subset Motivations by selecting the following variables of Motivations using grepl(). The Excel file Ford_Motivation Variable coding.xlsx (Under Modules -> Ford FordData) provides illustration of each variable. You may find the following code helpful.

# The command below will extract responses for Q2 and Q3 (Motivational variable and frustration variables)
Motivations.temp = FordData[ , grepl("Q2_", names(FordData) )| grepl("Q3_", names( FordData ) ) ]
#Motivations.temp[,19]
Motivations = Motivations.temp[,-c(19,33)]; rm(Motivations.temp)

colnames(Motivations) = c("trans","roadtrip","rackuser","haul","command","reliable","handlecond","concern","adjslow","expressme","lookright","dontcarecar","tech_driveconsist","tech_defensive","hearothers","comfortable","panoramicview", "other","bad_GM", "highbill", "difficulttodrive", "notcomfortable", "hassletoshare","difficulttoload", "dontlikecoldcar", "likelytogetlost", "notenoughroom", "notenoughtowcap", "notenoughstable", "notenoughhauling", "other_frus" )
# This will change the name of the variables to word that are meaningful to us

c. Create a vector freq that indicates the frequency of each motivation being check off. (Hint: use ?apply for help)

freq = apply(Motivations, 2, sum) ##execute summation column-wise (2)
#apply(Motivations, 1, sum) ##execute summation row-wise (1)

d. Let \(m\) be the motivation. Use the following equation to calculate the choice probability of each motivations in R. \[\text{The choice probability}= \frac{ Freq}{nhh}\]

freq/dim(FordData)[1]
##             trans          roadtrip          rackuser              haul 
##       0.191956124       0.345521024       0.129798903       0.107861060 
##           command          reliable        handlecond           concern 
##       0.363802559       0.466179159       0.568555759       0.360146252 
##           adjslow         expressme         lookright       dontcarecar 
##       0.067641682       0.292504570       0.471663620       0.054844607 
## tech_driveconsist    tech_defensive        hearothers       comfortable 
##       0.493601463       0.497257770       0.113345521       0.758683729 
##     panoramicview             other            bad_GM          highbill 
##       0.442413163       0.009140768       0.360146252       0.206581353 
##  difficulttodrive    notcomfortable     hassletoshare   difficulttoload 
##       0.049360146       0.058500914       0.137111517       0.102376600 
##   dontlikecoldcar   likelytogetlost     notenoughroom   notenoughtowcap 
##       0.170018282       0.063985375       0.047531993       0.043875686 
##   notenoughstable  notenoughhauling        other_frus 
##       0.065813528       0.074954296       0.367458867

e. What are the top 3 motivations for consumer to purchase an SUV? (Hint: use ?order for help)

sort(freq, decreasing = TRUE)
##       comfortable        handlecond    tech_defensive tech_driveconsist 
##               415               311               272               270 
##         lookright          reliable     panoramicview        other_frus 
##               258               255               242               201 
##           command           concern            bad_GM          roadtrip 
##               199               197               197               189 
##         expressme          highbill             trans   dontlikecoldcar 
##               160               113               105                93 
##     hassletoshare          rackuser        hearothers              haul 
##                75                71                62                59 
##   difficulttoload  notenoughhauling           adjslow   notenoughstable 
##                56                41                37                36 
##   likelytogetlost    notcomfortable       dontcarecar  difficulttodrive 
##                35                32                30                27 
##     notenoughroom   notenoughtowcap             other 
##                26                24                 5

2. This question explores the factor analysis using motivations variables.

a. Use the provided R command to create the hierarchy clustering tree. How many potential clusters are there in the Motivation subset? (Hint: use hcluster() with Distance Matrix of the Motivations dist(Motivations) to create a hierarchy clustering tree).

# we will use some graphics so that we can choose the number of groups to form.
Distance = dist(Motivations)
# There are 547 choose 2 (547C2) pairs of distance
# 547 * 546 / 2
out = hclust(Distance, method="complete") 
plot(out,main="Complete Linkage", xlab="", sub="", cex =.9)

b. We will run a Factor Analysis to reduce the dimension of Motivations. b-i. Create a variable nfactor that determine the number of factors based on your answer from a.

nfactor = 4 ## Change based on your answer from a
# This mean that we are going to reduce 31 variables to 4 representations.

b-ii. Use the following command to run a factor analysis. (Hint: use factanal() and the number of cluster in Q1 to find out the factors).

FA = factanal(Motivations, nfactor, rotation="varimax") 

b-iii. What are the important features of each potential factor according to the loadings? (In other words, how to access important factors? Hint: FA$loading)

FA$loadings
## 
## Loadings:
##                   Factor1 Factor2 Factor3 Factor4
## trans              0.302                         
## roadtrip           0.420           0.213         
## rackuser           0.307           0.322         
## haul               0.160           0.338         
## command            0.481           0.107         
## reliable           0.367   0.135                 
## handlecond         0.483   0.126  -0.149         
## concern            0.289          -0.143   0.262 
## adjslow            0.105           0.153   0.388 
## expressme          0.401           0.126         
## lookright          0.426                         
## dontcarecar       -0.146                   0.268 
## tech_driveconsist  0.471                         
## tech_defensive     0.492                         
## hearothers         0.115   0.108           0.317 
## comfortable        0.420                         
## panoramicview      0.453                   0.127 
## other             -0.187                         
## bad_GM             0.125   0.584   0.138         
## highbill                   0.321   0.120   0.232 
## difficulttodrive                   0.360   0.282 
## notcomfortable                     0.244   0.287 
## hassletoshare      0.114   0.224           0.329 
## difficulttoload            0.134   0.404   0.177 
## dontlikecoldcar    0.135   0.268           0.299 
## likelytogetlost            0.117           0.259 
## notenoughroom                      0.390   0.156 
## notenoughtowcap                    0.435         
## notenoughstable                    0.319   0.364 
## notenoughhauling           0.119   0.503         
## other_frus                -0.935  -0.205  -0.279 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.403   1.569   1.532   1.182
## Proportion Var   0.078   0.051   0.049   0.038
## Cumulative Var   0.078   0.128   0.178   0.216

c. Identify the high loading for each factor by determine their cut point p1, p2, p3, and p4, respectively. Use the following command to create the linear combinations as a summary.

p1= 0.45;  
F1 = FA$loadings[,c(1)]*(abs(FA$loadings[,1])>=p1) 
F1[F1!=0] # this command print variables with factor loadings of >= p1

p2= 0.3;
F2 = FA$loadings[,c(2)]*(abs(FA$loadings[,2])>=p2) 

p3= 0.4; 
F3 = FA$loadings[,c(3)]*(abs(FA$loadings[,3])>=p3) 

p4 =0.3; ## Change according to your judgement.
F4 = FA$loadings[,c(4)]*(abs(FA$loadings[,4])>=p4) 

AdjFloading = cbind(F1, F2, F3, F4)

d. Name the factors according to the loadings of each motivations and state your reasons. Factor1: Driving experience This factor contains the following statements: 1, SUVs are technical defective. 2. SUVs are weakly on processing command. 3. SUVs’ handle cond is hard to adjest. 4. The company don’t care cars. 5. tech_defensive.

Factor2: Financially careful This Factor contains three statements about 1. SUVs have bad gas mileage, 2. SUVs have high repair bills, 3. Some other frustration (with negative factor loading)

Factor 3: Hauler This factor contains the following statements: 1.not enough hauling 2. SUVS havehandlecond, concern, other frustration.

Factor 4: Safety Concern This factors contains three statement: 1. SUVs has slow adjustment 2. SUVs is not very stable when driving 3. sharing cause the trouble.

3. This question creates factor scores for each respondent by summarizing large loadings of each factor.

a. Create a matrix score that weights each motivation by factor loading with cut point in question a. (AdjFloading).

score = as.matrix(Motivations) %*% AdjFloading
(as.matrix(Motivations) %*% AdjFloading)[1:5,] # Print the first five rows, the factor scores for the first five respondents
##             F1         F2        F3       F4
## [1,] 0.0000000 -0.9352282 0.0000000 0.000000
## [2,] 1.8978694  0.5840602 0.4044405 1.080627
## [3,] 0.9623882  0.5840602 0.0000000 0.000000
## [4,] 0.4916807  0.0000000 0.4044405 0.000000
## [5,] 2.3792941  0.9054136 0.4044405 0.000000
# score

b. Create a variable stdscore that standardize the score.

nhh = dim(Motivations)[1]
smean = matrix(apply(score, 2, mean), ncol = nfactor, nrow = nhh, byrow = TRUE)
ssd = matrix(apply(score, 2, sd), ncol = nfactor, nrow = nhh, byrow = TRUE)
stdscore = (score - smean) / ssd  ## Standardized factor scores for all 547 ppl

4. This question explores the cluster analysis using the summarized index score obtained from question 3.

a. The following R command conducts k-means clustering algorithm (See description below) based on the standardized factor scores. We will explore the results.

# Temporarily commenting out scripts
 set.seed(1234)
 cluster = kmeans(stdscore, centers=nfactor)
 ## kmeans function uses 'stdscore' as a measure and create a number of groups, which depends on the user's choice
 ## in this case, we will form 4 groups.
 cluster$cluster
##   [1] 2 4 1 3 3 1 4 3 4 3 2 2 3 4 2 1 4 3 3 1 2 2 4 3 1 1 4 3 1 2 4 2 3 4 4 2 1
##  [38] 2 2 4 1 2 2 4 2 2 1 1 2 1 1 2 3 3 2 3 2 4 1 4 2 3 3 4 1 1 1 4 1 4 3 1 4 2
##  [75] 4 3 1 1 2 3 1 2 3 1 1 1 4 3 3 1 2 1 2 4 4 2 3 3 2 2 2 3 2 2 4 2 4 3 4 3 2
## [112] 1 4 1 3 4 2 1 1 4 3 1 3 1 2 3 3 2 1 3 2 2 3 4 1 3 2 4 4 1 2 2 4 1 2 2 2 1
## [149] 2 3 2 4 4 2 4 4 4 2 2 3 4 1 3 1 1 4 1 1 1 2 1 1 2 3 2 1 2 2 1 4 2 2 1 3 2
## [186] 3 1 2 2 2 4 1 4 4 2 2 2 4 2 4 4 2 4 3 2 2 1 4 4 4 2 2 4 4 2 2 2 2 1 3 2 2
## [223] 3 2 4 2 2 1 3 2 3 4 2 2 1 3 4 1 4 3 2 3 3 2 2 2 4 4 2 2 1 4 4 2 2 2 2 1 2
## [260] 1 2 1 1 1 3 1 2 4 2 3 2 4 2 2 2 2 1 4 4 4 4 1 1 1 4 3 4 3 1 4 3 1 3 1 2 2
## [297] 2 4 2 2 1 4 4 1 3 1 2 4 2 2 1 2 2 2 2 1 2 4 2 4 4 3 1 1 2 2 4 2 1 2 1 3 1
## [334] 4 2 1 1 2 3 4 1 2 4 2 1 2 4 3 1 4 2 1 2 2 2 1 2 1 4 1 2 3 1 4 3 4 1 2 1 2
## [371] 3 4 2 4 2 1 4 4 4 2 1 4 1 2 2 4 4 2 1 2 3 1 1 4 2 4 1 2 4 3 3 4 2 4 3 2 2
## [408] 2 4 2 1 4 1 1 1 4 4 1 1 2 2 4 2 4 3 4 4 1 2 2 4 2 2 4 4 3 3 3 2 1 2 4 4 2
## [445] 1 1 1 2 4 2 3 2 4 4 3 1 4 2 4 4 4 1 3 4 1 4 2 3 2 2 2 2 1 2 2 4 4 2 1 4 3
## [482] 2 1 1 2 2 3 4 2 1 4 4 2 3 4 2 4 3 4 1 1 1 1 4 3 2 2 2 2 2 1 2 4 2 2 2 2 1
## [519] 1 2 2 2 1 2 1 4 2 2 4 2 4 1 2 4 1 4 2 2 1 2 2 2 4 1 3 3 3
 # in my case, the first aresp is assigned to 2, the second resp is assigned to 3
 table(cluster$cluster)
## 
##   1   2   3   4 
## 131 196  83 137

b. How many respondents are there in each cluster?

c. Interpret the cluster results and match each cluster with each factor. Please state your reasons.

##cluser$cluster## a vector containing group assignment for each resp.
# in my case, the first resp is assigned to 2, the second resp is assigned to 3.
table(cluster$cluster)
## 
##   1   2   3   4 
## 131 196  83 137
cluster$centers ## these are the coordinates of the centers of the clusters
##            F1         F2         F3         F4
## 1 -0.58571163  0.7224308 -0.4050114 -0.4062587
## 2 -0.11805942 -1.2174161 -0.4050114 -0.4948663
## 3 -0.05792136  0.6690821  2.0351384  0.1930893
## 4  0.76405360  0.6455570 -0.2662610  0.9794692
# cluster 1 is a group centered at (-0.6, 0.7, -0.4, -0.3)
# cluster 2 is a group centered at (-0.1, -1.2, -0.4, -0.5)...

# Cluster 3 has the largest factor score of 2.0 for factor 3,
# This means that those who are assigned to cluster 3 are explained as "haulers"(with complaints)

# Cluster 2 has the "largest" factor score of 0.7 for factor 2,
# so we match Cluster label 2 with Factor 2
# based on my results, there are 196 respondents who dosen't like SUVs coming with high repair bill ad so on...

# if we keep repeating this, it's unfortunate that the last cluster will not be necessarily matched with the factor with the largest score. But we will keep as it in