Project Overview:

Develop a market segmentation system, to show your knowledge of factor analysis/principal component analysis. Data used is from the Simmons National Consumer Study dataset.

This segmentation system uses CNN cable tv and cnn.com viewers. The Spanish version of CNN was not included in this project. For this project, I chose a total of 6 variables, of which three of those questions belong to the Brand Loyalty group, and the other 3 questions belonging to the Career Achievers groups in order to demonstrate my knowledge of factor analysis. The factor analysis reduced all 6 questions into 2 factors. Then 4 more driver variables were added to the two factor analysis results to run a k-means clustering. Based on the diagnostic statistic “cubic clustering criterion” (ccc), 4 clusters were chosen as the ideal number of clusters. Lastly, five new variables were added as non-drivers to use to explain the make-up of the observers in each cluster.

Import Data and clean the data removing non-CNN viewers

NOTE: Data was previsouly obtain, using the read.fwf() function in R, and saved in the files used below for faster execution.

##Get CNN - All observations in dataset (viewers/non-viewers)
setwd("C:/Users/Deeandrea Burgos/Dropbox/Graduate School/Data Analytics/Database and Practicum/Exam #1/Raw Datasets")
cnn <- read.csv("cnnViewers.csv",head=TRUE, sep = ",")
factors <- read.csv("factors.csv", head=TRUE, sep=",")
cluster <- read.csv("cluster1.csv",head=TRUE, sep = ",")
nonDriver <- read.csv("nonDriver.csv", head=TRUE, sep = ",")
edu <- read.csv("education.csv", head=TRUE, sep = ",")

##Merge all questions together
master <- cbind(cnn,factors,cluster,nonDriver,edu)

##Remove non-CNN viewers
master <- master[cnn == 1,]
master <- master[-c(1)]

Questions Chosen For Factor Analysis

Factor 1: Brand Loyalty Questions: 1. I have a few favorite clothes brands that I always stick with. (brandStick) 2. Regardless of the type of clothing I’m shopping for, I normally look for my favorite brands first. (brandFirst) 3. I usually only shop at my favorite stores because I know they have the brands I like. (brandShop)

Factor 2: Career Achiever Questions: 1. I am a workaholic. (work) 2. I want to get to the top of my career. (career) 3. I am willing to sacrifice time with my family in order to get ahead. (ahead)

Questions Chosen For the Clustering

  1. Political outlook. (political)
  2. I am interested in international events. (international)
  3. It is important to be well informed about things. (informed)
  4. I worry about violence and crime. (crime)

Non-driver questions to see the “make-up” of the clusters

  1. How you consider yourself: Intelligent (smart, bright, well informed)?
  2. How many hours a week (on avg.) do you use a computer at work?
  3. Do you think that in the coming months the American economy will be better or worse than it is now?
  4. Education: no formal schooling -> graduate degree (11 levels)
  5. Marital status: Married, widowed, divorced, separated, single

Find duplicate enteries, if any

##Find duplicates of answers amongst all questions
#Factors and Cluster questions
anyDuplicateFC <- matrix(,nrow=nrow(master), ncol = 10)
for(i in 1:10){
  anyDuplicateFC[,i]=rowSums(master[,(1 + (5*(i-1))):(5*i)])
}
colnames(anyDuplicateFC) <- c("dF1Q1","dF1Q2","dF1Q3","dF2Q1","dF2Q2","dF2Q3","dC1","dC2","dC3","dC4")
table(anyDuplicateFC)
## anyDuplicateFC
##     0     1 
##  3275 49425
#Non-Driver questions
dupQ1 <- rowSums(master[,51:55])
#table(dupQ1)
dupQ2 <- rowSums(master[,56:61])
dupQ3 <- rowSums(master[,62:66])
dupQ4 <- rowSums(master[,67:71])
dupEdu <- rowSums(master[,72:82])

anyDuplicateND <- cbind(dupQ1,dupQ2, dupQ3, dupQ4, dupEdu)
table(anyDuplicateND)
## anyDuplicateND
##     0     1 
##  4478 21872
allDuplicates <- cbind(anyDuplicateFC,anyDuplicateND)
newMaster <- cbind(master,allDuplicates)

#Remove any possible duplicates
newMaster <- newMaster[!(newMaster$dF1Q1>1),];newMaster <- newMaster[!(newMaster$dF1Q2>1),];newMaster <- newMaster[!(newMaster$dF1Q3>1),]
newMaster <- newMaster[!(newMaster$dF2Q1>1),];newMaster <- newMaster[!(newMaster$dF2Q2>1),];newMaster <- newMaster[!(newMaster$dF2Q3>1),]
newMaster <- newMaster[!(newMaster$dC1>1),];newMaster <- newMaster[!(newMaster$dC2>1),];newMaster <- newMaster[!(newMaster$dC3>1),];newMaster <- newMaster[!(newMaster$dC4>1),]
newMaster <- newMaster[!(newMaster$dupQ1>1),];newMaster <- newMaster[!(newMaster$dupQ2>1),];newMaster <- newMaster[!(newMaster$dupQ3>1),];newMaster <- newMaster[!(newMaster$dupQ4>1),]
newMaster <- newMaster[!(newMaster$dupEdu>1),]

Calculate variable importance for all variables: Factors, Clusters, Non-Drivers

##Importance calculations
#NOTE: All variables are clean at this point
importanceFC <- matrix(,nrow = nrow(newMaster), ncol = 10)
for(i in 1:10){
  importanceFC[,i] = (newMaster[,1+(5*(i-1))]*5) + (newMaster[,2+(5*(i-1))]*4) + (newMaster[,3+(5*(i-1))]*3) + (newMaster[,4+(5*(i-1))]*2) + (newMaster[,5+(5*(i-1))]*1)
}
colnames(importanceFC) <- c("brandStick","brandFirst","brandShop","work","career","ahead","political","international","informed","crime")
table(importanceFC)
## importanceFC
##     0     1     2     3     4     5 
##  3275  5898  5535 11334 13556 13102
importanceND <- matrix(,nrow=nrow(newMaster), ncol = 4)#Number of columns is number of non-driver questions
importanceND[,1] <- (newMaster[,51]*5) + (newMaster[,52]*4) + (newMaster[,53]*3) + (newMaster[,54]*2) + (newMaster[,55]*1)
importanceND[,2] <- (newMaster[,56]*21) + (newMaster[,57]*17.5) + (newMaster[,58]*12) + (newMaster[,59]*7) + (newMaster[,60]*3) + (newMaster[,61]*2)#using a mean midpoint
importanceND[,3] <- (newMaster[,62]*1) + (newMaster[,63]*2) + (newMaster[,64]*3) + (newMaster[,65]*4) + (newMaster[,66]*5)
#importanceND[,4] <- (newMaster[,67]*5) + (newMaster[,68]*4) + (newMaster[,69]*3) + (newMaster[,70]*2) + (newMaster[,71]*1)

importanceND[,4] <- (newMaster[,72]*0) + (newMaster[,73]*8) + (newMaster[,74]*10) + (newMaster[,75]*12) + (newMaster[,76]*12.5)+ (newMaster[,77]*13) + (newMaster[,78]*14) + (newMaster[,79]*15) + (newMaster[,80]*16)+ (newMaster[,81]*17) + (newMaster[,82]*18)
colnames(importanceND) <- cbind("smart", "computer","economy","education")

masterClus <- data.frame(cbind(importanceFC,importanceND))

Principal Component for the two factors: Brand Loyalty and Career Acheivers

#Verimax Principal Component
library(psych)
fit <- principal(masterClus[,1:6],nfactors=2, rotate = "varimax", scores = TRUE)
print.psych(fit, sort=T) #prints the sorted loadings
## Principal Components Analysis
## Call: principal(r = masterClus[, 1:6], nfactors = 2, rotate = "varimax", 
##     scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##            item  RC2   RC1   h2   u2 com
## career        5 0.79  0.12 0.64 0.36   1
## work          4 0.76  0.06 0.59 0.41   1
## ahead         6 0.73 -0.01 0.54 0.46   1
## brandStick    1 0.07  0.83 0.70 0.30   1
## brandFirst    2 0.07  0.83 0.69 0.31   1
## brandShop     3 0.07  0.60 0.36 0.64   1
## 
##                        RC2  RC1
## SS loadings           1.76 1.75
## Proportion Var        0.29 0.29
## Cumulative Var        0.29 0.59
## Proportion Explained  0.50 0.50
## Cumulative Proportion 0.50 1.00
## 
## Mean item complexity =  1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  2675.45  with prob <  0 
## 
## Fit based upon off diagonal values = 0.74
round(cor(fit$scores)) #Exam scores thoroughly to check if indeed uncorrelated
##     RC2 RC1
## RC2   1   0
## RC1   0   1
biplot(fit)

plot(fit$values, type = "o", ylab = "eigenvalues", xlab="Principal Components", 
     main = "Screeplot for Factors:\n BrandLoyalty & Career Achievers")

#Put the principal components, Fac1 and Fac2, in its own data frame per component
fact = data.frame(fit$scores)

#colnames(fact) <- c("PCcareer", "PCbrand")
fact1_brand = data.frame(fact$RC1)
colnames(fact1_brand) <- c("brandLoyalty_FAC1")
fact2_career = data.frame(fact$RC2)
colnames(fact2_career) <- c("careerAchievers_FAC2")

Cluster Variables

#Make cluster variable
cluster <- cbind(fact1_brand,fact2_career,masterClus[,7:10])

#Overview of result of clusters 2 to 15
clus <- (nrow(cluster)-1)*sum(apply(cluster,2,var))
for(i in 2:15) clus[i] <- sum(kmeans(cluster, centers=i)$withinss)
plot(1:15, clus, typeb="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

#K-means cluster analysis - manually
set.seed(123)
fitC3 <- kmeans(cluster, 3, nstart=20) #3 to 7 cluster solution
set.seed(123)
fitC4 <- kmeans(cluster, 4, nstart=20)
set.seed(123)
fitC5 <- kmeans(cluster, 5, nstart=20)
set.seed(123)
fitC6 <- kmeans(cluster, 6, nstart=20)
set.seed(123)
fitC7 <- kmeans(cluster, 7, nstart=20)

#get cluster means label them as low, medium, high and compare the two.NOTE: Factors are on different scale from the cluster variables that are 1-5
clusterK3Means <- aggregate(cluster, by=list(fitC3$cluster),FUN=mean)
colnames(clusterK3Means)[1] <- c("Cluster Num") #Rename only first column
clusterK4Means <- aggregate(cluster, by=list(fitC4$cluster),FUN=mean)
colnames(clusterK4Means)[1] <- c("Cluster Num") #Rename only first column

#3-Clusters
clusterK3Means[,2:3] <- ifelse(clusterK3Means[,2:3]< (-0.3), "low", ifelse(clusterK3Means[,2:3]<0.3 & clusterK3Means[,2:3]>=(-0.3), "medium","high"))
clusterK3Means[,4:7] <- ifelse(clusterK3Means[,4:7]< 1.7, "low", ifelse(clusterK3Means[,4:7]<3.4 & clusterK3Means[,4:7]>=1.7, "medium","high"))
#Cluster 2 is very different from 1 and 3. Cluster 1 and 3 only differ by two driver variables

#4-Clusters
clusterK4Means[,2:3] <- ifelse(clusterK4Means[,2:3]< (-0.3), "low", ifelse(clusterK4Means[,2:3]<0.3 & clusterK4Means[,2:3]>=(-0.3), "medium","high"))
clusterK4Means[,4:7] <- ifelse(clusterK4Means[,4:7]< 1.7, "low", ifelse(clusterK4Means[,4:7]<3.4 & clusterK4Means[,4:7]>=1.7, "medium","high"))
#Here can see that cluster 2 and 4 differ by only one driver variable. 

library(useful)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(factoextra)
#Plots take approximately 4 minutes to execute each
plot(fitC3, data=cluster, title = "K-Means = 3") #multidimensional scaling to project the data into two dimensions and color code the points according to cluster membership

plot(fitC4, data=cluster, title = "K-Means = 4")

plot(fitC5, data=cluster, title = "K-Means = 5")

fviz_cluster(fitC3, data=cluster, geom="point", stand=FALSE, frame.type="norm", main="3-Cluster Plot")

fviz_cluster(fitC4, data=cluster, geom="point", stand=FALSE, frame.type="norm", main="4-Cluster Plot")

#To get Cubic Clustering Criterion
library(NbClust)
c_ccc <- NbClust(cluster, distance = "euclidean", min.nc=3, max.nc=8, method="kmeans", index = "ccc")
plot(3:8, c_ccc$All.index, xlab = "Clusters", ylab = "CCC", type = "line", main = "Diagnostic Statistics")

#says to choose 4 clusters

Create means table for non-drivers and describe the viewers of the different clusters

Determine the number of clusters

#Create means table for non-drivers
clustering <- cbind(importanceFC,importanceND, newMaster[62:66]) #Marital status included
#clustering <- cbind(importanceFC,importanceND) #education level included, no marital

clusNum3 <- data.frame(fitC3$cluster) #This gets the cluster numbers for each obs. when its for 3 clusters
colnames(clusNum3) <- c("Cluster Num")
clusNum3 <- cbind(clusNum3,clustering)
clusNum3[clusNum3 == 0] <- NA

#Only make variables that are 0/1, into 0 and 1 vs 1 and NA
clusNum3$Married[is.na(clusNum3$Married)] <- 0
clusNum3$Widowed[is.na(clusNum3$Widowed)] <- 0
clusNum3$Divorced[is.na(clusNum3$Divorced)] <- 0
clusNum3$Separated[is.na(clusNum3$Separated)] <- 0
clusNum3$Single[is.na(clusNum3$Single)] <- 0

clusID3 <- aggregate(.~clusNum3$`Cluster Num`, data = clusNum3, FUN=mean,na.rm=TRUE)
#########
#Choosing to do 4 clusters but reviewed the 3 cluster
clusNum4 <- data.frame(fitC4$cluster)#This gets the cluster numbers for each obs. when its for 4 clusters
colnames(clusNum4) <- c("Cluster Num")
clusNum4 <- cbind(clusNum4,clustering)
clusNum4[clusNum4 == 0] <- NA

clusNum4$Married[is.na(clusNum4$Married)] <- 0
clusNum4$Widowed[is.na(clusNum4$Widowed)] <- 0
clusNum4$Divorced[is.na(clusNum4$Divorced)] <- 0
clusNum4$Separated[is.na(clusNum4$Separated)] <- 0
clusNum4$Single[is.na(clusNum4$Single)] <- 0

clusID4 <- aggregate(.~clusNum4$`Cluster Num`, data = clusNum4, FUN=mean, na.rm=TRUE)
clusID4
##   clusNum4$`Cluster Num` Cluster Num brandStick brandFirst brandShop
## 1                      1           1   3.725275   3.095238  3.945055
## 2                      2           2   3.760431   3.323015  3.915209
## 3                      3           3   2.200000   1.400000  3.400000
## 4                      4           4   3.630252   3.233193  3.951681
##       work   career    ahead political international informed    crime
## 1 2.673993 3.047619 2.476190  3.296703      2.117216 4.263736 3.652015
## 2 2.833109 3.468371 2.694482  3.674293      4.172275 4.702557 4.005384
## 3 2.000000 1.200000 2.600000  2.000000      1.600000 1.200000 1.800000
## 4 2.596639 3.138655 2.491597  1.655462      4.298319 4.701681 3.739496
##      smart computer  economy education   Married    Widowed   Divorced
## 1 4.032967 15.46703 1.893773  14.94322 0.7032967 0.01831502 0.10989011
## 2 4.251682 15.33244 1.788694  15.52221 0.7321669 0.02691790 0.09421265
## 3 2.200000  5.20000 2.800000  12.20000 0.4000000 0.20000000 0.00000000
## 4 4.376050 16.32038 2.098739  16.14706 0.6575630 0.02100840 0.10084034
##     Separated    Single
## 1 0.018315018 0.1501832
## 2 0.013458950 0.1332436
## 3 0.000000000 0.4000000
## 4 0.006302521 0.2142857

Based on the results, cluster 3 is significantly different from the other groups. The other clusters are fairly similar, meaning there is enough overlap that I’m a little conserned that the clusters are not as unique as I was hoping for. In this scenario, I would consider in trying other questions to see if I can obtain more distinct clusters.

DIFFERENCES AND SIMILARITIES BETWEEN THE SEGMENTS

Cluster 1

(Married with little concern about anything around them) 60% of responders in this group are married with an average time of 13 hours using a computer for work, and an average education of 11th grade. This cluster believes on average the American economy will be somewhat worse in the next 12-months. On average, most people in this cluster feel they are agree a little that they are smart. People in this group also are neutral when it comes to worrying about crime and violence. People in this group disagree a little that they are interested in international events.

Cluster 2

(Natural worriers about things going on around them) 65% of responders in this group are married with an average time of 12.6 hours using a computer for work, and an average education of 11th grade. This cluster believes on average the American economy will be significantly worse in the next 12-months. On average, most people in this cluster feel they are agree a little that they are smart. People in this group also are agree a little when it comes to worrying about crime and violence. People in this group agree a little that they are interested in international events.

Cluster 3

(Single low educated) This cluster is the most different of all of them. 66% of responders are single with an average time of 5.6 hours using a computer for work, and average education of 10th grade, which is the lowest amongst the clusters. This cluster believes the American economy will be fairly neutral in the coming 12 months. On average, people in this group disagree slightly that they are smart. People in this group hardly ever worry about crime and violence. People in this group disagree a little that they are interested in international events.

Cluster 4

(The agree a little group) 57% of responders in this group are married with an average time of 12.6 hours using a computer for work, and an average education of 12th grade (Finished high school). This cluster believes on average the American economy will be somewhat worse in the next 12-months. On average, most people in this cluster feel they are agree a little that they are smart. People in this group also are agree a little when it comes to worrying about crime and violence. People in this group agree a little that they are interested in international events.