Data pre-processing

Load libraries:

library(dplyr)
library(flexclust)
library(factoextra)
library(pca3d)
library(dplyr)
library(ggcorrplot)
library(psych)
library(ggplot2)
library(tidyr)

The given dataset was extracted from a survey including 599 observations and 53 variables. Each variable corresponds to a survey question about respondents’ demographic information (age, gender, marital status, number of cars owned by family, etc.) and their opinion on the current energy crisis and use of public transportation. More details can be found at http://bit.ly/masst123

masst <- read.csv("E:/MSBA/5 Mutivariate Analysis/masst.csv", header=TRUE)

View the data:

colnames(masst)[1] <- "V1" # correct a wrong column name
head(masst)
##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1 12 16 25  1  8 16  1  1  1   1   1   1   1   1   1   1   2   3   2   4
## 2 12 20 16 16 12 20  3  3  3   3   4   4   4   4   4   5   5   5   2   1
## 3 16  9 12 20 20 10  5  4  3   3   2   1   1   1   1   1   1   1   3   2
## 4  9 16  9 16 12 15  1  2  2   3   4   4   5   5   5   5   5   5   5   2
## 5 15 20  3  1  3 20  1  1  2   2   3   3   3   4   4   4   4   5   4   4
## 6 10 15 15 20 12 25  5  5  5   5   5   5   5   5   5   5   5   5   4   2
##   V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 1   3   3   3   4   4   2   2   4   2   1   1   1   3   4   3   3   4   4
## 2   2   4   4   3   4   1   3   4   1   2   2   1   4   3   2   4   3   5
## 3   2   4   4   2   4   2   3   4   2   1   2   2   2   4   4   3   4   2
## 4   1   5   5   5   3   2   4   5   2   2   2   1   5   4   4   2   4   4
## 5   4   5   4   1   3   4   4   4   2   2   2   3   4   4   1   2   3   2
## 6   2   4   4   2   5   1   4   1   1   4   2   1   4   2   2   2   2   5
##   V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52
## 1   2   5   1   1   1   2   1   3   6   5   3   1   8   1
## 2   2   5   2   2   1   1   0   1   6  NA   5   1   6   1
## 3   2   5   2   1   1   2   0   2   4   1   5   2   3   1
## 4   1   5   1   1   1   3   1   3   6   5   4   1   6   1
## 5   4   5   2   1   1   2   0   4   3   1   5   4   8   1
## 6   4   4   1   1   1   2   2   2   2   2   5   2   5   1

Check missing patterns:

# Unique values per column
lapply(masst, function(x) length(unique(x))) 
#Check for Missing values
missing_values = masst %>% summarize_all(funs(sum(is.na(.))/n()))
missing_values = gather(missing_values, key="feature", value="missing_pct")
missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct))+
  geom_bar(stat="identity",fill="orange3")+
  coord_flip()+theme_bw()

Because most of the missing values come from V48 (Spouse’s education, as a lot of respondents had not married by the time they filled the questionnaire), I remove this column entirely to reduce the number of rows lost.

masst=masst[,-48]
dim(masst) #view the dataset
## [1] 598  51

For this problem I don’t want to induce any biases from imputing missing cells, so let’s just remove incomplete observations:

masst=na.omit(masst)
dim(masst)
## [1] 386  51

It can be seen that there is a significant difference in feature scale. To make it easier to compare them, let’s normalize the data and discover the dataset through a correlation heatmap. Normalizing data is necessary for cluster analysis because clustering techniques are built around the concept of measuring the distance between observations we are trying to group. Therefore, the analysis outcome will be overly influenced by any variables with a disproportionately higher scale than that of others.

masst = scale(masst) # standardize the dataset
correl <- cor(masst, use = "complete.obs")
ggcorrplot(correl)

As can be seen, variables V7 through V18 are highly correlated because they ask for the same information.

Hierachical clustering

create a distance matrix using the single linkage method:

D_matrix = dist(masst)

Plot a dendrogram to see how observations cluster together:

dend = hclust(D_matrix, "single")
plot(dend, main = "Mass Transportation Dendrogram", xlab = "Questions", ylab = "Distance")

As can be seen from the plot, this dataset has too many observations to be well presented with a dendrogram. Also, this is one of the cases when hierarchical clustering shows ineffectiveness because most of the observations will be grouped into the first cluster and a handful of outliers will become standalone clusters. So for this case, we skip the single linkage method and proceed to the k-means method.

k-means clustering

The first step to do is to determine the optimal number of clusters using total within-cluster sum of squares.

set.seed(1)
fviz_nbclust(masst, kmeans, method = "wss")

The plot above shows that the total within-cluster sum of squares does not decrease much after cluster 5 (elbow point), so let’s go with 5 clusters. Still, identifying the elbow/cutoff point can be subjective and may need domain expertise/considering the purpose of the analysis when done in practice.

Run the k-means algorithm:

set.seed(1)
kcl=kmeans(masst,5)
kcl

Let’s see how many observations each cluster has:

kcl$size
## [1] 81 60 79 84 82

Since the dataset is large, to have a better picture of what the centers tell about their clusters, we plot a heatmap as follows:

heatmap(kcl$centers, xlab="Variables", ylab="Cluster",
        col = colorRampPalette(c("blue", "white", "red"))(100),
        Colv = NA, Rowv = NA, scale = "column")

Color code: blue - low, white - moderate, red - high. From there we can see that Cluster 1 is high in questions 7 through 11, meaning that they are quick to turn to mass transportation with just a slight increase in gas price. In contrast, respondents in Cluster 4 are more sluggish to change their means of transportation, or only willing to change when there is a steep increase in the price. Cluster 5 people seem to be the least likely group to move to public transportation in case of an increase in gas expense. Meanwhile, Cluster2 and 3 are largely moderate in their preference for public transportation.

Cluster 2 is very low in V44 through 46 and V50, 51 - questions asking about the numbers of adults, children, cars, full-time workers and household income of their families. We can deduce that this group is made up mostly of young and unmarried people.

Another interesting insight is when looking at the last question (V52), we can see the main race group of each cluster. Because the score for this question is given low to Caucasian and high for other race groups, the map tells us that Cluster 5 consist mostly of white people, Cluster 2 and 3 represent African Americans and Cluster 4 is a mix of people from other races.

Projecting onto principal components

Let’s project the clusters onto 2 biggest PCs to see how they relate to each other. It should be noted that with such a high number of variables, the first two PCs can only explain more than 20% of the data variance. So PCA may not give us a clear-cut interpretation of the whole dataset.

sigma=var(masst)

vars=diag(sigma)
percentvars=vars/sum(vars)

ident=diag(52)

eigenvalues=eigen(sigma)$values
eigenvectors=eigen(sigma)$vectors

pcs=eigenvectors[1:8,]
colnames(pcs)=colnames(masst)

pc1=pcs[1,] #the 6 highest loading variables in PC1 
pc1=sort(pc1, decreasing=TRUE)
head(pc1)
##       V38       V25       V32        V5       V17       V26 
## 0.3837394 0.3347478 0.2840092 0.2479738 0.2384378 0.2204601

As can be seen, PC1 is driven by variables such as V38, V25, V32, V5, V17 and V26. People scoring high in these questions tend to have a strong opinion against electricity and petrolum companies, highly agreeing that these firms have not done all they could to solve the energy problem.

pc2=pcs[2,] #the 6 highest loading variables in PC2
pc2=sort(pc2, decreasing=TRUE)
head(pc2)
##       V39        V5        V8       V30        V7       V32 
## 0.2807093 0.2759152 0.2747267 0.2200885 0.2024919 0.1943976

PC2 is representated by variables like V39, V5, V8 and V30. People who put a high score on these variables tend to live further away from school or work compared with an average respondent and are more price sensitive. So they are more willing to use mass transportation when there is a slight increase in gas price.

Let’s see how the clusters relate to each other when projected on 2 largest PCs:

pca <- prcomp(masst, scale.= TRUE )
pca2d( pca, group=kcl$cluster,show.labels=TRUE, fancy=TRUE)

In the plot, Cluster 1 is represented by orange, Cluster 2 light blue, Cluster 3 green, Cluster 4 yellow and Cluster 5 dark blue.

As can be seen, Cluster 3 and 4 are less likely to adopt mass transportation, and Cluster 1, 2 and 3 include a lot of people who are disagree with the petroleum and electricity industries’ practices.

Overall, we need more PCs to fully explain the data variance and draw more detailed insights from the the visualization.