ANLY 699 - Cluster Analysis Assignment

============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/615149
Source: https://zhuanlan.zhihu.com/p/37856153

Goals

In this assignment you will be working with dataset from your 699 project. You will perform cluster analysis.

Submission Format

Submit 2 files: Rmarkdown and a knitted Rmarkdown (html or pdf).
Text should be entered outside of code blocks (do not use #comments to describe your figures).
Format your graphs properly: captions, title, axis labels

Tasks

Find the optimal number of clusters (elbow, gap or silhouette methods). [ - 10pts]
Perform the K-Means cluster analysis. [ - 10pts]
Preform the hierarchical analysis. [ - 10pts]
Describe the results. [ - 10pts]

Project Progress

The EDA parts are available at RPubs - part I and RPubs - part II. The modeling part is available at RPubs - part III.

The project dataset is the Bitcoin price as a time series. The cluster analysis cannot be conducted in this case. Hence, the clustering will run on another example dataset. Such a dataset contains variables in nominal or binary. In that regard, the distance matrix is calculated to run a cluster analysis.

data <- read.csv("~/Documents/HU/ANLY 699-90-O/699 R/M2_IFI_Data_Product.csv")
summary(data[-1])

##       Age        Gender         Habitat        Income       Married  
##  Min.   :18.00   F:300   City_center:269   Min.   : 50352   No :204  
##  1st Qu.:30.00   M:300   Rural      : 96   1st Qu.:138388   Yes:396  
##  Median :42.00           Small_town :173   Median :200416            
##  Mean   :42.40           Suburban   : 62   Mean   :220366            
##  3rd Qu.:55.25                             3rd Qu.:288706            
##  Max.   :67.00                             Max.   :505040            
##     Children      Car      Savings_Account Current_Account  Loan    
##  Min.   :0.000   No :304   No :186         No :145         No :391  
##  1st Qu.:0.000   Yes:296   Yes:414         Yes:455         Yes:209  
##  Median :1.000                                                      
##  Mean   :1.012                                                      
##  3rd Qu.:2.000                                                      
##  Max.   :3.000                                                      
##  Family_Quotient  Product  
##  Min.   : 20283   No :326  
##  1st Qu.: 98773   Yes:274  
##  Median :155820            
##  Mean   :178188            
##  3rd Qu.:231578            
##  Max.   :492432

dmatrix <- daisy(data[-1]) #large dissimilarity, 179,700 elements

Elbow Method

Fig. 1 is the plot of “average total within sum of squares by number of clusters (k)”. This elbow method investigates the percentage of variance explained by k. The optimal k is chosen when the change rate starts to drop.

avg.totw.ss <- numeric(19)
for (k in 2:20) {
  totw.ss <- numeric(15)
  for (trial in 1:15) {
    runs <- kmeans(dmatrix, centers=k)
    totw.ss[trial] <- runs$tot.withinss
  }
  avg.totw.ss[k-1] <- mean(totw.ss)
}

ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares")+ggtitle("Fig. 1. Plot of TotW.SS by k in Clustering")+theme_classic()

K-Means Cluster

k-means clustering is a method of vector quantization, originally from signal processing. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

for (k in 4:8) {
  set.seed(699)
  km <- kmeans(dmatrix, k)
  data <- data.frame(data, km$cluster)
}
colnames(data)[14:18] <- c("km4","km5","km6","km7","km8")

Hierarchical Cluster

Hierarchical clustering seeks to build a hierarchy of clusters. There are two types:
Agglomerative: a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

set.seed(699)
hc <- hclust(dmatrix, method="ward.D2")
for (k in 4:8) {
  data <- data.frame(data, cutree(hc, k))
}
colnames(data)[19:23] <- c("hc4","hc5","hc6","hc7","hc8")

Result

The number of clusters is chosen as k=4. The two-dimensional table compares the distribution of each class in each cluster, e.g., the cluster and the variable indicating whether the customer acquired the product.

table(data$km4, data$Product)

##    
##      No Yes
##   1   0 144
##   2 130   0
##   3 196   0
##   4   0 130