============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/615149
Source: https://zhuanlan.zhihu.com/p/37856153

 

Goals

In this assignment you will be working with dataset from your 699 project. You will perform cluster analysis.

Submission Format

Tasks

  1. Find the optimal number of clusters (elbow, gap or silhouette methods). [ - 10pts]

  2. Perform the K-Means cluster analysis. [ - 10pts]

  3. Preform the hierarchical analysis. [ - 10pts]

  4. Describe the results. [ - 10pts]

 

Project Progress

The EDA parts are available at RPubs - part I and RPubs - part II. The modeling part is available at RPubs - part III.

The project dataset is the Bitcoin price as a time series. The cluster analysis cannot be conducted in this case. Hence, the clustering will run on another example dataset. Such a dataset contains variables in nominal or binary. In that regard, the distance matrix is calculated to run a cluster analysis.

data <- read.csv("~/Documents/HU/ANLY 699-90-O/699 R/M2_IFI_Data_Product.csv")
summary(data[-1])
##       Age        Gender         Habitat        Income       Married  
##  Min.   :18.00   F:300   City_center:269   Min.   : 50352   No :204  
##  1st Qu.:30.00   M:300   Rural      : 96   1st Qu.:138388   Yes:396  
##  Median :42.00           Small_town :173   Median :200416            
##  Mean   :42.40           Suburban   : 62   Mean   :220366            
##  3rd Qu.:55.25                             3rd Qu.:288706            
##  Max.   :67.00                             Max.   :505040            
##     Children      Car      Savings_Account Current_Account  Loan    
##  Min.   :0.000   No :304   No :186         No :145         No :391  
##  1st Qu.:0.000   Yes:296   Yes:414         Yes:455         Yes:209  
##  Median :1.000                                                      
##  Mean   :1.012                                                      
##  3rd Qu.:2.000                                                      
##  Max.   :3.000                                                      
##  Family_Quotient  Product  
##  Min.   : 20283   No :326  
##  1st Qu.: 98773   Yes:274  
##  Median :155820            
##  Mean   :178188            
##  3rd Qu.:231578            
##  Max.   :492432
dmatrix <- daisy(data[-1]) #large dissimilarity, 179,700 elements

 

Elbow Method

Fig. 1 is the plot of “average total within sum of squares by number of clusters (k)”. This elbow method investigates the percentage of variance explained by k. The optimal k is chosen when the change rate starts to drop.

avg.totw.ss <- numeric(19)
for (k in 2:20) {
  totw.ss <- numeric(15)
  for (trial in 1:15) {
    runs <- kmeans(dmatrix, centers=k)
    totw.ss[trial] <- runs$tot.withinss
  }
  avg.totw.ss[k-1] <- mean(totw.ss)
}

ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares")+ggtitle("Fig. 1. Plot of TotW.SS by k in Clustering")+theme_classic()

 

K-Means Cluster

k-means clustering is a method of vector quantization, originally from signal processing. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

for (k in 4:8) {
  set.seed(699)
  km <- kmeans(dmatrix, k)
  data <- data.frame(data, km$cluster)
}
colnames(data)[14:18] <- c("km4","km5","km6","km7","km8")

 

Hierarchical Cluster

Hierarchical clustering seeks to build a hierarchy of clusters. There are two types:
Agglomerative: a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

set.seed(699)
hc <- hclust(dmatrix, method="ward.D2")
for (k in 4:8) {
  data <- data.frame(data, cutree(hc, k))
}
colnames(data)[19:23] <- c("hc4","hc5","hc6","hc7","hc8")

 

Result

The number of clusters is chosen as k=4. The two-dimensional table compares the distribution of each class in each cluster, e.g., the cluster and the variable indicating whether the customer acquired the product.

table(data$km4, data$Product)
##    
##      No Yes
##   1   0 144
##   2 130   0
##   3 196   0
##   4   0 130