============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/615149
Source: https://zhuanlan.zhihu.com/p/37856153
In this assignment you will be working with dataset from your 699 project. You will perform cluster analysis.
Find the optimal number of clusters (elbow, gap or silhouette methods). [ - 10pts]
Perform the K-Means cluster analysis. [ - 10pts]
Preform the hierarchical analysis. [ - 10pts]
Describe the results. [ - 10pts]
The EDA parts are available at RPubs - part I and RPubs - part II. The modeling part is available at RPubs - part III.
The project dataset is the Bitcoin price as a time series. The cluster analysis cannot be conducted in this case. Hence, the clustering will run on another example dataset. Such a dataset contains variables in nominal or binary. In that regard, the distance matrix is calculated to run a cluster analysis.
data <- read.csv("~/Documents/HU/ANLY 699-90-O/699 R/M2_IFI_Data_Product.csv")
summary(data[-1])
## Age Gender Habitat Income Married
## Min. :18.00 F:300 City_center:269 Min. : 50352 No :204
## 1st Qu.:30.00 M:300 Rural : 96 1st Qu.:138388 Yes:396
## Median :42.00 Small_town :173 Median :200416
## Mean :42.40 Suburban : 62 Mean :220366
## 3rd Qu.:55.25 3rd Qu.:288706
## Max. :67.00 Max. :505040
## Children Car Savings_Account Current_Account Loan
## Min. :0.000 No :304 No :186 No :145 No :391
## 1st Qu.:0.000 Yes:296 Yes:414 Yes:455 Yes:209
## Median :1.000
## Mean :1.012
## 3rd Qu.:2.000
## Max. :3.000
## Family_Quotient Product
## Min. : 20283 No :326
## 1st Qu.: 98773 Yes:274
## Median :155820
## Mean :178188
## 3rd Qu.:231578
## Max. :492432
dmatrix <- daisy(data[-1]) #large dissimilarity, 179,700 elements
Fig. 1 is the plot of “average total within sum of squares by number of clusters (k)”. This elbow method investigates the percentage of variance explained by k. The optimal k is chosen when the change rate starts to drop.
avg.totw.ss <- numeric(19)
for (k in 2:20) {
totw.ss <- numeric(15)
for (trial in 1:15) {
runs <- kmeans(dmatrix, centers=k)
totw.ss[trial] <- runs$tot.withinss
}
avg.totw.ss[k-1] <- mean(totw.ss)
}
ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares")+ggtitle("Fig. 1. Plot of TotW.SS by k in Clustering")+theme_classic()
k-means clustering is a method of vector quantization, originally from signal processing. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
for (k in 4:8) {
set.seed(699)
km <- kmeans(dmatrix, k)
data <- data.frame(data, km$cluster)
}
colnames(data)[14:18] <- c("km4","km5","km6","km7","km8")
Hierarchical clustering seeks to build a hierarchy of clusters. There are two types:
Agglomerative: a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
set.seed(699)
hc <- hclust(dmatrix, method="ward.D2")
for (k in 4:8) {
data <- data.frame(data, cutree(hc, k))
}
colnames(data)[19:23] <- c("hc4","hc5","hc6","hc7","hc8")
The number of clusters is chosen as k=4. The two-dimensional table compares the distribution of each class in each cluster, e.g., the cluster and the variable indicating whether the customer acquired the product.
table(data$km4, data$Product)
##
## No Yes
## 1 0 144
## 2 130 0
## 3 196 0
## 4 0 130