Clustering Model & Popular Algorithm

Connectivity models: Distance connectivity between observations is the measure, e.g., hierarchical clustering.
Centroid models: Distance from mean value of each observation/ cluster is the measure, e.g., k-means.
Distribution models: Significance of statistical distribution of variables in the dataset is the measure, e.g., expectation maximization algorithms.
Density models: Density in data space is the measure, e.g., DBSCAN models.

STUDI KASUS

Sebuah pusat perbelanjaan ingin melakukan segmentasi pelanggan yang dapat selama seminggu terakhir. Terdapat empat variabel data pelanggan yaitu jenis kelamin (1/Pria, 2/Wanita), usia, pendapatan pertahun dalam (ribu dollar), serta skor belanja (1-100).

Loading data

dataMall<- read.csv("data/Mall_Customers.csv")
head(dataMall)

##   Gender Age AnnualIncome SpendingScore
## 1   Male  19           15            39
## 2   Male  21           15            81
## 3 Female  20           16             6
## 4 Female  23           16            77
## 5 Female  31           17            40
## 6 Female  22           17            76

A quick analysis of scatter plot

library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, Age, color = Gender))+ geom_point()

library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, SpendingScore, color = Gender))+ geom_point()

## Hierarchal Clustering Hierarchical clustering is based on the connectivity model of clusters. The steps involved in the clustering process are: 1. Start with N clusters, N is number of elements (i.e., assign each element to its own cluster). In other words distances (similarities) between the clusters are equal to the distances (similarities) between the items they contain. 2. Now merge pairs of clusters with the closest to other (most similar clusters) (e.g., the first iteration will reduce the number of clusters to N - 1). 3. Again compute the distance (similarities) and merge with the closest one. 4. Repeat Steps 2 and 3 to exhaust the items until you get all data points in one cluster.

# apply the hierarchal clustering algorithm

clusters <-hclust(dist(dataMall[,2:3]))
clusters

## 
## Call:
## hclust(d = dist(dataMall[, 2:3]))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 200

#Plot the dendogram

plot(clusters)

Now we can see there are number of possible places where we can choose clusters. We will show cross-plot with two, three, and four clusters. # Create different number of clusters

clusterCut_2 <-cutree(clusters, 2)
#table the clustering distribution with actual networth
table(clusterCut_2,dataMall$Gender)

##             
## clusterCut_2 Female Male
##            1     94   70
##            2     18   18

Cluster 1 consist of 70 Male and 94 Female Cluster 2 consist of 18 Male and 18 Female

clusterCut_3 <-cutree(clusters, 3)
#table the clustering distribution with actual networth
table(clusterCut_3,dataMall$Gender)

##             
## clusterCut_3 Female Male
##            1     39   25
##            2     55   45
##            3     18   18

Cluster 1 consist of 25 Male and 39 Female Cluster 2 consist of 45 Male and 55 Female Cluster 3 consist of 18 Male and 18 Female

clusterCut_5 <-cutree(clusters, 5)
#table the clustering distribution with actual networth
table(clusterCut_5,dataMall$Gender)

##             
## clusterCut_5 Female Male
##            1     25   16
##            2     14    9
##            3     28   23
##            4     27   22
##            5     18   18

library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, SpendingScore)) +
geom_point(alpha =0.4, size =1.5) +geom_point(col = clusterCut_5)+
scale_color_manual(values =c('red', 'black'))

You can see most of our gender are overlapping with the three cluster created by hclust. In the next section, we apply another clustering algorithm to the same data and see how the results look.

K-MEANS CLUSTERING

CENTROID BASED ALGORITHM, The aim of the K-means algorithm is to divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized. It is not practical to require that the solution has minimal sum of squares against all partitions, except when M, N are small and K = 2. We seek instead “local” optima, solution such that no movement of a point from one cluster to another will reduce the within-cluster sum of squares.

Algorithm: In the simplest form of the algorithm, it has two steps: • Assignment: Assign each observation to the cluster that gives the minimum within cluster sum of squares (WCSS). • Update: Update the centroid by taking the mean of all the observation in the cluster.

Unlike the hierarchical cluster, to find the optimal value for k (number of cluster) here, we will use an Elbow curve. The curve shows the percentage of variance explained as a function of the number of clusters. #Elbow Curve

wss <-(nrow(dataMall)-1)*sum(apply(dataMall[,2:3],2,var))
for (i in 2:15) {
wss[i] <-sum(kmeans(dataMall[,2:3],centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum
of squares")

4 cluster

set.seed(917)
#Run k-means cluster of the dataset
Cluster_kmean <-kmeans(dataMall[,2:3], 5, nstart =20)
#Tabulate the cross distribution
table(Cluster_kmean$cluster,dataMall$Gender)

##    
##     Female Male
##   1     29   29
##   2     13    7
##   3     26   24
##   4     17   14
##   5     27   14

Cluster_kmean$cluster <-factor(Cluster_kmean$cluster)
ggplot(dataMall, aes(SpendingScore, AnnualIncome)) +
geom_point(alpha =0.4, size =1.5) +geom_point(col = Cluster_kmean$cluster) +
scale_color_manual(values =c('red', 'black'))

Density-Based Clustering

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. This algorithm works on a parametric approach. The two parameters involved in this algorithm are: • e: The radius of our neighborhoods around a data point p. • minPts: The minimum number of data points we want in a neighborhood to define a cluster. Once these parameters are defined, the algorithm divides the data points into three points: • Core points: A point p is a core point if at least minPts points are within distance · (· is the maximum radius of the neighborhood from p) of it (including p). • Border points: A point q is a border from p if there is a path p1, …, pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (all the points on the path must be core points, with the possible exception of q). • Outliers: All points not reachable from any other point are outliers. The steps in DBSCAN are simple after defining the previous steps: 1. Pick at random a point that is not assigned to a cluster and calculate its neighborhood. If, in the neighborhood, this point has minPts then make a cluster around that; otherwise, mark it as an outlier. 2. Once you find all the core points, start expanding that to include border points. 3. Repeat these steps until all the points are assigned either to a cluster or to an outlier.

Penentuan jumlah minimum point (minPts) yang digunakan sesuai dengan kesepakatan yang umum digunakan adalah (dimensi) + 1 = 6. Sedangkan radius tetangga (eps) dipilih berdasarkan letak dibawah siku dari KNN distance.

library(dbscan)

## Warning: package 'dbscan' was built under R version 4.1.2

cluster_dbscan <-dbscan(dataMall[,2:3],eps=8,minPts =5)
cluster_dbscan

## DBSCAN clustering for 200 objects.
## Parameters: eps = 8, minPts = 5
## The clustering contains 1 cluster(s) and 17 noise points.
## 
##   0   1 
##  17 183 
## 
## Available fields: cluster, eps, minPts

#Display the hull plot
hullplot(dataMall[,2:3],cluster_dbscan$cluster)

Internal Evaluation

When a clustering result is evaluated based on the data that was clustered, it is called an internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. # Dunn Index J Dunn proposed this index in 1974 through his published work titled, “Well Separated Clusters and Optimal Fuzzy Partitions,” in the Journal of Cybernetics.[7] The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal intercluster distances to the maximal intracluster distance.

installpackages (clValid)

library(clValid)

## Warning: package 'clValid' was built under R version 4.1.2

## Loading required package: cluster

#Showing for hierarchical cluster with clusters = 3
dunn(dist(dataMall[,2:3]), clusterCut_3)

## [1] 0.08657642

library(clValid)
#Showing for hierarchical cluster with clusters = 5
dunn(dist(dataMall[,2:4]), clusterCut_5)

## [1] 0.03739185

The Dunn index has a value between zero and infinity and should be maximized. The Dunn score with high value are more desirable; here the value is too low, suggesting it’s not a good cluster.

Silhouette Coefficient

The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered; objects with a low value may be outliers.

library(cluster)
#Showing for k-means cluster with clusters = 3
sk <-silhouette(clusterCut_3,dist(dataMall[,2:3]))
plot(sk)

## External Evaluation External evaluations are similar to evaluations done on test data. The data used for testing is not used for training the model. The test data is then evaluated and labels assigned by experts or some third-party benchmarks. Then clustering results on these already labeled items provide us the metric for how good the clusters grouped our data. As the metric depends on external inputs, it is called external evaluation. The method is simple if we know what the actual clusters will look like. Then we can have these evaluations. In reality, our data is already labeled before clustering and hence we can do external evaluation on the same data as we used for clustering. # Rand Measure The Rand index is similar to classification rate in multi-class classification problems. This measures how many items that are returned by the cluster and expert (labeled) are common and how many differ. If we assume expert labels (or external labels) to be correct, this measures the correct classification rate

Minggu 12_Cluster Analysis

Anik Nur Habyba

11/23/2021