Sebuah pusat perbelanjaan ingin melakukan segmentasi pelanggan yang dapat selama seminggu terakhir. Terdapat empat variabel data pelanggan yaitu jenis kelamin (1/Pria, 2/Wanita), usia, pendapatan pertahun dalam (ribu dollar), serta skor belanja (1-100).
Loading data
dataMall<- read.csv("data/Mall_Customers.csv")
head(dataMall)
## Gender Age AnnualIncome SpendingScore
## 1 Male 19 15 39
## 2 Male 21 15 81
## 3 Female 20 16 6
## 4 Female 23 16 77
## 5 Female 31 17 40
## 6 Female 22 17 76
A quick analysis of scatter plot
library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, Age, color = Gender))+ geom_point()
library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, SpendingScore, color = Gender))+ geom_point()
# apply the hierarchal clustering algorithm
clusters <-hclust(dist(dataMall[,2:3]))
clusters
##
## Call:
## hclust(d = dist(dataMall[, 2:3]))
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 200
#Plot the dendogram
plot(clusters)
Now we can see there are number of possible places where we can choose clusters. We will show cross-plot with two, three, and four clusters. # Create different number of clusters
clusterCut_2 <-cutree(clusters, 2)
#table the clustering distribution with actual networth
table(clusterCut_2,dataMall$Gender)
##
## clusterCut_2 Female Male
## 1 94 70
## 2 18 18
Cluster 1 consist of 70 Male and 94 Female Cluster 2 consist of 18 Male and 18 Female
clusterCut_3 <-cutree(clusters, 3)
#table the clustering distribution with actual networth
table(clusterCut_3,dataMall$Gender)
##
## clusterCut_3 Female Male
## 1 39 25
## 2 55 45
## 3 18 18
Cluster 1 consist of 25 Male and 39 Female Cluster 2 consist of 45 Male and 55 Female Cluster 3 consist of 18 Male and 18 Female
clusterCut_5 <-cutree(clusters, 5)
#table the clustering distribution with actual networth
table(clusterCut_5,dataMall$Gender)
##
## clusterCut_5 Female Male
## 1 25 16
## 2 14 9
## 3 28 23
## 4 27 22
## 5 18 18
library(ggplot2)
ggplot(dataMall, aes(AnnualIncome, SpendingScore)) +
geom_point(alpha =0.4, size =1.5) +geom_point(col = clusterCut_5)+
scale_color_manual(values =c('red', 'black'))
You can see most of our gender are overlapping with the three cluster created by hclust. In the next section, we apply another clustering algorithm to the same data and see how the results look.
Algorithm: In the simplest form of the algorithm, it has two steps: • Assignment: Assign each observation to the cluster that gives the minimum within cluster sum of squares (WCSS). • Update: Update the centroid by taking the mean of all the observation in the cluster.
Unlike the hierarchical cluster, to find the optimal value for k (number of cluster) here, we will use an Elbow curve. The curve shows the percentage of variance explained as a function of the number of clusters. #Elbow Curve
wss <-(nrow(dataMall)-1)*sum(apply(dataMall[,2:3],2,var))
for (i in 2:15) {
wss[i] <-sum(kmeans(dataMall[,2:3],centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum
of squares")
4 cluster
set.seed(917)
#Run k-means cluster of the dataset
Cluster_kmean <-kmeans(dataMall[,2:3], 5, nstart =20)
#Tabulate the cross distribution
table(Cluster_kmean$cluster,dataMall$Gender)
##
## Female Male
## 1 29 29
## 2 13 7
## 3 26 24
## 4 17 14
## 5 27 14
Cluster_kmean$cluster <-factor(Cluster_kmean$cluster)
ggplot(dataMall, aes(SpendingScore, AnnualIncome)) +
geom_point(alpha =0.4, size =1.5) +geom_point(col = Cluster_kmean$cluster) +
scale_color_manual(values =c('red', 'black'))
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. This algorithm works on a parametric approach. The two parameters involved in this algorithm are: • e: The radius of our neighborhoods around a data point p. • minPts: The minimum number of data points we want in a neighborhood to define a cluster. Once these parameters are defined, the algorithm divides the data points into three points: • Core points: A point p is a core point if at least minPts points are within distance · (· is the maximum radius of the neighborhood from p) of it (including p). • Border points: A point q is a border from p if there is a path p1, …, pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (all the points on the path must be core points, with the possible exception of q). • Outliers: All points not reachable from any other point are outliers. The steps in DBSCAN are simple after defining the previous steps: 1. Pick at random a point that is not assigned to a cluster and calculate its neighborhood. If, in the neighborhood, this point has minPts then make a cluster around that; otherwise, mark it as an outlier. 2. Once you find all the core points, start expanding that to include border points. 3. Repeat these steps until all the points are assigned either to a cluster or to an outlier.
Penentuan jumlah minimum point (minPts) yang digunakan sesuai dengan kesepakatan yang umum digunakan adalah (dimensi) + 1 = 6. Sedangkan radius tetangga (eps) dipilih berdasarkan letak dibawah siku dari KNN distance.
library(dbscan)
## Warning: package 'dbscan' was built under R version 4.1.2
cluster_dbscan <-dbscan(dataMall[,2:3],eps=8,minPts =5)
cluster_dbscan
## DBSCAN clustering for 200 objects.
## Parameters: eps = 8, minPts = 5
## The clustering contains 1 cluster(s) and 17 noise points.
##
## 0 1
## 17 183
##
## Available fields: cluster, eps, minPts
#Display the hull plot
hullplot(dataMall[,2:3],cluster_dbscan$cluster)
When a clustering result is evaluated based on the data that was clustered, it is called an internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. # Dunn Index J Dunn proposed this index in 1974 through his published work titled, “Well Separated Clusters and Optimal Fuzzy Partitions,” in the Journal of Cybernetics.[7] The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal intercluster distances to the maximal intracluster distance.
installpackages (clValid)
library(clValid)
## Warning: package 'clValid' was built under R version 4.1.2
## Loading required package: cluster
#Showing for hierarchical cluster with clusters = 3
dunn(dist(dataMall[,2:3]), clusterCut_3)
## [1] 0.08657642
library(clValid)
#Showing for hierarchical cluster with clusters = 5
dunn(dist(dataMall[,2:4]), clusterCut_5)
## [1] 0.03739185
The Dunn index has a value between zero and infinity and should be maximized. The Dunn score with high value are more desirable; here the value is too low, suggesting it’s not a good cluster.
The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered; objects with a low value may be outliers.
library(cluster)
#Showing for k-means cluster with clusters = 3
sk <-silhouette(clusterCut_3,dist(dataMall[,2:3]))
plot(sk)
## External Evaluation External evaluations are similar to evaluations done on test data. The data used for testing is not used for training the model. The test data is then evaluated and labels assigned by experts or some third-party benchmarks. Then clustering results on these already labeled items provide us the metric for how good the clusters grouped our data. As the metric depends on external inputs, it is called external evaluation. The method is simple if we know what the actual clusters will look like. Then we can have these evaluations. In reality, our data is already labeled before clustering and hence we can do external evaluation on the same data as we used for clustering. # Rand Measure The Rand index is similar to classification rate in multi-class classification problems. This measures how many items that are returned by the cluster and expert (labeled) are common and how many differ. If we assume expert labels (or external labels) to be correct, this measures the correct classification rate