This report describes the use of R within clustering methods in unsupervised learning approach in a real dataset. In this report, it is mainly included that preprocessing of data, comparison of three different clustering algorithms and evaluation metrics (silhouette) for 2 clustering algorithms. The clustering methods are respectively k-means, partitioning around medoids (PAM) in other words k-medoids and hierarchical clustering.
This dataset contains mainly pizza nutrients includes for 100 grams. Every id number represents a different pizza and every letter (A, B, C, D, E, F, G, H, I, J) represents the pizza producer brands.
Brand: Pizza brand
Id: Sample analysed.
Mois: Amount of water per 100 grams in the sample.
Prot: Amount of protein per 100 grams in the sample.
Fat: Amount of fat per 100 grams in the sample.
Ash: Amount of ash per 100 grams in the sample.
Sodium: Amount of sodium per 100 grams in the sample.
Carb: Amount of carbohydrates per 100 grams in the sample.
Cal: Amount of calories per 100 grams in the sample.
Number of Instances: 300
Attribute Characteristics: Integer
Number of Attributes: 9
Missing Values: No missing values
In this study, the main nutrient variables are analyzed which are id, mois, prot, fat, ash, sodium, carb and cal. The dataset is imported in R as a pizza and scaled version of dataset is saved as pizs. Moreover, the dataset documentation type is text (tab delimited) (*.txt).
In this section includes descriptive statistics information about unscaled for of the data.
summary(pizza)
## brand id mois prot
## Length:300 Min. :14003 Min. :25.00 Min. : 6.98
## Class :character 1st Qu.:14094 1st Qu.:30.90 1st Qu.: 8.06
## Mode :character Median :24021 Median :43.30 Median :10.44
## Mean :20841 Mean :40.90 Mean :13.37
## 3rd Qu.:24110 3rd Qu.:49.12 3rd Qu.:20.02
## Max. :34045 Max. :57.22 Max. :28.48
## fat ash sodium carb
## Min. : 4.38 Min. :1.170 Min. :0.2500 Min. : 0.510
## 1st Qu.:14.77 1st Qu.:1.450 1st Qu.:0.4500 1st Qu.: 3.467
## Median :17.14 Median :2.225 Median :0.4900 Median :23.245
## Mean :20.23 Mean :2.633 Mean :0.6694 Mean :22.865
## 3rd Qu.:21.43 3rd Qu.:3.592 3rd Qu.:0.7025 3rd Qu.:41.337
## Max. :47.20 Max. :5.430 Max. :1.7900 Max. :48.640
## cal
## Min. :2.180
## 1st Qu.:2.910
## Median :3.215
## Mean :3.271
## 3rd Qu.:3.520
## Max. :5.080
The data preparation period is shown below as step by step:
1. At the beginning of the analysis, the data imported as follows;
pizza <- read.delim("pizza.txt", stringsAsFactors = FALSE)
head(pizza)
## brand id mois prot fat ash sodium carb cal
## 1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
## 2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
## 3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
## 4 A 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
## 5 A 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
## 6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
2. Cluster tendency has been calculated by using Hopkins’ statistic
hopkins(pizza[,2:9], n=nrow(pizza[,2:9])-1)
## $H
## [1] 0.002383577
1-0.002373702
## [1] 0.9976263
0.9976263 solution show us the dataset highly convenient to clustering analysis.
3. For acquire scale data, scale function is used, and the name of the scaled dataset is designated as pizs.
pizs <- scale(pizza[,2:9])
Furthermore, the dataset new appearance is became as shown below:
head(pizs)
## id mois prot fat ash sodium carb
## [1,] -0.9725866 -1.369526 1.252089 2.745255 1.950635 2.971721 -1.225463
## [2,] -0.9748845 -1.299391 1.225669 2.636070 2.131776 3.025723 -1.211598
## [3,] -0.9789058 -1.314046 1.028292 2.846640 1.927007 2.593708 -1.223800
## [4,] -0.9801984 -1.083752 1.053158 2.551397 1.698611 2.539707 -1.191630
## [5,] -0.9817782 -1.090033 1.228777 2.386506 1.722238 2.620709 -1.170554
## [6,] -0.9717249 -1.021991 1.065591 2.460039 1.800996 2.647710 -1.190521
## cal
## [1,] 2.675659
## [2,] 2.530505
## [3,] 2.707915
## [4,] 2.369224
## [5,] 2.256327
## [6,] 2.256327
In the project, k-means clustering analysis is done by using euclidian distance metric. In first, optimal number of clusters are detected by using Elbow method.
fviz_nbclust(pizs, kmeans, method = "silhouette") + theme_classic()
As shown plots the best option is 3 number of clusters; furthermore, kmeans clustering in euclidean distance continue with 3 number of clusters on below
wcke<-eclust(pizs, "kmeans", hc_metric="euclidean",k=3)
fviz_cluster(wcke, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())
The summary of k-means clustering with euclidian distance metric is shown below.
summary(wcke)
## Length Class Mode
## cluster 300 -none- numeric
## centers 24 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
## clust_plot 9 gg list
## silinfo 3 -none- list
## nbclust 1 -none- numeric
## data 2400 -none- numeric
‘silhouette value’ is used to check the quality of clusters. It is a measure of how similar an object is to its own cluster and how far it is to other clusters. It takes the values between -1 and 1. If it is close to 1, it means that the observations in a cluster is well fitted. In other words, the value of average silhouette width must be between -1 and 1; therefore, the result closer to 1 implies high clustering quality and the value for silhouette width for this experiment is 0.48 which means the dataset is proper for cluster analysis.
sile<-silhouette(wcke$cluster, dist(pizs))
fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 151 0.32
## 2 2 29 0.76
## 3 3 120 0.61
Here the best clustering result is obtained with the 2nd cluster(green).
Another clustering method used in this study is PAM. It is an adoption of k-means algorithm. However, PAM is more robust and less sensitive to outliers. In PAM, the selected observations are the medoids. Also the distance metric is euclidean in this analysis.
In first, optimal number of clusters are detected by using Elbow method.
fviz_nbclust(pizs, pam, method = "silhouette") + theme_classic()
10 clusters are suggested by the method however the number of clustering is designated 4 in order to create more understandable clustering:
pam.res <- eclust(pizs, "pam", k = 4, hc_metric="euclidean") #plotting of clusters
fviz_cluster(pam.res, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())
pizs.pam = pam(pizs,3)
pizs.pam
## Medoids:
## ID id mois prot fat ash sodium
## [1,] 23 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 107 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 139 0.4679016 -0.4410209 -0.8662149 -0.60825993 -0.9240069 -0.5653993
## carb cal
## [1,] -1.1949583 2.27245510
## [2,] -0.9564632 -0.48545705
## [3,] 0.9104540 -0.09838166
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3
## Objective function:
## build swap
## 1.754077 1.620609
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pam.res$medoids
## id mois prot fat ash sodium
## [1,] 0.4732154 -1.0492077 1.1199866 2.45446807 1.8324986 2.5937084
## [2,] 0.4579919 0.7439488 1.2334394 0.09141019 1.0843043 0.1636256
## [3,] 0.4559813 -0.7330761 -0.8724315 -0.48904862 -1.0500186 -0.6734030
## [4,] 0.4625877 0.8036160 -0.5616019 -0.47010851 -0.2624456 -0.1873864
## carb cal
## [1,] -1.19495831 2.2724551
## [2,] -0.95646323 -0.4854571
## [3,] 1.01694485 0.1757967
## [4,] 0.02691297 -0.8080199
pam.res$clusinfo
## size max_diss av_diss diameter separation
## [1,] 29 1.710058 1.124369 3.076119 3.362456
## [2,] 90 2.457287 1.619764 4.343469 1.725482
## [3,] 120 2.417467 1.235228 4.066620 1.215986
## [4,] 61 2.069652 1.227803 3.479344 1.215986
When we look at the structure of 4 clusters, cluster 3 has the maximum observations and cluster 1 has the minimum observations.
The value of average silhouette width is 0.47 in PAM clustering analysis with euclidean distance metric. Moreover, there is just slightly different between k-means and PAM about silhouette result.
sile<-silhouette(pam.res$cluster, dist(pizs))
fviz_silhouette(sile)
## cluster size ave.sil.width
## 1 1 29 0.73
## 2 2 90 0.36
## 3 3 120 0.50
## 4 4 61 0.45
Here the best clustering result is obtained with the 1st cluster, which is 0.73, while the 2nd cluster has the worst quality among the clusters and its silhouette width value is 0.36.
In first step, the distance matrix is found in euclidean distance metrics and hierarchical clustering analysis is initiated with 3 clusters as follows:
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
+ agnes(pizs, method = x)$ac}
map_dbl(m, ac)
## average single complete ward
## 0.9609417 0.9365627 0.9704849 0.9938937
Ward is the biggest value in here with 0.9938937 so ward.D2 is determined as method for hierarchical clustering.
d <- dist(pizs, method = "euclidean")
res.hc <- hclust(d, method = "ward.D2")
grp <- cutree(res.hc, k = 3)
plot(res.hc, cex = 0.6,labels = pizza$id)
plot(res.hc,labels = pizza$id, main = 'Hclust Dendrogram')
rect.hclust(res.hc, k = 3, border = 2:5)
After analyzed the dendrogram, the optimal cutree point is determined as standard deviation value of 20 which is corresponded to 3 clusters in this experiment and 2 of these clusters include mainly outliers.
In conclusion, 3 clustering methods (k-means, PAM and hierarchical clustering) with euclidean distance distance method are analyzed with in the report. The analyze start with descriptive statistics and observation of data scaling. Moreover, cluster tendency is measured by using Hopkins’ statistic and optimal number of clusters are found by using Elbow method for k-means algorithm and PAM algorithm. Moreover, for the dataset k-means and PAM algorithm has no dramatic change between them with euclidean distance metrics are gave the slightly different result for silhouette width and clustering, although K-means algorithm has better result in average silhouette width which are respectively 0,48 and 0,37. Nevertheless, 3 clustering algorithms are giving different clustering results although for future study k-means algorithm is gave more accurate clustering solution than PAM algorithm. In addition, 3 cluster are reached in hierarchical clustering with using euclidean distance metric. The proportion is good and the dendrogram shape is also efficient for cut tree for 3 clusters.
1. The dataset is imported by using data.world - https://data.world/sdhilip/pizza-datasets