INTRODUCTION

This report describes the use of R within clustering methods in unsupervised learning approach in a real dataset. In this report, it is mainly included that preprocessing of data, comparison of three different clustering algorithms and evaluation metrics (silhouette) for 2 clustering algorithms. The clustering methods are respectively k-means, partitioning around medoids (PAM) in other words k-medoids and hierarchical clustering.

DATA

Description of the Dataset

This dataset contains mainly pizza nutrients includes for 100 grams. Every id number represents a different pizza and every letter (A, B, C, D, E, F, G, H, I, J) represents the pizza producer brands.

  • Brand: Pizza brand

  • Id: Sample analysed.

  • Mois: Amount of water per 100 grams in the sample.

  • Prot: Amount of protein per 100 grams in the sample.

  • Fat: Amount of fat per 100 grams in the sample.

  • Ash: Amount of ash per 100 grams in the sample.

  • Sodium: Amount of sodium per 100 grams in the sample.

  • Carb: Amount of carbohydrates per 100 grams in the sample.

  • Cal: Amount of calories per 100 grams in the sample.

Number of Instances: 300

Attribute Characteristics: Integer

Number of Attributes: 9

Missing Values: No missing values

In this study, the main nutrient variables are analyzed which are id, mois, prot, fat, ash, sodium, carb and cal. The dataset is imported in R as a pizza and scaled version of dataset is saved as pizs. Moreover, the dataset documentation type is text (tab delimited) (*.txt).

Descriptive Statistics

In this section includes descriptive statistics information about unscaled for of the data.

summary(pizza)
##     brand                 id             mois            prot      
##  Length:300         Min.   :14003   Min.   :25.00   Min.   : 6.98  
##  Class :character   1st Qu.:14094   1st Qu.:30.90   1st Qu.: 8.06  
##  Mode  :character   Median :24021   Median :43.30   Median :10.44  
##                     Mean   :20841   Mean   :40.90   Mean   :13.37  
##                     3rd Qu.:24110   3rd Qu.:49.12   3rd Qu.:20.02  
##                     Max.   :34045   Max.   :57.22   Max.   :28.48  
##       fat             ash            sodium            carb       
##  Min.   : 4.38   Min.   :1.170   Min.   :0.2500   Min.   : 0.510  
##  1st Qu.:14.77   1st Qu.:1.450   1st Qu.:0.4500   1st Qu.: 3.467  
##  Median :17.14   Median :2.225   Median :0.4900   Median :23.245  
##  Mean   :20.23   Mean   :2.633   Mean   :0.6694   Mean   :22.865  
##  3rd Qu.:21.43   3rd Qu.:3.592   3rd Qu.:0.7025   3rd Qu.:41.337  
##  Max.   :47.20   Max.   :5.430   Max.   :1.7900   Max.   :48.640  
##       cal       
##  Min.   :2.180  
##  1st Qu.:2.910  
##  Median :3.215  
##  Mean   :3.271  
##  3rd Qu.:3.520  
##  Max.   :5.080

Data Preparation

The data preparation period is shown below as step by step:

1. At the beginning of the analysis, the data imported as follows;

pizza <- read.delim("pizza.txt", stringsAsFactors = FALSE)
head(pizza)
##   brand    id  mois  prot   fat  ash sodium carb  cal
## 1     A 14069 27.82 21.43 44.87 5.11   1.77 0.77 4.93
## 2     A 14053 28.49 21.26 43.89 5.34   1.79 1.02 4.84
## 3     A 14025 28.35 19.99 45.78 5.08   1.63 0.80 4.95
## 4     A 14016 30.55 20.15 43.13 4.79   1.61 1.38 4.74
## 5     A 14005 30.49 21.28 41.65 4.82   1.64 1.76 4.67
## 6     A 14075 31.14 20.23 42.31 4.92   1.65 1.40 4.67

2. Cluster tendency has been calculated by using Hopkins’ statistic

hopkins(pizza[,2:9], n=nrow(pizza[,2:9])-1)
## $H
## [1] 0.002383577
1-0.002373702
## [1] 0.9976263

0.9976263 solution show us the dataset highly convenient to clustering analysis.

3. For acquire scale data, scale function is used, and the name of the scaled dataset is designated as pizs.

pizs <- scale(pizza[,2:9])

Furthermore, the dataset new appearance is became as shown below:

head(pizs)
##              id      mois     prot      fat      ash   sodium      carb
## [1,] -0.9725866 -1.369526 1.252089 2.745255 1.950635 2.971721 -1.225463
## [2,] -0.9748845 -1.299391 1.225669 2.636070 2.131776 3.025723 -1.211598
## [3,] -0.9789058 -1.314046 1.028292 2.846640 1.927007 2.593708 -1.223800
## [4,] -0.9801984 -1.083752 1.053158 2.551397 1.698611 2.539707 -1.191630
## [5,] -0.9817782 -1.090033 1.228777 2.386506 1.722238 2.620709 -1.170554
## [6,] -0.9717249 -1.021991 1.065591 2.460039 1.800996 2.647710 -1.190521
##           cal
## [1,] 2.675659
## [2,] 2.530505
## [3,] 2.707915
## [4,] 2.369224
## [5,] 2.256327
## [6,] 2.256327

CLUSTERING ANALYSIS

K-means

In the project, k-means clustering analysis is done by using euclidian distance metric. In first, optimal number of clusters are detected by using Elbow method.

fviz_nbclust(pizs, kmeans, method = "silhouette") + theme_classic()

As shown plots the best option is 3 number of clusters; furthermore, kmeans clustering in euclidean distance continue with 3 number of clusters on below

wcke<-eclust(pizs, "kmeans", hc_metric="euclidean",k=3)

fviz_cluster(wcke, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())

The summary of k-means clustering with euclidian distance metric is shown below.

summary(wcke)
##              Length Class  Mode   
## cluster       300   -none- numeric
## centers        24   -none- numeric
## totss           1   -none- numeric
## withinss        3   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            3   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric
## clust_plot      9   gg     list   
## silinfo         3   -none- list   
## nbclust         1   -none- numeric
## data         2400   -none- numeric

Silhouette

‘silhouette value’ is used to check the quality of clusters. It is a measure of how similar an object is to its own cluster and how far it is to other clusters. It takes the values between -1 and 1. If it is close to 1, it means that the observations in a cluster is well fitted. In other words, the value of average silhouette width must be between -1 and 1; therefore, the result closer to 1 implies high clustering quality and the value for silhouette width for this experiment is 0.48 which means the dataset is proper for cluster analysis.

sile<-silhouette(wcke$cluster, dist(pizs))

fviz_silhouette(sile)
##   cluster size ave.sil.width
## 1       1  151          0.32
## 2       2   29          0.76
## 3       3  120          0.61

Here the best clustering result is obtained with the 2nd cluster(green).

Partitioning Around Medoids (PAM)

Another clustering method used in this study is PAM. It is an adoption of k-means algorithm. However, PAM is more robust and less sensitive to outliers. In PAM, the selected observations are the medoids. Also the distance metric is euclidean in this analysis.

In first, optimal number of clusters are detected by using Elbow method.

fviz_nbclust(pizs, pam, method = "silhouette") + theme_classic()

10 clusters are suggested by the method however the number of clustering is designated 4 in order to create more understandable clustering:

pam.res <- eclust(pizs, "pam", k = 4, hc_metric="euclidean") #plotting of clusters 

fviz_cluster(pam.res, geom = "point", ellipse.type = "norm", ggtheme = theme_minimal())

pizs.pam = pam(pizs,3)
pizs.pam
## Medoids:
##       ID        id       mois       prot         fat        ash     sodium
## [1,]  23 0.4732154 -1.0492077  1.1199866  2.45446807  1.8324986  2.5937084
## [2,] 107 0.4579919  0.7439488  1.2334394  0.09141019  1.0843043  0.1636256
## [3,] 139 0.4679016 -0.4410209 -0.8662149 -0.60825993 -0.9240069 -0.5653993
##            carb         cal
## [1,] -1.1949583  2.27245510
## [2,] -0.9564632 -0.48545705
## [3,]  0.9104540 -0.09838166
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 3
## Objective function:
##    build     swap 
## 1.754077 1.620609 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
pam.res$medoids
##             id       mois       prot         fat        ash     sodium
## [1,] 0.4732154 -1.0492077  1.1199866  2.45446807  1.8324986  2.5937084
## [2,] 0.4579919  0.7439488  1.2334394  0.09141019  1.0843043  0.1636256
## [3,] 0.4559813 -0.7330761 -0.8724315 -0.48904862 -1.0500186 -0.6734030
## [4,] 0.4625877  0.8036160 -0.5616019 -0.47010851 -0.2624456 -0.1873864
##             carb        cal
## [1,] -1.19495831  2.2724551
## [2,] -0.95646323 -0.4854571
## [3,]  1.01694485  0.1757967
## [4,]  0.02691297 -0.8080199
pam.res$clusinfo
##      size max_diss  av_diss diameter separation
## [1,]   29 1.710058 1.124369 3.076119   3.362456
## [2,]   90 2.457287 1.619764 4.343469   1.725482
## [3,]  120 2.417467 1.235228 4.066620   1.215986
## [4,]   61 2.069652 1.227803 3.479344   1.215986

When we look at the structure of 4 clusters, cluster 3 has the maximum observations and cluster 1 has the minimum observations.

Silhouette

The value of average silhouette width is 0.47 in PAM clustering analysis with euclidean distance metric. Moreover, there is just slightly different between k-means and PAM about silhouette result.

sile<-silhouette(pam.res$cluster, dist(pizs))
fviz_silhouette(sile)
##   cluster size ave.sil.width
## 1       1   29          0.73
## 2       2   90          0.36
## 3       3  120          0.50
## 4       4   61          0.45

Here the best clustering result is obtained with the 1st cluster, which is 0.73, while the 2nd cluster has the worst quality among the clusters and its silhouette width value is 0.36.

Hierarchical Clustering Analysis

In first step, the distance matrix is found in euclidean distance metrics and hierarchical clustering analysis is initiated with 3 clusters as follows:

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
 + agnes(pizs, method = x)$ac}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.9609417 0.9365627 0.9704849 0.9938937

Ward is the biggest value in here with 0.9938937 so ward.D2 is determined as method for hierarchical clustering.

d <- dist(pizs, method = "euclidean")

res.hc <- hclust(d, method = "ward.D2")

grp <- cutree(res.hc, k = 3)

plot(res.hc, cex = 0.6,labels = pizza$id) 

plot(res.hc,labels = pizza$id, main = 'Hclust Dendrogram')

rect.hclust(res.hc, k = 3, border = 2:5)

After analyzed the dendrogram, the optimal cutree point is determined as standard deviation value of 20 which is corresponded to 3 clusters in this experiment and 2 of these clusters include mainly outliers.

CONCLUSION

In conclusion, 3 clustering methods (k-means, PAM and hierarchical clustering) with euclidean distance distance method are analyzed with in the report. The analyze start with descriptive statistics and observation of data scaling. Moreover, cluster tendency is measured by using Hopkins’ statistic and optimal number of clusters are found by using Elbow method for k-means algorithm and PAM algorithm. Moreover, for the dataset k-means and PAM algorithm has no dramatic change between them with euclidean distance metrics are gave the slightly different result for silhouette width and clustering, although K-means algorithm has better result in average silhouette width which are respectively 0,48 and 0,37. Nevertheless, 3 clustering algorithms are giving different clustering results although for future study k-means algorithm is gave more accurate clustering solution than PAM algorithm. In addition, 3 cluster are reached in hierarchical clustering with using euclidean distance metric. The proportion is good and the dendrogram shape is also efficient for cut tree for 3 clusters.

REFERENCES

1. The dataset is imported by using data.world - https://data.world/sdhilip/pizza-datasets