Unsupervised Learning - Clustering

Daria Ivanushenko

Introduction

This paper is describing the strength of fragrance of 20 different perfumes. Data was obtained by using a handheld odor meter (OMX-GR sensor) per second for 28 seconds period. It is interesting to compare perfumes of famous brands and to see the difference between them. Below I will provide some some technical explanations about the way odometer works and why it was chosen for this experiment.

Odor meter (olfactometer) - special equipment used to detect and measure odor dilution. Indicates the relative strength and odor classification numerically by comparing odor gas and purified air.

Review of the Data set

setwd("C:/Users/daria/OneDrive/Desktop/datasets")
# loading the data. Working directory was setted before.
library(readxl)
perfume_data <- read_excel("perfume_data.xlsx")

Data set used for this project was taken from the dataset repository. Data consist of names of the 20 perfumes and 28 time periods (T1-T28) each fragrance was measured - each second new measurement. Below you can find firs 5 rows of the data, dimension of the data frame and summary for each of the observations.

head(perfume_data)
## # A tibble: 6 x 29
##   Name      T1    T2    T3    T4    T5    T6    T7    T8    T9   T10   T11   T12
##   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ajayeb 64558 64556 64543 64543 64541 64543 64543 64541 64541 64541 64541 64541
## 2 ajmal  60502 60489 61485 60487 61485 61513 60515 60500 60500 60487 60500 61526
## 3 amreaj 57040 57040 57040 58041 58041 58041 58041 57042 57042 58043 58043 58043
## 4 aood   71083 72087 71091 71095 71099 72103 71099 72099 72099 73098 72094 73094
## 5 asgar~ 68209 68209 68216 68216 68223 68223 68223 68223 68230 68230 68230 68230
## 6 bukho~ 71046 71046 71046 71046 71046 71046 71046 71046 71046 70046 69046 70046
## # ... with 16 more variables: T13 <dbl>, T14 <dbl>, T15 <dbl>, T16 <dbl>,
## #   T17 <dbl>, T18 <dbl>, T19 <dbl>, T20 <dbl>, T21 <dbl>, T22 <dbl>,
## #   T23 <dbl>, T24 <dbl>, T25 <dbl>, T26 <dbl>, T27 <dbl>, T28 <dbl>
summary(perfume_data)
##      Name                 T1              T2              T3       
##  Length:20          Min.   :46014   Min.   :46014   Min.   :46014  
##  Class :character   1st Qu.:63057   1st Qu.:63054   1st Qu.:63804  
##  Mode  :character   Median :67665   Median :67703   Median :67710  
##                     Mean   :67879   Mean   :67978   Mean   :67978  
##                     3rd Qu.:71692   3rd Qu.:72021   3rd Qu.:71680  
##                     Max.   :85056   Max.   :85056   Max.   :85056  
##        T4              T5              T6              T7       
##  Min.   :46014   Min.   :46014   Min.   :46015   Min.   :46015  
##  1st Qu.:63801   1st Qu.:63049   1st Qu.:63799   1st Qu.:63907  
##  Median :67674   Median :67671   Median :67714   Median :67717  
##  Mean   :67927   Mean   :67977   Mean   :68077   Mean   :68030  
##  3rd Qu.:71211   3rd Qu.:71670   3rd Qu.:72025   3rd Qu.:72143  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##        T8              T9             T10             T11       
##  Min.   :46015   Min.   :46015   Min.   :46015   Min.   :46015  
##  1st Qu.:63796   1st Qu.:64156   1st Qu.:63046   1st Qu.:63046  
##  Median :67674   Median :67674   Median :67674   Median :67717  
##  Mean   :67929   Mean   :67979   Mean   :67929   Mean   :67776  
##  3rd Qu.:72024   3rd Qu.:72024   3rd Qu.:72274   3rd Qu.:72023  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##       T12             T13             T14             T15       
##  Min.   :46015   Min.   :46015   Min.   :46015   Min.   :46015  
##  1st Qu.:63046   1st Qu.:63549   1st Qu.:63043   1st Qu.:64146  
##  Median :67497   Median :67714   Median :67714   Median :67492  
##  Mean   :68029   Mean   :67979   Mean   :67878   Mean   :67927  
##  3rd Qu.:72715   3rd Qu.:72215   3rd Qu.:72215   3rd Qu.:72212  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##       T16             T17             T18             T19       
##  Min.   :46015   Min.   :46015   Min.   :46015   Min.   :46015  
##  1st Qu.:63397   1st Qu.:63397   1st Qu.:64156   1st Qu.:64156  
##  Median :67673   Median :67673   Median :67473   Median :67476  
##  Mean   :67924   Mean   :67971   Mean   :68071   Mean   :68070  
##  3rd Qu.:72214   3rd Qu.:72214   3rd Qu.:72211   3rd Qu.:72211  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##       T20             T21             T22             T23       
##  Min.   :46015   Min.   :46015   Min.   :46015   Min.   :46015  
##  1st Qu.:64156   1st Qu.:64156   1st Qu.:64156   1st Qu.:64156  
##  Median :67480   Median :67214   Median :66968   Median :67720  
##  Mean   :67969   Mean   :68068   Mean   :68070   Mean   :68071  
##  3rd Qu.:71460   3rd Qu.:72138   3rd Qu.:72217   3rd Qu.:72138  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##       T24             T25             T26             T27       
##  Min.   :46015   Min.   :46015   Min.   :46015   Min.   :46015  
##  1st Qu.:63541   1st Qu.:63039   1st Qu.:63789   1st Qu.:63039  
##  Median :67419   Median :67970   Median :67462   Median :67461  
##  Mean   :67819   Mean   :67968   Mean   :68068   Mean   :67916  
##  3rd Qu.:71463   3rd Qu.:71463   3rd Qu.:72216   3rd Qu.:72216  
##  Max.   :85056   Max.   :85056   Max.   :85056   Max.   :85056  
##       T28       
##  Min.   :46015  
##  1st Qu.:62789  
##  Median :67724  
##  Mean   :67917  
##  3rd Qu.:72213  
##  Max.   :85056
dim(perfume_data)
## [1] 20 29

Preparation of Data

In order to perform clustering analysis we need to remove first column of our data set, as it consists string values.

data = perfume_data[2:29]
names = perfume_data[,1]
# Now we have 28 columns
dim(data)
## [1] 20 28
head(data)
## # A tibble: 6 x 28
##      T1    T2    T3    T4    T5    T6    T7    T8    T9   T10   T11   T12   T13
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 64558 64556 64543 64543 64541 64543 64543 64541 64541 64541 64541 64541 64541
## 2 60502 60489 61485 60487 61485 61513 60515 60500 60500 60487 60500 61526 60528
## 3 57040 57040 57040 58041 58041 58041 58041 57042 57042 58043 58043 58043 58043
## 4 71083 72087 71091 71095 71099 72103 71099 72099 72099 73098 72094 73094 72091
## 5 68209 68209 68216 68216 68223 68223 68223 68223 68230 68230 68230 68230 68230
## 6 71046 71046 71046 71046 71046 71046 71046 71046 71046 70046 69046 70046 70046
## # ... with 15 more variables: T14 <dbl>, T15 <dbl>, T16 <dbl>, T17 <dbl>,
## #   T18 <dbl>, T19 <dbl>, T20 <dbl>, T21 <dbl>, T22 <dbl>, T23 <dbl>,
## #   T24 <dbl>, T25 <dbl>, T26 <dbl>, T27 <dbl>, T28 <dbl>

Clustering tendency

Before starting with cluster analysis I would like to conduct a Hopkins test to asses the clustering tendency of the data set. Null hypothesis is telling us that data is uniformly randomly distributed and no cluster could be defined. Data is highly clustered when Hopkins statistics is close to 1.

library(factoextra)
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.8088706
## 
## $plot

As value of 0.8102679 is close to 1 proves us that perfume data set is highly clusterable.

Due to the small sample the graph is not so clear and accurate but blocks of colors are visible on the plot and data seems to be ordered which can be a sign of finding clusters in data set.

Optimal number of cluster

In this chapter I would like to show the results of silhouette method to identify optimal number of clusters.I decided to use Silhouette method, as it is the most often used by other researchers.

Optimal numbers of clusters is choosing by the highest silhouette statistics.

  • Optimal number of clusters for k-means method: 3
  • Optimal number of clusters for pam method: 2
  • Optimal number of clusters for hierarchical methods: 5
library(FunCluster)
## Warning: package 'cluster' was built under R version 4.0.4
library(cluster)
library(NbClust)
library(gridExtra)
k1 = fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_minimal()
k1

k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2

k3 = fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_minimal()
k3

K-Means Clustering

library(gridExtra)
km1 = eclust(data, "kmeans", hc_metric = "eucledian", k = 3, graph = FALSE)
a1 = fviz_silhouette(km1)
##   cluster size ave.sil.width
## 1       1    3          0.72
## 2       2    4          0.31
## 3       3   13          0.54
b1 = fviz_cluster(km1, data = data, elipse.type = "convex", main = "K-means/3 clusters/all data") + theme_minimal()
grid.arrange(a1, b1, ncol=2)

To sum up, we divided our data into 3 clusters where green contains frangrances with the less perciptable odor, the odor of the fragrances in the red cluster perceived as very strong and the rest is in the blue cluster.

Graphs above showed that the average silhouette statistics in the cluster 2 is lower comparing to other 2 clusters. Average silhouette for cluster 2 is 0.31 means that elements are poorly matched to their own cluster. It can indicate that clustering configuration can have too few or too many clusters.

To understand whether changing of number of clusters can help me to get better results, I decided to cluster our data into 4 clusters this time.

km2 = eclust(data, "kmeans", hc_metric = "eucledian", k = 4, graph = FALSE)
a2 = fviz_silhouette(km2)
##   cluster size ave.sil.width
## 1       1    3          0.65
## 2       2    9          0.45
## 3       3    7          0.62
## 4       4    1          0.00
b2 = fviz_cluster(km2, data = data, elipse.type = "convex", main = "K-means/4 clusters/all data") + theme_minimal()
grid.arrange(a2, b2, ncol = 2)

Clustering a data into 4 clusters is showing that 12th observations seems to be an outlier. In the next step I will make clustering without 12th element, as it could be one of the reason for low silhouette statistics.

km3 = eclust(data[-12,], "kmeans", hc_metric = "eucledian", k = 4, graph = FALSE)
a3 = fviz_silhouette(km3)
##   cluster size ave.sil.width
## 1       1    7          0.44
## 2       2    3          0.61
## 3       3    4          0.50
## 4       4    5          0.66
b3 = fviz_cluster(km3, data = data, elipse.type = "convex", main = "K-means/4 clusters/eucledian") + theme_minimal()
grid.arrange(a3, b3, ncol = 2)

After removing outlier from our data which is the cause of poorly matched elements average silhouette is higher now.

PAM clustering

cm1 = eclust(data, "pam", k = 2, graph = FALSE)
pam1 = fviz_silhouette(cm1)
##   cluster size ave.sil.width
## 1       1   17          0.53
## 2       2    3          0.77
pam2 = fviz_cluster(cm1, data = data, elipse.type = "convex", main = "PAM/2 clusters/all data") + theme_minimal()
grid.arrange(pam1, pam2, ncol = 2)

Taking into account results from the silhouette statistics, Pam clustering was executed for 2 clusters. PAM method gives good results but it is worth to to execute PAM method without outlier.

cm2 = eclust(data, "pam", k = 4, graph = FALSE)
pam3 = fviz_silhouette(cm2)
##   cluster size ave.sil.width
## 1       1    9          0.45
## 2       2    7          0.62
## 3       3    1          0.00
## 4       4    3          0.65
pam4 = fviz_cluster(cm2, data = data, elipse.type = "convex", main = "PAM/4 clusters/all data") + theme_minimal()
grid.arrange(pam3, pam4, ncol = 2)

cm3 = eclust(data[-12,], "pam", k = 2, graph = FALSE)
pam5 = fviz_silhouette(cm3)
##   cluster size ave.sil.width
## 1       1   16          0.60
## 2       2    3          0.75
pam6 = fviz_cluster(cm3, data = data[-12,], elipse.type = "convex", main = "PAM/4 clusters") + theme_minimal()
grid.arrange(pam5, pam6, ncol = 2)

12th observation on the “PAM/4cluster/all data” plot seems to behave abnormally to the rest of the data. Taking a glance into data one more time, is showing that 12th observation (solidmusk) has low values comparing to others. Removing one of the observations from the dataset can improve our results. Our average silhouette statistics is higher.

Hierarchical Clustering

library(stats)
library(ClustGeo)
d = dist(data)
hc = hclust(d, method = "complete")
plot(hc, main = "Cluster Dendogram - complete method")
rect.hclust(hc, k = 5)

#checking the quality of the partitioning. 
inertia = matrix(0, 4, 2)
cl_5 = cutree(hc, k=5)
cl_2 = cutree(hc, k = 2)
options("scipen"=100, "digits"=4)

One can see from the dendogram that 12th observation could be an outlier in our dataset, as it is not belong to any of the predefined clusters. Therefore, the same procedure will be done on the data with omitting 12th observation, as it is not included in the initial clustering. Dendogram could be a good tool to make a pre-analysis before trying any clustering algorithms, as we can see the whole picture of our data.

d = dist(data[-12,])
hc = hclust(d, method = "complete")
plot(hc, main = "Cluster Dendogram - complete method")
rect.hclust(hc, k = 6)

#checking the quality of the partitioning. 
inertia = matrix(0, 4, 3)
cl_5 = cutree(hc, k = 5)
cl_2 = cutree(hc, k = 2)
cl_6 = cutree(hc, k = 6)
options("scipen"=100, "digits"=4)

inertia[1, 1] = withindiss(d, cl_5)
inertia[2, 1] = inertdiss(d)
inertia[3, 1]<-inertia[1,1]/ inertia[2,1]
inertia[4, 1]<-1-inertia[3,1]   

inertia[1, 2] = withindiss(d, cl_6)
inertia[2, 2] = inertdiss(d)
inertia[3, 2]<-inertia[1,2]/ inertia[2,2]
inertia[4, 2]<-1-inertia[3,2]   

inertia[1, 3] = withindiss(d, cl_2)
inertia[2, 3] = inertdiss(d)
inertia[3, 3]<-inertia[1,3]/ inertia[2,3]
inertia[4, 3]<-1-inertia[3,3]


colnames(inertia) = c("5 clusters", "6 clusters", "2 cluster")
rownames(inertia) = c("intra-clust", "total", "percentage", "Q")
inertia
##                   5 clusters       6 clusters       2 cluster
## intra-clust   58380581.28421   39874861.14386  510409677.5239
## total       1432878440.49861 1432878440.49861 1432878440.4986
## percentage           0.04074          0.02783          0.3562
## Q                    0.95926          0.97217          0.6438

After executing hierarchical clustering on the data excluding 12th observation, we can notice some improvements. I decided to count inertia for 6, 5 and 2 clusters. Partitioning into 6 clusters could improve our results. In case of partitioning into 6 clusters, within-clusters diversity is almost 3% and inter clusters diversity is 97% comparing to the results from 5 clusters partitioning where 4% and 96% respectively.

Model-Based clustering in R

library(mclust)
mc = Mclust(data)
summary(mc)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 9 components: 
## 
##  log-likelihood  n  df    BIC    ICL
##           -4638 20 261 -10058 -10058
## 
## Clustering table:
## 1 2 3 4 5 6 7 8 9 
## 4 2 1 2 4 1 2 1 3
fviz_mclust(mc, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

Model-based clustering assumes that our data comes from the specific distributions. The algorithms of the function is trying to fits different distributions to the the data and as a results each cluster will be fitted to different distribution, as it is assumed that observations are coming from different distributions. Comparing to previous methods it tries to measure the probability of belonging of each observation to the cluster. After computing model-based clustering we can notice that our data is not clustered well, as it is probably comes from the same distribution. So, only few point are clustered into two clusters. From the summary of the function we can see that spherical, equal volume model was fitted to the data with 9 clusters in total. I think small sample of 20 observations makes it difficult to analyze the this method. Graph shows that we discovered 9 models from which data was generated. We can see two bigger clusters (1 and 5) that are represented in the form of the ellips.

Statistics in clustered groups

library(flexclust)
d1<-cclust(data[-12,], 4, dist="euclidean") 
stripes(d1)

From the graph we can say that in 3rd cluster our fragrances are the most distant from centroid within given cluster. First and second clusters consist of 3 observations only.Observations in the first cluster are very distant from the centroid.