This paper is describing the strength of fragrance of 20 different perfumes. Data was obtained by using a handheld odor meter (OMX-GR sensor) per second for 28 seconds period. It is interesting to compare perfumes of famous brands and to see the difference between them. Below I will provide some some technical explanations about the way odometer works and why it was chosen for this experiment.
Odor meter (olfactometer) - special equipment used to detect and measure odor dilution. Indicates the relative strength and odor classification numerically by comparing odor gas and purified air.
setwd("C:/Users/daria/OneDrive/Desktop/datasets")
# loading the data. Working directory was setted before.
library(readxl)
perfume_data <- read_excel("perfume_data.xlsx")
Data set used for this project was taken from the dataset repository. Data consist of names of the 20 perfumes and 28 time periods (T1-T28) each fragrance was measured - each second new measurement. Below you can find firs 5 rows of the data, dimension of the data frame and summary for each of the observations.
head(perfume_data)
## # A tibble: 6 x 29
## Name T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ajayeb 64558 64556 64543 64543 64541 64543 64543 64541 64541 64541 64541 64541
## 2 ajmal 60502 60489 61485 60487 61485 61513 60515 60500 60500 60487 60500 61526
## 3 amreaj 57040 57040 57040 58041 58041 58041 58041 57042 57042 58043 58043 58043
## 4 aood 71083 72087 71091 71095 71099 72103 71099 72099 72099 73098 72094 73094
## 5 asgar~ 68209 68209 68216 68216 68223 68223 68223 68223 68230 68230 68230 68230
## 6 bukho~ 71046 71046 71046 71046 71046 71046 71046 71046 71046 70046 69046 70046
## # ... with 16 more variables: T13 <dbl>, T14 <dbl>, T15 <dbl>, T16 <dbl>,
## # T17 <dbl>, T18 <dbl>, T19 <dbl>, T20 <dbl>, T21 <dbl>, T22 <dbl>,
## # T23 <dbl>, T24 <dbl>, T25 <dbl>, T26 <dbl>, T27 <dbl>, T28 <dbl>
summary(perfume_data)
## Name T1 T2 T3
## Length:20 Min. :46014 Min. :46014 Min. :46014
## Class :character 1st Qu.:63057 1st Qu.:63054 1st Qu.:63804
## Mode :character Median :67665 Median :67703 Median :67710
## Mean :67879 Mean :67978 Mean :67978
## 3rd Qu.:71692 3rd Qu.:72021 3rd Qu.:71680
## Max. :85056 Max. :85056 Max. :85056
## T4 T5 T6 T7
## Min. :46014 Min. :46014 Min. :46015 Min. :46015
## 1st Qu.:63801 1st Qu.:63049 1st Qu.:63799 1st Qu.:63907
## Median :67674 Median :67671 Median :67714 Median :67717
## Mean :67927 Mean :67977 Mean :68077 Mean :68030
## 3rd Qu.:71211 3rd Qu.:71670 3rd Qu.:72025 3rd Qu.:72143
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T8 T9 T10 T11
## Min. :46015 Min. :46015 Min. :46015 Min. :46015
## 1st Qu.:63796 1st Qu.:64156 1st Qu.:63046 1st Qu.:63046
## Median :67674 Median :67674 Median :67674 Median :67717
## Mean :67929 Mean :67979 Mean :67929 Mean :67776
## 3rd Qu.:72024 3rd Qu.:72024 3rd Qu.:72274 3rd Qu.:72023
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T12 T13 T14 T15
## Min. :46015 Min. :46015 Min. :46015 Min. :46015
## 1st Qu.:63046 1st Qu.:63549 1st Qu.:63043 1st Qu.:64146
## Median :67497 Median :67714 Median :67714 Median :67492
## Mean :68029 Mean :67979 Mean :67878 Mean :67927
## 3rd Qu.:72715 3rd Qu.:72215 3rd Qu.:72215 3rd Qu.:72212
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T16 T17 T18 T19
## Min. :46015 Min. :46015 Min. :46015 Min. :46015
## 1st Qu.:63397 1st Qu.:63397 1st Qu.:64156 1st Qu.:64156
## Median :67673 Median :67673 Median :67473 Median :67476
## Mean :67924 Mean :67971 Mean :68071 Mean :68070
## 3rd Qu.:72214 3rd Qu.:72214 3rd Qu.:72211 3rd Qu.:72211
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T20 T21 T22 T23
## Min. :46015 Min. :46015 Min. :46015 Min. :46015
## 1st Qu.:64156 1st Qu.:64156 1st Qu.:64156 1st Qu.:64156
## Median :67480 Median :67214 Median :66968 Median :67720
## Mean :67969 Mean :68068 Mean :68070 Mean :68071
## 3rd Qu.:71460 3rd Qu.:72138 3rd Qu.:72217 3rd Qu.:72138
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T24 T25 T26 T27
## Min. :46015 Min. :46015 Min. :46015 Min. :46015
## 1st Qu.:63541 1st Qu.:63039 1st Qu.:63789 1st Qu.:63039
## Median :67419 Median :67970 Median :67462 Median :67461
## Mean :67819 Mean :67968 Mean :68068 Mean :67916
## 3rd Qu.:71463 3rd Qu.:71463 3rd Qu.:72216 3rd Qu.:72216
## Max. :85056 Max. :85056 Max. :85056 Max. :85056
## T28
## Min. :46015
## 1st Qu.:62789
## Median :67724
## Mean :67917
## 3rd Qu.:72213
## Max. :85056
dim(perfume_data)
## [1] 20 29
In order to perform clustering analysis we need to remove first column of our data set, as it consists string values.
data = perfume_data[2:29]
names = perfume_data[,1]
# Now we have 28 columns
dim(data)
## [1] 20 28
head(data)
## # A tibble: 6 x 28
## T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 64558 64556 64543 64543 64541 64543 64543 64541 64541 64541 64541 64541 64541
## 2 60502 60489 61485 60487 61485 61513 60515 60500 60500 60487 60500 61526 60528
## 3 57040 57040 57040 58041 58041 58041 58041 57042 57042 58043 58043 58043 58043
## 4 71083 72087 71091 71095 71099 72103 71099 72099 72099 73098 72094 73094 72091
## 5 68209 68209 68216 68216 68223 68223 68223 68223 68230 68230 68230 68230 68230
## 6 71046 71046 71046 71046 71046 71046 71046 71046 71046 70046 69046 70046 70046
## # ... with 15 more variables: T14 <dbl>, T15 <dbl>, T16 <dbl>, T17 <dbl>,
## # T18 <dbl>, T19 <dbl>, T20 <dbl>, T21 <dbl>, T22 <dbl>, T23 <dbl>,
## # T24 <dbl>, T25 <dbl>, T26 <dbl>, T27 <dbl>, T28 <dbl>
Before starting with cluster analysis I would like to conduct a Hopkins test to asses the clustering tendency of the data set. Null hypothesis is telling us that data is uniformly randomly distributed and no cluster could be defined. Data is highly clustered when Hopkins statistics is close to 1.
library(factoextra)
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.8088706
##
## $plot
As value of 0.8102679 is close to 1 proves us that perfume data set is highly clusterable.
Due to the small sample the graph is not so clear and accurate but blocks of colors are visible on the plot and data seems to be ordered which can be a sign of finding clusters in data set.
In this chapter I would like to show the results of silhouette method to identify optimal number of clusters.I decided to use Silhouette method, as it is the most often used by other researchers.
Optimal numbers of clusters is choosing by the highest silhouette statistics.
library(FunCluster)
## Warning: package 'cluster' was built under R version 4.0.4
library(cluster)
library(NbClust)
library(gridExtra)
k1 = fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_minimal()
k1
k2 = fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_minimal()
k2
k3 = fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_minimal()
k3
library(gridExtra)
km1 = eclust(data, "kmeans", hc_metric = "eucledian", k = 3, graph = FALSE)
a1 = fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 3 0.72
## 2 2 4 0.31
## 3 3 13 0.54
b1 = fviz_cluster(km1, data = data, elipse.type = "convex", main = "K-means/3 clusters/all data") + theme_minimal()
grid.arrange(a1, b1, ncol=2)
To sum up, we divided our data into 3 clusters where green contains frangrances with the less perciptable odor, the odor of the fragrances in the red cluster perceived as very strong and the rest is in the blue cluster.
Graphs above showed that the average silhouette statistics in the cluster 2 is lower comparing to other 2 clusters. Average silhouette for cluster 2 is 0.31 means that elements are poorly matched to their own cluster. It can indicate that clustering configuration can have too few or too many clusters.
To understand whether changing of number of clusters can help me to get better results, I decided to cluster our data into 4 clusters this time.
km2 = eclust(data, "kmeans", hc_metric = "eucledian", k = 4, graph = FALSE)
a2 = fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 3 0.65
## 2 2 9 0.45
## 3 3 7 0.62
## 4 4 1 0.00
b2 = fviz_cluster(km2, data = data, elipse.type = "convex", main = "K-means/4 clusters/all data") + theme_minimal()
grid.arrange(a2, b2, ncol = 2)
Clustering a data into 4 clusters is showing that 12th observations seems to be an outlier. In the next step I will make clustering without 12th element, as it could be one of the reason for low silhouette statistics.
km3 = eclust(data[-12,], "kmeans", hc_metric = "eucledian", k = 4, graph = FALSE)
a3 = fviz_silhouette(km3)
## cluster size ave.sil.width
## 1 1 7 0.44
## 2 2 3 0.61
## 3 3 4 0.50
## 4 4 5 0.66
b3 = fviz_cluster(km3, data = data, elipse.type = "convex", main = "K-means/4 clusters/eucledian") + theme_minimal()
grid.arrange(a3, b3, ncol = 2)
After removing outlier from our data which is the cause of poorly matched elements average silhouette is higher now.
cm1 = eclust(data, "pam", k = 2, graph = FALSE)
pam1 = fviz_silhouette(cm1)
## cluster size ave.sil.width
## 1 1 17 0.53
## 2 2 3 0.77
pam2 = fviz_cluster(cm1, data = data, elipse.type = "convex", main = "PAM/2 clusters/all data") + theme_minimal()
grid.arrange(pam1, pam2, ncol = 2)
Taking into account results from the silhouette statistics, Pam clustering was executed for 2 clusters. PAM method gives good results but it is worth to to execute PAM method without outlier.
cm2 = eclust(data, "pam", k = 4, graph = FALSE)
pam3 = fviz_silhouette(cm2)
## cluster size ave.sil.width
## 1 1 9 0.45
## 2 2 7 0.62
## 3 3 1 0.00
## 4 4 3 0.65
pam4 = fviz_cluster(cm2, data = data, elipse.type = "convex", main = "PAM/4 clusters/all data") + theme_minimal()
grid.arrange(pam3, pam4, ncol = 2)
cm3 = eclust(data[-12,], "pam", k = 2, graph = FALSE)
pam5 = fviz_silhouette(cm3)
## cluster size ave.sil.width
## 1 1 16 0.60
## 2 2 3 0.75
pam6 = fviz_cluster(cm3, data = data[-12,], elipse.type = "convex", main = "PAM/4 clusters") + theme_minimal()
grid.arrange(pam5, pam6, ncol = 2)
12th observation on the “PAM/4cluster/all data” plot seems to behave abnormally to the rest of the data. Taking a glance into data one more time, is showing that 12th observation (solidmusk) has low values comparing to others. Removing one of the observations from the dataset can improve our results. Our average silhouette statistics is higher.
library(stats)
library(ClustGeo)
d = dist(data)
hc = hclust(d, method = "complete")
plot(hc, main = "Cluster Dendogram - complete method")
rect.hclust(hc, k = 5)
#checking the quality of the partitioning.
inertia = matrix(0, 4, 2)
cl_5 = cutree(hc, k=5)
cl_2 = cutree(hc, k = 2)
options("scipen"=100, "digits"=4)
One can see from the dendogram that 12th observation could be an outlier in our dataset, as it is not belong to any of the predefined clusters. Therefore, the same procedure will be done on the data with omitting 12th observation, as it is not included in the initial clustering. Dendogram could be a good tool to make a pre-analysis before trying any clustering algorithms, as we can see the whole picture of our data.
d = dist(data[-12,])
hc = hclust(d, method = "complete")
plot(hc, main = "Cluster Dendogram - complete method")
rect.hclust(hc, k = 6)
#checking the quality of the partitioning.
inertia = matrix(0, 4, 3)
cl_5 = cutree(hc, k = 5)
cl_2 = cutree(hc, k = 2)
cl_6 = cutree(hc, k = 6)
options("scipen"=100, "digits"=4)
inertia[1, 1] = withindiss(d, cl_5)
inertia[2, 1] = inertdiss(d)
inertia[3, 1]<-inertia[1,1]/ inertia[2,1]
inertia[4, 1]<-1-inertia[3,1]
inertia[1, 2] = withindiss(d, cl_6)
inertia[2, 2] = inertdiss(d)
inertia[3, 2]<-inertia[1,2]/ inertia[2,2]
inertia[4, 2]<-1-inertia[3,2]
inertia[1, 3] = withindiss(d, cl_2)
inertia[2, 3] = inertdiss(d)
inertia[3, 3]<-inertia[1,3]/ inertia[2,3]
inertia[4, 3]<-1-inertia[3,3]
colnames(inertia) = c("5 clusters", "6 clusters", "2 cluster")
rownames(inertia) = c("intra-clust", "total", "percentage", "Q")
inertia
## 5 clusters 6 clusters 2 cluster
## intra-clust 58380581.28421 39874861.14386 510409677.5239
## total 1432878440.49861 1432878440.49861 1432878440.4986
## percentage 0.04074 0.02783 0.3562
## Q 0.95926 0.97217 0.6438
After executing hierarchical clustering on the data excluding 12th observation, we can notice some improvements. I decided to count inertia for 6, 5 and 2 clusters. Partitioning into 6 clusters could improve our results. In case of partitioning into 6 clusters, within-clusters diversity is almost 3% and inter clusters diversity is 97% comparing to the results from 5 clusters partitioning where 4% and 96% respectively.
library(mclust)
mc = Mclust(data)
summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 9 components:
##
## log-likelihood n df BIC ICL
## -4638 20 261 -10058 -10058
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 4 2 1 2 4 1 2 1 3
fviz_mclust(mc, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
Model-based clustering assumes that our data comes from the specific distributions. The algorithms of the function is trying to fits different distributions to the the data and as a results each cluster will be fitted to different distribution, as it is assumed that observations are coming from different distributions. Comparing to previous methods it tries to measure the probability of belonging of each observation to the cluster. After computing model-based clustering we can notice that our data is not clustered well, as it is probably comes from the same distribution. So, only few point are clustered into two clusters. From the summary of the function we can see that spherical, equal volume model was fitted to the data with 9 clusters in total. I think small sample of 20 observations makes it difficult to analyze the this method. Graph shows that we discovered 9 models from which data was generated. We can see two bigger clusters (1 and 5) that are represented in the form of the ellips.
library(flexclust)
d1<-cclust(data[-12,], 4, dist="euclidean")
stripes(d1)
From the graph we can say that in 3rd cluster our fragrances are the most distant from centroid within given cluster. First and second clusters consist of 3 observations only.Observations in the first cluster are very distant from the centroid.