Choice of topic
I decided to study the nutritional value of coffee. Clustering could be an introduction to customer recommendations (coffee selection based on caloric and caffeine preferences).
Database information
Analyzed coffee nutrient content comes from the polish Starbucks website. (link to the nutrition information). During preprocessing of the data each drink without kcal/kJ was deleted (every natural tea). Also kcal was chosen as more popular unit of energy, thus column of kJ (kilojoules) was deleted. Natural and Added sugars were also omitted due to the Total Sugars values given.
For the more comfortable analysis I concatenated names of coffees with their sizes and types of milk (if there was any). It gave a unique names, which allows to transform these names to the rownames later.
Let’s see some information about dataset:
coffee <- read.csv("Polish_starbucks_coffees.csv")
summary(coffee)
## Produkt Size_oz kcal
## CAFFE_AMERICANO_Grande : 1 Min. : 1.00 Min. : 0.0
## CAFFE_AMERICANO_Short : 1 1st Qu.:10.00 1st Qu.:157.0
## CAFFE_AMERICANO_Tall : 1 Median :12.00 Median :228.0
## CAFFE_AMERICANO_Venti : 1 Mean :13.52 Mean :234.7
## CAFFE_LATTE_Grande_Almond : 1 3rd Qu.:16.00 3rd Qu.:312.0
## CAFFE_LATTE_Grande_Coconut: 1 Max. :20.00 Max. :690.0
## (Other) :1011
## Tluszcz Tluszcze_nasycone Weglowodany Blonnik
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 4.900 1st Qu.: 2.200 1st Qu.:21.40 1st Qu.: 0.0000
## Median : 9.800 Median : 5.400 Median :32.50 Median : 0.3000
## Mean : 9.728 Mean : 5.487 Mean :32.49 Mean : 0.7487
## 3rd Qu.:13.600 3rd Qu.: 7.600 3rd Qu.:43.20 3rd Qu.: 0.9000
## Max. :40.400 Max. :24.300 Max. :78.80 Max. :10.4000
##
## Bialko Sol Cukier Kofeina
## Min. : 0.000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.400 1st Qu.:0.2300 1st Qu.:19.80 1st Qu.: 20.00
## Median : 3.600 Median :0.3300 Median :29.70 Median : 83.00
## Mean : 4.797 Mean :0.3677 Mean :29.58 Mean : 96.45
## 3rd Qu.: 7.200 3rd Qu.:0.4900 3rd Qu.:40.30 3rd Qu.:150.00
## Max. :19.600 Max. :1.6000 Max. :68.70 Max. :469.00
##
Now i transform (using dplyr) the first column (coffees names with sizes) into ids of records (rownames):
library(dplyr)
coffee <- coffee %>%
remove_rownames() %>%
column_to_rownames(var = "Produkt")
We can look at the correlation between variables. It will be helpful in the process of selecting variables for clustering.
Correlation plot:
chart.Correlation(coffee,histogram = TRUE)
Heatmap:
col<- colorRampPalette(c("blue", "white", "red"))(20)
res <- cor(coffee)
round(res,2)
## Size_oz kcal Tluszcz Tluszcze_nasycone Weglowodany Blonnik
## Size_oz 1.00 0.49 0.35 0.32 0.49 0.22
## kcal 0.49 1.00 0.88 0.80 0.93 0.45
## Tluszcz 0.35 0.88 1.00 0.94 0.69 0.52
## Tluszcze_nasycone 0.32 0.80 0.94 1.00 0.62 0.45
## Weglowodany 0.49 0.93 0.69 0.62 1.00 0.29
## Blonnik 0.22 0.45 0.52 0.45 0.29 1.00
## Bialko 0.43 0.48 0.33 0.24 0.35 0.27
## Sol 0.43 0.68 0.54 0.50 0.70 0.02
## Cukier 0.47 0.89 0.64 0.57 0.99 0.18
## Kofeina 0.48 0.00 -0.04 -0.04 -0.01 0.00
## Bialko Sol Cukier Kofeina
## Size_oz 0.43 0.43 0.47 0.48
## kcal 0.48 0.68 0.89 0.00
## Tluszcz 0.33 0.54 0.64 -0.04
## Tluszcze_nasycone 0.24 0.50 0.57 -0.04
## Weglowodany 0.35 0.70 0.99 -0.01
## Blonnik 0.27 0.02 0.18 0.00
## Bialko 1.00 0.27 0.31 0.29
## Sol 0.27 1.00 0.72 0.00
## Cukier 0.31 0.72 1.00 -0.06
## Kofeina 0.29 0.00 -0.06 1.00
heatmap(x = res, col = col, symm = TRUE)
For the more intuitive clustering and analysis we choose to variables: kcal and Bialko. They are not highly correlated with each other and of great importance in the diet of people (especially those trying to take care of their silhouette).
coffee_kcal_protein <- coffee[c(2,7)]
chart.Correlation(coffee_kcal_protein,histogram = TRUE)
At first it is worth assessing the clustering tendency:
hopkins(coffee, n = nrow(coffee)-1)
## $H
## [1] 0.08713878
hopkins(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1)
## $H
## [1] 0.2311419
clust_tendency_coffee <- get_clust_tendency(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1, graph = FALSE)
clust_tendency_coffee$hopkins_stat
## [1] 0.7701866
clust_tendency_kcal_protein <- get_clust_tendency(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1, graph = FALSE)
clust_tendency_kcal_protein$hopkins_stat
## [1] 0.7701866
The Hopkins Statistic values are well below 0.5, which is good, because now the Hopkins statistic is computed as 1−H in get_clust_tendency() function from factoextra package. It means, that if the output is close to 0, then it is very likely that data has statistically significant clusters.
Here it is a plot of cluster tendency:
#plot cluster tendency
get_clust_tendency(coffee, 2, graph=TRUE, gradient=list(low="red", mid="white",
high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9632006
##
## $plot
Here it is a plot of cluster tendency for kcal and proteins only data:
get_clust_tendency(coffee_kcal_protein, 2, graph=TRUE, gradient=list(low="red", mid="white",
high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.6678308
##
## $plot
Before we start estimating the k number of cluster, let’s see how the relation of main variables looks:
coffee.frame <- as.data.frame(coffee)
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Kofeina, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Blonnik, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Cukier,z=~Tluszcz, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Tluszcz, color= ~Kofeina, type = "scatter3d",mode="markers")
Now we can estimate the number of clusters.
Elbow method:
fviz_nbclust(coffee_kcal_protein, kmeans, method = "wss")+
labs(subtitle = "Elbow method")
Silhouette method:
fviz_nbclust(coffee_kcal_protein, pam, method = "silhouette")+
labs(subtitle = "Silhouette method")
Gap_stat method:
fviz_nbclust(coffee, kmeans, method = "gap_stat",nboot = 500)+
labs(subtitle = "gap_stat")
Estimation using NbCLust():
NbEstimation <- NbClust(data = coffee_kcal_protein, distance = "manhattan", max.nc = 6, method = "median")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 7 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
NbEstimation$All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 0.0899 943.343 1492.4942 27.2708 2295.845 422232944613 2.681384e+13 7456416
## 3 4.5294 1909.599 169.0217 24.6207 3250.486 371594645426 4.285570e+12 3018257
## 4 0.3593 1540.091 928.5518 12.8182 3425.588 556125198378 3.168388e+12 2587030
## 5 2.8392 2443.568 10.3641 17.2639 4095.896 449518698030 8.466935e+11 1349777
## 6 3.4775 1974.994 117.1976 8.7329 4106.453 640621708372 8.296173e+11 1336094
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 11.4020 9.4479 0.2834 0.6839 0.5116 0.3417 1651.2520 1.9245 0.3746
## 3 27.4638 23.3405 0.1881 0.6477 0.5069 0.3968 237.1842 1.5107 0.3705
## 4 30.7104 27.2311 0.3021 0.5675 0.4884 0.3392 1201.9918 1.9450 0.3401
## 5 57.1508 52.1922 0.2541 0.5906 0.5022 0.5593 8.6669 0.7222 0.3192
## 6 57.6722 52.7267 0.2541 0.6050 0.4965 0.4505 174.4357 1.2114 0.2917
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 3728207.8 0.5537 0.9534 0.1585 0.0112 0 0.0218 69.5227 1.1172
## 3 1006085.7 0.6344 -0.0105 0.4076 0.0066 0 0.0211 44.7771 0.6735
## 4 646757.5 0.6393 1.0149 0.4068 0.0107 0 0.0204 42.5137 0.4248
## 5 269955.3 0.5850 0.7283 0.6049 0.0130 0 0.0322 30.3184 0.3279
## 6 222682.3 0.5850 0.5981 0.6050 0.0130 0 0.0322 30.1793 0.3625
NbEstimation$All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.5975 577.3314 0.1463
## 3 0.4854 165.4076 0.2224
## 4 0.5825 442.2237 0.1434
## 5 -0.0027 -4016.9908 0.4968
## 6 0.4768 156.9456 0.2993
NbEstimation$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 3.0000 5.000 3.000 2.0000 3.0000 5
## Value_Index 4.5294 2443.568 1323.473 27.2708 954.6414 297709510689
## TrCovW TraceW Friedman Rubin Cindex DB Silhouette
## Number_clusters 3.000000e+00 3 5.0000 5.0000 3.0000 4.0000 2.0000
## Value_Index 2.252827e+13 4006932 26.4404 -24.4265 0.1881 0.5675 0.5116
## Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey
## Number_clusters 5.0000 NA 2.0000 2.0000 3 4.0000 1
## Value_Index 0.5593 NA 1.9245 0.3746 2722122 0.6393 NA
## McClain Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 5.000 0 4.0000 0 5.0000
## Value_Index 0.1585 0.013 0 0.0204 0 0.3279
Output of NbCLust() suggests divison into 3 clusters.
Estimation using pamk():
pamk.best <- pamk(coffee_kcal_protein, krange=1:10,criterion="asw", usepam=TRUE,
scaling=FALSE,alpha=0.001, diss=inherits(coffee, "dist"), critout=FALSE)
pamk.best$nc
## [1] 2
pamk.best$pamobject$medoids
## kcal Bialko
## VANILLA_CREAM_FRAPPUCCINO®_Mini_Lactose_free_milk 165 2.4
## GINGERBREAD_COFFEE_FRAPPUCCINO®_Tall_Soy 323 5.5
pamk.best$pamobject$silinfo$clus.avg.widths
## [1] 0.5657424 0.5169302
Output of NbCLust() suggests divison into 2 clusters.
It is difficult to choose specific k number, but based on the plots above we try to compute 2,3 and 4 clusters.
K-means algorithm:
coffee_clst_kmeans1 <- kmeans(coffee_kcal_protein, 2, nstart = 25)
coffee_clst_kmeans2 <- kmeans(coffee_kcal_protein, 3, nstart = 25)
coffee_clst_kmeans3 <- kmeans(coffee_kcal_protein, 4, nstart = 25)
fviz_cluster(coffee_clst_kmeans1, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "euclid",
main = "Coffees",
ellipse.alpha = 0.2)
fviz_cluster(coffee_clst_kmeans2, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "euclid",
main = "Coffees",
ellipse.alpha = 0.2)
fviz_cluster(coffee_clst_kmeans3, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "euclid",
main = "Coffees",
ellipse.alpha = 0.2)
PAM algorithm:
coffee_clst_pam1 <- pam(coffee_kcal_protein,2)
coffee_clst_pam2 <- pam(coffee_kcal_protein,3)
coffee_clst_pam3 <- pam(coffee_kcal_protein,4)
fviz_cluster(coffee_clst_pam1, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "t",
main = "Coffees",
ellipse.alpha = 0.2)
fviz_cluster(coffee_clst_pam2, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "t",
main = "Coffees",
ellipse.alpha = 0.2)
fviz_cluster(coffee_clst_pam3, data = coffee_kcal_protein,
geom = "point",
ellipse.type = "t",
main = "Coffees",
ellipse.alpha = 0.2)
4 clusters looks nice, but it is still pretty difficult to choose the optimal number of clusters.
coffee_clst_kmeans1$centers
## kcal Bialko
## 1 344.0161 6.546774
## 2 153.3859 3.494168
coffee_clst_kmeans1$size
## [1] 434 583
coffee_clst_kmeans2$centers
## kcal Bialko
## 1 239.50509 4.410794
## 2 397.89655 7.806466
## 3 98.02041 3.066667
coffee_clst_kmeans2$size
## [1] 491 232 294
coffee_clst_kmeans3$centers
## kcal Bialko
## 1 198.75275 3.854670
## 2 77.98174 2.894977
## 3 432.64384 8.699315
## 4 299.08681 5.455556
coffee_clst_kmeans3$size
## [1] 364 219 146 288
We’re computing silhouette plot for PAM clusters:
fviz_silhouette(coffee_clst_pam1)
## cluster size ave.sil.width
## 1 1 570 0.57
## 2 2 447 0.52
fviz_silhouette(coffee_clst_pam2)
## cluster size ave.sil.width
## 1 1 253 0.58
## 2 2 457 0.58
## 3 3 307 0.45
fviz_silhouette(coffee_clst_pam3)
## cluster size ave.sil.width
## 1 1 221 0.57
## 2 2 335 0.56
## 3 3 273 0.55
## 4 4 188 0.39
Sizes of each cluster are similar. 2 clusters look the best, but it is worth considering dividing the data into 3 or 4 cluster for the bigger business value.
To be more sure about the output, we create another clusters, but without assuming k number at the begining. We use hierarchical clustering algorithms agnes() and diana(). Agnes() works in a bottom-up manner. Diana() works in a top-down manner, so it is some kind of reversed AGNES.
DIANA is commonly used for division into big clusters method, and AGNES more often for division into small clusters. We are considering big clusters, so i will use DIANA algorithm.
coffee_clst_agnes <- agnes(coffee_kcal_protein) #agnes
coffee_clst_diana <- diana(coffee_kcal_protein) #diana
pltree(coffee_clst_diana,cex =0.01,hang = -1, main = "Dendrogram of diana") #plotting
Now we can cut the tree into 2, 3 or 4 clusters. We can clearly see that there are 2 or 4 big nodes. Considering the business value that diverse clusters can create, I choose 4 as k number.
We use cutree() to divide the data into clusters:
sub_coffee_clst_diana <- cutree(coffee_clst_diana, k = 4)
table(sub_coffee_clst_diana)
## sub_coffee_clst_diana
## 1 2 3 4
## 590 282 122 23
Plot of 4 new clusters:
fviz_cluster(list(data=coffee_kcal_protein,cluster=sub_coffee_clst_diana),geom = "point",ellipse.alpha = 0.3)
We can also try to estimate the k number using elbow and silhouette mathods. Fviz_gap_stat() is also working considering DIANA clustering.
fviz_nbclust(coffee_kcal_protein, FUN = hcut, method = "wss")
fviz_nbclust(coffee_kcal_protein, FUN = hcut, method = "silhouette")
gap_stat <- clusGap(coffee_kcal_protein, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
Many methods shows the 2 as the best k number, but it would have probably less business value.