Choice of topic

I decided to study the nutritional value of coffee. Clustering could be an introduction to customer recommendations (coffee selection based on caloric and caffeine preferences).

Database information

Analyzed coffee nutrient content comes from the polish Starbucks website. (link to the nutrition information). During preprocessing of the data each drink without kcal/kJ was deleted (every natural tea). Also kcal was chosen as more popular unit of energy, thus column of kJ (kilojoules) was deleted. Natural and Added sugars were also omitted due to the Total Sugars values given.

For the more comfortable analysis I concatenated names of coffees with their sizes and types of milk (if there was any). It gave a unique names, which allows to transform these names to the rownames later.

Let’s see some information about dataset:

coffee <- read.csv("Polish_starbucks_coffees.csv")
summary(coffee)
##                        Produkt        Size_oz           kcal      
##  CAFFE_AMERICANO_Grande    :   1   Min.   : 1.00   Min.   :  0.0  
##  CAFFE_AMERICANO_Short     :   1   1st Qu.:10.00   1st Qu.:157.0  
##  CAFFE_AMERICANO_Tall      :   1   Median :12.00   Median :228.0  
##  CAFFE_AMERICANO_Venti     :   1   Mean   :13.52   Mean   :234.7  
##  CAFFE_LATTE_Grande_Almond :   1   3rd Qu.:16.00   3rd Qu.:312.0  
##  CAFFE_LATTE_Grande_Coconut:   1   Max.   :20.00   Max.   :690.0  
##  (Other)                   :1011                                  
##     Tluszcz       Tluszcze_nasycone  Weglowodany       Blonnik       
##  Min.   : 0.000   Min.   : 0.000    Min.   : 0.00   Min.   : 0.0000  
##  1st Qu.: 4.900   1st Qu.: 2.200    1st Qu.:21.40   1st Qu.: 0.0000  
##  Median : 9.800   Median : 5.400    Median :32.50   Median : 0.3000  
##  Mean   : 9.728   Mean   : 5.487    Mean   :32.49   Mean   : 0.7487  
##  3rd Qu.:13.600   3rd Qu.: 7.600    3rd Qu.:43.20   3rd Qu.: 0.9000  
##  Max.   :40.400   Max.   :24.300    Max.   :78.80   Max.   :10.4000  
##                                                                      
##      Bialko            Sol             Cukier         Kofeina      
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.: 1.400   1st Qu.:0.2300   1st Qu.:19.80   1st Qu.: 20.00  
##  Median : 3.600   Median :0.3300   Median :29.70   Median : 83.00  
##  Mean   : 4.797   Mean   :0.3677   Mean   :29.58   Mean   : 96.45  
##  3rd Qu.: 7.200   3rd Qu.:0.4900   3rd Qu.:40.30   3rd Qu.:150.00  
##  Max.   :19.600   Max.   :1.6000   Max.   :68.70   Max.   :469.00  
## 

Now i transform (using dplyr) the first column (coffees names with sizes) into ids of records (rownames):

library(dplyr)
coffee <- coffee %>% 
  remove_rownames() %>% 
  column_to_rownames(var = "Produkt")

We can look at the correlation between variables. It will be helpful in the process of selecting variables for clustering.

Correlation plot:

chart.Correlation(coffee,histogram = TRUE)

Heatmap:

col<- colorRampPalette(c("blue", "white", "red"))(20)
res <- cor(coffee)
round(res,2)
##                   Size_oz kcal Tluszcz Tluszcze_nasycone Weglowodany Blonnik
## Size_oz              1.00 0.49    0.35              0.32        0.49    0.22
## kcal                 0.49 1.00    0.88              0.80        0.93    0.45
## Tluszcz              0.35 0.88    1.00              0.94        0.69    0.52
## Tluszcze_nasycone    0.32 0.80    0.94              1.00        0.62    0.45
## Weglowodany          0.49 0.93    0.69              0.62        1.00    0.29
## Blonnik              0.22 0.45    0.52              0.45        0.29    1.00
## Bialko               0.43 0.48    0.33              0.24        0.35    0.27
## Sol                  0.43 0.68    0.54              0.50        0.70    0.02
## Cukier               0.47 0.89    0.64              0.57        0.99    0.18
## Kofeina              0.48 0.00   -0.04             -0.04       -0.01    0.00
##                   Bialko  Sol Cukier Kofeina
## Size_oz             0.43 0.43   0.47    0.48
## kcal                0.48 0.68   0.89    0.00
## Tluszcz             0.33 0.54   0.64   -0.04
## Tluszcze_nasycone   0.24 0.50   0.57   -0.04
## Weglowodany         0.35 0.70   0.99   -0.01
## Blonnik             0.27 0.02   0.18    0.00
## Bialko              1.00 0.27   0.31    0.29
## Sol                 0.27 1.00   0.72    0.00
## Cukier              0.31 0.72   1.00   -0.06
## Kofeina             0.29 0.00  -0.06    1.00
heatmap(x = res, col = col, symm = TRUE)

For the more intuitive clustering and analysis we choose to variables: kcal and Bialko. They are not highly correlated with each other and of great importance in the diet of people (especially those trying to take care of their silhouette).

coffee_kcal_protein <- coffee[c(2,7)]
chart.Correlation(coffee_kcal_protein,histogram = TRUE)

At first it is worth assessing the clustering tendency:

hopkins(coffee, n = nrow(coffee)-1)
## $H
## [1] 0.08713878
hopkins(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1)
## $H
## [1] 0.2311419
clust_tendency_coffee <- get_clust_tendency(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1, graph = FALSE)
clust_tendency_coffee$hopkins_stat
## [1] 0.7701866
clust_tendency_kcal_protein <- get_clust_tendency(coffee_kcal_protein, n = nrow(coffee_kcal_protein)-1, graph = FALSE)
clust_tendency_kcal_protein$hopkins_stat
## [1] 0.7701866

The Hopkins Statistic values are well below 0.5, which is good, because now the Hopkins statistic is computed as 1−H in get_clust_tendency() function from factoextra package. It means, that if the output is close to 0, then it is very likely that data has statistically significant clusters.

Here it is a plot of cluster tendency:

#plot cluster tendency
get_clust_tendency(coffee, 2, graph=TRUE, gradient=list(low="red", mid="white",
                                                     high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9632006
## 
## $plot

Here it is a plot of cluster tendency for kcal and proteins only data:

get_clust_tendency(coffee_kcal_protein, 2, graph=TRUE, gradient=list(low="red", mid="white",
                                                     high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.6678308
## 
## $plot

Before we start estimating the k number of cluster, let’s see how the relation of main variables looks:

coffee.frame <- as.data.frame(coffee)
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Kofeina, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Blonnik, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Cukier,z=~Tluszcz, color= ~Kofeina, type = "scatter3d",mode="markers")
plot_ly(coffee.frame,x=~kcal,y=~Bialko,z=~Tluszcz, color= ~Kofeina, type = "scatter3d",mode="markers")

Now we can estimate the number of clusters.

Elbow method:

fviz_nbclust(coffee_kcal_protein, kmeans, method = "wss")+
  labs(subtitle = "Elbow method")

Silhouette method:

fviz_nbclust(coffee_kcal_protein, pam, method = "silhouette")+
  labs(subtitle = "Silhouette method")

Gap_stat method:

fviz_nbclust(coffee, kmeans, method = "gap_stat",nboot = 500)+
  labs(subtitle = "gap_stat")

Estimation using NbCLust():

NbEstimation <- NbClust(data = coffee_kcal_protein, distance = "manhattan", max.nc = 6, method = "median")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 7 proposed 5 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
NbEstimation$All.index
##       KL       CH  Hartigan     CCC    Scott      Marriot       TrCovW  TraceW
## 2 0.0899  943.343 1492.4942 27.2708 2295.845 422232944613 2.681384e+13 7456416
## 3 4.5294 1909.599  169.0217 24.6207 3250.486 371594645426 4.285570e+12 3018257
## 4 0.3593 1540.091  928.5518 12.8182 3425.588 556125198378 3.168388e+12 2587030
## 5 2.8392 2443.568   10.3641 17.2639 4095.896 449518698030 8.466935e+11 1349777
## 6 3.4775 1974.994  117.1976  8.7329 4106.453 640621708372 8.296173e+11 1336094
##   Friedman   Rubin Cindex     DB Silhouette   Duda  Pseudot2  Beale Ratkowsky
## 2  11.4020  9.4479 0.2834 0.6839     0.5116 0.3417 1651.2520 1.9245    0.3746
## 3  27.4638 23.3405 0.1881 0.6477     0.5069 0.3968  237.1842 1.5107    0.3705
## 4  30.7104 27.2311 0.3021 0.5675     0.4884 0.3392 1201.9918 1.9450    0.3401
## 5  57.1508 52.1922 0.2541 0.5906     0.5022 0.5593    8.6669 0.7222    0.3192
## 6  57.6722 52.7267 0.2541 0.6050     0.4965 0.4505  174.4357 1.2114    0.2917
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex  Dindex   SDbw
## 2 3728207.8     0.5537  0.9534  0.1585 0.0112      0  0.0218 69.5227 1.1172
## 3 1006085.7     0.6344 -0.0105  0.4076 0.0066      0  0.0211 44.7771 0.6735
## 4  646757.5     0.6393  1.0149  0.4068 0.0107      0  0.0204 42.5137 0.4248
## 5  269955.3     0.5850  0.7283  0.6049 0.0130      0  0.0322 30.3184 0.3279
## 6  222682.3     0.5850  0.5981  0.6050 0.0130      0  0.0322 30.1793 0.3625
NbEstimation$All.CriticalValues
##   CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2         0.5975           577.3314       0.1463
## 3         0.4854           165.4076       0.2224
## 4         0.5825           442.2237       0.1434
## 5        -0.0027         -4016.9908       0.4968
## 6         0.4768           156.9456       0.2993
NbEstimation$Best.nc
##                     KL       CH Hartigan     CCC    Scott      Marriot
## Number_clusters 3.0000    5.000    3.000  2.0000   3.0000            5
## Value_Index     4.5294 2443.568 1323.473 27.2708 954.6414 297709510689
##                       TrCovW  TraceW Friedman    Rubin Cindex     DB Silhouette
## Number_clusters 3.000000e+00       3   5.0000   5.0000 3.0000 4.0000     2.0000
## Value_Index     2.252827e+13 4006932  26.4404 -24.4265 0.1881 0.5675     0.5116
##                   Duda PseudoT2  Beale Ratkowsky    Ball PtBiserial Frey
## Number_clusters 5.0000       NA 2.0000    2.0000       3     4.0000    1
## Value_Index     0.5593       NA 1.9245    0.3746 2722122     0.6393   NA
##                 McClain  Dunn Hubert SDindex Dindex   SDbw
## Number_clusters  2.0000 5.000      0  4.0000      0 5.0000
## Value_Index      0.1585 0.013      0  0.0204      0 0.3279

Output of NbCLust() suggests divison into 3 clusters.

Estimation using pamk():

pamk.best <- pamk(coffee_kcal_protein, krange=1:10,criterion="asw", usepam=TRUE,
                  scaling=FALSE,alpha=0.001, diss=inherits(coffee, "dist"), critout=FALSE) 
pamk.best$nc
## [1] 2
pamk.best$pamobject$medoids
##                                                   kcal Bialko
## VANILLA_CREAM_FRAPPUCCINO®_Mini_Lactose_free_milk  165    2.4
## GINGERBREAD_COFFEE_FRAPPUCCINO®_Tall_Soy           323    5.5
pamk.best$pamobject$silinfo$clus.avg.widths
## [1] 0.5657424 0.5169302

Output of NbCLust() suggests divison into 2 clusters.

It is difficult to choose specific k number, but based on the plots above we try to compute 2,3 and 4 clusters.

K-means algorithm:

coffee_clst_kmeans1 <- kmeans(coffee_kcal_protein, 2, nstart = 25)
coffee_clst_kmeans2 <- kmeans(coffee_kcal_protein, 3, nstart = 25)
coffee_clst_kmeans3 <- kmeans(coffee_kcal_protein, 4, nstart = 25)

fviz_cluster(coffee_clst_kmeans1, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "euclid",
             main = "Coffees",
             ellipse.alpha = 0.2)

fviz_cluster(coffee_clst_kmeans2, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "euclid",
             main = "Coffees",
             ellipse.alpha = 0.2)

fviz_cluster(coffee_clst_kmeans3, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "euclid",
             main = "Coffees",
             ellipse.alpha = 0.2)

PAM algorithm:

coffee_clst_pam1 <- pam(coffee_kcal_protein,2)
coffee_clst_pam2 <- pam(coffee_kcal_protein,3)
coffee_clst_pam3 <- pam(coffee_kcal_protein,4)

fviz_cluster(coffee_clst_pam1, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "t",
             main = "Coffees",
             ellipse.alpha = 0.2)

fviz_cluster(coffee_clst_pam2, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "t",
             main = "Coffees",
             ellipse.alpha = 0.2)

fviz_cluster(coffee_clst_pam3, data = coffee_kcal_protein,
             geom = "point",
             ellipse.type = "t",
             main = "Coffees",
             ellipse.alpha = 0.2)

4 clusters looks nice, but it is still pretty difficult to choose the optimal number of clusters.

coffee_clst_kmeans1$centers
##       kcal   Bialko
## 1 344.0161 6.546774
## 2 153.3859 3.494168
coffee_clst_kmeans1$size
## [1] 434 583
coffee_clst_kmeans2$centers
##        kcal   Bialko
## 1 239.50509 4.410794
## 2 397.89655 7.806466
## 3  98.02041 3.066667
coffee_clst_kmeans2$size
## [1] 491 232 294
coffee_clst_kmeans3$centers
##        kcal   Bialko
## 1 198.75275 3.854670
## 2  77.98174 2.894977
## 3 432.64384 8.699315
## 4 299.08681 5.455556
coffee_clst_kmeans3$size
## [1] 364 219 146 288

We’re computing silhouette plot for PAM clusters:

fviz_silhouette(coffee_clst_pam1)
##   cluster size ave.sil.width
## 1       1  570          0.57
## 2       2  447          0.52

fviz_silhouette(coffee_clst_pam2)
##   cluster size ave.sil.width
## 1       1  253          0.58
## 2       2  457          0.58
## 3       3  307          0.45

fviz_silhouette(coffee_clst_pam3)
##   cluster size ave.sil.width
## 1       1  221          0.57
## 2       2  335          0.56
## 3       3  273          0.55
## 4       4  188          0.39

Sizes of each cluster are similar. 2 clusters look the best, but it is worth considering dividing the data into 3 or 4 cluster for the bigger business value.

To be more sure about the output, we create another clusters, but without assuming k number at the begining. We use hierarchical clustering algorithms agnes() and diana(). Agnes() works in a bottom-up manner. Diana() works in a top-down manner, so it is some kind of reversed AGNES.

DIANA is commonly used for division into big clusters method, and AGNES more often for division into small clusters. We are considering big clusters, so i will use DIANA algorithm.

coffee_clst_agnes <- agnes(coffee_kcal_protein) #agnes
coffee_clst_diana <- diana(coffee_kcal_protein) #diana
pltree(coffee_clst_diana,cex =0.01,hang = -1, main = "Dendrogram of diana") #plotting

Now we can cut the tree into 2, 3 or 4 clusters. We can clearly see that there are 2 or 4 big nodes. Considering the business value that diverse clusters can create, I choose 4 as k number.

We use cutree() to divide the data into clusters:

sub_coffee_clst_diana <- cutree(coffee_clst_diana, k = 4)
table(sub_coffee_clst_diana)
## sub_coffee_clst_diana
##   1   2   3   4 
## 590 282 122  23

Plot of 4 new clusters:

fviz_cluster(list(data=coffee_kcal_protein,cluster=sub_coffee_clst_diana),geom = "point",ellipse.alpha = 0.3)

We can also try to estimate the k number using elbow and silhouette mathods. Fviz_gap_stat() is also working considering DIANA clustering.

fviz_nbclust(coffee_kcal_protein, FUN = hcut, method = "wss")

fviz_nbclust(coffee_kcal_protein, FUN = hcut, method = "silhouette")

gap_stat <- clusGap(coffee_kcal_protein, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Many methods shows the 2 as the best k number, but it would have probably less business value.