Introduction

Data source: https://github.com/jldbc/coffee-quality-database/tree/master, scraped from Coffee Quality Institute’s reviews.

I am an uneducated coffee enthusiast - I love trying new coffee, but I never know which bean bag to choose next. In the world of specialty coffee, the descriptions for consumer are often built on subjective descriptions, which sometimes aren’t informative enough. As a Data-Sciencist in training I decided to change this using data-driven approach to discover similarities in my previous and potential next morning drink.

The goal is to create clusters of beans by their expert-rated taste attributes (Aroma, Flavor, Aftertaste, Acidity, Body, Balance, and Sweetness) and then check whether information that’s actually readable on a coffee bag (mainly country of origin on the cheaper bags, also growing altitude/processing method for high end products) matters for which flavour profile cluster a coffee lands in.

If everything goes well, this study will make a choice of the next coffee bag a little easier by providing the ability to compare the previous coffee experience with a potential one, without relying on imprecise descriptions, but based on data.

Installing and loading necessary packages

library(hopkins)
library(cluster)
library(factoextra)
library(flexclust)
library(dplyr)
library(psych)
library(readr)
library(corrplot)
library(fpc)

Data

arabica_data <- read_csv("arabica_data_cleaned.csv")
arabica_data
data <- arabica_data %>% select(c("Country.of.Origin","altitude_mean_meters","Processing.Method","Aroma","Flavor","Aftertaste","Acidity","Body","Balance","Sweetness"))

missing.data <- data[!complete.cases(data),]
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "altitude_mean_meters"]
data.matrix <- as.matrix(data.numeric)

There are some coffees with missing data on the processing method or mean alitude of their cultivation, one with missing country of origin. This is not influencing our clustering data, so they will be kept, only less valuable in interpretation.

head(data)
## # A tibble: 6 × 10
##   Country.of.Origin altitude_mean_meters Processing.Method Aroma Flavor
##   <chr>                            <dbl> <chr>             <dbl>  <dbl>
## 1 Ethiopia                          2075 Washed / Wet       8.67   8.83
## 2 Ethiopia                          2075 Washed / Wet       8.75   8.67
## 3 Guatemala                         1700 <NA>               8.42   8.5 
## 4 Ethiopia                          2000 Natural / Dry      8.17   8.58
## 5 Ethiopia                          2075 Washed / Wet       8.25   8.5 
## 6 Brazil                              NA Natural / Dry      8.58   8.42
## # ℹ 5 more variables: Aftertaste <dbl>, Acidity <dbl>, Body <dbl>,
## #   Balance <dbl>, Sweetness <dbl>

Taste attributes (description based on https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi):

  • Aroma – The smell of the coffee, both dry (grounds) and wet (after brewing), indicating fragrance and scent complexity

  • Flavor – The overall taste impression of the coffee, combining aroma, acidity, sweetness, bitterness, and other flavor notes

  • Aftertaste – The lingering taste left in the mouth after swallowing; its duration and pleasantness are evaluated

  • Acidity – The bright, lively, or tangy quality of the coffee (not sourness), contributing to vibrancy and structure

  • Body – The physical mouthfeel or weight of the coffee on the palate (e.g., light, creamy, heavy)

  • Balance – How harmoniously all flavor attributes (acidity, sweetness, body, etc.) work together without one overpowering the others

  • Sweetness – The natural pleasant sweetness in coffee, often associated with ripe fruit or sugar-like qualities, without harshness or bitterness

K-means clustering

The first, classic approach to this problem is k-means clustering - an algorithm that partitions a dataset into distinct clusters by minimizing the distance of chosen metric between data points and their respective cluster centroids. There are three main methods or choosing the optimal number of clusters in this method: silhouette width, gap statistic and elbow method.

Silhouette score measures each data point fits its assigned cluster compared to neighboring clusters. For each point it calculates how similar it is to its own cluster (a) versus how similar it is to the nearest other cluster (b). The silhouette score formula is (b - a)/max(a, b). The higher the value, the better are the clusters.

The elbow method runs k-means on different values of k (numbers of clusters), calculates the Total Within-Cluster Sum of Squares (WSS) for each k and then plots the k vs WSS which is used for decision process. The goal is to identify the specific point where this decrease transitions from a sharp decline to a marginal, linear plateau. This signifies that adding more clusters no longer provides a significant improvement in data fit, so there is no point in adding more clusters.

Gap statistic compares the observed within-cluster dispersion to an expected reference distribution of the data. It measures the difference between the clustering results on the clustered dataset and a synthetic one, where no natural clusters exist. The optimal number of clusters is defined as the value of k that maximizes the gap, indicating that the clustering structure is significantly stronger than what would be found in a random distribution of points.

Let’s start with an arbitrary number of clusters (3) to explore the dataset.

km1 <- eclust(data.matrix, "kmeans", hc_metric="euclidean", k=3)

fviz_cluster(km1)

km1$centers
##      Aroma   Flavor Aftertaste  Acidity     Body  Balance Sweetness
## 1 6.780714 6.464286   6.394286 6.614286 6.620000 6.447857  5.953571
## 2 7.346169 7.259521   7.129411 7.305672 7.313904 7.261215  9.945893
## 3 7.735080 7.723833   7.609536 7.713966 7.681180 7.721936  9.945915
# km1$silinfo

Silouette score seems quite low, meaning the cluster structure is weak. There is an issue - coffees have similar, very high scores for sweetness, which creates an outlier cluster for less sweet coffees, otherwise dissimilar. Sweetness is typically very high and it doesn’t offer any additional information here, Let’s get rid of it form the dataset along with a few coffees with exceptionally low sweetness score.

data.numeric <- data.numeric %>% filter(Sweetness >= 8)
data.numeric <- data.numeric %>% select(-c("Sweetness"))
data.matrix <- as.matrix(data.numeric)
data.matrix
km2 <- eclust(data.matrix, "kmeans", hc_metric="euclidean", k=3)

km2$centers
##      Aroma   Flavor Aftertaste  Acidity     Body  Balance
## 1 7.887124 7.900970   7.793478 7.894849 7.826120 7.900100
## 2 7.235185 7.100236   6.970370 7.202896 7.230303 7.110000
## 3 7.577582 7.546871   7.424253 7.532361 7.521437 7.541807
# km2$silinfo

K-means is sensitive to outliers, as it uses the mean as the cluster center. As we can observe, a single extreme point can distort the entire cluster significantly. K-medoids algorithm is far more robust to outliers. In the case of k-means, to obtain better clustering it will be removed. Observations nr 680 and 876 seem to completely distort the cluster structure.

data.numeric.filtered <- data.numeric[-c(680, 876), ]
data.matrix.filtered <- as.matrix(data.numeric.filtered)

km3 <- eclust(data.matrix.filtered , "kmeans", hc_metric="euclidean", k=3)

km3$centers
##      Aroma   Flavor Aftertaste  Acidity     Body  Balance
## 1 7.887124 7.900970   7.793478 7.894849 7.826120 7.900100
## 2 7.235185 7.100236   6.970370 7.202896 7.230303 7.110000
## 3 7.580899 7.546291   7.423210 7.531854 7.524237 7.540628
# km3$silinfo

# silhouette plot for k-means
sil<-silhouette(km3$cluster, dist(data.matrix.filtered))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  299          0.26
## 2       2  297          0.29
## 3       3  701          0.34

As we can see, visualisation and silhouette score improved, but the structure still seems artifical, as I chose the initial number of clusters arbitrarily. In order to choose the most appropriate number of clusters, gap statistic and elbow method need to be reviewed.

Optimal number of clusters

fviz_nbclust(data.matrix.filtered, FUNcluster=kmeans, method = "silhouette")

fviz_nbclust(data.matrix.filtered, FUNcluster=kmeans, method="gap_stat") + theme_classic()

fviz_nbclust(data.matrix.filtered, FUNcluster=kmeans, method="wss")

Silhouette method suggests a 2-cluster structure, but gap statistic suggests 1 cluster and elbow method shows no clear bend, which supports the hypothesis of no distinct clusters more.

km4 <- eclust(data.matrix.filtered, "kmeans", hc_metric="euclidean", k=2)

sil<-silhouette(km4$cluster, dist(data.matrix.filtered))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  758          0.38
## 2       2  539          0.34

The visual analysis seems to be in line more with elbow method conclusion - the clusters aren’t distinct.

d1<- flexclust::cclust(data.matrix.filtered, 2, dist="euclidean")
stripes(d1) #flexclust::

This plot illustrates the dispersion within two clusters, where the height of each bar represents the maximum distance of an individual observation from its group’s centroid. Cluster 1 exhibits a higher degree of internal variance and a larger range of distance compared to cluster 2.

K-medoids (Partitioning Around Medoids) clustering

K -medoid divides a dataset into clusters by selecting actual, central data points called medoids as representatives. Unlike k-means, it is less sensitive to outliers because it minimizes the sum of absolute dissimilarities rather than squared distances.

pam1<-eclust(data.matrix, "pam", k=2)

fviz_silhouette(pam1)
##   cluster size ave.sil.width
## 1       1  678          0.38
## 2       2  621          0.32

Optimal number of clusters

fviz_nbclust(data.matrix, FUNcluster=cluster::pam, method = "silhouette")

fviz_nbclust(data.matrix, FUNcluster=cluster::pam, method="gap_stat") + theme_classic()

fviz_nbclust(data.matrix, FUNcluster=cluster::pam, method="wss")

# Calinsky - Harabasz 

pamk.best <- pamk(data.matrix, krange=2:10, criterion="ch", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data.matrix, "dist"), critout=FALSE) # fpc::pamk()
pamk.best
## $pamobject
## Medoids:
##       ID Aroma Flavor Aftertaste Acidity Body Balance
## [1,] 297  7.67   7.67       7.58    7.75 7.67    7.67
## [2,] 921  7.42   7.33       7.25    7.42 7.42    7.33
## Clustering vector:
##    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##  [371] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [445] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [482] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
##  [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
##  [556] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
##  [593] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [630] 1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1
##  [667] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [704] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2
##  [741] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
##  [778] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2
##  [815] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [852] 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
##  [889] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
##  [926] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
##  [963] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [1000] 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1037] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [1074] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [1111] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1148] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1185] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [1222] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [1259] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1296] 2 2 2 2
## Objective function:
##     build      swap 
## 0.5701878 0.5295290 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 2
## 
## $crit
##  [1]   0.0000 971.3197 867.7275 731.2089 623.9307 547.1169 519.7610 470.9029
##  [9] 439.2365 407.5513

The silhouette score and elbow method give us similar information as in case of k-means algorithm, but gap statistic points to 4 clusters. Calinsky - Harabasz Index is defined as the ratio of the between-cluster separation (BCSS) to the within-cluster dispersion (WCSS), normalized by their number of degrees of freedom (Wikipedia). It also suggests 2 clusters. These mixed messages are worrying in terms of dataset clusterability.

pam2<-eclust(data.matrix, "pam", k=4)

fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1  233          0.13
## 2       2  418          0.26
## 3       3  233          0.21
## 4       4  415          0.22

Gap statistic suggested solution performs very low in terms of visual identification and silhouette score. Both k-means and k-medoids perform poorly, as there are no clear visible clusters.

fviz_cluster(pam1, geom="point", ellipse.type="norm") # factoextra::

cluster_summary <- describeBy(data.matrix, group = pam1$cluster)
cluster_summary
## 
##  Descriptive statistics by group 
## INDICES: 1
##            vars   n mean   sd median trimmed  mad  min  max range skew kurtosis
## Aroma         1 678 7.75 0.24   7.75    7.74 0.25 7.08 8.75  1.67 0.74     1.34
## Flavor        2 678 7.75 0.23   7.75    7.72 0.25 7.17 8.83  1.66 1.32     2.83
## Aftertaste    3 678 7.63 0.23   7.58    7.62 0.24 7.00 8.67  1.67 0.97     2.24
## Acidity       4 678 7.74 0.25   7.75    7.72 0.25 6.75 8.75  2.00 0.59     1.34
## Body          5 678 7.70 0.22   7.67    7.69 0.24 7.00 8.58  1.58 0.57     1.19
## Balance       6 678 7.74 0.25   7.67    7.71 0.24 7.17 8.75  1.58 1.19     2.16
##              se
## Aroma      0.01
## Flavor     0.01
## Aftertaste 0.01
## Acidity    0.01
## Body       0.01
## Balance    0.01
## ------------------------------------------------------------ 
## INDICES: 2
##            vars   n mean   sd median trimmed  mad  min  max range  skew
## Aroma         1 621 7.37 0.26   7.42    7.39 0.24 5.08 8.00  2.92 -1.72
## Flavor        2 621 7.28 0.27   7.33    7.31 0.24 6.08 8.00  1.92 -1.21
## Aftertaste    3 621 7.16 0.28   7.25    7.18 0.25 6.17 7.83  1.66 -1.09
## Acidity       4 621 7.33 0.23   7.33    7.34 0.24 5.25 7.83  2.58 -1.73
## Body          5 621 7.33 0.23   7.33    7.35 0.24 5.25 8.08  2.83 -1.64
## Balance       6 621 7.29 0.28   7.33    7.29 0.24 6.17 8.33  2.16 -0.24
##            kurtosis   se
## Aroma          9.90 0.01
## Flavor         2.20 0.01
## Aftertaste     1.65 0.01
## Acidity       10.50 0.01
## Body          11.28 0.01
## Balance        1.81 0.01

The best shot at clustering shows that the data doesn’t vary a lot between clusters and that they are heavily overlapping. This is quite disappointing.

Dataset clusterability

Hopkin’s statistic

Hopkin’s statistic is a way of measuring the cluster tendency of a dataset. If individuals are aggregated, then its value approaches 1, and if they are randomly distributed along the value tends to 0.5 (Wikipedia). This could be a hint why do we have trouble with clustering this dataset.

data <- arabica_data %>% select(c("Country.of.Origin","altitude_mean_meters","Processing.Method","Aroma","Flavor","Aftertaste","Acidity","Body","Balance","Sweetness"))
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "altitude_mean_meters"]
data.numeric <- data.numeric %>% filter(Sweetness >= 8)
data.matrix <- as.matrix(data.numeric)

data.numeric.cut <- data.numeric %>% select(-c("Sweetness"))
data.matrix.cut <- as.matrix(data.numeric.cut)
get_clust_tendency(t(data.matrix), 5, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 2026) # factoextra::
## $hopkins_stat
## [1] 0.8476718
## 
## $plot

For dataset including sweetness score, the visualization of Hopkins statistic suggests that these data possess clustering tendency, characterized by distinct structural density rather than a uniform or random distribution. The Hopkins statistic heatmap displays a sharp contrast between the deep blue blocks of high similarity and the peach-colored parts indicating high distance between potential clusters. The Hopkins statistic values is quite high at ~0,85. These results seem to validate clustering on this dataset, but let’s compare what happens when we exclude sweetness once more.

get_clust_tendency(t(data.matrix.cut), 5, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 2026) # factoextra::
## $hopkins_stat
## [1] 0.5403408
## 
## $plot

The statistic value, dropped sharply to 0,54. This suggests that this data clustered well only with sweetness included, but that is a mistake, as it only detected coffees highly dissimilar in terms of sweetness, more detecting outliers than clusters. In this case clustering forces artificial boundaries.

Conclusions

High-quality arabica coffees tend to be surprisingly balanced and similar in their sensory profiles with a few outliers. This discourages any further analysis.

Bilbiography: