Data source: https://github.com/jldbc/coffee-quality-database/tree/master, scraped from Coffee Quality Institute’s reviews.
I am an uneducated coffee enthusiast - I love trying new coffee, but I never know which bean bag to choose next. In the world of specialty coffee, the descriptions for consumer are often built on subjective descriptions, which sometimes aren’t informative enough. As a Data-Sciencist in training I decided to change this using data-driven approach to discover similarities in my previous and potential next morning drink.
The goal is to create clusters of beans by their expert-rated taste attributes (Aroma, Flavor, Aftertaste, Acidity, Body, Balance, and Sweetness) and then check whether information that’s actually readable on a coffee bag (mainly country of origin on the cheaper bags, also growing altitude/processing method for high end products) matters for which flavour profile cluster a coffee lands in.
If everything goes well, this study will make a choice of the next coffee bag a little easier by providing the ability to compare the previous coffee experience with a potential one, without relying on imprecise descriptions, but based on data.
arabica_data <- read_csv("arabica_data_cleaned.csv")
arabica_data
data <- arabica_data %>% select(c("Country.of.Origin","altitude_mean_meters","Processing.Method","Aroma","Flavor","Aftertaste","Acidity","Body","Balance","Sweetness"))
missing.data <- data[!complete.cases(data),]
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "altitude_mean_meters"]
data.matrix <- as.matrix(data.numeric)There are some coffees with missing data on the processing method or mean alitude of their cultivation, one with missing country of origin. This is not influencing our clustering data, so they will be kept, only less valuable in interpretation.
## # A tibble: 6 × 10
## Country.of.Origin altitude_mean_meters Processing.Method Aroma Flavor
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Ethiopia 2075 Washed / Wet 8.67 8.83
## 2 Ethiopia 2075 Washed / Wet 8.75 8.67
## 3 Guatemala 1700 <NA> 8.42 8.5
## 4 Ethiopia 2000 Natural / Dry 8.17 8.58
## 5 Ethiopia 2075 Washed / Wet 8.25 8.5
## 6 Brazil NA Natural / Dry 8.58 8.42
## # ℹ 5 more variables: Aftertaste <dbl>, Acidity <dbl>, Body <dbl>,
## # Balance <dbl>, Sweetness <dbl>
Taste attributes (description based on https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi):
Aroma – The smell of the coffee, both dry (grounds) and wet (after brewing), indicating fragrance and scent complexity
Flavor – The overall taste impression of the coffee, combining aroma, acidity, sweetness, bitterness, and other flavor notes
Aftertaste – The lingering taste left in the mouth after swallowing; its duration and pleasantness are evaluated
Acidity – The bright, lively, or tangy quality of the coffee (not sourness), contributing to vibrancy and structure
Body – The physical mouthfeel or weight of the coffee on the palate (e.g., light, creamy, heavy)
Balance – How harmoniously all flavor attributes (acidity, sweetness, body, etc.) work together without one overpowering the others
Sweetness – The natural pleasant sweetness in coffee, often associated with ripe fruit or sugar-like qualities, without harshness or bitterness
The first, classic approach to this problem is k-means clustering - an algorithm that partitions a dataset into distinct clusters by minimizing the distance of chosen metric between data points and their respective cluster centroids. There are three main methods or choosing the optimal number of clusters in this method: silhouette width, gap statistic and elbow method.
Silhouette score measures each data point fits its assigned cluster compared to neighboring clusters. For each point it calculates how similar it is to its own cluster (a) versus how similar it is to the nearest other cluster (b). The silhouette score formula is (b - a)/max(a, b). The higher the value, the better are the clusters.
The elbow method runs k-means on different values of k (numbers of clusters), calculates the Total Within-Cluster Sum of Squares (WSS) for each k and then plots the k vs WSS which is used for decision process. The goal is to identify the specific point where this decrease transitions from a sharp decline to a marginal, linear plateau. This signifies that adding more clusters no longer provides a significant improvement in data fit, so there is no point in adding more clusters.
Gap statistic compares the observed within-cluster dispersion to an expected reference distribution of the data. It measures the difference between the clustering results on the clustered dataset and a synthetic one, where no natural clusters exist. The optimal number of clusters is defined as the value of k that maximizes the gap, indicating that the clustering structure is significantly stronger than what would be found in a random distribution of points.
Let’s start with an arbitrary number of clusters (3) to explore the dataset.
## Aroma Flavor Aftertaste Acidity Body Balance Sweetness
## 1 6.780714 6.464286 6.394286 6.614286 6.620000 6.447857 5.953571
## 2 7.346169 7.259521 7.129411 7.305672 7.313904 7.261215 9.945893
## 3 7.735080 7.723833 7.609536 7.713966 7.681180 7.721936 9.945915
Silouette score seems quite low, meaning the cluster structure is weak. There is an issue - coffees have similar, very high scores for sweetness, which creates an outlier cluster for less sweet coffees, otherwise dissimilar. Sweetness is typically very high and it doesn’t offer any additional information here, Let’s get rid of it form the dataset along with a few coffees with exceptionally low sweetness score.
data.numeric <- data.numeric %>% filter(Sweetness >= 8)
data.numeric <- data.numeric %>% select(-c("Sweetness"))
data.matrix <- as.matrix(data.numeric)
data.matrix## Aroma Flavor Aftertaste Acidity Body Balance
## 1 7.887124 7.900970 7.793478 7.894849 7.826120 7.900100
## 2 7.235185 7.100236 6.970370 7.202896 7.230303 7.110000
## 3 7.577582 7.546871 7.424253 7.532361 7.521437 7.541807
K-means is sensitive to outliers, as it uses the mean as the cluster center. As we can observe, a single extreme point can distort the entire cluster significantly. K-medoids algorithm is far more robust to outliers. In the case of k-means, to obtain better clustering it will be removed. Observations nr 680 and 876 seem to completely distort the cluster structure.
data.numeric.filtered <- data.numeric[-c(680, 876), ]
data.matrix.filtered <- as.matrix(data.numeric.filtered)
km3 <- eclust(data.matrix.filtered , "kmeans", hc_metric="euclidean", k=3)## Aroma Flavor Aftertaste Acidity Body Balance
## 1 7.887124 7.900970 7.793478 7.894849 7.826120 7.900100
## 2 7.235185 7.100236 6.970370 7.202896 7.230303 7.110000
## 3 7.580899 7.546291 7.423210 7.531854 7.524237 7.540628
# km3$silinfo
# silhouette plot for k-means
sil<-silhouette(km3$cluster, dist(data.matrix.filtered))
fviz_silhouette(sil)## cluster size ave.sil.width
## 1 1 299 0.26
## 2 2 297 0.29
## 3 3 701 0.34
As we can see, visualisation and silhouette score improved, but the structure still seems artifical, as I chose the initial number of clusters arbitrarily. In order to choose the most appropriate number of clusters, gap statistic and elbow method need to be reviewed.
Silhouette method suggests a 2-cluster structure, but gap statistic suggests 1 cluster and elbow method shows no clear bend, which supports the hypothesis of no distinct clusters more.
## cluster size ave.sil.width
## 1 1 758 0.38
## 2 2 539 0.34
The visual analysis seems to be in line more with elbow method
conclusion - the clusters aren’t distinct.
This plot illustrates the dispersion within two clusters, where the height of each bar represents the maximum distance of an individual observation from its group’s centroid. Cluster 1 exhibits a higher degree of internal variance and a larger range of distance compared to cluster 2.
K -medoid divides a dataset into clusters by selecting actual, central data points called medoids as representatives. Unlike k-means, it is less sensitive to outliers because it minimizes the sum of absolute dissimilarities rather than squared distances.
## cluster size ave.sil.width
## 1 1 678 0.38
## 2 2 621 0.32
# Calinsky - Harabasz
pamk.best <- pamk(data.matrix, krange=2:10, criterion="ch", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data.matrix, "dist"), critout=FALSE) # fpc::pamk()
pamk.best## $pamobject
## Medoids:
## ID Aroma Flavor Aftertaste Acidity Body Balance
## [1,] 297 7.67 7.67 7.58 7.75 7.67 7.67
## [2,] 921 7.42 7.33 7.25 7.42 7.42 7.33
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [371] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [445] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
## [556] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
## [593] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1
## [667] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2
## [741] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [778] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2
## [815] 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## [852] 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [889] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2
## [926] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [963] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [1000] 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1037] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [1074] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [1111] 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1148] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1185] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [1222] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [1259] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1296] 2 2 2 2
## Objective function:
## build swap
## 0.5701878 0.5295290
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $nc
## [1] 2
##
## $crit
## [1] 0.0000 971.3197 867.7275 731.2089 623.9307 547.1169 519.7610 470.9029
## [9] 439.2365 407.5513
The silhouette score and elbow method give us similar information as in case of k-means algorithm, but gap statistic points to 4 clusters. Calinsky - Harabasz Index is defined as the ratio of the between-cluster separation (BCSS) to the within-cluster dispersion (WCSS), normalized by their number of degrees of freedom (Wikipedia). It also suggests 2 clusters. These mixed messages are worrying in terms of dataset clusterability.
## cluster size ave.sil.width
## 1 1 233 0.13
## 2 2 418 0.26
## 3 3 233 0.21
## 4 4 415 0.22
Gap statistic suggested solution performs very low in terms of visual identification and silhouette score. Both k-means and k-medoids perform poorly, as there are no clear visible clusters.
##
## Descriptive statistics by group
## INDICES: 1
## vars n mean sd median trimmed mad min max range skew kurtosis
## Aroma 1 678 7.75 0.24 7.75 7.74 0.25 7.08 8.75 1.67 0.74 1.34
## Flavor 2 678 7.75 0.23 7.75 7.72 0.25 7.17 8.83 1.66 1.32 2.83
## Aftertaste 3 678 7.63 0.23 7.58 7.62 0.24 7.00 8.67 1.67 0.97 2.24
## Acidity 4 678 7.74 0.25 7.75 7.72 0.25 6.75 8.75 2.00 0.59 1.34
## Body 5 678 7.70 0.22 7.67 7.69 0.24 7.00 8.58 1.58 0.57 1.19
## Balance 6 678 7.74 0.25 7.67 7.71 0.24 7.17 8.75 1.58 1.19 2.16
## se
## Aroma 0.01
## Flavor 0.01
## Aftertaste 0.01
## Acidity 0.01
## Body 0.01
## Balance 0.01
## ------------------------------------------------------------
## INDICES: 2
## vars n mean sd median trimmed mad min max range skew
## Aroma 1 621 7.37 0.26 7.42 7.39 0.24 5.08 8.00 2.92 -1.72
## Flavor 2 621 7.28 0.27 7.33 7.31 0.24 6.08 8.00 1.92 -1.21
## Aftertaste 3 621 7.16 0.28 7.25 7.18 0.25 6.17 7.83 1.66 -1.09
## Acidity 4 621 7.33 0.23 7.33 7.34 0.24 5.25 7.83 2.58 -1.73
## Body 5 621 7.33 0.23 7.33 7.35 0.24 5.25 8.08 2.83 -1.64
## Balance 6 621 7.29 0.28 7.33 7.29 0.24 6.17 8.33 2.16 -0.24
## kurtosis se
## Aroma 9.90 0.01
## Flavor 2.20 0.01
## Aftertaste 1.65 0.01
## Acidity 10.50 0.01
## Body 11.28 0.01
## Balance 1.81 0.01
The best shot at clustering shows that the data doesn’t vary a lot between clusters and that they are heavily overlapping. This is quite disappointing.
Hopkin’s statistic is a way of measuring the cluster tendency of a dataset. If individuals are aggregated, then its value approaches 1, and if they are randomly distributed along the value tends to 0.5 (Wikipedia). This could be a hint why do we have trouble with clustering this dataset.
data <- arabica_data %>% select(c("Country.of.Origin","altitude_mean_meters","Processing.Method","Aroma","Flavor","Aftertaste","Acidity","Body","Balance","Sweetness"))
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "altitude_mean_meters"]
data.numeric <- data.numeric %>% filter(Sweetness >= 8)
data.matrix <- as.matrix(data.numeric)
data.numeric.cut <- data.numeric %>% select(-c("Sweetness"))
data.matrix.cut <- as.matrix(data.numeric.cut)get_clust_tendency(t(data.matrix), 5, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 2026) # factoextra::## $hopkins_stat
## [1] 0.8476718
##
## $plot
For dataset including sweetness score, the visualization of Hopkins statistic suggests that these data possess clustering tendency, characterized by distinct structural density rather than a uniform or random distribution. The Hopkins statistic heatmap displays a sharp contrast between the deep blue blocks of high similarity and the peach-colored parts indicating high distance between potential clusters. The Hopkins statistic values is quite high at ~0,85. These results seem to validate clustering on this dataset, but let’s compare what happens when we exclude sweetness once more.
get_clust_tendency(t(data.matrix.cut), 5, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 2026) # factoextra::## $hopkins_stat
## [1] 0.5403408
##
## $plot
The statistic value, dropped sharply to 0,54. This suggests that this data clustered well only with sweetness included, but that is a mistake, as it only detected coffees highly dissimilar in terms of sweetness, more detecting outliers than clusters. In this case clustering forces artificial boundaries.
High-quality arabica coffees tend to be surprisingly balanced and similar in their sensory profiles with a few outliers. This discourages any further analysis.
https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index
Materials from Unsupervised Learning course at University of Warsaw, Faculty of Economic Sciences