Dimension Reduction and Clustering on Coffee Quality Measures

Coffee is inseparable part of exam session season and for may people, inseparable part of every-day life. This study is based on dataset from Kaggle [https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi/data] that gathers 1339 observations from Coffee Quality Institute about multiple features of different coffee beans.

The analysis aims to: a) perform dimension reduction on coffee quality features using Principal Component Analysis, b) investigate if different types of coffee described in the dataset differ significantly or are rather uniform. c) see if results from dimension reduction using PCA and clustering have something in common and drive conclusions form it.

Data preprocessing an analysis

I created a subset of data containing only numeric values for parameters quality that I want to analyze and dropped NA values as they were appearing in variable pointing at average harvest altitude, which on my opinion may be a relevant factor. Moreover, only 230 rows with NAs appeared, which doesn’t make big impact on the dataset. Additionally I also manually fixed 2 outliers, which were obviously mistakenly 100 times bigger than they should.

coffee<-read.csv("merged_data_cleaned.csv")

coffee_clean <- coffee %>%
  select(Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup, Sweetness, Cupper.Points, Moisture, altitude_mean_meters) %>%
  na.omit() 
# Correct altitude values based on their original incorrect values
coffee_clean$altitude_mean_meters[which(coffee_clean$altitude_mean_meters == 190164)] <- 1901.64
coffee_clean$altitude_mean_meters[which(coffee_clean$altitude_mean_meters == 110000)] <- 1100
summary(coffee_clean)

##      Aroma           Flavor        Aftertaste       Acidity     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:7.420   1st Qu.:7.330   1st Qu.:7.250   1st Qu.:7.330  
##  Median :7.580   Median :7.580   Median :7.420   Median :7.500  
##  Mean   :7.571   Mean   :7.521   Mean   :7.394   Mean   :7.529  
##  3rd Qu.:7.750   3rd Qu.:7.750   3rd Qu.:7.580   3rd Qu.:7.750  
##  Max.   :8.750   Max.   :8.830   Max.   :8.670   Max.   :8.750  
##       Body          Balance        Uniformity       Clean.Cup     
##  Min.   :0.000   Min.   :0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:7.330   1st Qu.:7.330   1st Qu.:10.000   1st Qu.:10.000  
##  Median :7.500   Median :7.500   Median :10.000   Median :10.000  
##  Mean   :7.507   Mean   :7.506   Mean   : 9.868   Mean   : 9.851  
##  3rd Qu.:7.670   3rd Qu.:7.750   3rd Qu.:10.000   3rd Qu.:10.000  
##  Max.   :8.580   Max.   :8.750   Max.   :10.000   Max.   :10.000  
##    Sweetness      Cupper.Points       Moisture       altitude_mean_meters
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.00000   Min.   :    1       
##  1st Qu.:10.000   1st Qu.: 7.250   1st Qu.:0.10000   1st Qu.: 1100       
##  Median :10.000   Median : 7.500   Median :0.11000   Median : 1311       
##  Mean   : 9.874   Mean   : 7.489   Mean   :0.09205   Mean   : 1337       
##  3rd Qu.:10.000   3rd Qu.: 7.750   3rd Qu.:0.12000   3rd Qu.: 1600       
##  Max.   :10.000   Max.   :10.000   Max.   :0.20000   Max.   :11000

I also standarized data so to avoid potential problems that may appear because scale difference.

preproc <- preProcess(coffee_clean, method=c("center", "scale"))
coffee_scaled <- predict(preproc, coffee_clean)

Correlation

I created correlation matrix to investigate the relationship of the variables

Variables Aftertaste, Balance and Flavor are triggering the highest correlation score, which may negatively impact further analysis, so I decided to drop them. This leaves us with 10 dimensions.

Principal Component Analysis (PCA)

The aim for using PCA is to reduce dimensions of data but keep as much original information as possible in the same time - models usually are considered satisfactory when cumulative variance explained by Principal Components exceeds the threshold of 70%.

pca1<-prcomp(coffee_done, center=FALSE, scale.=FALSE) 
summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     2.0008 1.1281 0.9891 0.9532 0.74260 0.7029 0.53184
## Proportion of Variance 0.4448 0.1414 0.1087 0.1010 0.06127 0.0549 0.03143
## Cumulative Proportion  0.4448 0.5862 0.6949 0.7959 0.85716 0.9121 0.94348
##                            PC8     PC9
## Standard deviation     0.52364 0.48419
## Proportion of Variance 0.03047 0.02605
## Cumulative Proportion  0.97395 1.00000

First 4 PCs explain close to 80% of the data, which would seem as a good choice, but let’s check Kaiser’s Stopping Rule and elbowpoint.

coffee_done.cov<-cov(coffee_done)
coffee_done.eigen<-eigen(coffee_done.cov)
eigenvalues <-coffee_done.eigen$values
eigenvalues

## [1] 4.0033025 1.2726928 0.9783224 0.9086513 0.5514587 0.4940823 0.2828514
## [8] 0.2741945 0.2344440

Kaiser’s criterion suggests that only first 2 PCs should be chosen, as 3rd one already hav eigen value below1, which means that it doesn’t explain more than a single variable itself, but then model may be over simplified as it would explain only 58.3% of variance, so deciding for 3 PCs is a better option in that case. In theory, 3 PCs does not meet theoretical threshold of 70% to assess the dimension reduction as satisfactory, but it is close enough to 70% to let me decide for keeping 3 components.

Graphical analysis

Positively correlated variables are oriented in the same direction on the plot, while negatively correlated variables point in opposite directions.

Additionally, the distance of a variable from the origin indicates how well it is represented — variables farther from the origin have a stronger representation in the principal component space.

This being said, we can see that Aroma, Acidity, Body and Cupper Points have strong positive correlation (as Cupper Points is a general grade of the coffee cup, this is not unexpected), similarly Uniformity, Clean Cup and Sweetness. Moreover we can see that altitude is not that strongly represented in these dimensions.

As previously said, I’d rather keep analysis with 3 PCs, so let’s see 3D plot.

Here it’s visible that altitude is main trigger of the PC3. We can also visualize contribution of variables to each PC separately:

Main contributors of PC1 are: Acidity, Aroma, Cupper Points and Body, which suggests that this PC identifies general coffee quality, the second component is about sweetness and texture (moisture, uniformity and Clean Cup) and third component is triggered mainly by altitude with smaller contribution of moisture.

Individual observations

Let’s see how individual observations with similar profile group on the plot:

PCA Conclusions

PCA provided us with some valuable and interesting observations about, how variables group with each other. Based on plots we may come to the conclusion that coffee sweetness and texture is not very much correlated with it’s final quality grade (Cupper Points), however acidity is very strongly. Moreover from 2D plot we may also conclude that the higher beans grow, the better end grade of the coffee, but 3D plot provided additional insight that altitude (however still positively correlated with Cupper Points) triggers mainly 3rd dimension

Clustering

For the formality, I checked Hopkins statistic to be sure that data is clusterable and is not randomly distributed. I tried different clustering methods of differently processed data frames to compare their results and drive conclusion which would be the best and how PCA impacts it.

hopkins_test_clean <- get_clust_tendency(coffee_clean, n = nrow(coffee_clean) - 1, graph = TRUE)
print(hopkins_test_clean$hopkins_stat)

## [1] 0.9923984

# 0.9923984 ->good for clustering

hopkins_pca <- get_clust_tendency(pca1$x[, 1:3], n = nrow(pca1$x) - 1, graph = TRUE)
print(hopkins_pca$hopkins_stat)

## [1] 0.9791923

#0.9791923 -> good for clustering

hopkins_pca_2 <- get_clust_tendency(pca1$x[, 1:2], n = nrow(pca1$x) - 1, graph = TRUE)
print(hopkins_pca_2$hopkins_stat)

## [1] 0.9830112

#0.9830112 -> good for clustering

Let’s start with KMeans.

First I determined optimal number of clusters using elbow point method and Silhouette method.

For coffee_clean df (Without outliers and extremely high correlated variables, but before scaling and PCA):

opt_elbow_clean <- Optimal_Clusters_KMeans(coffee_clean, max_clusters = 10, plot_clusters = TRUE)

opt_sil_clean <- Optimal_Clusters_KMeans(coffee_clean, max_clusters = 10, plot_clusters = TRUE, criterion="silhouette")

Here elbow point method suggests 2 or 3 clusters, while Silhouette clearly stand for 2

For data transformed by PCA:

opt_pca_elbow <- Optimal_Clusters_KMeans(pca1$x[,1:3], max_clusters = 10, plot_clusters = TRUE)

opt_pca_sil <- Optimal_Clusters_KMeans(pca1$x[,1:3], max_clusters = 10, plot_clusters = TRUE, criterion="silhouette")

For PCA transformed data situation is basically the same. Elbow point analysis would drive us towards conclusion about 2 or 3 clusters, and Silhouette chart decides for 2. In this case, I’m confident to decide that 2 is the best option

Let’s compare if clusters on clean data differ significantly from ones on PCA.

set.seed(123)  
kmeans_pca <- kmeans(pca1$x[, 1:3], centers = 2, nstart = 25, iter.max = 100)  # Assuming k=2

kmeans_clean <- kmeans(coffee_clean, centers = 2, nstart = 25, iter.max = 100)  # Assuming k=2

clean_summary <- aggregate(coffee_clean, by = list(Cluster = kmeans_clean$cluster), mean)

pca_summary <- aggregate(coffee_clean, by = list(Cluster = kmeans_pca$cluster), mean)

print(clean_summary)

##   Cluster    Aroma   Flavor Aftertaste  Acidity     Body  Balance Uniformity
## 1       1 7.523664 7.472185   7.339714 7.477294 7.472050 7.452353   9.847966
## 2       2 7.624883 7.576848   7.457782 7.589183 7.546887 7.567588   9.890914
##   Clean.Cup Sweetness Cupper.Points   Moisture altitude_mean_meters
## 1  9.799328  9.867983      7.429714 0.09467227             1014.931
## 2  9.911751  9.880428      7.557471 0.08900778             1710.506

print(pca_summary)

##   Cluster    Aroma   Flavor Aftertaste  Acidity     Body  Balance Uniformity
## 1       1 7.331652 7.252149   7.125045 7.284140 7.276448 7.232692   9.775023
## 2       2 7.728906 7.698651   7.572954 7.691514 7.659340 7.686717   9.929400
##   Clean.Cup Sweetness Cupper.Points   Moisture altitude_mean_meters
## 1  9.692172  9.868348      7.153190 0.10726244             1177.881
## 2  9.956972  9.877331      7.711409 0.08196402             1442.969

In both scenarios clusters are very similar, which suggests that PCA indeed kept a majority of information.

Cluster 2 generally seems like “better quality coffee”, however differences are very minor, I wouldn’t resign from interpreting them as this is the specification of the dataset. Each variable has very narrow spread in values, except altitude. We also may suppose that the main driver of clusters is altitude as it’s where clusters show the highest difference but it also is a variable with the videst spread.

##Visualisation of KMeans clusters

However I decided to use 3 PCs from PCA dimension reduction, I wanted to perform the visualization also on set with 2 PCs to explore differences.

Clusters are indeed mainly defined by first dimension, which indicates coffee quality, as assumed before. We can also see some outliers that weren’t detected before.

Visually it may seem like PCA with only 2 PCs works the best for kmeans,as there’s no overlap but it’s very obvious result as it takes only 2 dimensions which is simpler to visualize. Overlap on the first plot may be trigger by many dimensions and small spread of particular variables.

To compare clustering quality I’m checking average Silhouette width for all 3 scenarios.

##   cluster size ave.sil.width
## 1       1  595          0.48
## 2       2  514          0.53

##   cluster size ave.sil.width
## 1       1  442          0.27
## 2       2  667          0.35

##   cluster size ave.sil.width
## 1       1  663          0.45
## 2       2  446          0.33

Original (clean) data shows the highest average Silhouette width, meaning they’d be the best choice.

I decided to replicate the same process for CLARA method to compare results.

##   cluster size ave.sil.width
## 1       1  445          0.56
## 2       2  664          0.48

##   cluster size ave.sil.width
## 1       1  567          0.33
## 2       2  542          0.28

##   cluster size ave.sil.width
## 1       1  571          0.45
## 2       2  538          0.31

Results from CLARA match exactly the ones from KMeans, but we can see that CLARA deals slightly better with visualizing clusters on clean data.

Let’s see slightly different plot for CLARA and compare medoid points for each variable.

## Call:     clara(x = coffee_clean, k = 2, metric = "euclidean", stand = FALSE, samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE, rngR = FALSE, pamLike = FALSE, correct.d = TRUE) 
## Medoids:
##      Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup
## 244   7.67   7.83       7.58    7.92 7.75    7.67         10        10
## 1004  7.50   7.25       7.00    7.25 7.50    7.25         10        10
##      Sweetness Cupper.Points Moisture altitude_mean_meters
## 244         10          7.58     0.12              1706.88
## 1004        10          7.25     0.11              1219.20
## Objective function:   227.9888
## Clustering vector:    Named int [1:1109] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 ...
##  - attr(*, "names")= chr [1:1109] "1" "2" "3" "4" "5" "8" "9" ...
## Cluster sizes:            410 699 
## Best sample:
##  [1] 10   44   57   79   88   102  110  190  208  210  236  244  289  464  467 
## [16] 478  481  497  546  577  578  611  612  683  701  717  730  732  775  786 
## [31] 797  802  848  851  853  860  916  925  958  979  1004 1108 1120 1122 1124
## [46] 1135 1150 1187 1206 1239
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

Three variables show same medoids across clusters, which indicates why the quality of clustering on this dataset isn’t as good as I hoped. Those are variables responsible for PC2 from dimension reduction, which makes it more obvious, why clusters are based mostly on first dimension.

DBSCAN

#clean
kNNdistplot(coffee_clean, k = 11)
abline(h =90 , col = "grey", lty = 2)

set.seed(123)
db_orig <- dbscan(coffee_clean, eps = 90, MinPts = 11)  # Adjust `eps` as needed

fviz_cluster(list(data = coffee_clean, cluster = db_orig$cluster)) +
  ggtitle("DBSCAN Clustering on Original Data")

#pca
kNNdistplot(pca_3pc, k = 10)
abline(h =2 , col = "grey", lty = 2)

set.seed(123)
db_orig <- dbscan(pca_3pc, eps = 2, MinPts = 10)  # Adjust `eps` as needed

fviz_cluster(list(data = pca_3pc, cluster = db_orig$cluster)) +
  ggtitle("DBSCAN Clustering on OPA (3PCs)")

#pca2pc

kNNdistplot(pca_2pc, k = 10)
abline(h =2 , col = "grey", lty = 2)

set.seed(123)
db_orig <- dbscan(pca_2pc, eps = 2, MinPts = 10)  # Adjust `eps` as needed

fviz_cluster(list(data = pca_2pc, cluster = db_orig$cluster)) +
  ggtitle("DBSCAN Clustering on PCA (2PCs)")

DBSCAN seems to not see clusters at all, most of tries it shown 1 cluster plus noise. Only in clean data it sees 2 clusters, however second one is much smaller. The visualization again may fail because of many dimensions and little differences in values of the variables

Clearer visualization of clean data clusters:

res.fpc <- fpc::dbscan(coffee_clean, eps = 90 , MinPts = 11)

hullplot(coffee_clean, res.fpc, solid=TRUE, alpha=0.7)

As the widest spread variable is altitude,it may be worth checking how clusters look on the world map. For that I added back column Country of origin to coffee_clean, which was preprocessed earlier to show only numeric data for analysis purposes, and standarized it in terms of country naming convention to match map package data.

Data preparation code:

db_orig <- dbscan(coffee_clean, eps = 90, MinPts = 11)  # Adjust `eps` as needed
coffee_clean$cluster_dbscan <- db_orig$cluster 

coffee_clean_with_country <- coffee %>%
  filter(complete.cases(select(coffee, Aroma, Flavor, Aftertaste, Acidity, 
                               Body, Balance, Uniformity, Clean.Cup, Sweetness, 
                               Cupper.Points, Moisture, altitude_mean_meters))) %>%
  select(Country.of.Origin) %>%
  mutate(cluster_dbscan = coffee_clean$cluster_dbscan)  
country_clusters <- coffee_clean_with_country %>%
  group_by(Country.of.Origin, cluster_dbscan) %>%
  summarise(count = n(), .groups = "drop")

country_clusters <- country_clusters %>%
  group_by(Country.of.Origin) %>%
  mutate(percent = count / sum(count) * 100)

world_map <- map_data("world")


country_clusters <- country_clusters %>%
  mutate(Country.of.Origin = case_when(
    Country.of.Origin == "United States" ~ "USA",
    Country.of.Origin == "Puerto Rico" ~ "Puerto Rico", 
    Country.of.Origin == "Tanzania" ~ "Tanzania",
    Country.of.Origin == "Ivory Coast" ~ "Ivory Coast",
    Country.of.Origin == "Cote d?Ivoire" ~ "Ivory Coast",
    Country.of.Origin == "Taiwan" ~ "Taiwan",
    Country.of.Origin == "Myanmar" ~ "Myanmar",
    Country.of.Origin == "Vietnam" ~ "Vietnam",
    Country.of.Origin == "Zambia" ~ "Zambia",
    Country.of.Origin == "El Salvador" ~ "El Salvador",
    Country.of.Origin == "Papua New Guinea" ~ "Papua New Guinea",
    Country.of.Origin == "Costa Rica" ~ "Costa Rica",
    Country.of.Origin == "Ethiopia" ~ "Ethiopia",
    Country.of.Origin == "Guatemala" ~ "Guatemala",
    TRUE ~ Country.of.Origin  # Keep others unchanged
  ))

country_clusters_percent <- country_clusters %>%
  group_by(`Country.of.Origin`, cluster_dbscan) %>%
  summarise(count = n(), .groups = "drop") %>%  # Count occurrences per country
  group_by(`Country.of.Origin`) %>%
  mutate(percent = (count / sum(count)) * 100)  # Convert to percentage

world_clusters <- world_map %>%
  left_join(country_clusters_percent, by = c("region" = "Country.of.Origin"))

#Country plot

We can clearly see that there are some countries with observations in only single cluster but also some in only one cluster. Sticking to the previous interpretation that Cluster 1 stands for slightly better quality of coffee than cluster 2 and definetely has higher mean altitude of beans harvesting, this makes perfect sense. Countries visible in both clusters are ones of the best known quality coffee producers: Mexico, Gwatemala, Costarica, Ekwador, Colombia, Brazil and Kenia. Their presence in both clusters may be interpreted as visualization of common marketing practice. If the country is known from their very good coffee, it produces both - very good and just good as people will buy it anyway just because it’s form this country, especially if it’s a bit cheaper.

For countries marked on red in first cluster where around 90-100% of observations lands in this cluster, I would say that some of them are not big exporters of the coffee, so one thing is that there is probably less observations from these countries. Second thing is that, if they already export some beans or get them assessed by Coffee Quality Institute, these are probably very good beans. These countries would be Zambia, Cambodia, Malawi,Rwanda, Burundi. The rest of them I would interpret as just very good coffee quality, however China surprised me here personally, as I never heard of Chinease coffee, but after doing some research, it suits this interpretation.

As separate category I would interpret USA, India, Myanmar/Birma and Indonesia as they’re appearing both in cluster 1 and cluster 0 which should be interpreted as noise. Those countries have the biggest number of observations across dataset and their visibility in both cluster 1 and 0 may be caused by the disproportion of representation across dataset comparing to other countries.

Conclusion

The project aimed to analyse coffee quality parameters using dimension reduction (Principal Component Analysis) and various clustering methods (KMeans, CLARA, DBSCAN).

Reducing dimensionality allowed to discover that it’s possible to explain nearly 70% of the data using only 3 dimensions, which makes visualization easier and computing faster. First dimension combined mainly features responsible for overall quality (Cupper Points, Aroma) adn sensory attributes (Body, Acidity), while second dimension focused around texture (Moisture, Cup) and sweetness. Third dimension also contained Moisture but in a smaller propotion as it’s main driver was altitude. Moreover, visualization showed that altitude and moisture impact coffee quality in differen directions.

Clustering performed on both clean data and PCA-transformed data showed very similar results, which suggests that PCA indeed kept most of data information. Clustering on clean data showed however higher average Silhouette width, which was expected as it had 100% of information, not only 70%. There were 2 clusters detected, one significanly smaller than another, it was recognized that bigger one indicates slightly higher coffee quality and higher altitude of harvest. DBSCAN also enabled to visualize interesting split of clusters among different countries, where some of them were strongly associated with one cluster only, but some ahd divided observations between both clusters or showed some noise in observations.

While, I’m happy with results of dimension reduction, I had different expectations for clustering when starting the project. There are some areas for future improvements, maybe different way of preparing the data may help the fact that small variances in values of specific variables were interpreted as a noise, or experimenting with different clustering methods and/or combinig them (using hybrid-models) would work better for given data.

Dimension reduction and clustering on coffee quality measures

Katarzyna Pelka

2025-02-20