Coffee is inseparable part of exam session season and for may people, inseparable part of every-day life. This study is based on dataset from Kaggle [https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi/data] that gathers 1339 observations from Coffee Quality Institute about multiple features of different coffee beans.
The analysis aims to: a) perform dimension reduction on coffee quality features using Principal Component Analysis, b) investigate if different types of coffee described in the dataset differ significantly or are rather uniform. c) see if results from dimension reduction using PCA and clustering have something in common and drive conclusions form it.
I created a subset of data containing only numeric values for parameters quality that I want to analyze and dropped NA values as they were appearing in variable pointing at average harvest altitude, which on my opinion may be a relevant factor. Moreover, only 230 rows with NAs appeared, which doesn’t make big impact on the dataset. Additionally I also manually fixed 2 outliers, which were obviously mistakenly 100 times bigger than they should.
coffee<-read.csv("merged_data_cleaned.csv")
coffee_clean <- coffee %>%
select(Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean.Cup, Sweetness, Cupper.Points, Moisture, altitude_mean_meters) %>%
na.omit()
# Correct altitude values based on their original incorrect values
coffee_clean$altitude_mean_meters[which(coffee_clean$altitude_mean_meters == 190164)] <- 1901.64
coffee_clean$altitude_mean_meters[which(coffee_clean$altitude_mean_meters == 110000)] <- 1100
summary(coffee_clean)
## Aroma Flavor Aftertaste Acidity
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:7.420 1st Qu.:7.330 1st Qu.:7.250 1st Qu.:7.330
## Median :7.580 Median :7.580 Median :7.420 Median :7.500
## Mean :7.571 Mean :7.521 Mean :7.394 Mean :7.529
## 3rd Qu.:7.750 3rd Qu.:7.750 3rd Qu.:7.580 3rd Qu.:7.750
## Max. :8.750 Max. :8.830 Max. :8.670 Max. :8.750
## Body Balance Uniformity Clean.Cup
## Min. :0.000 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:7.330 1st Qu.:7.330 1st Qu.:10.000 1st Qu.:10.000
## Median :7.500 Median :7.500 Median :10.000 Median :10.000
## Mean :7.507 Mean :7.506 Mean : 9.868 Mean : 9.851
## 3rd Qu.:7.670 3rd Qu.:7.750 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :8.580 Max. :8.750 Max. :10.000 Max. :10.000
## Sweetness Cupper.Points Moisture altitude_mean_meters
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 1
## 1st Qu.:10.000 1st Qu.: 7.250 1st Qu.:0.10000 1st Qu.: 1100
## Median :10.000 Median : 7.500 Median :0.11000 Median : 1311
## Mean : 9.874 Mean : 7.489 Mean :0.09205 Mean : 1337
## 3rd Qu.:10.000 3rd Qu.: 7.750 3rd Qu.:0.12000 3rd Qu.: 1600
## Max. :10.000 Max. :10.000 Max. :0.20000 Max. :11000
| I also standarized data so to avoid potential problems that may appear because scale difference. |
preproc <- preProcess(coffee_clean, method=c("center", "scale"))
coffee_scaled <- predict(preproc, coffee_clean)
I created correlation matrix to investigate the relationship of the variables
Variables Aftertaste, Balance and Flavor are triggering the highest correlation score, which may negatively impact further analysis, so I decided to drop them. This leaves us with 10 dimensions.
The aim for using PCA is to reduce dimensions of data but keep as much original information as possible in the same time - models usually are considered satisfactory when cumulative variance explained by Principal Components exceeds the threshold of 70%.
pca1<-prcomp(coffee_done, center=FALSE, scale.=FALSE)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0008 1.1281 0.9891 0.9532 0.74260 0.7029 0.53184
## Proportion of Variance 0.4448 0.1414 0.1087 0.1010 0.06127 0.0549 0.03143
## Cumulative Proportion 0.4448 0.5862 0.6949 0.7959 0.85716 0.9121 0.94348
## PC8 PC9
## Standard deviation 0.52364 0.48419
## Proportion of Variance 0.03047 0.02605
## Cumulative Proportion 0.97395 1.00000
First 4 PCs explain close to 80% of the data, which would seem as a good choice, but let’s check Kaiser’s Stopping Rule and elbowpoint.
coffee_done.cov<-cov(coffee_done)
coffee_done.eigen<-eigen(coffee_done.cov)
eigenvalues <-coffee_done.eigen$values
eigenvalues
## [1] 4.0033025 1.2726928 0.9783224 0.9086513 0.5514587 0.4940823 0.2828514
## [8] 0.2741945 0.2344440
Kaiser’s criterion suggests that only first 2 PCs should be chosen, as 3rd one already hav eigen value below1, which means that it doesn’t explain more than a single variable itself, but then model may be over simplified as it would explain only 58.3% of variance, so deciding for 3 PCs is a better option in that case. In theory, 3 PCs does not meet theoretical threshold of 70% to assess the dimension reduction as satisfactory, but it is close enough to 70% to let me decide for keeping 3 components.
Positively correlated variables are oriented in the same direction on the plot, while negatively correlated variables point in opposite directions.
Additionally, the distance of a variable from the origin indicates how well it is represented — variables farther from the origin have a stronger representation in the principal component space.
This being said, we can see that Aroma, Acidity, Body and Cupper Points have strong positive correlation (as Cupper Points is a general grade of the coffee cup, this is not unexpected), similarly Uniformity, Clean Cup and Sweetness. Moreover we can see that altitude is not that strongly represented in these dimensions.
As previously said, I’d rather keep analysis with 3 PCs, so let’s see 3D plot.
Here it’s visible that altitude is main trigger of the PC3. We can also visualize contribution of variables to each PC separately:
Main contributors of PC1 are: Acidity, Aroma, Cupper Points and Body, which suggests that this PC identifies general coffee quality, the second component is about sweetness and texture (moisture, uniformity and Clean Cup) and third component is triggered mainly by altitude with smaller contribution of moisture.
Let’s see how individual observations with similar profile group on the plot:
PCA provided us with some valuable and interesting observations about, how variables group with each other. Based on plots we may come to the conclusion that coffee sweetness and texture is not very much correlated with it’s final quality grade (Cupper Points), however acidity is very strongly. Moreover from 2D plot we may also conclude that the higher beans grow, the better end grade of the coffee, but 3D plot provided additional insight that altitude (however still positively correlated with Cupper Points) triggers mainly 3rd dimension
For the formality, I checked Hopkins statistic to be sure that data is clusterable and is not randomly distributed. I tried different clustering methods of differently processed data frames to compare their results and drive conclusion which would be the best and how PCA impacts it.
hopkins_test_clean <- get_clust_tendency(coffee_clean, n = nrow(coffee_clean) - 1, graph = TRUE)
print(hopkins_test_clean$hopkins_stat)
## [1] 0.9923984
# 0.9923984 ->good for clustering
hopkins_pca <- get_clust_tendency(pca1$x[, 1:3], n = nrow(pca1$x) - 1, graph = TRUE)
print(hopkins_pca$hopkins_stat)
## [1] 0.9791923
#0.9791923 -> good for clustering
hopkins_pca_2 <- get_clust_tendency(pca1$x[, 1:2], n = nrow(pca1$x) - 1, graph = TRUE)
print(hopkins_pca_2$hopkins_stat)
## [1] 0.9830112
#0.9830112 -> good for clustering
Let’s start with KMeans.
First I determined optimal number of clusters using elbow point method and Silhouette method.
For coffee_clean df (Without outliers and extremely high correlated variables, but before scaling and PCA):
opt_elbow_clean <- Optimal_Clusters_KMeans(coffee_clean, max_clusters = 10, plot_clusters = TRUE)
opt_sil_clean <- Optimal_Clusters_KMeans(coffee_clean, max_clusters = 10, plot_clusters = TRUE, criterion="silhouette")
Here elbow point method suggests 2 or 3 clusters, while Silhouette clearly stand for 2
For data transformed by PCA:
opt_pca_elbow <- Optimal_Clusters_KMeans(pca1$x[,1:3], max_clusters = 10, plot_clusters = TRUE)
opt_pca_sil <- Optimal_Clusters_KMeans(pca1$x[,1:3], max_clusters = 10, plot_clusters = TRUE, criterion="silhouette")
For PCA transformed data situation is basically the same. Elbow point analysis would drive us towards conclusion about 2 or 3 clusters, and Silhouette chart decides for 2. In this case, I’m confident to decide that 2 is the best option
Let’s compare if clusters on clean data differ significantly from ones on PCA.
set.seed(123)
kmeans_pca <- kmeans(pca1$x[, 1:3], centers = 2, nstart = 25, iter.max = 100) # Assuming k=2
kmeans_clean <- kmeans(coffee_clean, centers = 2, nstart = 25, iter.max = 100) # Assuming k=2
clean_summary <- aggregate(coffee_clean, by = list(Cluster = kmeans_clean$cluster), mean)
pca_summary <- aggregate(coffee_clean, by = list(Cluster = kmeans_pca$cluster), mean)
print(clean_summary)
## Cluster Aroma Flavor Aftertaste Acidity Body Balance Uniformity
## 1 1 7.523664 7.472185 7.339714 7.477294 7.472050 7.452353 9.847966
## 2 2 7.624883 7.576848 7.457782 7.589183 7.546887 7.567588 9.890914
## Clean.Cup Sweetness Cupper.Points Moisture altitude_mean_meters
## 1 9.799328 9.867983 7.429714 0.09467227 1014.931
## 2 9.911751 9.880428 7.557471 0.08900778 1710.506
print(pca_summary)
## Cluster Aroma Flavor Aftertaste Acidity Body Balance Uniformity
## 1 1 7.331652 7.252149 7.125045 7.284140 7.276448 7.232692 9.775023
## 2 2 7.728906 7.698651 7.572954 7.691514 7.659340 7.686717 9.929400
## Clean.Cup Sweetness Cupper.Points Moisture altitude_mean_meters
## 1 9.692172 9.868348 7.153190 0.10726244 1177.881
## 2 9.956972 9.877331 7.711409 0.08196402 1442.969
In both scenarios clusters are very similar, which suggests that PCA indeed kept a majority of information.
Cluster 2 generally seems like “better quality coffee”, however differences are very minor, I wouldn’t resign from interpreting them as this is the specification of the dataset. Each variable has very narrow spread in values, except altitude. We also may suppose that the main driver of clusters is altitude as it’s where clusters show the highest difference but it also is a variable with the videst spread.
##Visualisation of KMeans clusters
However I decided to use 3 PCs from PCA dimension reduction, I wanted to perform the visualization also on set with 2 PCs to explore differences.
Clusters are indeed mainly defined by first dimension, which indicates coffee quality, as assumed before. We can also see some outliers that weren’t detected before.
Visually it may seem like PCA with only 2 PCs works the best for kmeans,as there’s no overlap but it’s very obvious result as it takes only 2 dimensions which is simpler to visualize. Overlap on the first plot may be trigger by many dimensions and small spread of particular variables.
To compare clustering quality I’m checking average Silhouette width for all 3 scenarios.
## cluster size ave.sil.width
## 1 1 595 0.48
## 2 2 514 0.53
## cluster size ave.sil.width
## 1 1 442 0.27
## 2 2 667 0.35
## cluster size ave.sil.width
## 1 1 663 0.45
## 2 2 446 0.33
Original (clean) data shows the highest average Silhouette width, meaning they’d be the best choice.
I decided to replicate the same process for CLARA method to compare results.
## cluster size ave.sil.width
## 1 1 445 0.56
## 2 2 664 0.48
## cluster size ave.sil.width
## 1 1 567 0.33
## 2 2 542 0.28
## cluster size ave.sil.width
## 1 1 571 0.45
## 2 2 538 0.31
Results from CLARA match exactly the ones from KMeans, but we can see that CLARA deals slightly better with visualizing clusters on clean data.
Let’s see slightly different plot for CLARA and compare medoid points for each variable.
## Call: clara(x = coffee_clean, k = 2, metric = "euclidean", stand = FALSE, samples = 6, sampsize = 50, trace = 0, medoids.x = TRUE, rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
## Medoids:
## Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup
## 244 7.67 7.83 7.58 7.92 7.75 7.67 10 10
## 1004 7.50 7.25 7.00 7.25 7.50 7.25 10 10
## Sweetness Cupper.Points Moisture altitude_mean_meters
## 244 10 7.58 0.12 1706.88
## 1004 10 7.25 0.11 1219.20
## Objective function: 227.9888
## Clustering vector: Named int [1:1109] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 ...
## - attr(*, "names")= chr [1:1109] "1" "2" "3" "4" "5" "8" "9" ...
## Cluster sizes: 410 699
## Best sample:
## [1] 10 44 57 79 88 102 110 190 208 210 236 244 289 464 467
## [16] 478 481 497 546 577 578 611 612 683 701 717 730 732 775 786
## [31] 797 802 848 851 853 860 916 925 958 979 1004 1108 1120 1122 1124
## [46] 1135 1150 1187 1206 1239
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
Three variables show same medoids across clusters, which indicates why the quality of clustering on this dataset isn’t as good as I hoped. Those are variables responsible for PC2 from dimension reduction, which makes it more obvious, why clusters are based mostly on first dimension.
#clean
kNNdistplot(coffee_clean, k = 11)
abline(h =90 , col = "grey", lty = 2)
set.seed(123)
db_orig <- dbscan(coffee_clean, eps = 90, MinPts = 11) # Adjust `eps` as needed
fviz_cluster(list(data = coffee_clean, cluster = db_orig$cluster)) +
ggtitle("DBSCAN Clustering on Original Data")
#pca
kNNdistplot(pca_3pc, k = 10)
abline(h =2 , col = "grey", lty = 2)
set.seed(123)
db_orig <- dbscan(pca_3pc, eps = 2, MinPts = 10) # Adjust `eps` as needed
fviz_cluster(list(data = pca_3pc, cluster = db_orig$cluster)) +
ggtitle("DBSCAN Clustering on OPA (3PCs)")
#pca2pc
kNNdistplot(pca_2pc, k = 10)
abline(h =2 , col = "grey", lty = 2)
set.seed(123)
db_orig <- dbscan(pca_2pc, eps = 2, MinPts = 10) # Adjust `eps` as needed
fviz_cluster(list(data = pca_2pc, cluster = db_orig$cluster)) +
ggtitle("DBSCAN Clustering on PCA (2PCs)")
DBSCAN seems to not see clusters at all, most of tries it shown 1 cluster plus noise. Only in clean data it sees 2 clusters, however second one is much smaller. The visualization again may fail because of many dimensions and little differences in values of the variables
Clearer visualization of clean data clusters:
res.fpc <- fpc::dbscan(coffee_clean, eps = 90 , MinPts = 11)
hullplot(coffee_clean, res.fpc, solid=TRUE, alpha=0.7)
As the widest spread variable is altitude,it may be worth checking how clusters look on the world map. For that I added back column Country of origin to coffee_clean, which was preprocessed earlier to show only numeric data for analysis purposes, and standarized it in terms of country naming convention to match map package data.
Data preparation code:
db_orig <- dbscan(coffee_clean, eps = 90, MinPts = 11) # Adjust `eps` as needed
coffee_clean$cluster_dbscan <- db_orig$cluster
coffee_clean_with_country <- coffee %>%
filter(complete.cases(select(coffee, Aroma, Flavor, Aftertaste, Acidity,
Body, Balance, Uniformity, Clean.Cup, Sweetness,
Cupper.Points, Moisture, altitude_mean_meters))) %>%
select(Country.of.Origin) %>%
mutate(cluster_dbscan = coffee_clean$cluster_dbscan)
country_clusters <- coffee_clean_with_country %>%
group_by(Country.of.Origin, cluster_dbscan) %>%
summarise(count = n(), .groups = "drop")
country_clusters <- country_clusters %>%
group_by(Country.of.Origin) %>%
mutate(percent = count / sum(count) * 100)
world_map <- map_data("world")
country_clusters <- country_clusters %>%
mutate(Country.of.Origin = case_when(
Country.of.Origin == "United States" ~ "USA",
Country.of.Origin == "Puerto Rico" ~ "Puerto Rico",
Country.of.Origin == "Tanzania" ~ "Tanzania",
Country.of.Origin == "Ivory Coast" ~ "Ivory Coast",
Country.of.Origin == "Cote d?Ivoire" ~ "Ivory Coast",
Country.of.Origin == "Taiwan" ~ "Taiwan",
Country.of.Origin == "Myanmar" ~ "Myanmar",
Country.of.Origin == "Vietnam" ~ "Vietnam",
Country.of.Origin == "Zambia" ~ "Zambia",
Country.of.Origin == "El Salvador" ~ "El Salvador",
Country.of.Origin == "Papua New Guinea" ~ "Papua New Guinea",
Country.of.Origin == "Costa Rica" ~ "Costa Rica",
Country.of.Origin == "Ethiopia" ~ "Ethiopia",
Country.of.Origin == "Guatemala" ~ "Guatemala",
TRUE ~ Country.of.Origin # Keep others unchanged
))
country_clusters_percent <- country_clusters %>%
group_by(`Country.of.Origin`, cluster_dbscan) %>%
summarise(count = n(), .groups = "drop") %>% # Count occurrences per country
group_by(`Country.of.Origin`) %>%
mutate(percent = (count / sum(count)) * 100) # Convert to percentage
world_clusters <- world_map %>%
left_join(country_clusters_percent, by = c("region" = "Country.of.Origin"))
#Country plot
We can clearly see that there are some countries with observations in only single cluster but also some in only one cluster. Sticking to the previous interpretation that Cluster 1 stands for slightly better quality of coffee than cluster 2 and definetely has higher mean altitude of beans harvesting, this makes perfect sense. Countries visible in both clusters are ones of the best known quality coffee producers: Mexico, Gwatemala, Costarica, Ekwador, Colombia, Brazil and Kenia. Their presence in both clusters may be interpreted as visualization of common marketing practice. If the country is known from their very good coffee, it produces both - very good and just good as people will buy it anyway just because it’s form this country, especially if it’s a bit cheaper.
For countries marked on red in first cluster where around 90-100% of observations lands in this cluster, I would say that some of them are not big exporters of the coffee, so one thing is that there is probably less observations from these countries. Second thing is that, if they already export some beans or get them assessed by Coffee Quality Institute, these are probably very good beans. These countries would be Zambia, Cambodia, Malawi,Rwanda, Burundi. The rest of them I would interpret as just very good coffee quality, however China surprised me here personally, as I never heard of Chinease coffee, but after doing some research, it suits this interpretation.
As separate category I would interpret USA, India, Myanmar/Birma and Indonesia as they’re appearing both in cluster 1 and cluster 0 which should be interpreted as noise. Those countries have the biggest number of observations across dataset and their visibility in both cluster 1 and 0 may be caused by the disproportion of representation across dataset comparing to other countries.
The project aimed to analyse coffee quality parameters using dimension reduction (Principal Component Analysis) and various clustering methods (KMeans, CLARA, DBSCAN).
Reducing dimensionality allowed to discover that it’s possible to explain nearly 70% of the data using only 3 dimensions, which makes visualization easier and computing faster. First dimension combined mainly features responsible for overall quality (Cupper Points, Aroma) adn sensory attributes (Body, Acidity), while second dimension focused around texture (Moisture, Cup) and sweetness. Third dimension also contained Moisture but in a smaller propotion as it’s main driver was altitude. Moreover, visualization showed that altitude and moisture impact coffee quality in differen directions.
Clustering performed on both clean data and PCA-transformed data showed very similar results, which suggests that PCA indeed kept most of data information. Clustering on clean data showed however higher average Silhouette width, which was expected as it had 100% of information, not only 70%. There were 2 clusters detected, one significanly smaller than another, it was recognized that bigger one indicates slightly higher coffee quality and higher altitude of harvest. DBSCAN also enabled to visualize interesting split of clusters among different countries, where some of them were strongly associated with one cluster only, but some ahd divided observations between both clusters or showed some noise in observations.
While, I’m happy with results of dimension reduction, I had different expectations for clustering when starting the project. There are some areas for future improvements, maybe different way of preparing the data may help the fact that small variances in values of specific variables were interpreted as a noise, or experimenting with different clustering methods and/or combinig them (using hybrid-models) would work better for given data.