High-dimensional sensory datasets are difficult to interpret due to complex interdependencies between variables. Dimension reduction techniques provide a structured way to uncover latent structure while preserving essential information.
This study compares Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) using coffee sensory evaluation data. While PCA maximizes explained variance, MDS preserves pairwise distances. The goal is to examine whether these fundamentally different approaches produce distinct geometric representations.
species_palette <- c("Arabica" = "#1f77b4", "Robusta" = "#d62728")
coffee <- read.csv("merged_data_cleaned.csv")
sensory_vars <- coffee %>%
select(Aroma, Flavor, Aftertaste, Acidity, Body, Balance,
Uniformity, Clean.Cup, Sweetness,
Cupper.Points, Total.Cup.Points, Moisture)
species <- coffee$Species
sensory_scaled <- scale(sensory_vars)
cor_matrix <- cor(sensory_scaled)
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7,
tl.col = "black",
order = "hclust")
The correlation matrix reveals a clear block structure among flavor-related attributes such as Flavor, Aftertaste, Aroma, Body, Balance, and Total Cup Points, all of which exhibit strong positive correlations. This clustering pattern suggests that these variables move together along a common quality gradient.
In contrast, Moisture shows weaker and in some cases slightly negative associations with the main sensory attributes, indicating that it behaves differently from the core quality-related variables. Overall, the strong positive interdependencies support the presence of a dominant latent quality dimension underlying the sensory data.
The PCA findings indicate that the first principal component (PC1) captures 58.85% of the overall variance, suggesting a strong underlying latent structure in the sensory dataset. The second component (PC2) contributes 12.60% of the variance, whereas the third component accounts for 7.65%. Together, the first two components capture approximately 71.45% of the total variance, suggesting that a two-dimensional representation provides a substantial summary of the data structure. The cumulative variance exceeds 84% by the fourth component, indicating that most of the variability is concentrated in the first few dimensions.
Examining the loadings, most flavor-related variables such as Aroma (−0.318), Flavor (−0.349), Aftertaste (−0.345), Acidity (−0.322), and Total Cup Points (−0.369) show relatively large and similar contributions to PC1. This consistent pattern suggests that PC1 represents a global sensory quality dimension. In contrast, PC2 is more strongly associated with Sweetness (0.536), Clean Cup (0.474), and Uniformity (0.459), indicating that this component captures a secondary structure related to specific cleanliness and sweetness attributes. Moisture loads strongly on PC3 (−0.922), suggesting that it behaves differently from the core flavor attributes and forms a largely independent dimension.
pca_model <- prcomp(sensory_scaled, center = FALSE, scale. = FALSE)
summary(pca_model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6574 1.2299 0.95794 0.7676 0.69174 0.60750 0.55903
## Proportion of Variance 0.5885 0.1260 0.07647 0.0491 0.03988 0.03075 0.02604
## Cumulative Proportion 0.5885 0.7145 0.79099 0.8401 0.87997 0.91072 0.93676
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.51504 0.46948 0.41925 0.31207 0.004182
## Proportion of Variance 0.02211 0.01837 0.01465 0.00812 0.000000
## Cumulative Proportion 0.95887 0.97724 0.99188 1.00000 1.000000
pca_model$rotation
## PC1 PC2 PC3 PC4 PC5
## Aroma -0.31817882 -0.1319324 -0.110760024 0.016218215 0.02549520
## Flavor -0.34874830 -0.1225097 -0.096349898 0.039612497 0.05911413
## Aftertaste -0.34480415 -0.1431004 -0.055091052 0.052051398 0.04949966
## Acidity -0.32201603 -0.1463529 -0.135952501 -0.066854710 -0.10307270
## Body -0.30660900 -0.1558051 -0.102739921 -0.241671021 -0.07213774
## Balance -0.33008061 -0.1237548 -0.002274809 -0.050061222 -0.01851892
## Uniformity -0.21210784 0.4592216 0.148943787 0.285175301 -0.78330649
## Clean.Cup -0.20162707 0.4744611 0.254095386 0.465121905 0.57417639
## Sweetness -0.16270840 0.5356531 0.085069750 -0.767702415 0.13025996
## Cupper.Points -0.31505314 -0.1484964 -0.020887098 0.163363899 0.11416728
## Total.Cup.Points -0.36866953 0.1591176 0.036137597 0.007521486 0.03349051
## Moisture 0.06555463 0.3403030 -0.922302752 0.127684133 0.05351153
## PC6 PC7 PC8 PC9
## Aroma 0.256613251 0.68764326 -0.4704940808 -0.04501265
## Flavor 0.178182284 0.07674974 0.0603175684 0.05034773
## Aftertaste 0.099795200 -0.04147535 0.0470794727 0.21950291
## Acidity 0.087740161 0.15743860 0.8180068678 -0.09654110
## Body -0.715907563 -0.02698049 -0.1491580932 -0.50510032
## Balance -0.325859695 -0.24538782 -0.1365644106 0.74021714
## Uniformity 0.011933051 -0.01472985 -0.0571025363 -0.03347511
## Clean.Cup -0.231186622 0.11074897 0.1080918128 -0.06069239
## Sweetness 0.205440372 -0.06051772 -0.0259197695 0.01736866
## Cupper.Points 0.410656073 -0.64223661 -0.2145368931 -0.35741727
## Total.Cup.Points -0.001986957 -0.01184465 -0.0006012022 -0.01459578
## Moisture -0.039966977 -0.07108534 -0.0332356624 0.04270291
## PC10 PC11 PC12
## Aroma 0.286242505 0.120096483 -9.860467e-02
## Flavor -0.435516968 -0.777639078 -1.039661e-01
## Aftertaste -0.645093903 0.602496567 -1.058879e-01
## Acidity 0.334949317 0.091559148 -9.897016e-02
## Body -0.086705671 0.007501729 -9.670181e-02
## Balance 0.349540626 -0.079145487 -1.070684e-01
## Uniformity -0.034197323 -0.007446221 -1.451963e-01
## Clean.Cup 0.050865794 0.012144603 -1.995920e-01
## Sweetness -0.009950068 0.016112157 -1.607171e-01
## Cupper.Points 0.257285816 0.047587511 -1.232563e-01
## Total.Cup.Points 0.013267113 0.006455946 9.141690e-01
## Moisture 0.011344226 0.019494488 2.905536e-06
#Scree Plot
fviz_eig(pca_model, addlabels = TRUE) +
coffee_theme +
labs(
title = "Scree Plot",
x = "Principal Component",
y = "Explained Variance (%)"
)
#Loading Plot
fviz_pca_var(pca_model, repel = TRUE) +
coffee_theme +
labs(
title = "PCA Variable Loadings",
subtitle = "Directions indicate how sensory attributes contribute to components"
)
#PCA Scores
pca_scores <- as.data.frame(pca_model$x)
pca_scores$Species <- species
ggplot(pca_scores, aes(PC1, PC2, color = Species, fill = Species)) +
geom_point(alpha = 0.55, size = 2) +
stat_ellipse(type = "norm", level = 0.95, geom = "polygon", alpha = 0.12, color = NA) +
stat_ellipse(type = "norm", level = 0.95, linewidth = 0.7) +
scale_color_manual(values = species_palette, na.value = "grey50") +
scale_fill_manual(values = species_palette, na.value = "grey50") +
coffee_theme +
labs(
title = "PCA Projection of Coffee Sensory Data",
subtitle = "Ellipses show 95% concentration by Species",
x = "PC1",
y = "PC2"
)
I observe that coffee sensory evaluations do not form strictly distinct groups but instead align along a continuous quality dimension. The dominance of the first principal component highlights how strongly correlated the flavor-related attributes are. Therefore, dimensionality reduction provides a meaningful and interpretable summary of the sensory structure.
dist_matrix <- dist(sensory_scaled, method = "euclidean")
mds_model <- cmdscale(dist_matrix, k = 2, eig = TRUE)
mds_scores <- as.data.frame(mds_model$points)
colnames(mds_scores) <- c("Dim1", "Dim2")
mds_scores$Species <- species
ggplot(mds_scores, aes(Dim1, Dim2, color = Species, fill = Species)) +
geom_point(alpha = 0.55, size = 2) +
stat_ellipse(type = "norm", level = 0.95, geom = "polygon", alpha = 0.12, color = NA) +
stat_ellipse(type = "norm", level = 0.95, linewidth = 0.7) +
scale_color_manual(values = species_palette, na.value = "grey50") +
scale_fill_manual(values = species_palette, na.value = "grey50") +
coffee_theme +
labs(
title = "Euclidean MDS",
x = "Dimension 1",
y = "Dimension 2"
)
dist_matrix_manhattan <- dist(sensory_scaled, method = "manhattan")
mds_model_manhattan <- cmdscale(dist_matrix_manhattan, k = 2)
mds_scores_manhattan <- as.data.frame(mds_model_manhattan)
colnames(mds_scores_manhattan) <- c("Dim1", "Dim2")
mds_scores_manhattan$Species <- species
ggplot(mds_scores_manhattan, aes(Dim1, Dim2, color = Species, fill = Species)) +
geom_point(alpha = 0.55, size = 2) +
stat_ellipse(type = "norm", level = 0.95, geom = "polygon", alpha = 0.12, color = NA) +
stat_ellipse(type = "norm", level = 0.95, linewidth = 0.7) +
scale_color_manual(values = species_palette, na.value = "grey50") +
scale_fill_manual(values = species_palette, na.value = "grey50") +
coffee_theme +
labs(
title = "Manhattan MDS",
x = "Dimension 1",
y = "Dimension 2"
)
From my perspective, both Euclidean and Manhattan MDS confirm that coffee sensory evaluations are structured along a continuous dimension rather than forming sharply separated groups. Even though Manhattan distance stretches the geometry and slightly emphasizes species differences, the fundamental pattern does not change. Therefore, I interpret the sensory space as being driven more by gradual variation in quality attributes than by strict categorical separation.
Both variance-oriented PCA and distance-based MDS display comparable overall structures when the Euclidean metric is applied. However, the overall configuration is sensitive to the choice of distance metric, as different metrics alter the geometric relationships among observations. Although the Manhattan distance slightly improves the separation structure, the overlap between species remains considerable, indicating that the groups are not strongly distinct in the reduced dimensional space.
To sum up, this study applied Principal Component Analysis (PCA) to explore the structure of sensory evaluation variables. The results show that the first component accounts for 58.85% of the overall variance, indicating that a dominant latent dimension drives most of the variation in the dataset. The second and third components account for 12.60% and 7.65% of the variance, respectively. Together, the first three components explain nearly 79% of the total variability, meaning that most of the information in the dataset can be represented in a reduced dimensional space.The loadings show that several sensory attributes move in similar directions, indicating shared patterns across evaluation criteria. This suggests that the sensory scores are not independent from each other but are structured around broader latent characteristics.
Overall, PCA provides a clear and interpretable low-dimensional representation of the data, making it easier to understand the main variation patterns without losing substantial information.