Modern datasets often contain numerous interrelated variables that interact in complex and non-linear ways. Understanding such high-dimensional data is challenging because relationships between variables cannot easily be visualized or interpreted in their original form. Dimension reduction techniques address this problem by transforming high-dimensional data into lower-dimensional representations while preserving meaningful structural relationships.
This project applies a pure nonlinear dimension reduction technique, t-Distributed Stochastic Neighbor Embedding (t-SNE), to the Wine Quality dataset. Unlike linear techniques such as Principal Component Analysis (PCA), t-SNE preserves local neighborhood structure and captures complex, non-linear patterns within the data. The objective is to explore whether wines with similar physicochemical properties cluster together according to their quality ratings when projected into two dimensions.
In addition to structural visualization, this study incorporates Association Rule Mining to identify interpretable relationships between categorized physicochemical properties and wine quality scores. While t-SNE reveals hidden geometric structure in the data, association rules provide explicit IF–THEN relationships that explain how combinations of chemical properties relate to specific quality outcomes.
By combining nonlinear dimension reduction with rule-based pattern discovery, this analysis provides both visual insight and interpretable knowledge about the relationship between chemical composition and perceived wine quality.
The Wine Quality dataset contains physicochemical measurements of red wine samples along with a quality score assigned by experts. The dataset includes 11 numeric features such as acidity, alcohol content, and pH, and a discrete quality rating ranging from 3 to 8.
This dataset is commonly used in machine learning and exploratory data analysis to study how chemical properties influence perceived wine quality.
wine <- read.csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
sep = ";"
)
# Separate features and quality
wine_features <- wine[, -ncol(wine)]
wine_quality <- wine$quality
# Scale numeric features
wine_scaled <- scale(wine_features)
# Identify and remove duplicate rows
unique_rows <- !duplicated(wine_scaled)
wine_unique <- wine_scaled[unique_rows, ]
quality_unique <- wine_quality[unique_rows]
set.seed(123)
tsne_result <- Rtsne(
wine_unique,
dims = 2,
perplexity = 30,
max_iter = 500,
verbose = TRUE
)
## Performing PCA
## Read the 1359 x 11 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.22 seconds (sparsity = 0.093749)!
## Learning embedding...
## Iteration 50: error is 73.471960 (50 iterations in 0.15 seconds)
## Iteration 100: error is 70.789156 (50 iterations in 0.14 seconds)
## Iteration 150: error is 70.783163 (50 iterations in 0.14 seconds)
## Iteration 200: error is 70.577097 (50 iterations in 0.13 seconds)
## Iteration 250: error is 70.521138 (50 iterations in 0.12 seconds)
## Iteration 300: error is 1.465603 (50 iterations in 0.12 seconds)
## Iteration 350: error is 1.275684 (50 iterations in 0.14 seconds)
## Iteration 400: error is 1.218266 (50 iterations in 0.13 seconds)
## Iteration 450: error is 1.192088 (50 iterations in 0.13 seconds)
## Iteration 500: error is 1.180989 (50 iterations in 0.13 seconds)
## Fitting performed in 1.33 seconds.
# Create data frame for plotting
tsne_df <- data.frame(
Dim1 = tsne_result$Y[,1],
Dim2 = tsne_result$Y[,2],
Quality = factor(quality_unique)
)
ggplot(tsne_df, aes(Dim1, Dim2, color = Quality)) +
geom_point(alpha = 0.7, size = 2) +
theme_minimal() +
labs(
title = "t-SNE Projection of Wine Quality Data",
subtitle = "Duplicate observations removed prior to dimension reduction",
x = "t-SNE Dimension 1",
y = "t-SNE Dimension 2",
color = "Wine Quality"
) +
scale_color_viridis_d(option = "plasma")
While t-SNE provides a visual representation of structural relationships in high-dimensional space, association rule mining identifies interpretable patterns between physicochemical properties and wine quality.
Association rules reveal relationships of the form:
IF (conditions) THEN (outcome)
In this analysis, we explore rules that predict wine quality based on categorized chemical properties.
# Create categorical version of dataset
wine_cat <- wine
wine_cat[, -ncol(wine_cat)] <- lapply(
wine_cat[, -ncol(wine_cat)],
function(x) {
cut(
x,
breaks = quantile(x, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Low", "Medium", "High")
)
}
)
wine_cat$quality <- as.factor(wine_cat$quality)
wine_trans <- as(wine_cat, "transactions")
rules <- apriori(
wine_trans,
parameter = list(supp = 0.05, conf = 0.6),
appearance = list(
rhs = paste0("quality=", unique(wine_cat$quality)),
default = "lhs"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 79
##
## set item appearances ...[6 item(s)] done [0.00s].
## set transactions ...[39 item(s), 1599 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [106 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {total.sulfur.dioxide=High,
## sulphates=Low,
## alcohol=Low} => {quality=5} 0.06191370 0.8761062 0.07066917 2.057113 99
## [2] {citric.acid=Medium,
## sulphates=Low,
## alcohol=Low} => {quality=5} 0.05440901 0.8365385 0.06504065 1.964207 87
## [3] {total.sulfur.dioxide=High,
## pH=Low,
## alcohol=Low} => {quality=5} 0.06128831 0.8305085 0.07379612 1.950049 98
## [4] {volatile.acidity=High,
## total.sulfur.dioxide=High,
## alcohol=Low} => {quality=5} 0.05003127 0.8247423 0.06066291 1.936509 80
## [5] {density=High,
## sulphates=Low,
## alcohol=Low} => {quality=5} 0.05378361 0.8190476 0.06566604 1.923138 86
## [6] {chlorides=High,
## total.sulfur.dioxide=High,
## alcohol=Low} => {quality=5} 0.06066291 0.8151261 0.07442151 1.913930 97
## [7] {chlorides=Medium,
## sulphates=Low,
## alcohol=Low} => {quality=5} 0.05503440 0.8073394 0.06816760 1.895647 88
## [8] {total.sulfur.dioxide=High,
## density=High,
## alcohol=Low} => {quality=5} 0.05878674 0.8034188 0.07317073 1.886442 94
## [9] {chlorides=High,
## pH=Low,
## alcohol=Low} => {quality=5} 0.06066291 0.8016529 0.07567230 1.882295 97
## [10] {free.sulfur.dioxide=High,
## total.sulfur.dioxide=High,
## alcohol=Low} => {quality=5} 0.07879925 0.7974684 0.09881176 1.872470 126
plot(rules_sorted[1:10], method = "graph", engine = "htmlwidget")
The association rule analysis identifies combinations of physicochemical properties that are strongly associated with specific wine quality ratings.
Rules are evaluated using three main metrics:
High-lift rules suggest that certain chemical profiles significantly increase the likelihood of specific quality scores.
For example, rules indicating High alcohol content combined with Medium volatile acidity may strongly predict higher quality ratings (6 or 7).
These findings complement the t-SNE visualization by providing interpretable, rule-based insights into how physicochemical properties influence wine quality.
Conclusion
This project combined nonlinear dimension reduction and association rule mining to analyze the Wine Quality dataset from both structural and interpretative perspectives.
The t-SNE analysis successfully reduced the original 11-dimensional feature space into a two-dimensional representation without applying PCA. The resulting visualization revealed meaningful clustering patterns, indicating that wines with similar physicochemical characteristics tend to group together. Higher-quality wines showed noticeable grouping, suggesting that certain chemical profiles are associated with improved sensory evaluation. However, some overlap between quality levels highlights the complexity of wine assessment and the influence of multiple interacting variables.
Complementing the visualization, association rule mining identified specific combinations of categorized chemical properties that are strongly associated with particular quality scores. Using support, confidence, and lift metrics, the analysis revealed interpretable patterns such as the influence of alcohol content, acidity levels, and sulphates on quality ratings. These rules provide actionable insights by explicitly identifying conditions that increase the likelihood of certain quality outcomes.
Overall, the integration of t-SNE and association rule mining demonstrates the value of combining exploratory visualization techniques with rule-based knowledge discovery. While t-SNE uncovers hidden structural relationships in high-dimensional space, association rules translate those patterns into interpretable relationships. Together, these methods provide a comprehensive understanding of how physicochemical properties influence wine quality.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016
Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. https://doi.org/10.18637/jss.v014.i15
Hahsler, M., & Chelluboina, S. (2020). arulesViz: Interactive visualization of association rules with R. The R Journal, 12(1), 169–184. https://doi.org/10.32614/RJ-2020-013
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Lin Pedersen, T., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686