Association rules and Pure Dimension Reduction Using t-SNE on Wine Quality Data

Introduction

Modern datasets often contain numerous interrelated variables that interact in complex and non-linear ways. Understanding such high-dimensional data is challenging because relationships between variables cannot easily be visualized or interpreted in their original form. Dimension reduction techniques address this problem by transforming high-dimensional data into lower-dimensional representations while preserving meaningful structural relationships.

This project applies a pure nonlinear dimension reduction technique, t-Distributed Stochastic Neighbor Embedding (t-SNE), to the Wine Quality dataset. Unlike linear techniques such as Principal Component Analysis (PCA), t-SNE preserves local neighborhood structure and captures complex, non-linear patterns within the data. The objective is to explore whether wines with similar physicochemical properties cluster together according to their quality ratings when projected into two dimensions.

In addition to structural visualization, this study incorporates Association Rule Mining to identify interpretable relationships between categorized physicochemical properties and wine quality scores. While t-SNE reveals hidden geometric structure in the data, association rules provide explicit IF–THEN relationships that explain how combinations of chemical properties relate to specific quality outcomes.

By combining nonlinear dimension reduction with rule-based pattern discovery, this analysis provides both visual insight and interpretable knowledge about the relationship between chemical composition and perceived wine quality.

Dataset Description

The Wine Quality dataset contains physicochemical measurements of red wine samples along with a quality score assigned by experts. The dataset includes 11 numeric features such as acidity, alcohol content, and pH, and a discrete quality rating ranging from 3 to 8.

This dataset is commonly used in machine learning and exploratory data analysis to study how chemical properties influence perceived wine quality.

wine <- read.csv(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
  sep = ";"
)

# Separate features and quality
wine_features <- wine[, -ncol(wine)]
wine_quality <- wine$quality


# Scale numeric features

wine_scaled <- scale(wine_features)


# Identify and remove duplicate rows
unique_rows <- !duplicated(wine_scaled)
wine_unique <- wine_scaled[unique_rows, ]
quality_unique <- wine_quality[unique_rows]


set.seed(123)

tsne_result <- Rtsne(
  wine_unique,
  dims = 2,
  perplexity = 30,
  max_iter = 500,
  verbose = TRUE
)

## Performing PCA
## Read the 1359 x 11 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.22 seconds (sparsity = 0.093749)!
## Learning embedding...
## Iteration 50: error is 73.471960 (50 iterations in 0.15 seconds)
## Iteration 100: error is 70.789156 (50 iterations in 0.14 seconds)
## Iteration 150: error is 70.783163 (50 iterations in 0.14 seconds)
## Iteration 200: error is 70.577097 (50 iterations in 0.13 seconds)
## Iteration 250: error is 70.521138 (50 iterations in 0.12 seconds)
## Iteration 300: error is 1.465603 (50 iterations in 0.12 seconds)
## Iteration 350: error is 1.275684 (50 iterations in 0.14 seconds)
## Iteration 400: error is 1.218266 (50 iterations in 0.13 seconds)
## Iteration 450: error is 1.192088 (50 iterations in 0.13 seconds)
## Iteration 500: error is 1.180989 (50 iterations in 0.13 seconds)
## Fitting performed in 1.33 seconds.

# Create data frame for plotting
tsne_df <- data.frame(
  Dim1 = tsne_result$Y[,1],
  Dim2 = tsne_result$Y[,2],
  Quality = factor(quality_unique)
)

ggplot(tsne_df, aes(Dim1, Dim2, color = Quality)) +
  geom_point(alpha = 0.7, size = 2) +
  theme_minimal() +
  labs(
    title = "t-SNE Projection of Wine Quality Data",
    subtitle = "Duplicate observations removed prior to dimension reduction",
    x = "t-SNE Dimension 1",
    y = "t-SNE Dimension 2",
    color = "Wine Quality"
  ) +
  scale_color_viridis_d(option = "plasma")

Association Rule Analysis

While t-SNE provides a visual representation of structural relationships in high-dimensional space, association rule mining identifies interpretable patterns between physicochemical properties and wine quality.

Association rules reveal relationships of the form:

IF (conditions) THEN (outcome)

In this analysis, we explore rules that predict wine quality based on categorized chemical properties.

# Create categorical version of dataset
wine_cat <- wine

wine_cat[, -ncol(wine_cat)] <- lapply(
  wine_cat[, -ncol(wine_cat)],
  function(x) {
    cut(
      x,
      breaks = quantile(x, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
      include.lowest = TRUE,
      labels = c("Low", "Medium", "High")
    )
  }
)

wine_cat$quality <- as.factor(wine_cat$quality)

wine_trans <- as(wine_cat, "transactions")

rules <- apriori(
  wine_trans,
  parameter = list(supp = 0.05, conf = 0.6),
  appearance = list(
    rhs = paste0("quality=", unique(wine_cat$quality)),
    default = "lhs"
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 79 
## 
## set item appearances ...[6 item(s)] done [0.00s].
## set transactions ...[39 item(s), 1599 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [106 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)

inspect(head(rules_sorted, 10))

##      lhs                             rhs            support confidence   coverage     lift count
## [1]  {total.sulfur.dioxide=High,                                                                
##       sulphates=Low,                                                                            
##       alcohol=Low}                => {quality=5} 0.06191370  0.8761062 0.07066917 2.057113    99
## [2]  {citric.acid=Medium,                                                                       
##       sulphates=Low,                                                                            
##       alcohol=Low}                => {quality=5} 0.05440901  0.8365385 0.06504065 1.964207    87
## [3]  {total.sulfur.dioxide=High,                                                                
##       pH=Low,                                                                                   
##       alcohol=Low}                => {quality=5} 0.06128831  0.8305085 0.07379612 1.950049    98
## [4]  {volatile.acidity=High,                                                                    
##       total.sulfur.dioxide=High,                                                                
##       alcohol=Low}                => {quality=5} 0.05003127  0.8247423 0.06066291 1.936509    80
## [5]  {density=High,                                                                             
##       sulphates=Low,                                                                            
##       alcohol=Low}                => {quality=5} 0.05378361  0.8190476 0.06566604 1.923138    86
## [6]  {chlorides=High,                                                                           
##       total.sulfur.dioxide=High,                                                                
##       alcohol=Low}                => {quality=5} 0.06066291  0.8151261 0.07442151 1.913930    97
## [7]  {chlorides=Medium,                                                                         
##       sulphates=Low,                                                                            
##       alcohol=Low}                => {quality=5} 0.05503440  0.8073394 0.06816760 1.895647    88
## [8]  {total.sulfur.dioxide=High,                                                                
##       density=High,                                                                             
##       alcohol=Low}                => {quality=5} 0.05878674  0.8034188 0.07317073 1.886442    94
## [9]  {chlorides=High,                                                                           
##       pH=Low,                                                                                   
##       alcohol=Low}                => {quality=5} 0.06066291  0.8016529 0.07567230 1.882295    97
## [10] {free.sulfur.dioxide=High,                                                                 
##       total.sulfur.dioxide=High,                                                                
##       alcohol=Low}                => {quality=5} 0.07879925  0.7974684 0.09881176 1.872470   126

plot(rules_sorted[1:10], method = "graph", engine = "htmlwidget")

Interpretation of Association Rules

The association rule analysis identifies combinations of physicochemical properties that are strongly associated with specific wine quality ratings.

Rules are evaluated using three main metrics:

Support: Frequency of the rule in the dataset
Confidence: Probability of the outcome given the conditions
Lift: Strength of association relative to random occurrence

High-lift rules suggest that certain chemical profiles significantly increase the likelihood of specific quality scores.

For example, rules indicating High alcohol content combined with Medium volatile acidity may strongly predict higher quality ratings (6 or 7).

These findings complement the t-SNE visualization by providing interpretable, rule-based insights into how physicochemical properties influence wine quality.

Conclusion

Conclusion

This project combined nonlinear dimension reduction and association rule mining to analyze the Wine Quality dataset from both structural and interpretative perspectives.

The t-SNE analysis successfully reduced the original 11-dimensional feature space into a two-dimensional representation without applying PCA. The resulting visualization revealed meaningful clustering patterns, indicating that wines with similar physicochemical characteristics tend to group together. Higher-quality wines showed noticeable grouping, suggesting that certain chemical profiles are associated with improved sensory evaluation. However, some overlap between quality levels highlights the complexity of wine assessment and the influence of multiple interacting variables.

Complementing the visualization, association rule mining identified specific combinations of categorized chemical properties that are strongly associated with particular quality scores. Using support, confidence, and lift metrics, the analysis revealed interpretable patterns such as the influence of alcohol content, acidity levels, and sulphates on quality ratings. These rules provide actionable insights by explicitly identifying conditions that increase the likelihood of certain quality outcomes.

Overall, the integration of t-SNE and association rule mining demonstrates the value of combining exploratory visualization techniques with rule-based knowledge discovery. While t-SNE uncovers hidden structural relationships in high-dimensional space, association rules translate those patterns into interpretable relationships. Together, these methods provide a comprehensive understanding of how physicochemical properties influence wine quality.

References

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016

Hahsler, M., Grün, B., & Hornik, K. (2005). arules — A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1–25. https://doi.org/10.18637/jss.v014.i15

Hahsler, M., & Chelluboina, S. (2020). arulesViz: Interactive visualization of association rules with R. The R Journal, 12(1), 169–184. https://doi.org/10.32614/RJ-2020-013

van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Lin Pedersen, T., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686