Introduction

Modern datasets often contain a large number of variables that interact in complex and non-linear ways. Visualizing and interpreting such high-dimensional data presents a significant challenge in data analysis and machine learning. Dimension reduction techniques address this challenge by transforming high-dimensional data into lower-dimensional representations while preserving important structural relationships.

Among these techniques, t-Distributed Stochastic Neighbor Embedding (t-SNE) has become a popular method for exploratory data analysis due to its ability to capture non-linear patterns and preserve local neighborhood structure. Unlike linear methods such as Principal Component Analysis (PCA), t-SNE focuses on maintaining similarities between nearby observations, making it particularly effective for visualizing clusters and patterns in complex datasets.

This project applies a pure dimension reduction approach using t-SNE to the Wine Quality dataset, which consists of physicochemical measurements of red wine samples and corresponding quality ratings assigned by expert tasters. The objective of the analysis is to explore whether wines with similar quality scores exhibit similar physicochemical characteristics when projected into a two-dimensional space.

The resulting visualization provides insights into the relationship between chemical composition and perceived wine quality and highlights the potential of non-linear dimension reduction methods in exploratory data analysis.

High-dimensional datasets often contain complex relationships that are difficult to visualize.
Dimension reduction techniques transform data into lower-dimensional representations while preserving meaningful structure.

This report demonstrates pure dimension reduction using t-SNE on the Wine Quality dataset. The goal is to visualize the underlying structure of wine samples based on their physicochemical properties and quality scores.


Dataset Description

The Wine Quality dataset contains physicochemical measurements of red wine samples along with a quality score assigned by experts. The dataset includes 11 numeric features such as acidity, alcohol content, and pH, and a discrete quality rating ranging from 3 to 8.

This dataset is commonly used in machine learning and exploratory data analysis to study how chemical properties influence perceived wine quality.

wine <- read.csv(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
  sep = ";"
)

# Separate features and quality
wine_features <- wine[, -ncol(wine)]
wine_quality <- wine$quality


# Scale numeric features

wine_scaled <- scale(wine_features)


# Identify and remove duplicate rows
unique_rows <- !duplicated(wine_scaled)
wine_unique <- wine_scaled[unique_rows, ]
quality_unique <- wine_quality[unique_rows]


set.seed(123)

tsne_result <- Rtsne(
  wine_unique,
  dims = 2,
  perplexity = 30,
  max_iter = 500,
  verbose = TRUE
)
## Performing PCA
## Read the 1359 x 11 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.37 seconds (sparsity = 0.093749)!
## Learning embedding...
## Iteration 50: error is 73.471960 (50 iterations in 0.24 seconds)
## Iteration 100: error is 70.789156 (50 iterations in 0.22 seconds)
## Iteration 150: error is 70.783163 (50 iterations in 0.21 seconds)
## Iteration 200: error is 70.577097 (50 iterations in 0.20 seconds)
## Iteration 250: error is 70.521138 (50 iterations in 0.21 seconds)
## Iteration 300: error is 1.465603 (50 iterations in 0.28 seconds)
## Iteration 350: error is 1.275684 (50 iterations in 0.25 seconds)
## Iteration 400: error is 1.218266 (50 iterations in 0.19 seconds)
## Iteration 450: error is 1.192088 (50 iterations in 0.26 seconds)
## Iteration 500: error is 1.180989 (50 iterations in 0.20 seconds)
## Fitting performed in 2.25 seconds.
# Create data frame for plotting
tsne_df <- data.frame(
  Dim1 = tsne_result$Y[,1],
  Dim2 = tsne_result$Y[,2],
  Quality = factor(quality_unique)
)

ggplot(tsne_df, aes(Dim1, Dim2, color = Quality)) +
  geom_point(alpha = 0.7, size = 2) +
  theme_minimal() +
  labs(
    title = "t-SNE Projection of Wine Quality Data",
    subtitle = "Duplicate observations removed prior to dimension reduction",
    x = "t-SNE Dimension 1",
    y = "t-SNE Dimension 2",
    color = "Wine Quality"
  ) +
  scale_color_viridis_d(option = "plasma")

Conclusion

This project demonstrated a pure dimension reduction approach using t-SNE on the Wine Quality dataset. By reducing the original 11-dimensional dataset to two dimensions without applying PCA, t-SNE successfully revealed meaningful structure within the data.The results indicate that wines with similar quality scores tend to cluster together, particularly for higher-quality wines. This suggests that physicochemical properties play a significant role in determining wine quality, though overlaps between groups highlight the complexity of sensory evaluation.Overall, t-SNE proved to be an effective exploratory tool for visualizing non-linear relationships in high-dimensional data. This approach can be extended to other datasets where understanding hidden structure and similarity patterns is essential.

References

Maaten, L. van der, & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.