Wine quality prediction, based on chemical measurements, has become an important task for producers, regulators, and researchers who are interested in improving consistency and market value. Producers stand to benefit by forecasting whether wines will be rated as “High” or “Low” quality before bottling, thus allowing interventions during the earliest stages of fermentation and blending. This project explores the Red Wine Quality dataset with two approaches: classification using original chemical features and classification post-dimensionality reduction using PCA. We compare RF and SVM models across both settings to explore how dimensionality reduction affects predictive accuracy and interpretability.
The dataset consists of eleven physicochemical measurements, like acidity, alcohol content, sulfur dioxide levels, chlorides, and pH, all of which contribute to taste and aroma. Since many of these variables are correlated, the models may suffer from multicollinearity or redundant information. This motivates the use of PCA to extract a smaller number of uncorrelated components while still capturing much of the variance in the data.
Our analysis begins with EDA. There are no missing values in this dataset, and the distribution of high vs. low-quality wines is reasonably balanced, so we will not need to address oversampling or adjust for class weights. Distributions of alcohol and volatile acidity exhibit marked differences between high- and low-quality wines. High-quality wines have higher alcohol levels and lower volatile acidity than low-quality wines, consistent with established principles in enology. The distribution of alcohol across both quality groups is shown below:
Alcohol seems to be one of the strongest drivers of wine quality, as higher quality wines congregate at a higher percentile of alcohol. Similarly, volatile acidity related to sour vinegary notes exhibits the opposite relationship, as the lower it is the better the quality. This can be seen on the second histogram:
These early insights help guide feature interpretation and model expectations. A correlation heatmap, not shown here, further supports these findings with alcohol positively correlated with quality, and volatile acidity negatively correlated. Then we present two predictive models based on the original features: a Random Forest classifier and an SVM with the radial kernel. Random Forest inherently captures nonlinear relationships, directly quantifies feature importance, and tolerates correlated predictors. The plot of feature importance identifies alcohol as most influential and residual sugar as less important. We tune the SVM on the full set of original features over a variety of cost and gamma parameters to produce similarly good results, capturing non-linear patterns.
To assess whether dimensionality reduction improves performance, we apply PCA to the standardized numeric predictors. The first two principal components capture only moderate variance and less than half of the total, which suggests that information is dispersed across dimensions. For modeling, we retain eight principal components. Below is a PCA scatterplot of the first two components.
While some structure is visible, the separation of high and low quality wines is limited when projected to just two dimensions. This could indicate that while PCA simplifies visual analysis, it may not enhance model predictive power in this dataset. Models using the PCA transformed data perform similarly to those using the original features; Random Forest performs slightly better without PCA. This aligns with the known strength of tree-based models in handling multicollinearity. Support vector machines show comparable accuracy with and without PCA, but the PCA version sacrifices interpretability since principal components do not map directly onto chemical properties.
These results do provide appropriate guidelines in practice for any wine producer. A Random Forest model could be deployed on the original features as a diagnostic tool, detecting possible low-quality batches well in advance of production completion. The importance of alcohol and volatile acidity indicate intervention points: optimization of fermentation to raise alcohol levels or controlling microbial activity to decrease volatile acidity.
In conclusion, this project demonstrates how machine learning and PCA together offer a powerful framework for evaluating wine quality. While PCA does not significantly improve accuracy in this case, it remains useful for reducing redundancy and visualizing high-dimensional structures. Combining traditional supervised models with dimensionality reduction techniques provides a richer understanding of the data and highlights the trade-offs between accuracy, interpretability, and computational efficiency in real-world decision-making.