Note to Dr. Ong:
Since our group was reorganized on June 14, we began working on this project a bit late. After talking with you, you gave us an extra week, and we’re really grateful for that.
We sincerely apologize for not being able to complete the preparation for the presentation and demonstration in accordance with the original schedule. We sent you an email and a Teams message this week to confirm whether our topic meets your requirements, but we did not receive a response.
Therefore, we decided to proceed with writing the RMarkdown section and submit it on time. We hope to receive your feedback soon to confirm whether our topic meets your requirements or if any adjustments are needed. We will continue to follow up and present our presentation once we receive your response.

1. Introduction

Wine quality assessment is typically a subjective process that relies on the experience and judgment of professional wine tasters. However, with the help of structured physicochemical attribute data, machine learning methods offer the possibility of objectively predicting wine quality.

This project uses a white wine dataset from the UCI Machine Learning Repository to attempt to predict wine quality using data science methods. We pose two questions:
1. Regression problem: To predict wine quality scores based on physicochemical attributes.
2. Classification problem: To classify wines as high or low quality using chemical features.

To this end, we will build regression and classification models, evaluate their performance, and analyze the importance of each feature in wine quality prediction.

2. Dataset Description

The dataset used in this project is the white wine quality dataset from the UCI Machine Learning Repository. It contains 4,898 samples, where each row represents a white wine sample with 11 physicochemical attributes and a quality score rated by human tasters.

The dataset has the following variables:

fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality

There are no missing values, and all variables are numeric. This makes the dataset well-suited for regression and classification tasks.

3. Exploratory Data Analysis

In this section, we explore the distribution of wine quality scores and investigate how different physicochemical attributes relate to wine quality. We also generate correlation plots to identify which features may have the strongest influence on the target variable.

3.1 Load and Prepare the Dataset

We created a binary variable high_quality for the classification task, labeling samples with a score of 7 or higher as 1 (high quality) and the rest as 0 (low quality).

This classification resulted in 1,060 samples being categorized as high quality and 3,838 as low quality, indicating a significant class imbalance in the dataset.

Although this imbalance poses challenges for model training, we retained this threshold to reflect the real-world standards for high-quality wine. To address the imbalance issue, we will use metrics such as F1-score and confusion matrix in model evaluation, rather than relying solely on accuracy.

wine <- read.csv("winequality-white.csv", sep = ";")

# 将列名格式化
colnames(wine) <- tolower(gsub(" ", "_", colnames(wine)))

# 创建分类标签（评分 ≥ 7 设为高质量酒）
wine$high_quality <- as.factor(ifelse(wine$quality >= 7, 1, 0))

head(wine, 10)

##    fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1            7.0             0.27        0.36           20.7     0.045
## 2            6.3             0.30        0.34            1.6     0.049
## 3            8.1             0.28        0.40            6.9     0.050
## 4            7.2             0.23        0.32            8.5     0.058
## 5            7.2             0.23        0.32            8.5     0.058
## 6            8.1             0.28        0.40            6.9     0.050
## 7            6.2             0.32        0.16            7.0     0.045
## 8            7.0             0.27        0.36           20.7     0.045
## 9            6.3             0.30        0.34            1.6     0.049
## 10           8.1             0.22        0.43            1.5     0.044
##    free.sulfur.dioxide total.sulfur.dioxide density   ph sulphates alcohol
## 1                   45                  170  1.0010 3.00      0.45     8.8
## 2                   14                  132  0.9940 3.30      0.49     9.5
## 3                   30                   97  0.9951 3.26      0.44    10.1
## 4                   47                  186  0.9956 3.19      0.40     9.9
## 5                   47                  186  0.9956 3.19      0.40     9.9
## 6                   30                   97  0.9951 3.26      0.44    10.1
## 7                   30                  136  0.9949 3.18      0.47     9.6
## 8                   45                  170  1.0010 3.00      0.45     8.8
## 9                   14                  132  0.9940 3.30      0.49     9.5
## 10                  28                  129  0.9938 3.22      0.45    11.0
##    quality high_quality
## 1        6            0
## 2        6            0
## 3        6            0
## 4        6            0
## 5        6            0
## 6        6            0
## 7        6            0
## 8        6            0
## 9        6            0
## 10       6            0

# 查看分类结果的分布
table(wine$high_quality)

## 
##    0    1 
## 3838 1060

prop.table(table(wine$high_quality))

## 
##         0         1 
## 0.7835851 0.2164149

3.2 Distribution of Wine Quality

The histogram below shows the distribution of wine quality scores in the dataset. Most white wines are rated between 5 and 6, with relatively fewer wines rated as 7 or higher. This indicates that the dataset is slightly skewed toward medium-quality wines.

# 绘制评分分布直方图
ggplot(wine, aes(x = quality)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Distribution of Wine Quality Scores",
       x = "Quality Score",
       y = "Count") +
  theme_minimal()

3.3 Correlation Heatmap

The heatmap below shows the correlation coefficients between variables in the white wine dataset. Most features exhibit weak to moderate correlations, indicating that the dataset does not suffer from severe multicollinearity issues.

Several relationships are worth noting: 1. residual_sugar and density exhibit a strong positive correlation (0.84), which makes sense since sugar content affects wine density. 2. There is also a positive correlation between free_sulfur_dioxide and total_sulfur_dioxide (0.62), as the latter includes the former. 3. alcohol is negatively correlated with density (-0.78) and weakly negatively correlated with chlorides and volatile_acidity.

Overall, although most variables show low correlation, the heatmap helps identify a few strong linear relationships that may influence wine quality. In particular, the roles of alcohol, residual sugar, and sulfur dioxide will be further explored during the modeling phase.

# 仅选取数值型变量进行相关性计算
numeric_features <- wine[, 1:11]

# 计算相关系数矩阵
cor_matrix <- cor(numeric_features)

# 绘制相关性热力图
corrplot(cor_matrix, 
         method = "color", 
         type = "upper", 
         col = colorRampPalette(c("white", "salmon", "red", "darkred"))(200),
         tl.col = "black",
         addCoef.col = "black",
         tl.cex = 0.8, 
         number.cex = 0.7)

3.4 Feature vs. Quality

To further explore the relationship between wine quality and its attributes, we will use box plots to visualize changes in feature values at different quality levels. These visualizations will help us identify which features may serve as good predictor variables in subsequent modeling.

3.4.1 Alcohol vs. Quality

The boxplot shows that alcohol content increases with wine quality. Higher-rated wines (scores 7 to 9) tend to have significantly higher alcohol concentrations than lower-rated wines (scores 3 to 5).

The median alcohol level for quality scores 7 and above is clearly higher, and the spread is narrower, suggesting that alcohol is not only a strong positive indicator of quality, but also more consistent in high-quality wines. This pattern supports using alcohol as a key predictive feature in both regression and classification models.

ggplot(wine, aes(x = as.factor(quality), y = alcohol)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Alcohol Content by Wine Quality",
       x = "Quality Score",
       y = "Alcohol (%)") +
  theme_minimal()

3.4.2 Volatile Acidity vs. Quality

The boxplot reveals a negative relationship between volatile acidity and wine quality. Wines with lower quality scores (3–5) exhibit higher levels of volatile acidity, with greater variability and more outliers. In contrast, high-quality wines (scores 7–9) tend to have lower and more consistent volatile acidity levels.

This suggests that volatile acidity may be a strong negative indicator of wine quality, as higher acidity can lead to undesirable taste characteristics. The trend supports including this variable in both regression and classification models.

ggplot(wine, aes(x = as.factor(quality), y = volatile.acidity)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Volatile Acidity by Wine Quality",
       x = "Quality Score",
       y = "Volatile Acidity") +
  theme_minimal()

3.4.3 Sulphates vs. Quality

The boxplot indicates that sulphate levels show a slight increase with wine quality, although the trend is less distinct compared to alcohol or volatile acidity. Wines rated 7 and above tend to have marginally higher and more variable sulphate content.

This suggests that sulphates may play a secondary role in wine quality—possibly contributing to stability and preservation—but are not a primary driver of quality scores.

ggplot(wine, aes(x = as.factor(quality), y = sulphates)) +
  geom_boxplot(fill = "salmon") +
  labs(title = "Sulphates by Wine Quality",
       x = "Quality Score",
       y = "Sulphates") +
  theme_minimal()

3.5 Summary

Through exploratory data analysis (EDA), we gained important insights into the distribution of the dataset and the relationships between variables. The score histogram shows that most wine scores are concentrated between 5 and 6 points, reflecting an unbalanced but realistic distribution pattern.

The correlation heatmap revealed significant relationships between certain variables. Alcohol content showed a moderate positive correlation with scores, while volatile acidity and density exhibited negative correlations.

Box plots further confirm these trends: alcohol content increases with score, volatile acidity decreases with score, and sulfates are slightly higher in high-scoring samples. This indicates that certain chemical properties are indeed associated with scores.

Overall, this EDA helps identify key predictive variables and confirms that the dataset is suitable for subsequent regression and classification modeling.

4. Modeling

4.1 Partition data sets

For regression modeling, we separated the dataset into a training set (80%) and a test set (20%). This division was based on the target variable quality to ensure that the score distribution remained consistent across both subsets.

set.seed(123)

# 创建划分索引
split_index <- createDataPartition(wine$quality, p = 0.8, list = FALSE)

# 划分训练集与测试集
train_data <- wine[split_index, ]
test_data <- wine[-split_index, ]

4.2 Regression Model Construction

To address the first research question: To predict wine quality scores based on physicochemical attributes. We constructed two regression models: a linear regression model and a random forest regression model.

The linear regression model serves as the benchmark model, offering good interpretability and simplicity, while the random forest model can capture complex interactions between variables. Both models were trained on the previously divided training set and then evaluated.

4.2.1 Linear Regression Model

We trained a linear regression model using the training dataset, where wine quality was the dependent variable and all physical and chemical properties were the independent variables. This model was used to make predictions on the test set, and its performance will be evaluated in Section 4.3.

# 使用训练集构建线性回归模型，预测评分 quality
lm_model <- lm(quality ~ ., data = train_data)

# 生成模型在测试集上的预测结果
lm_predictions <- predict(lm_model, newdata = test_data)

4.2.2 Random Forest Regression Model

We trained a random forest regression model using the training dataset, with wine quality as the target variable. This model captures complex nonlinear relationships among the predictors. It was trained using 500 decision trees, and prediction was performed on the test set. The model’s performance will be evaluated later in Section 4.3.

# 构建随机森林回归模型，使用所有理化特征预测质量
rf_model <- randomForest(quality ~ ., data = train_data, ntree = 500, importance = TRUE)

# 在测试集上进行预测
rf_predictions <- predict(rf_model, newdata = test_data)

4.3 Regression Model Evaluation

The evaluation results indicate that the random forest regression model outperforms the linear regression model across all metrics. It exhibits lower RMSE (0.447) and MAE (0.326), indicating smaller prediction errors. Additionally, the R² value of 0.749 surpasses the linear model’s 0.658, suggesting stronger fitting to the test data.

This suggests that the random forest model is better at capturing the nonlinear relationship between the physical and chemical characteristics of wine and its score.

# 定义评估函数：输入预测值与真实值，输出 RMSE、MAE、R²
evaluate_regression <- function(predicted, actual) {
  rmse_val <- rmse(actual, predicted)
  mae_val <- mae(actual, predicted)
  r2_val <- cor(actual, predicted)^2
  
  return(data.frame(
    RMSE = rmse_val,
    MAE = mae_val,
    R2 = r2_val
  ))
}

# 线性回归模型评估
eval_lm <- evaluate_regression(lm_predictions, test_data$quality)

# 随机森林模型评估
eval_rf <- evaluate_regression(rf_predictions, test_data$quality)

# 汇总成一个表格
regression_results <- rbind(
  Linear_Regression = eval_lm,
  Random_Forest = eval_rf
)

# 展示结果
print(regression_results)

##                        RMSE       MAE        R2
## Linear_Regression 0.5213397 0.4215201 0.6582827
## Random_Forest     0.4473360 0.3258701 0.7494542

4.4 Classification Model Construction

In this section, we construct classification models to address the second research question:
To classify wines as high or low quality using chemical features.

To achieve this, we first transformed the original quality scores into a binary variable high_quality, where values equal to or above 7 are labeled as 1 (high quality), and others as 0 (low quality).
We then split the dataset into training and testing sets (80% / 20%) to ensure unbiased evaluation.

Two classification algorithms are applied and compared:
Logistic Regression: A linear and interpretable model suitable as a baseline classifier.
Random Forest: A powerful ensemble model capable of capturing nonlinear relationships and handling feature interactions.

The following subsections present the construction of these models individually.

4.4.1 Logistic Regression Model

In this subsection, we build a logistic regression model to classify wines as high-quality(1) or not(0). We first create a binary target variable high_quality, labeling samples with a quality score ≥ 7 as 1.

The dataset is then randomly split into a training set (80%) and a test set (20%) to enable proper model evaluation.
A logistic regression model is trained using all physicochemical features.
The model outputs probabilities, which are converted into binary class predictions using a 0.5 threshold.

# 创建目标变量 high_quality：评分 ≥ 7 为高质量（1），否则为低质量（0）
wine$high_quality <- ifelse(wine$quality >= 7, 1, 0)

# 划分训练集和测试集（80%训练，20%测试）
set.seed(111)
split_index <- sample(1:nrow(wine), 0.8 * nrow(wine))
train_data_cls <- wine[split_index, ]
test_data_cls <- wine[-split_index, ]

# 构建逻辑回归模型
logit_model <- glm(high_quality ~ . - quality, data = train_data_cls, family = "binomial")

# 在测试集上进行概率预测
logit_probs <- predict(logit_model, newdata = test_data_cls, type = "response")

# 概率转换为0/1分类（阈值0.5）
logit_preds <- ifelse(logit_probs > 0.5, 1, 0)

4.4.2 Random Forest Classification Model

In this subsection, we construct a Random Forest classification model to predict whether a wine is high-quality or not.
The model uses all physicochemical features to predict the binary target high_quality.

We train the model using 500 decision trees to ensure stability and robustness.
Predictions are then made on the test dataset, generating final class labels for evaluation in the next section.

# 构建随机森林分类模型
rf_cls_model <- randomForest(
  as.factor(high_quality) ~ . - quality, 
  data = train_data_cls, 
  ntree = 500
)

# 在测试集上进行预测
rf_cls_preds <- predict(rf_cls_model, newdata = test_data_cls)

4.5 Classification Model Evaluation

This subsection evaluates the performance of the two classification models developed in Section 4.4:
Logistic Regression and Random Forest Classifier.

To assess how well each model performs in classifying high-quality wines (quality ≥ 7), we use several evaluation metrics derived from the confusion matrix, including: Accuracy, Sensitivity (Recall), Specificity, Precision, Balanced Accuracy, F1-Score.

By comparing these metrics, we aim to determine which model is more effective at identifying high-quality wines.

4.5.1 Extract Evaluation Metrics

To ensure robustness, we define a custom extraction function with error-handling to avoid missing or misnamed fields.
This step prepares the evaluation results for tabular comparison in the next subsection.

# 生成混淆矩阵
conf_matrix_logit <- confusionMatrix(as.factor(logit_preds), as.factor(test_data_cls$high_quality), positive = "1")
conf_matrix_rf <- confusionMatrix(rf_cls_preds, as.factor(test_data_cls$high_quality), positive = "1")

# 定义安全提取函数，避免错误导致 NA
safe_get <- function(metric_list, name) {
  if (name %in% names(metric_list)) {
    return(round(as.numeric(metric_list[[name]]), 4))
  } else {
    return(NA)
  }
}

# Logistic Regression 指标
logit_accuracy <- safe_get(conf_matrix_logit$overall, "Accuracy")
logit_sensitivity <- safe_get(conf_matrix_logit$byClass, "Sensitivity")
logit_specificity <- safe_get(conf_matrix_logit$byClass, "Specificity")
logit_precision <- safe_get(conf_matrix_logit$byClass, "Pos Pred Value")
logit_balacc <- safe_get(conf_matrix_logit$byClass, "Balanced Accuracy")
logit_f1 <- if (!is.na(logit_precision + logit_sensitivity) && (logit_precision + logit_sensitivity) != 0) {
  round(2 * (logit_precision * logit_sensitivity) / (logit_precision + logit_sensitivity), 4)
} else { NA }

# Random Forest 指标
rf_accuracy <- safe_get(conf_matrix_rf$overall, "Accuracy")
rf_sensitivity <- safe_get(conf_matrix_rf$byClass, "Sensitivity")
rf_specificity <- safe_get(conf_matrix_rf$byClass, "Specificity")
rf_precision <- safe_get(conf_matrix_rf$byClass, "Pos Pred Value")
rf_balacc <- safe_get(conf_matrix_rf$byClass, "Balanced Accuracy")
rf_f1 <- if (!is.na(rf_precision + rf_sensitivity) && (rf_precision + rf_sensitivity) != 0) {
  round(2 * (rf_precision * rf_sensitivity) / (rf_precision + rf_sensitivity), 4)
} else { NA }

4.5.2 Generate Final Comparison Table

In this subsection, we will organize the evaluation metrics extracted from the two models into a structured comparison table.

# 加载输出表格工具
library(knitr)

# 构建对比表格
comparison_table <- data.frame(
  Metric = c("Accuracy", "Sensitivity", "Specificity", "Precision", "Balanced Accuracy", "F1-Score"),
  Logistic_Regression = c(logit_accuracy, logit_sensitivity, logit_specificity, logit_precision, logit_balacc, logit_f1),
  Random_Forest = c(rf_accuracy, rf_sensitivity, rf_specificity, rf_precision, rf_balacc, rf_f1)
)

# 输出为整洁表格
knitr::kable(comparison_table, caption = "Comparison of Classification Model Metrics (Final Version)")

Comparison of Classification Model Metrics (Final Version)
Metric	Logistic_Regression	Random_Forest
Accuracy	0.8143	0.8816
Sensitivity	0.3267	0.5545
Specificity	0.9409	0.9666
Precision	0.5893	0.8116
Balanced Accuracy	0.6338	0.7605
F1-Score	0.4204	0.6589

4.5.3 Evaluation Summary

The classification results show that the Random Forest model outperforms the Logistic Regression model in most evaluation metrics, including accuracy, recall, and F1-score. It is more effective at handling class imbalance and capturing complex patterns in the data. Logistic regression, while simple and interpretable, tends to underperform in detecting the minority (high-quality) class.

5. Conclusion

This study successfully explored the relationship between the physicochemical properties of white wine and its quality. By constructing regression and classification models, we confirmed that chemical composition can be used to effectively predict wine scores and classify wines into high-quality or low-quality categories. Among all models, random forest performed best in both tasks, demonstrating higher accuracy and robustness. The research results validate the feasibility of data-driven methods in wine quality assessment and provide practical references for stakeholders in the wine industry seeking to enhance product quality through scientific analysis.

Predicting Wine Quality Using Chemical Attributes.

Yang Xiao Chen (24070135)

Niu Zhen Yu (24074946)

Shi Wei Qi (24085718)

Zhang Li (24201140)

2025-06-20