2023-10-30

Introduction

We will explore a dataset that encompasses various checmical attributes of wines and their associated quality ratings. The dataset comprises parameters such as fixed acidity, volatile acidity, citri acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality.

The data set’s primary focus is on understanding the chemical composition of wines and their resultant quality ratings. understanding and analyzing these attributes can provide valuable insights into what constitutes a high-quality wine.

Problems to Address

  • Predicting Wine Quality: By applying regression and classification models, we aim to forecast the quality ratings using the provided chemical features

  • Identifying Key Factors: We seek to unveil which attributes contribute most to higher or lower quality ratings

  • Quality Classification: Applying clustering method, we aim to group wines into distinct quality categories

  • Optimal Chemical Composition: Seeking to determine the ideal combination of chemical properties that lead to higher-quality wines.

Data Exploration

summary(wine_data)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

Pre-Processing Data

library(tidyverse)
wine_data <- wine_data%>%
  mutate_all(funs(ifelse(is.na(.),mean(., na.rm = TRUE), .)))

wine_data_scaled <- as.data.frame(scale(wine_data))

With these codes, we are handling the missing values by inputting the mean into the missing values. Also beside that, we also want to scale the Min_Max scaling so it is better when algorithms needed to perform

Visualize Data Distributions (Acidity)

Exploring the scatter plot matrix for ‘Fixed Acidity’, ‘Volatile Acidity’, ‘Citric Acid’, and ‘pH’ in the wine data set presents a preliminary insight into potential associations between these attributes. These plots reveal varied patterns and potential relationships among these key chemical components. While some variables exhibit potential trends that might suggest correlations, others seem relatively independent or display less apparent connections.

Multiple Linear Regression

multiple_lm <- lm(quality ~ ., data = wine_data)
summary(multiple_lm)
## 
## Call:
## lm(formula = quality ~ ., data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.880e+01   7.987 1.71e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

Analysis Summary

  • Model Significance: The over all model is statistical significant, indicating that at least one of the predictors is related to the response variable.
  • Model Fit: The model explains approximately 28.19% of the variance in the wine quality ratings based on the chemical attributes considered (adjusted r-squared: 0.2803).
  • Residuals: The residuals(prediction errors) are normally distributed, with a mean near zero and a standard deviation of 0.7514.

Consideration:

  • Volatile acidity, alcohol content, residual sugar, pH, and free sulfur dioxide appear to be the most influential factors in predicting wine quality based on this model.
  • Variables like citric acid, chlorides, total sulfur dioxide, density, and sulphates do not significantly impact wine quality based on this analysis.

Visualization

Visualization

Correlation Matrices

Clustering Datasets

selected_variables <- wine_data[, c("volatile.acidity", "alcohol", "pH", "sulphates")]
scaled_features <- scale(selected_variables)
k <- 3
set.seed(123)
kmeans_model <- kmeans(scaled_features, centers = k)
cluster_assignments <- kmeans_model$cluster
wine_data$cluster <- as.factor(cluster_assignments)
table(wine_data$cluster)
## 
##    1    2    3 
## 1339 1301 2258

We selects a smaller subset of variables related to acidity, alcohol, pH, and sulphates for clustering. This focused approach could potentially lead to more distinct and interpretable clusters based on the specific chemical attributes known to significantly impact wine quality.

Clustering (visualize)

cluster_centers <- aggregate(wine_data[, c("volatile.acidity", "alcohol", "pH", "sulphates")],
                            by = list(cluster = wine_data$cluster), FUN = mean)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggparcoord(cluster_centers, columns = 2:5, groupColumn = 1) +
  labs(title = "Cluster Centers in Multidimensional Space")

Conclusion

After a comprehensive analysis of the wine dataset encompassing chemical attributes and quality ratings, several significant insights have been unveiled. Utilizing multiple exploratory and analytical techniques, we endeavored to understand the complex relationship between chemical compositions and wine quality. The clustering analysis highlighted distinct groupings based on selected chemical attributes, revealing unique patterns in volatile acidity, alcohol content, pH, and sulphates that contribute to differentiated clusters. The visualization of cluster centers in a multidimensional space via a parallel coordinate plot emphasized the noticeable differences in the chemical profiles among clusters. These findings underscore the importance of specific chemical components, such as volatile acidity and alcohol content, in defining wine quality. Further exploration and validation of these cluster characteristics could offer valuable guidance for the wine production industry, enabling the identification of key chemical features that contribute to higher quality wines.