Dimension reduction based on wine quality

Objective and Methodology

The primary objective of this study is to perform dimensionality reduction using the Principal Component Analysis (PCA) method. The analysis is conducted on a dataset sourced from Kaggle [https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009], focusing on the quality assessment of red wine. The dataset originates from the publication: P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. “Modeling wine preferences by data mining from physicochemical properties.” Decision Support Systems, Elsevier, 47(4):547–553, 2009.

The dataset comprises 1,599 observations and 12 numerical variables, with input features derived from physicochemical tests and an output variable reflecting sensory evaluations of wine quality.

The input variables include:

1.Fixed acidity

2.Volatile acidity

3.Citric acid

4.Residual sugar

5.Chlorides

6.Free sulfur dioxide

7.Total sulfur dioxide

8.Density

9.pH

10.Sulphates

11.Alcohol

The output variable is:

12.Quality (scored on a scale from 0 to 10).

Preprocessing and Analysis

Initial preprocessing ensured that the dataset contained only numerical variables. A correlation matrix was subsequently plotted to identify and remove highly correlated variables, thereby mitigating redundancy. The data were then standardized.

wine<-read.csv("winequality-red.csv", sep=",", dec=".", header=TRUE)
summary(wine) #no missing data
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000
str(wine) # all  variables are numeric/integer
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Standarization

I’ve decided to standardize data, to avoid a problem in which some features come to dominate solely Before further analysis the data has been normalized.

preproc <- preProcess(wine, method=c("center", "scale"))
wine.s <- predict(preproc, wine)

Correlation

Correlation matrix was done to see if any relationship between variables is visible.

Strong relationship was spotted between:

  • fixed.acidity and citric.acid/density/ph
  • volatile.acidity and citric.acid
  • free.sulfur.dioxide and total.sulfur.dioxide

In this dataset we can observe positively and negatively correlated variables. Due to higly correlation( over 60%) variables “fixed.acidity”, “citric.acid” and “free.sulfur.dioxide” were not included for futher analysis.

wine1 <- wine.s[, !names(wine.s) %in% c("fixed.acidity","free.sulfur.dioxide", "citric.acid")]
#dataset now does not contain highly correlated data
##                      volatile.acidity residual.sugar chlorides
## volatile.acidity                1.000          0.032     0.159
## residual.sugar                  0.032          1.000     0.213
## chlorides                       0.159          0.213     1.000
## total.sulfur.dioxide            0.094          0.145     0.130
## density                         0.025          0.422     0.411
## pH                              0.234         -0.090    -0.234
## sulphates                      -0.326          0.038     0.021
## alcohol                        -0.225          0.117    -0.285
## quality                        -0.381          0.032    -0.190
##                      total.sulfur.dioxide density     pH sulphates alcohol
## volatile.acidity                    0.094   0.025  0.234    -0.326  -0.225
## residual.sugar                      0.145   0.422 -0.090     0.038   0.117
## chlorides                           0.130   0.411 -0.234     0.021  -0.285
## total.sulfur.dioxide                1.000   0.129 -0.010    -0.001  -0.258
## density                             0.129   1.000 -0.312     0.161  -0.462
## pH                                 -0.010  -0.312  1.000    -0.080   0.180
## sulphates                          -0.001   0.161 -0.080     1.000   0.207
## alcohol                            -0.258  -0.462  0.180     0.207   1.000
## quality                            -0.197  -0.177 -0.044     0.377   0.479
##                      quality
## volatile.acidity      -0.381
## residual.sugar         0.032
## chlorides             -0.190
## total.sulfur.dioxide  -0.197
## density               -0.177
## pH                    -0.044
## sulphates              0.377
## alcohol                0.479
## quality                1.000

Checking the dimensions of the dataset

## [1] 1599    9

After preprocessing we receive 1599 observations of 9 variables.

PCA Principal Component Analysis (PCA)

PCA was employed to reduce the dimensionality of the dataset while retaining the maximum amount of variance and information. The ultimate goal of this dimensionality reduction process is to decrease the dataset’s size, thus simplifying the analysis and visualization, while preserving as much of the original information as possible.

pca1<-prcomp(wine1, center=FALSE, scale.=FALSE) 
pca1
## Standard deviations (1, .., p=9):
## [1] 1.4854575 1.3627168 1.0795932 0.9965988 0.9409623 0.8034452 0.7357594
## [8] 0.6600119 0.5194195
## 
## Rotation (n x k) = (9 x 9):
##                              PC1         PC2         PC3         PC4
## volatile.acidity     -0.18862945  0.48105372 -0.08809078 -0.28057759
## residual.sugar       -0.20053190 -0.13768811  0.73893952 -0.13414784
## chlorides            -0.32759206 -0.26475007 -0.41223446 -0.40187129
## total.sulfur.dioxide -0.25433290  0.05062584  0.35475199 -0.53841056
## density              -0.49167252 -0.18120005  0.19600821  0.32771617
## pH                    0.28297685  0.40799818  0.05591064 -0.33668842
## sulphates            -0.05510541 -0.52518937 -0.22981855 -0.40838526
## alcohol               0.52687790 -0.15333385  0.18445900 -0.24980863
## quality               0.38697469 -0.42230630  0.14844281  0.04420701
##                              PC5          PC6         PC7          PC8
## volatile.acidity      0.39019933 -0.204400794  0.66190763  0.101053625
## residual.sugar        0.39784638 -0.143964365 -0.23382900  0.015944189
## chlorides             0.27968116 -0.322648421 -0.30754381 -0.418937315
## total.sulfur.dioxide -0.66353250  0.001721278  0.13921341 -0.113004981
## density               0.24055357  0.374308156  0.18718796  0.003222442
## pH                    0.25791313  0.638595743 -0.25044985 -0.299684819
## sulphates             0.06521357  0.452207221  0.17049947  0.433983211
## alcohol               0.19892063 -0.284249090 -0.02756719  0.385860954
## quality               0.05252029  0.013768618  0.51540687 -0.611722053
##                              PC9
## volatile.acidity      0.06444947
## residual.sugar        0.37797353
## chlorides            -0.19681929
## total.sulfur.dioxide -0.21115883
## density              -0.58871490
## pH                   -0.10058166
## sulphates             0.27457012
## alcohol              -0.57657900
## quality               0.07157087
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.4855 1.3627 1.0796 0.9966 0.94096 0.80345 0.73576
## Proportion of Variance 0.2452 0.2063 0.1295 0.1104 0.09838 0.07172 0.06015
## Cumulative Proportion  0.2452 0.4515 0.5810 0.6914 0.78975 0.86147 0.92162
##                           PC8     PC9
## Standard deviation     0.6600 0.51942
## Proportion of Variance 0.0484 0.02998
## Cumulative Proportion  0.9700 1.00000

The results suggested that first 3 PC already explain almost 60% of variance while 4 PC explains nearly 70%.

Optimal number of components should be based on Kaiser’s Stopping Rule.

According to Kaiser’s Stopping Rule, eigenvalue higher than 1 should be left. Eigenvalue = 1 means that components contain the same amount of information as a single variable, for dimension reduction purpose only components with eigenvalues >1 should be applied.

wine1.cov<-cov(wine1)
wine1.eigen<-eigen(wine1.cov)  
wine1.eigen$values
## [1] 2.2065839 1.8569970 1.1655214 0.9932092 0.8854100 0.6455242 0.5413419
## [8] 0.4356157 0.2697966

According do Kaiser’s rule, 3 components should be chosen. Component 4 is close to 1 but still below.

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.4855 1.3627 1.0796 0.9966 0.94096 0.80345 0.73576
## Proportion of Variance 0.2452 0.2063 0.1295 0.1104 0.09838 0.07172 0.06015
## Cumulative Proportion  0.2452 0.4515 0.5810 0.6914 0.78975 0.86147 0.92162
##                           PC8     PC9
## Standard deviation     0.6600 0.51942
## Proportion of Variance 0.0484 0.02998
## Cumulative Proportion  0.9700 1.00000

Unfortunately 3 components explain only 58 % of variance. However the cumulative sum of explained variance should be between 70-90% in order to be considered as satisfactory. The selected number 3 of components is not in this range.

An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size (Jollife 2002, Peres-Neto, Jackson, and Somers (2005))

Based on the cumulative proportion of variance, the 70% threshold is exceeded only when selecting five principal components. Although the fourth component falls slightly below 1 (Kaiser’s Stopping Rule) and for 4 components cumulative variance is almost 70% (69,14%), it is essential to retain an adequate amount of information. The cumulative variance explained by the first three components is 58.10%, which could lead to an overly simplified model. In contrast, the cumulative variance of 69,14%, achieved by the first four components is might be sufficient. Therefore, the decision was made to select 4 components as it combines 70% of cumulative variance and Kaiser Rule.

Analysis of the given components

Below there is a correlation plot - Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

Moreover, the distance between the variables and the orgin represents how well the variables are represented on the map, which means that variables further from the origin are more strongly represented on the map.

Plot of variables:

Variables:

In this case positively correlated variables are for example: residual sugar, chlorides and density. Negatively correlated can be pH and residual sugar. Such plot suggest that quality of the wine is positively correlated with the alcohol it contains but negatively correlated with volatile acidity. This plot reveals that chlorides, residual sugar and total sulfur dioxide are close to each other which corresponds to the results of 3rd component obtained later on, on graph “Contribution of variables to Dim-3”. Variables aren’t spotted in the center of the plot which suggests that quality of the model is not bad.

Graph of individuals:

Individuals with a similar profile are grouped together.

We can observe many individuals grouped together in the center.

Summary:

Variables correlated positively are plotted next to each other, while those negatively correlated are on opposite sides of the plot.

Plots with Contribution of variables based on dimension

dim1 <- fviz_contrib(pca1, "var", axes=1, xtickslab.rt=45, fill = "purple",color = "purple")
dim2 <- fviz_contrib(pca1, "var", axes=2, xtickslab.rt=45, fill = "purple",color = "purple")
dim3 <- fviz_contrib(pca1, "var", axes=3, xtickslab.rt=45, fill = "purple",color = "purple")
dim4 <- fviz_contrib(pca1, "var", axes=4, xtickslab.rt=45, fill = "purple",color = "purple")
grid.arrange(dim1, dim2, dim3, dim4, ncol = 2)

1st principal component consists alcohol, density and quality. 2nd has sulphates, volatile acidity, quality and pH. 3rd is for residual sugar, chlorides and total sulfur dioxide, while 4th contains total sulfur dioxide, sulphates, chlorides and pH.

Based on charts we can assume that 3 only components could be enough since they represent all the variables, it would be acceptable to omit 4th components since all variables presented there already occured in previous plots.

Summary

The main aim of this paper was to check whether wine quality could be described by smaller number of variables while preserving as much information as possible. To achieve that, PCA method for dimension reduction was used. Results suggested that 4 components would be optimal of such reduction since it retained 70% of the variance. Research shows that only 40% of the variables can already explain around 70% of the variance, moreover 3 variables out of 10 keeps almost 60% of the information included in the original dataset. Dimension reduction in such cases might be a very powerful tool when it come to the analysis of a big datasets and can be applied when it comes to food specific.