The primary objective of this study is to perform dimensionality reduction using the Principal Component Analysis (PCA) method. The analysis is conducted on a dataset sourced from Kaggle [https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009], focusing on the quality assessment of red wine. The dataset originates from the publication: P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. “Modeling wine preferences by data mining from physicochemical properties.” Decision Support Systems, Elsevier, 47(4):547–553, 2009.
The dataset comprises 1,599 observations and 12 numerical variables, with input features derived from physicochemical tests and an output variable reflecting sensory evaluations of wine quality.
The input variables include:
1.Fixed acidity
2.Volatile acidity
3.Citric acid
4.Residual sugar
5.Chlorides
6.Free sulfur dioxide
7.Total sulfur dioxide
8.Density
9.pH
10.Sulphates
11.Alcohol
The output variable is:
12.Quality (scored on a scale from 0 to 10).
Initial preprocessing ensured that the dataset contained only numerical variables. A correlation matrix was subsequently plotted to identify and remove highly correlated variables, thereby mitigating redundancy. The data were then standardized.
wine<-read.csv("winequality-red.csv", sep=",", dec=".", header=TRUE)
summary(wine) #no missing data
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
str(wine) # all variables are numeric/integer
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Standarization
I’ve decided to standardize data, to avoid a problem in which some features come to dominate solely Before further analysis the data has been normalized.
preproc <- preProcess(wine, method=c("center", "scale"))
wine.s <- predict(preproc, wine)
Correlation
Correlation matrix was done to see if any relationship between variables is visible.
Strong relationship was spotted between:
In this dataset we can observe positively and negatively correlated variables. Due to higly correlation( over 60%) variables “fixed.acidity”, “citric.acid” and “free.sulfur.dioxide” were not included for futher analysis.
wine1 <- wine.s[, !names(wine.s) %in% c("fixed.acidity","free.sulfur.dioxide", "citric.acid")]
#dataset now does not contain highly correlated data
## volatile.acidity residual.sugar chlorides
## volatile.acidity 1.000 0.032 0.159
## residual.sugar 0.032 1.000 0.213
## chlorides 0.159 0.213 1.000
## total.sulfur.dioxide 0.094 0.145 0.130
## density 0.025 0.422 0.411
## pH 0.234 -0.090 -0.234
## sulphates -0.326 0.038 0.021
## alcohol -0.225 0.117 -0.285
## quality -0.381 0.032 -0.190
## total.sulfur.dioxide density pH sulphates alcohol
## volatile.acidity 0.094 0.025 0.234 -0.326 -0.225
## residual.sugar 0.145 0.422 -0.090 0.038 0.117
## chlorides 0.130 0.411 -0.234 0.021 -0.285
## total.sulfur.dioxide 1.000 0.129 -0.010 -0.001 -0.258
## density 0.129 1.000 -0.312 0.161 -0.462
## pH -0.010 -0.312 1.000 -0.080 0.180
## sulphates -0.001 0.161 -0.080 1.000 0.207
## alcohol -0.258 -0.462 0.180 0.207 1.000
## quality -0.197 -0.177 -0.044 0.377 0.479
## quality
## volatile.acidity -0.381
## residual.sugar 0.032
## chlorides -0.190
## total.sulfur.dioxide -0.197
## density -0.177
## pH -0.044
## sulphates 0.377
## alcohol 0.479
## quality 1.000
Checking the dimensions of the dataset
## [1] 1599 9
After preprocessing we receive 1599 observations of 9 variables.
PCA was employed to reduce the dimensionality of the dataset while retaining the maximum amount of variance and information. The ultimate goal of this dimensionality reduction process is to decrease the dataset’s size, thus simplifying the analysis and visualization, while preserving as much of the original information as possible.
pca1<-prcomp(wine1, center=FALSE, scale.=FALSE)
pca1
## Standard deviations (1, .., p=9):
## [1] 1.4854575 1.3627168 1.0795932 0.9965988 0.9409623 0.8034452 0.7357594
## [8] 0.6600119 0.5194195
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4
## volatile.acidity -0.18862945 0.48105372 -0.08809078 -0.28057759
## residual.sugar -0.20053190 -0.13768811 0.73893952 -0.13414784
## chlorides -0.32759206 -0.26475007 -0.41223446 -0.40187129
## total.sulfur.dioxide -0.25433290 0.05062584 0.35475199 -0.53841056
## density -0.49167252 -0.18120005 0.19600821 0.32771617
## pH 0.28297685 0.40799818 0.05591064 -0.33668842
## sulphates -0.05510541 -0.52518937 -0.22981855 -0.40838526
## alcohol 0.52687790 -0.15333385 0.18445900 -0.24980863
## quality 0.38697469 -0.42230630 0.14844281 0.04420701
## PC5 PC6 PC7 PC8
## volatile.acidity 0.39019933 -0.204400794 0.66190763 0.101053625
## residual.sugar 0.39784638 -0.143964365 -0.23382900 0.015944189
## chlorides 0.27968116 -0.322648421 -0.30754381 -0.418937315
## total.sulfur.dioxide -0.66353250 0.001721278 0.13921341 -0.113004981
## density 0.24055357 0.374308156 0.18718796 0.003222442
## pH 0.25791313 0.638595743 -0.25044985 -0.299684819
## sulphates 0.06521357 0.452207221 0.17049947 0.433983211
## alcohol 0.19892063 -0.284249090 -0.02756719 0.385860954
## quality 0.05252029 0.013768618 0.51540687 -0.611722053
## PC9
## volatile.acidity 0.06444947
## residual.sugar 0.37797353
## chlorides -0.19681929
## total.sulfur.dioxide -0.21115883
## density -0.58871490
## pH -0.10058166
## sulphates 0.27457012
## alcohol -0.57657900
## quality 0.07157087
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4855 1.3627 1.0796 0.9966 0.94096 0.80345 0.73576
## Proportion of Variance 0.2452 0.2063 0.1295 0.1104 0.09838 0.07172 0.06015
## Cumulative Proportion 0.2452 0.4515 0.5810 0.6914 0.78975 0.86147 0.92162
## PC8 PC9
## Standard deviation 0.6600 0.51942
## Proportion of Variance 0.0484 0.02998
## Cumulative Proportion 0.9700 1.00000
The results suggested that first 3 PC already explain almost 60% of variance while 4 PC explains nearly 70%.
Optimal number of components should be based on Kaiser’s Stopping Rule.
According to Kaiser’s Stopping Rule, eigenvalue higher than 1 should be left. Eigenvalue = 1 means that components contain the same amount of information as a single variable, for dimension reduction purpose only components with eigenvalues >1 should be applied.
wine1.cov<-cov(wine1)
wine1.eigen<-eigen(wine1.cov)
wine1.eigen$values
## [1] 2.2065839 1.8569970 1.1655214 0.9932092 0.8854100 0.6455242 0.5413419
## [8] 0.4356157 0.2697966
According do Kaiser’s rule, 3 components should be chosen. Component 4 is close to 1 but still below.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4855 1.3627 1.0796 0.9966 0.94096 0.80345 0.73576
## Proportion of Variance 0.2452 0.2063 0.1295 0.1104 0.09838 0.07172 0.06015
## Cumulative Proportion 0.2452 0.4515 0.5810 0.6914 0.78975 0.86147 0.92162
## PC8 PC9
## Standard deviation 0.6600 0.51942
## Proportion of Variance 0.0484 0.02998
## Cumulative Proportion 0.9700 1.00000
Unfortunately 3 components explain only 58 % of variance. However the cumulative sum of explained variance should be between 70-90% in order to be considered as satisfactory. The selected number 3 of components is not in this range.
An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size (Jollife 2002, Peres-Neto, Jackson, and Somers (2005))
Based on the cumulative proportion of variance, the 70% threshold is exceeded only when selecting five principal components. Although the fourth component falls slightly below 1 (Kaiser’s Stopping Rule) and for 4 components cumulative variance is almost 70% (69,14%), it is essential to retain an adequate amount of information. The cumulative variance explained by the first three components is 58.10%, which could lead to an overly simplified model. In contrast, the cumulative variance of 69,14%, achieved by the first four components is might be sufficient. Therefore, the decision was made to select 4 components as it combines 70% of cumulative variance and Kaiser Rule.
Below there is a correlation plot - Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
Moreover, the distance between the variables and the orgin represents how well the variables are represented on the map, which means that variables further from the origin are more strongly represented on the map.
Plot of variables:
Variables:
In this case positively correlated variables are for example: residual sugar, chlorides and density. Negatively correlated can be pH and residual sugar. Such plot suggest that quality of the wine is positively correlated with the alcohol it contains but negatively correlated with volatile acidity. This plot reveals that chlorides, residual sugar and total sulfur dioxide are close to each other which corresponds to the results of 3rd component obtained later on, on graph “Contribution of variables to Dim-3”. Variables aren’t spotted in the center of the plot which suggests that quality of the model is not bad.
Graph of individuals:
Individuals with a similar profile are grouped together.
We can observe many individuals grouped together in the center.
Summary:
Variables correlated positively are plotted next to each other, while those negatively correlated are on opposite sides of the plot.
dim1 <- fviz_contrib(pca1, "var", axes=1, xtickslab.rt=45, fill = "purple",color = "purple")
dim2 <- fviz_contrib(pca1, "var", axes=2, xtickslab.rt=45, fill = "purple",color = "purple")
dim3 <- fviz_contrib(pca1, "var", axes=3, xtickslab.rt=45, fill = "purple",color = "purple")
dim4 <- fviz_contrib(pca1, "var", axes=4, xtickslab.rt=45, fill = "purple",color = "purple")
grid.arrange(dim1, dim2, dim3, dim4, ncol = 2)
1st principal component consists alcohol, density and quality. 2nd has sulphates, volatile acidity, quality and pH. 3rd is for residual sugar, chlorides and total sulfur dioxide, while 4th contains total sulfur dioxide, sulphates, chlorides and pH.
Based on charts we can assume that 3 only components could be enough since they represent all the variables, it would be acceptable to omit 4th components since all variables presented there already occured in previous plots.
The main aim of this paper was to check whether wine quality could be described by smaller number of variables while preserving as much information as possible. To achieve that, PCA method for dimension reduction was used. Results suggested that 4 components would be optimal of such reduction since it retained 70% of the variance. Research shows that only 40% of the variables can already explain around 70% of the variance, moreover 3 variables out of 10 keeps almost 60% of the information included in the original dataset. Dimension reduction in such cases might be a very powerful tool when it come to the analysis of a big datasets and can be applied when it comes to food specific.