The aim of this project is to use PCA (Principal Component Analysis) as a method of dimension reduction on Wine Quality data. Due to the large amount of data and the many features of the parameters, it is difficult to interpret what the data tells us and what the differences between them are. By reducing the dimensionality, it will be easier to understand our data and visualize it. The goal is to decrease the size of the dataset preserving as much information as possible.
Dataset comes from a paper: “P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.” and can be found on UC Irvine Machine Learning Repository website (https://archive.ics.uci.edu/dataset/186/wine+quality). Dataset represents data about attributes of different variants of red wine “Vinho Verde” from the Minho, northwest region of Portugal. Each wine is described by 11 features based on physicochemical tests: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.
# Loading of the data
dane <- read.csv("Wine.csv", dec=".", header=TRUE, fileEncoding = "windows-1252")
head(dane)## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
#Check if the data is complete
missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)## fixed.acidity volatile.acidity citric.acid
## "0%" "0%" "0%"
## residual.sugar chlorides free.sulfur.dioxide
## "0%" "0%" "0%"
## total.sulfur.dioxide density pH
## "0%" "0%" "0%"
## sulphates alcohol
## "0%" "0%"
Due to different scales of variables, normalization of the data.
Checking correlations between the variables.
There are few statistically significant relationships between variables. The main variable causing most high correlactions is fixed acidity which has high-positive correlation with density, citric acid and high-negative correlation with pH.
It’s a statistical technique used for dimensionality reduction while preserving as much variability in the data as possible. It transforms a dataset with possibly correlated variables into a smaller set of uncorrelated variables called principal components. These principal components are linear combinations of the original variables, ranked in order of the amount of variance they capture from the data (eigenvalues).
## PC1 PC2 PC3 PC4
## fixed.acidity 0.48943617 0.110276486 -0.12339211 0.229494292
## volatile.acidity -0.23855443 -0.275251916 -0.44959770 -0.079416197
## citric.acid 0.46367336 0.151893857 0.23816825 0.079642298
## residual.sugar 0.14607440 -0.272057736 0.10122305 0.372892803
## chlorides 0.21227289 -0.148097153 -0.09182809 -0.666290176
## free.sulfur.dioxide -0.03633677 -0.513173176 0.42922675 0.043819328
## total.sulfur.dioxide 0.02350296 -0.569265381 0.32287357 0.034817927
## density 0.39506064 -0.233977990 -0.33947107 0.174401704
## pH -0.43861675 -0.006601052 0.05729530 0.003976728
## sulphates 0.24287570 0.037791051 0.28005347 -0.550506798
## alcohol -0.11328995 0.386559616 0.47095374 0.122755757
## PC5 PC6 PC7 PC8
## fixed.acidity -0.08262836 -0.10139410 0.35021991 0.17745372
## volatile.acidity 0.21860545 -0.41161967 0.53366584 0.07860653
## citric.acid -0.05846339 -0.06937100 -0.10535120 0.37785841
## residual.sugar 0.73211707 -0.04913308 -0.29072890 -0.29982988
## chlorides 0.24660399 -0.30430493 -0.37028355 0.35702801
## free.sulfur.dioxide -0.15908955 0.01390770 0.11653463 0.20419537
## total.sulfur.dioxide -0.22229841 -0.13584831 0.09392513 -0.01831724
## density 0.15695068 0.39098088 0.17040495 0.23914889
## pH 0.26751571 0.52217130 0.02512456 0.56156817
## sulphates 0.22616176 0.38166207 0.44749507 -0.37445342
## alcohol 0.35083320 -0.36141052 0.32785901 0.21769157
## PC9 PC10 PC11
## fixed.acidity 0.194567484 0.24908701 -0.639722770
## volatile.acidity -0.128872645 -0.36615745 -0.002527657
## citric.acid -0.380903481 -0.62176402 0.071303174
## residual.sugar 0.007402285 -0.09292845 -0.183930715
## chlorides 0.111424805 0.21773086 -0.053068991
## free.sulfur.dioxide 0.635693443 -0.24813204 0.052178157
## total.sulfur.dioxide -0.592298277 0.37052182 -0.069227454
## density 0.020751757 0.24016825 0.567171757
## pH -0.166777801 0.01044687 -0.340780959
## sulphates -0.058404653 -0.11228134 -0.069246461
## alcohol 0.037384716 0.30331255 0.314470798
The method is to retain components with eigenvalues greater than 1 because an eigenvalue below 1 indicates that the component explains less variance than a single original variable.
## [1] 3.09826049 1.92594488 1.55137399 1.21332843 0.95929127 0.65955338
## [7] 0.58381193 0.42296415 0.34462890 0.18142991 0.05941268
There are 4 components with eigenvalues over 1. The fifth component came really close to the set target value of 1 but doesn’t reach it, making the final number of components 4.
The second approach uses the scree plot, which visualizes the eigenvalues of the components in ascending order. According to the scree plot method, the optimal number of components is indicated by the number of bars preceding the point where the line connecting the eigenvalues bends.
fviz_eig(pca.s, choice = "eigenvalue", ncp = 25, barfill = "slateblue", barcolor = "grey2", linecolor = "black", addlabels = TRUE)The plot doesn’t give an obvious answer. There is no exact point where line breaks down so we cannot determine a number of components.
The last thing we check is cumulated percentage of explained variance by components.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7602 1.3878 1.2455 1.1015 0.97943 0.81213 0.76408
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion 0.2817 0.4567 0.5978 0.7081 0.79529 0.85525 0.90832
## PC8 PC9 PC10 PC11
## Standard deviation 0.65036 0.58705 0.42595 0.2437
## Proportion of Variance 0.03845 0.03133 0.01649 0.0054
## Cumulative Proportion 0.94678 0.97811 0.99460 1.0000
fviz_eig(pca.s, ncp = 25, barfill = "slateblue", barcolor = "grey2", linecolor = "black", addlabels = TRUE)The plot indicates that 4 components are able to explain over 70% of variance which is the exact minimum value we were looking for. The results are consistent for all chosen methods so we stay with 4 principal components as our final outcome.
The plot illustrates how the wine’s attributes influence an observation. It also shows the relationships between variables and their respective values based on the length of the vector from the center of the axis (the longer the vector, the greater the variable’s impact).
fviz_pca_var(pca.s, col.var="contrib")+
scale_color_gradient2(low="blue", mid="yellow", high="red", midpoint=10)Contribution of each attribute to a certain component.
PC1 <- fviz_contrib(pca.s, choice = "var", axes = 1,fill = "#1AA7EC",color = "#1AA7EC")
PC2 <- fviz_contrib(pca.s, choice = "var", axes = 2,fill = "#1AA7EC",color = "#1AA7EC")
PC3 <- fviz_contrib(pca.s, choice = "var", axes = 3,fill = "#1AA7EC",color = "#1AA7EC")
PC4 <- fviz_contrib(pca.s, choice = "var", axes = 4,fill = "#1AA7EC",color = "#1AA7EC")
grid.arrange(PC1, PC2, PC3, PC4,ncol=2, nrow=2)On the plots we can see that components consists of variables in range from 3 to 5.
This project demonstrates the efficiency of dimension reduction. Using PCA, we reduced the dimensions from 11 to 4, explaining over 70% of the variance. Additionally, we identified which variables are the key contributors to our components.