Wine Attributes’ Analysis with PCA

Load packages

library(readr)
library(dplyr)
library(corrplot)
library(stats)
library(ggplot2)
library(factoextra)
library(psych)
library(caret)
library(factoextra)
library(gridExtra)
library(scales)
library(Hmisc)
library(ggfortify)

Introduction

The aim of this project is to use PCA (Principal Component Analysis) as a method of dimension reduction on Wine Quality data. Due to the large amount of data and the many features of the parameters, it is difficult to interpret what the data tells us and what the differences between them are. By reducing the dimensionality, it will be easier to understand our data and visualize it. The goal is to decrease the size of the dataset preserving as much information as possible.

Dataset

Dataset comes from a paper: “P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.” and can be found on UC Irvine Machine Learning Repository website (https://archive.ics.uci.edu/dataset/186/wine+quality). Dataset represents data about attributes of different variants of red wine “Vinho Verde” from the Minho, northwest region of Portugal. Each wine is described by 11 features based on physicochemical tests: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.

# Loading of the data
dane <- read.csv("Wine.csv", dec=".", header=TRUE, fileEncoding = "windows-1252")
head(dane)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4

#Check if the data is complete
missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)

##        fixed.acidity     volatile.acidity          citric.acid 
##                 "0%"                 "0%"                 "0%" 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                 "0%"                 "0%"                 "0%" 
## total.sulfur.dioxide              density                   pH 
##                 "0%"                 "0%"                 "0%" 
##            sulphates              alcohol 
##                 "0%"                 "0%"

#Histogram
hist.data.frame(dane)

Due to different scales of variables, normalization of the data.

preproc1 <- preProcess(dane, method=c("center", "scale"))
dane.s <- predict(preproc1, dane)

Checking correlations between the variables.

corrplot(cor(dane.s), type = "lower", order = "hclust", tl.col = "#0041C2", tl.cex = 0.5)

There are few statistically significant relationships between variables. The main variable causing most high correlactions is fixed acidity which has high-positive correlation with density, citric acid and high-negative correlation with pH.

Principal Component Analysis

It’s a statistical technique used for dimensionality reduction while preserving as much variability in the data as possible. It transforms a dataset with possibly correlated variables into a smaller set of uncorrelated variables called principal components. These principal components are linear combinations of the original variables, ranked in order of the amount of variance they capture from the data (eigenvalues).

pca.s<- prcomp(dane.s, center=FALSE, scale.=FALSE)
pca.s$rotation

##                              PC1          PC2         PC3          PC4
## fixed.acidity         0.48943617  0.110276486 -0.12339211  0.229494292
## volatile.acidity     -0.23855443 -0.275251916 -0.44959770 -0.079416197
## citric.acid           0.46367336  0.151893857  0.23816825  0.079642298
## residual.sugar        0.14607440 -0.272057736  0.10122305  0.372892803
## chlorides             0.21227289 -0.148097153 -0.09182809 -0.666290176
## free.sulfur.dioxide  -0.03633677 -0.513173176  0.42922675  0.043819328
## total.sulfur.dioxide  0.02350296 -0.569265381  0.32287357  0.034817927
## density               0.39506064 -0.233977990 -0.33947107  0.174401704
## pH                   -0.43861675 -0.006601052  0.05729530  0.003976728
## sulphates             0.24287570  0.037791051  0.28005347 -0.550506798
## alcohol              -0.11328995  0.386559616  0.47095374  0.122755757
##                              PC5         PC6         PC7         PC8
## fixed.acidity        -0.08262836 -0.10139410  0.35021991  0.17745372
## volatile.acidity      0.21860545 -0.41161967  0.53366584  0.07860653
## citric.acid          -0.05846339 -0.06937100 -0.10535120  0.37785841
## residual.sugar        0.73211707 -0.04913308 -0.29072890 -0.29982988
## chlorides             0.24660399 -0.30430493 -0.37028355  0.35702801
## free.sulfur.dioxide  -0.15908955  0.01390770  0.11653463  0.20419537
## total.sulfur.dioxide -0.22229841 -0.13584831  0.09392513 -0.01831724
## density               0.15695068  0.39098088  0.17040495  0.23914889
## pH                    0.26751571  0.52217130  0.02512456  0.56156817
## sulphates             0.22616176  0.38166207  0.44749507 -0.37445342
## alcohol               0.35083320 -0.36141052  0.32785901  0.21769157
##                               PC9        PC10         PC11
## fixed.acidity         0.194567484  0.24908701 -0.639722770
## volatile.acidity     -0.128872645 -0.36615745 -0.002527657
## citric.acid          -0.380903481 -0.62176402  0.071303174
## residual.sugar        0.007402285 -0.09292845 -0.183930715
## chlorides             0.111424805  0.21773086 -0.053068991
## free.sulfur.dioxide   0.635693443 -0.24813204  0.052178157
## total.sulfur.dioxide -0.592298277  0.37052182 -0.069227454
## density               0.020751757  0.24016825  0.567171757
## pH                   -0.166777801  0.01044687 -0.340780959
## sulphates            -0.058404653 -0.11228134 -0.069246461
## alcohol               0.037384716  0.30331255  0.314470798

Choosing number of components

Kaiser Criterio

The method is to retain components with eigenvalues greater than 1 because an eigenvalue below 1 indicates that the component explains less variance than a single original variable.

dane.eigen<-eigen(cov(dane.s))
dane.eigen$values

##  [1] 3.09826049 1.92594488 1.55137399 1.21332843 0.95929127 0.65955338
##  [7] 0.58381193 0.42296415 0.34462890 0.18142991 0.05941268

There are 4 components with eigenvalues over 1. The fifth component came really close to the set target value of 1 but doesn’t reach it, making the final number of components 4.

Scree Plot

The second approach uses the scree plot, which visualizes the eigenvalues of the components in ascending order. According to the scree plot method, the optimal number of components is indicated by the number of bars preceding the point where the line connecting the eigenvalues bends.

fviz_eig(pca.s, choice = "eigenvalue", ncp = 25, barfill = "slateblue", barcolor = "grey2", linecolor = "black",  addlabels = TRUE)

The plot doesn’t give an obvious answer. There is no exact point where line breaks down so we cannot determine a number of components.

Percentage of explained variance

The last thing we check is cumulated percentage of explained variance by components.

summary(pca.s)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7602 1.3878 1.2455 1.1015 0.97943 0.81213 0.76408
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion  0.2817 0.4567 0.5978 0.7081 0.79529 0.85525 0.90832
##                            PC8     PC9    PC10   PC11
## Standard deviation     0.65036 0.58705 0.42595 0.2437
## Proportion of Variance 0.03845 0.03133 0.01649 0.0054
## Cumulative Proportion  0.94678 0.97811 0.99460 1.0000

fviz_eig(pca.s,  ncp = 25, barfill = "slateblue", barcolor = "grey2", linecolor = "black",  addlabels = TRUE)

The plot indicates that 4 components are able to explain over 70% of variance which is the exact minimum value we were looking for. The results are consistent for all chosen methods so we stay with 4 principal components as our final outcome.

Analysis of components

The plot illustrates how the wine’s attributes influence an observation. It also shows the relationships between variables and their respective values based on the length of the vector from the center of the axis (the longer the vector, the greater the variable’s impact).

fviz_pca_var(pca.s, col.var="contrib")+
  scale_color_gradient2(low="blue", mid="yellow", high="red", midpoint=10)

autoplot(pca.s, loadings=TRUE, loadings.colour='blue', loadings.label=TRUE, loadings.label.size=3)

Contribution of each attribute to a certain component.

PC1 <- fviz_contrib(pca.s, choice = "var", axes = 1,fill = "#1AA7EC",color = "#1AA7EC")
PC2 <- fviz_contrib(pca.s, choice = "var", axes = 2,fill = "#1AA7EC",color = "#1AA7EC")
PC3 <- fviz_contrib(pca.s, choice = "var", axes = 3,fill = "#1AA7EC",color = "#1AA7EC")
PC4 <- fviz_contrib(pca.s, choice = "var", axes = 4,fill = "#1AA7EC",color = "#1AA7EC")
grid.arrange(PC1, PC2, PC3, PC4,ncol=2, nrow=2)

On the plots we can see that components consists of variables in range from 3 to 5.

PC1: fixed acidity, citric acid, density, pH
PC2: total sulfur dioxide, free sulfur dioxide, alcohol
PC3: alcohol, volatile acidity, free sulfur dioxide, density, total sulfur dioxide
PC4: chlorides, sulphates, residual sugar

Conclusion

This project demonstrates the efficiency of dimension reduction. Using PCA, we reduced the dimensions from 11 to 4, explaining over 70% of the variance. Additionally, we identified which variables are the key contributors to our components.