Introduction

The aim of the Principal component analysis is reducing the dimensionality of large datasets and still keeping most of the information in the large set. In practice we create new set of variables (principal components) that are the linear combinations of the original observed variables such that the first component accounts for the largest variance in the data, the second component accounts for the second largest variance in the data and so on. The number of principal components is less than or equal to the number of original variables. The principal components are orthogonal to each other, so there is no redundant information. In this project the PCA will be carried and the results of hierarchical clustering will be compared between PCA and whole dataset.

Dataset

Dataset used in this report includes information about 1143 wines. Each wine is described by 13 characteristics. The dataset can be fount at Kaggle website (https://www.kaggle.com/yasserh/wine-quality-dataset).

Variable Value
fixed acidity Fixed acidity value
volatile acidity Volatile acidity value
citric acid Citric acid value
residual sugar Residual sugar value
chlorides Chlorides value
free sulfur dioxide Free sulfur dioxide value
total sulfur dioxide Total sulfur dioxide value
density Density value
pH pH value
sulphates Sulphates value
alcohol % of alcohol
quality assessmnet of quality 1-10
#Basic statistics
library(kableExtra)
summary(wine1) %>% kable()  %>%
  kable_material(c("hover", "striped")) %>%
  scroll_box(width="940px")
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality Id
Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000 Min. : 0
1st Qu.: 7.100 1st Qu.:0.3925 1st Qu.:0.0900 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.00 1st Qu.:0.9956 1st Qu.:3.205 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.: 411
Median : 7.900 Median :0.5200 Median :0.2500 Median : 2.200 Median :0.07900 Median :13.00 Median : 37.00 Median :0.9967 Median :3.310 Median :0.6200 Median :10.20 Median :6.000 Median : 794
Mean : 8.311 Mean :0.5313 Mean :0.2684 Mean : 2.532 Mean :0.08693 Mean :15.62 Mean : 45.91 Mean :0.9967 Mean :3.311 Mean :0.6577 Mean :10.44 Mean :5.657 Mean : 805
3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 61.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.:1210
Max. :15.900 Max. :1.5800 Max. :1.0000 Max. :15.500 Max. :0.61100 Max. :68.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000 Max. :1597

Most variables are continuous thus the PCA not MCA will be used.

Dataset preparation

#checking for missing values
any(is.na(wine1))
## [1] FALSE
missing_in_cols <- sapply(wine1, function(x) sum(is.na(x))/nrow(wine1))

We do not have missing values in data

#droping unnecessary variables
wine<- wine1[,1:11]
#Data normalization
#install.packages("caret")
library(caret)

preproc <- preProcess(wine, method=c("center", "scale"))
winenorm <- predict(preproc, wine)
summary(winenorm)
##  fixed.acidity     volatile.acidity    citric.acid       residual.sugar    
##  Min.   :-2.1236   Min.   :-2.28988   Min.   :-1.36443   Min.   :-1.20372  
##  1st Qu.:-0.6930   1st Qu.:-0.77290   1st Qu.:-0.90685   1st Qu.:-0.46622  
##  Median :-0.2352   Median :-0.06312   Median :-0.09337   Median :-0.24496  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.4514   3rd Qu.: 0.60491   3rd Qu.: 0.77096   3rd Qu.: 0.05004  
##  Max.   : 4.3425   Max.   : 5.83779   Max.   : 3.71982   Max.   : 9.56389  
##    chlorides        free.sulfur.dioxide total.sulfur.dioxide    density        
##  Min.   :-1.58529   Min.   :-1.4258     Min.   :-1.2176      Min.   :-3.45983  
##  1st Qu.:-0.35823   1st Qu.:-0.8405     1st Qu.:-0.7600      1st Qu.:-0.60279  
##  Median :-0.16783   Median :-0.2552     Median :-0.2719      Median :-0.02619  
##  Mean   : 0.00000   Mean   : 0.0000     Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.06489   3rd Qu.: 0.5253     3rd Qu.: 0.4602      3rd Qu.: 0.57899  
##  Max.   :11.08730   Max.   : 5.1104     Max.   : 7.4152      Max.   : 3.61524  
##        pH              sulphates          alcohol       
##  Min.   :-3.644836   Min.   :-1.9232   Min.   :-1.8870  
##  1st Qu.:-0.676702   1st Qu.:-0.6321   1st Qu.:-0.8706  
##  Median :-0.006478   Median :-0.2213   Median :-0.2237  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.568000   3rd Qu.: 0.4243   3rd Qu.: 0.6079  
##  Max.   : 4.461681   Max.   : 7.8774   Max.   : 4.1193

Correlation matrix

PCA is particularly useful when the variables are highly correlated. Correlation indicates that there is redundancy in the data. We can solve this problem by replacing a group of correlated variables with a new variable. The plot below indicates that in our dataset we indeed have quite large correlation for few variables. For example variable fixed.acidity seems to be highly correlated with citric.acid, density and pH.

corrplot(cor<-cor(winenorm, method="pearson")) 
corrplot(cor)

PCA

Now we will compute and determine the number of PCs

# PCA with prcomp:: (non-rotated) & preditions
pca<-prcomp(winenorm, center=FALSE, scale.=FALSE) # stats::
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7766 1.3705 1.2545 1.1007 0.97888 0.81570 0.74672
## Proportion of Variance 0.2869 0.1708 0.1431 0.1101 0.08711 0.06049 0.05069
## Cumulative Proportion  0.2869 0.4577 0.6007 0.7109 0.79798 0.85847 0.90916
##                           PC8     PC9    PC10    PC11
## Standard deviation     0.6473 0.58694 0.42099 0.24178
## Proportion of Variance 0.0381 0.03132 0.01611 0.00531
## Cumulative Proportion  0.9473 0.97857 0.99469 1.00000

Eigen values

The eigen values measure the variance retained by each PC and can be used to determine the number of PC. Eigen value > 1 indicate us that the PCs account for more variance than accounted by one of the original variables in standardized data, and is used to determine cutoff point for which we keep PCs.

library("factoextra")
eig.val <-  get_eigenvalue(pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.15615793       28.6923448                    28.69234
## Dim.2  1.87826140       17.0751036                    45.76745
## Dim.3  1.57365209       14.3059281                    60.07338
## Dim.4  1.21150477       11.0136798                    71.08706
## Dim.5  0.95821355        8.7110323                    79.79809
## Dim.6  0.66536918        6.0488107                    85.84690
## Dim.7  0.55759739        5.0690672                    90.91597
## Dim.8  0.41905879        3.8096253                    94.72559
## Dim.9  0.34449397        3.1317633                    97.85736
## Dim.10 0.17723144        1.6111950                    99.46855
## Dim.11 0.05845948        0.5314498                   100.00000

The proportion of variation explained by each eigenvalue is given in the second column. 71% of the variation is explained by the first four eigenvalues.For the first four PCs the eigen values are larger than 1, thus we should retain 4 PCs.

Scree plot

Another way to determine the number of PC is to look at the Scree Plot.

# eigen values on y-axis
fviz_eig(pca, choice='eigenvalue', addlabels    =   TRUE, barcolor = "dark blue") 

var <-  get_pca_var(pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
#   Coordinates
head(var$coord)
##                           Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## fixed.acidity        0.86223204 -0.1402402  0.15351508 -0.25981155 -0.09532211
## volatile.acidity    -0.40353299  0.3957808  0.55662339  0.08506953  0.25211893
## citric.acid          0.81735044 -0.2007884 -0.30965306 -0.07024316 -0.07603340
## residual.sugar       0.31002053  0.3457238 -0.11460606 -0.44826819  0.68633585
## chlorides            0.39949476  0.2098688  0.06593714  0.71838231  0.26493375
## free.sulfur.dioxide -0.08501148  0.7085767 -0.53726758 -0.05592439 -0.14612960
##                            Dim.6       Dim.7       Dim.8        Dim.9
## fixed.acidity        0.084318190 -0.25825249  0.12179708 -0.101822013
## volatile.acidity     0.339406432 -0.39480941  0.05413647  0.060401232
## citric.acid          0.072892301  0.07082927  0.24481300  0.203643373
## residual.sugar       0.020243140  0.22901729 -0.20636376 -0.004925393
## chlorides            0.260156435  0.28149531  0.21214546 -0.049370175
## free.sulfur.dioxide -0.007772404 -0.07453572  0.13366656 -0.382528173
##                          Dim.10       Dim.11
## fixed.acidity       -0.12082333  0.152082021
## volatile.acidity     0.14910179  0.002580409
## citric.acid          0.27040640 -0.012975643
## residual.sugar       0.04233413  0.044332141
## chlorides           -0.10609899  0.011278046
## free.sulfur.dioxide  0.08752290 -0.009137291
#   Cos2:   quality on  the factore map
head(var$cos2)
##                           Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## fixed.acidity       0.743444090 0.01966731 0.023566879 0.067502041 0.009086305
## volatile.acidity    0.162838875 0.15664248 0.309829599 0.007236825 0.063563956
## citric.acid         0.668061746 0.04031597 0.095885017 0.004934102 0.005781079
## residual.sugar      0.096112730 0.11952497 0.013134548 0.200944367 0.471056901
## chlorides           0.159596066 0.04404492 0.004347706 0.516073137 0.070189893
## free.sulfur.dioxide 0.007226952 0.50208092 0.288656452 0.003127537 0.021353860
##                            Dim.6       Dim.7       Dim.8        Dim.9
## fixed.acidity       7.109557e-03 0.066694350 0.014834529 0.0103677222
## volatile.acidity    1.151967e-01 0.155874473 0.002930758 0.0036483089
## citric.acid         5.313288e-03 0.005016785 0.059933406 0.0414706233
## residual.sugar      4.097847e-04 0.052448917 0.042586002 0.0000242595
## chlorides           6.768137e-02 0.079239611 0.045005697 0.0024374142
## free.sulfur.dioxide 6.041027e-05 0.005555574 0.017866749 0.1463278034
##                          Dim.10       Dim.11
## fixed.acidity       0.014598278 2.312894e-02
## volatile.acidity    0.022231343 6.658510e-06
## citric.acid         0.073119619 1.683673e-04
## residual.sugar      0.001792178 1.965339e-03
## chlorides           0.011256995 1.271943e-04
## free.sulfur.dioxide 0.007660257 8.349008e-05
#   Contributions   to  the principal   components
head(var$contrib)
##                          Dim.1     Dim.2      Dim.3      Dim.4      Dim.5
## fixed.acidity       23.5553513  1.047102  1.4975915  5.5717520  0.9482547
## volatile.acidity     5.1594020  8.339759 19.6885704  0.5973418  6.6335898
## citric.acid         21.1669302  2.146451  6.0931522  0.4072705  0.6033184
## residual.sugar       3.0452446  6.363596  0.8346539 16.5863454 49.1599081
## chlorides            5.0566565  2.344983  0.2762813 42.5976973  7.3250784
## free.sulfur.dioxide  0.2289794 26.731152 18.3430921  0.2581531  2.2285074
##                           Dim.6      Dim.7      Dim.8        Dim.9    Dim.10
## fixed.acidity        1.06851316 11.9610226  3.5399636  3.009551168  8.236844
## volatile.acidity    17.31320437 27.9546632  0.6993667  1.059034181 12.543678
## citric.acid          0.79854729  0.8997146 14.3019090 12.038127553 41.256572
## residual.sugar       0.06158757  9.4062343 10.1622979  0.007042068  1.011208
## chlorides           10.17200270 14.2109006 10.7397096  0.707534644  6.351579
## free.sulfur.dioxide  0.00907921  0.9963415  4.2635423 42.476158374  4.322177
##                          Dim.11
## fixed.acidity       39.56405409
## volatile.acidity     0.01138996
## citric.acid          0.28800686
## residual.sugar       3.36188185
## chlorides            0.21757689
## free.sulfur.dioxide  0.14281701

Correlation circle and quality of representation

The correlation circle shows the relationship between variables - positively correlated variables are grouped together, while negatively correlated variables are positioned on opposite sides of the plot origin. Variables that are away from the origin are well represented on the factor map. Square cosine (cos2) is the quality of representation of the variables on a factor map. High cos2 means a good representation of the variable on the principal component, while low cos2 dicates that the variable is not perfectly represented by the PCs. Below plot is a correlation circle. the red color means high cos2 and blue color low cos2. Variables that are closed to the center of the plot are less important for the first components.

#correlation circle
fviz_pca_var(pca,   col.var =   "cos2", gradient.cols   =   c("#00AFBB",    "#E7B800",  "#FC4E03"), 
                                                    repel   =   TRUE)

the graph below indicates that for most variables first five PCs (Dim.1 - Dim.5) are a good representation

library("corrplot")
corrplot(var$cos2,  is.corr=FALSE)

#individual observations quality of representation.
# unlabeled observations in two dimensions with coloured quality of representation
fviz_pca_ind(pca, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ))

Contributions of variables to PCs

Variables that are correlated with our four PCs are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed. We will use the function corrplot() to find the most contributing variables.

Contribution for each PC:

library("corrplot")
corrplot(var$contrib,   is.corr=FALSE)  

Contribution to first four PCs:

# contributions of individual variables to PC
library(gridExtra)
#   Contributions   of  variables   to  PC1
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
#   Contributions   of  variables   to  PC12
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
#   Contributions   of  variables   to  PC3
c<-fviz_contrib(pca, "var", axes=3, xtickslab.rt=90) 
#   Contributions   of  variables   to  PC4
d<-fviz_contrib(pca, "var", axes=4, xtickslab.rt=90) 
grid.arrange(a,b, ncol=2,  top='Contribution to the PC1 and PC2')

grid.arrange(c,d, ncol=2, top='Contribution to the PC3 and PC4')

The most contributing variables to the first component are fixed.acidity, citric.acid, pH and density, to the second component: total.sulfur.dioxide, free.sulfur.dioxide and alcohol. For the third PC important are 6 variables, the most important is alcohol volatile.acidity and free.sulfur.dioxide. For the fourth PC the most contributing are:chlorides, sulphates and residual.sugar.

Total contribution to PC1-PC4:

fviz_contrib(pca,   choice  =   "var",  axes    =   1:4,    top =   10)

Generally, the most contributing variables to our four PCs are: fixed.acidity, citric.acid, density, free.sulfur.dioxide, total.sulfur.dioxide and chlorides.

Hierachical clustering

Now we will compare the results of hierarchical clustering for the trees generated with first four PCs and all variables included in the data set. For hierarchical clustering we will use euclidean distance measure and Ward’s linkage method.

Comparison of two dendograms

#   Compute hierarchical    clustering on initial dataset
res.dist <- dist(winenorm, method = "euclidean")
hc <- hclust(res.dist, method = "ward.D2")

#   Compute hierarchical    clustering including 4 PCs
pc_transform = as.data.frame(-pca$x[,1:4])
res.dist1 <- dist(pc_transform, method = "euclidean")
hc1 <- hclust(res.dist1, method = "ward.D2")


# create two dendrograms
library(dendextend)
dend1 <- as.dendrogram (hc)
dend2 <- as.dendrogram (hc1)
dend_list <- dendlist(dend1, dend2)
tanglegram(dend1, dend2)

tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)

To verify the quality of the alignment of two trees we can use the entanglement. It takes the values between 0 and 1, the lower the entanglement the better the alignment. The entanglement for our trees is equal to 0.38. The data set is quite large and the tanglegram is not transparent, therefore we will calculate Correlation matrix between dendograms. We will use cor.dendlist() and compute “Baker” and “Cophenetic” correlation. The valueS of these metrics range between -1 to 1, With near 0 values meaning that the two trees are not statistically similar.

# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.4972264

The Cophenetic correlation between trees is 0.4972264

# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.4716416

The Baker correlation between trees is 0.4716416

Conclusions

This project aimed to conduct the dimension reduction and find out whether we can retain as much information as possible while limiting the number of variables. Since the variables used dataset were continuous, PCA was applied. According to the obtained results we can explain over 71% of variation in our data using 4 principal components. The fixed.acidity variable has the highest contribution to determined dimensions. Based on the hierarchical clustering carried out on first four PCs and 11 variables included in the dataset, the entanglement was calculated, its value equal to 0.38 indicates rather good alignment.

References

https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf
https://tuxdoc.com/download/practical-guide-to-principal-component-methods-in-r-multivariate-analysis-book-2-4_pdf
https://www.kaggle.com/yasserh/wine-quality-dataset
http://www.personal.psu.edu/jxb14/M554/factor/efa.pdf