Unsupervised learning - Dimension reduction (PCA)

Introduction

The aim of the Principal component analysis is reducing the dimensionality of large datasets and still keeping most of the information in the large set. In practice we create new set of variables (principal components) that are the linear combinations of the original observed variables such that the first component accounts for the largest variance in the data, the second component accounts for the second largest variance in the data and so on. The number of principal components is less than or equal to the number of original variables. The principal components are orthogonal to each other, so there is no redundant information. In this project the PCA will be carried and the results of hierarchical clustering will be compared between PCA and whole dataset.

Dataset

Dataset used in this report includes information about 1143 wines. Each wine is described by 13 characteristics. The dataset can be fount at Kaggle website (https://www.kaggle.com/yasserh/wine-quality-dataset).

Variable	Value
fixed acidity	Fixed acidity value
volatile acidity	Volatile acidity value
citric acid	Citric acid value
residual sugar	Residual sugar value
chlorides	Chlorides value
free sulfur dioxide	Free sulfur dioxide value
total sulfur dioxide	Total sulfur dioxide value
density	Density value
pH	pH value
sulphates	Sulphates value
alcohol	% of alcohol
quality	assessmnet of quality 1-10

#Basic statistics
library(kableExtra)
summary(wine1) %>% kable()  %>%
  kable_material(c("hover", "striped")) %>%
  scroll_box(width="940px")

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality	Id
Min. : 4.600	Min. :0.1200	Min. :0.0000	Min. : 0.900	Min. :0.01200	Min. : 1.00	Min. : 6.00	Min. :0.9901	Min. :2.740	Min. :0.3300	Min. : 8.40	Min. :3.000	Min. : 0
1st Qu.: 7.100	1st Qu.:0.3925	1st Qu.:0.0900	1st Qu.: 1.900	1st Qu.:0.07000	1st Qu.: 7.00	1st Qu.: 21.00	1st Qu.:0.9956	1st Qu.:3.205	1st Qu.:0.5500	1st Qu.: 9.50	1st Qu.:5.000	1st Qu.: 411
Median : 7.900	Median :0.5200	Median :0.2500	Median : 2.200	Median :0.07900	Median :13.00	Median : 37.00	Median :0.9967	Median :3.310	Median :0.6200	Median :10.20	Median :6.000	Median : 794
Mean : 8.311	Mean :0.5313	Mean :0.2684	Mean : 2.532	Mean :0.08693	Mean :15.62	Mean : 45.91	Mean :0.9967	Mean :3.311	Mean :0.6577	Mean :10.44	Mean :5.657	Mean : 805
3rd Qu.: 9.100	3rd Qu.:0.6400	3rd Qu.:0.4200	3rd Qu.: 2.600	3rd Qu.:0.09000	3rd Qu.:21.00	3rd Qu.: 61.00	3rd Qu.:0.9978	3rd Qu.:3.400	3rd Qu.:0.7300	3rd Qu.:11.10	3rd Qu.:6.000	3rd Qu.:1210
Max. :15.900	Max. :1.5800	Max. :1.0000	Max. :15.500	Max. :0.61100	Max. :68.00	Max. :289.00	Max. :1.0037	Max. :4.010	Max. :2.0000	Max. :14.90	Max. :8.000	Max. :1597

Most variables are continuous thus the PCA not MCA will be used.

Dataset preparation

#checking for missing values
any(is.na(wine1))

## [1] FALSE

missing_in_cols <- sapply(wine1, function(x) sum(is.na(x))/nrow(wine1))

We do not have missing values in data

#droping unnecessary variables
wine<- wine1[,1:11]

#Data normalization
#install.packages("caret")
library(caret)

preproc <- preProcess(wine, method=c("center", "scale"))
winenorm <- predict(preproc, wine)
summary(winenorm)

##  fixed.acidity     volatile.acidity    citric.acid       residual.sugar    
##  Min.   :-2.1236   Min.   :-2.28988   Min.   :-1.36443   Min.   :-1.20372  
##  1st Qu.:-0.6930   1st Qu.:-0.77290   1st Qu.:-0.90685   1st Qu.:-0.46622  
##  Median :-0.2352   Median :-0.06312   Median :-0.09337   Median :-0.24496  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.4514   3rd Qu.: 0.60491   3rd Qu.: 0.77096   3rd Qu.: 0.05004  
##  Max.   : 4.3425   Max.   : 5.83779   Max.   : 3.71982   Max.   : 9.56389  
##    chlorides        free.sulfur.dioxide total.sulfur.dioxide    density        
##  Min.   :-1.58529   Min.   :-1.4258     Min.   :-1.2176      Min.   :-3.45983  
##  1st Qu.:-0.35823   1st Qu.:-0.8405     1st Qu.:-0.7600      1st Qu.:-0.60279  
##  Median :-0.16783   Median :-0.2552     Median :-0.2719      Median :-0.02619  
##  Mean   : 0.00000   Mean   : 0.0000     Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.06489   3rd Qu.: 0.5253     3rd Qu.: 0.4602      3rd Qu.: 0.57899  
##  Max.   :11.08730   Max.   : 5.1104     Max.   : 7.4152      Max.   : 3.61524  
##        pH              sulphates          alcohol       
##  Min.   :-3.644836   Min.   :-1.9232   Min.   :-1.8870  
##  1st Qu.:-0.676702   1st Qu.:-0.6321   1st Qu.:-0.8706  
##  Median :-0.006478   Median :-0.2213   Median :-0.2237  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.568000   3rd Qu.: 0.4243   3rd Qu.: 0.6079  
##  Max.   : 4.461681   Max.   : 7.8774   Max.   : 4.1193

Correlation matrix

PCA is particularly useful when the variables are highly correlated. Correlation indicates that there is redundancy in the data. We can solve this problem by replacing a group of correlated variables with a new variable. The plot below indicates that in our dataset we indeed have quite large correlation for few variables. For example variable fixed.acidity seems to be highly correlated with citric.acid, density and pH.

corrplot(cor<-cor(winenorm, method="pearson")) 
corrplot(cor)

PCA

Now we will compute and determine the number of PCs

# PCA with prcomp:: (non-rotated) & preditions
pca<-prcomp(winenorm, center=FALSE, scale.=FALSE) # stats::
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7766 1.3705 1.2545 1.1007 0.97888 0.81570 0.74672
## Proportion of Variance 0.2869 0.1708 0.1431 0.1101 0.08711 0.06049 0.05069
## Cumulative Proportion  0.2869 0.4577 0.6007 0.7109 0.79798 0.85847 0.90916
##                           PC8     PC9    PC10    PC11
## Standard deviation     0.6473 0.58694 0.42099 0.24178
## Proportion of Variance 0.0381 0.03132 0.01611 0.00531
## Cumulative Proportion  0.9473 0.97857 0.99469 1.00000

Eigen values

The eigen values measure the variance retained by each PC and can be used to determine the number of PC. Eigen value > 1 indicate us that the PCs account for more variance than accounted by one of the original variables in standardized data, and is used to determine cutoff point for which we keep PCs.

library("factoextra")
eig.val <-  get_eigenvalue(pca)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.15615793       28.6923448                    28.69234
## Dim.2  1.87826140       17.0751036                    45.76745
## Dim.3  1.57365209       14.3059281                    60.07338
## Dim.4  1.21150477       11.0136798                    71.08706
## Dim.5  0.95821355        8.7110323                    79.79809
## Dim.6  0.66536918        6.0488107                    85.84690
## Dim.7  0.55759739        5.0690672                    90.91597
## Dim.8  0.41905879        3.8096253                    94.72559
## Dim.9  0.34449397        3.1317633                    97.85736
## Dim.10 0.17723144        1.6111950                    99.46855
## Dim.11 0.05845948        0.5314498                   100.00000

The proportion of variation explained by each eigenvalue is given in the second column. 71% of the variation is explained by the first four eigenvalues.For the first four PCs the eigen values are larger than 1, thus we should retain 4 PCs.

Scree plot

Another way to determine the number of PC is to look at the Scree Plot.

# eigen values on y-axis
fviz_eig(pca, choice='eigenvalue', addlabels    =   TRUE, barcolor = "dark blue")

var <-  get_pca_var(pca)
var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

#   Coordinates
head(var$coord)

##                           Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## fixed.acidity        0.86223204 -0.1402402  0.15351508 -0.25981155 -0.09532211
## volatile.acidity    -0.40353299  0.3957808  0.55662339  0.08506953  0.25211893
## citric.acid          0.81735044 -0.2007884 -0.30965306 -0.07024316 -0.07603340
## residual.sugar       0.31002053  0.3457238 -0.11460606 -0.44826819  0.68633585
## chlorides            0.39949476  0.2098688  0.06593714  0.71838231  0.26493375
## free.sulfur.dioxide -0.08501148  0.7085767 -0.53726758 -0.05592439 -0.14612960
##                            Dim.6       Dim.7       Dim.8        Dim.9
## fixed.acidity        0.084318190 -0.25825249  0.12179708 -0.101822013
## volatile.acidity     0.339406432 -0.39480941  0.05413647  0.060401232
## citric.acid          0.072892301  0.07082927  0.24481300  0.203643373
## residual.sugar       0.020243140  0.22901729 -0.20636376 -0.004925393
## chlorides            0.260156435  0.28149531  0.21214546 -0.049370175
## free.sulfur.dioxide -0.007772404 -0.07453572  0.13366656 -0.382528173
##                          Dim.10       Dim.11
## fixed.acidity       -0.12082333  0.152082021
## volatile.acidity     0.14910179  0.002580409
## citric.acid          0.27040640 -0.012975643
## residual.sugar       0.04233413  0.044332141
## chlorides           -0.10609899  0.011278046
## free.sulfur.dioxide  0.08752290 -0.009137291

#   Cos2:   quality on  the factore map
head(var$cos2)

##                           Dim.1      Dim.2       Dim.3       Dim.4       Dim.5
## fixed.acidity       0.743444090 0.01966731 0.023566879 0.067502041 0.009086305
## volatile.acidity    0.162838875 0.15664248 0.309829599 0.007236825 0.063563956
## citric.acid         0.668061746 0.04031597 0.095885017 0.004934102 0.005781079
## residual.sugar      0.096112730 0.11952497 0.013134548 0.200944367 0.471056901
## chlorides           0.159596066 0.04404492 0.004347706 0.516073137 0.070189893
## free.sulfur.dioxide 0.007226952 0.50208092 0.288656452 0.003127537 0.021353860
##                            Dim.6       Dim.7       Dim.8        Dim.9
## fixed.acidity       7.109557e-03 0.066694350 0.014834529 0.0103677222
## volatile.acidity    1.151967e-01 0.155874473 0.002930758 0.0036483089
## citric.acid         5.313288e-03 0.005016785 0.059933406 0.0414706233
## residual.sugar      4.097847e-04 0.052448917 0.042586002 0.0000242595
## chlorides           6.768137e-02 0.079239611 0.045005697 0.0024374142
## free.sulfur.dioxide 6.041027e-05 0.005555574 0.017866749 0.1463278034
##                          Dim.10       Dim.11
## fixed.acidity       0.014598278 2.312894e-02
## volatile.acidity    0.022231343 6.658510e-06
## citric.acid         0.073119619 1.683673e-04
## residual.sugar      0.001792178 1.965339e-03
## chlorides           0.011256995 1.271943e-04
## free.sulfur.dioxide 0.007660257 8.349008e-05

#   Contributions   to  the principal   components
head(var$contrib)

##                          Dim.1     Dim.2      Dim.3      Dim.4      Dim.5
## fixed.acidity       23.5553513  1.047102  1.4975915  5.5717520  0.9482547
## volatile.acidity     5.1594020  8.339759 19.6885704  0.5973418  6.6335898
## citric.acid         21.1669302  2.146451  6.0931522  0.4072705  0.6033184
## residual.sugar       3.0452446  6.363596  0.8346539 16.5863454 49.1599081
## chlorides            5.0566565  2.344983  0.2762813 42.5976973  7.3250784
## free.sulfur.dioxide  0.2289794 26.731152 18.3430921  0.2581531  2.2285074
##                           Dim.6      Dim.7      Dim.8        Dim.9    Dim.10
## fixed.acidity        1.06851316 11.9610226  3.5399636  3.009551168  8.236844
## volatile.acidity    17.31320437 27.9546632  0.6993667  1.059034181 12.543678
## citric.acid          0.79854729  0.8997146 14.3019090 12.038127553 41.256572
## residual.sugar       0.06158757  9.4062343 10.1622979  0.007042068  1.011208
## chlorides           10.17200270 14.2109006 10.7397096  0.707534644  6.351579
## free.sulfur.dioxide  0.00907921  0.9963415  4.2635423 42.476158374  4.322177
##                          Dim.11
## fixed.acidity       39.56405409
## volatile.acidity     0.01138996
## citric.acid          0.28800686
## residual.sugar       3.36188185
## chlorides            0.21757689
## free.sulfur.dioxide  0.14281701

Correlation circle and quality of representation

The correlation circle shows the relationship between variables - positively correlated variables are grouped together, while negatively correlated variables are positioned on opposite sides of the plot origin. Variables that are away from the origin are well represented on the factor map. Square cosine (cos2) is the quality of representation of the variables on a factor map. High cos2 means a good representation of the variable on the principal component, while low cos2 dicates that the variable is not perfectly represented by the PCs. Below plot is a correlation circle. the red color means high cos2 and blue color low cos2. Variables that are closed to the center of the plot are less important for the first components.

#correlation circle
fviz_pca_var(pca,   col.var =   "cos2", gradient.cols   =   c("#00AFBB",    "#E7B800",  "#FC4E03"), 
                                                    repel   =   TRUE)

the graph below indicates that for most variables first five PCs (Dim.1 - Dim.5) are a good representation

library("corrplot")
corrplot(var$cos2,  is.corr=FALSE)

#individual observations quality of representation.
# unlabeled observations in two dimensions with coloured quality of representation
fviz_pca_ind(pca, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ))

Contributions of variables to PCs

Variables that are correlated with our four PCs are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed. We will use the function corrplot() to find the most contributing variables.

Contribution for each PC:

library("corrplot")
corrplot(var$contrib,   is.corr=FALSE)

Contribution to first four PCs:

# contributions of individual variables to PC
library(gridExtra)
#   Contributions   of  variables   to  PC1
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90) # default angle=45Â°
#   Contributions   of  variables   to  PC12
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
#   Contributions   of  variables   to  PC3
c<-fviz_contrib(pca, "var", axes=3, xtickslab.rt=90) 
#   Contributions   of  variables   to  PC4
d<-fviz_contrib(pca, "var", axes=4, xtickslab.rt=90) 
grid.arrange(a,b, ncol=2,  top='Contribution to the PC1 and PC2')

grid.arrange(c,d, ncol=2, top='Contribution to the PC3 and PC4')

The most contributing variables to the first component are fixed.acidity, citric.acid, pH and density, to the second component: total.sulfur.dioxide, free.sulfur.dioxide and alcohol. For the third PC important are 6 variables, the most important is alcohol volatile.acidity and free.sulfur.dioxide. For the fourth PC the most contributing are:chlorides, sulphates and residual.sugar.

Total contribution to PC1-PC4:

fviz_contrib(pca,   choice  =   "var",  axes    =   1:4,    top =   10)

Generally, the most contributing variables to our four PCs are: fixed.acidity, citric.acid, density, free.sulfur.dioxide, total.sulfur.dioxide and chlorides.

Hierachical clustering

Now we will compare the results of hierarchical clustering for the trees generated with first four PCs and all variables included in the data set. For hierarchical clustering we will use euclidean distance measure and Ward’s linkage method.

Comparison of two dendograms

#   Compute hierarchical    clustering on initial dataset
res.dist <- dist(winenorm, method = "euclidean")
hc <- hclust(res.dist, method = "ward.D2")

#   Compute hierarchical    clustering including 4 PCs
pc_transform = as.data.frame(-pca$x[,1:4])
res.dist1 <- dist(pc_transform, method = "euclidean")
hc1 <- hclust(res.dist1, method = "ward.D2")


# create two dendrograms
library(dendextend)
dend1 <- as.dendrogram (hc)
dend2 <- as.dendrogram (hc1)
dend_list <- dendlist(dend1, dend2)
tanglegram(dend1, dend2)

tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)

To verify the quality of the alignment of two trees we can use the entanglement. It takes the values between 0 and 1, the lower the entanglement the better the alignment. The entanglement for our trees is equal to 0.38. The data set is quite large and the tanglegram is not transparent, therefore we will calculate Correlation matrix between dendograms. We will use cor.dendlist() and compute “Baker” and “Cophenetic” correlation. The valueS of these metrics range between -1 to 1, With near 0 values meaning that the two trees are not statistically similar.

# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)

## [1] 0.4972264

The Cophenetic correlation between trees is 0.4972264

# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)

## [1] 0.4716416

The Baker correlation between trees is 0.4716416

Conclusions

This project aimed to conduct the dimension reduction and find out whether we can retain as much information as possible while limiting the number of variables. Since the variables used dataset were continuous, PCA was applied. According to the obtained results we can explain over 71% of variation in our data using 4 principal components. The fixed.acidity variable has the highest contribution to determined dimensions. Based on the hierarchical clustering carried out on first four PCs and 11 variables included in the dataset, the entanglement was calculated, its value equal to 0.38 indicates rather good alignment.

References

https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf
https://tuxdoc.com/download/practical-guide-to-principal-component-methods-in-r-multivariate-analysis-book-2-4_pdf
https://www.kaggle.com/yasserh/wine-quality-dataset
http://www.personal.psu.edu/jxb14/M554/factor/efa.pdf