The aim of the Principal component analysis is reducing the dimensionality of large datasets and still keeping most of the information in the large set. In practice we create new set of variables (principal components) that are the linear combinations of the original observed variables such that the first component accounts for the largest variance in the data, the second component accounts for the second largest variance in the data and so on. The number of principal components is less than or equal to the number of original variables. The principal components are orthogonal to each other, so there is no redundant information. In this project the PCA will be carried and the results of hierarchical clustering will be compared between PCA and whole dataset.
Dataset used in this report includes information about 1143 wines. Each wine is described by 13 characteristics. The dataset can be fount at Kaggle website (https://www.kaggle.com/yasserh/wine-quality-dataset).
| Variable | Value |
|---|---|
| fixed acidity | Fixed acidity value |
| volatile acidity | Volatile acidity value |
| citric acid | Citric acid value |
| residual sugar | Residual sugar value |
| chlorides | Chlorides value |
| free sulfur dioxide | Free sulfur dioxide value |
| total sulfur dioxide | Total sulfur dioxide value |
| density | Density value |
| pH | pH value |
| sulphates | Sulphates value |
| alcohol | % of alcohol |
| quality | assessmnet of quality 1-10 |
#Basic statistics
library(kableExtra)
summary(wine1) %>% kable() %>%
kable_material(c("hover", "striped")) %>%
scroll_box(width="940px")
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | Id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 4.600 | Min. :0.1200 | Min. :0.0000 | Min. : 0.900 | Min. :0.01200 | Min. : 1.00 | Min. : 6.00 | Min. :0.9901 | Min. :2.740 | Min. :0.3300 | Min. : 8.40 | Min. :3.000 | Min. : 0 | |
| 1st Qu.: 7.100 | 1st Qu.:0.3925 | 1st Qu.:0.0900 | 1st Qu.: 1.900 | 1st Qu.:0.07000 | 1st Qu.: 7.00 | 1st Qu.: 21.00 | 1st Qu.:0.9956 | 1st Qu.:3.205 | 1st Qu.:0.5500 | 1st Qu.: 9.50 | 1st Qu.:5.000 | 1st Qu.: 411 | |
| Median : 7.900 | Median :0.5200 | Median :0.2500 | Median : 2.200 | Median :0.07900 | Median :13.00 | Median : 37.00 | Median :0.9967 | Median :3.310 | Median :0.6200 | Median :10.20 | Median :6.000 | Median : 794 | |
| Mean : 8.311 | Mean :0.5313 | Mean :0.2684 | Mean : 2.532 | Mean :0.08693 | Mean :15.62 | Mean : 45.91 | Mean :0.9967 | Mean :3.311 | Mean :0.6577 | Mean :10.44 | Mean :5.657 | Mean : 805 | |
| 3rd Qu.: 9.100 | 3rd Qu.:0.6400 | 3rd Qu.:0.4200 | 3rd Qu.: 2.600 | 3rd Qu.:0.09000 | 3rd Qu.:21.00 | 3rd Qu.: 61.00 | 3rd Qu.:0.9978 | 3rd Qu.:3.400 | 3rd Qu.:0.7300 | 3rd Qu.:11.10 | 3rd Qu.:6.000 | 3rd Qu.:1210 | |
| Max. :15.900 | Max. :1.5800 | Max. :1.0000 | Max. :15.500 | Max. :0.61100 | Max. :68.00 | Max. :289.00 | Max. :1.0037 | Max. :4.010 | Max. :2.0000 | Max. :14.90 | Max. :8.000 | Max. :1597 |
Most variables are continuous thus the PCA not MCA will be used.
#checking for missing values
any(is.na(wine1))
## [1] FALSE
missing_in_cols <- sapply(wine1, function(x) sum(is.na(x))/nrow(wine1))
We do not have missing values in data
#droping unnecessary variables
wine<- wine1[,1:11]
#Data normalization
#install.packages("caret")
library(caret)
preproc <- preProcess(wine, method=c("center", "scale"))
winenorm <- predict(preproc, wine)
summary(winenorm)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :-2.1236 Min. :-2.28988 Min. :-1.36443 Min. :-1.20372
## 1st Qu.:-0.6930 1st Qu.:-0.77290 1st Qu.:-0.90685 1st Qu.:-0.46622
## Median :-0.2352 Median :-0.06312 Median :-0.09337 Median :-0.24496
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.4514 3rd Qu.: 0.60491 3rd Qu.: 0.77096 3rd Qu.: 0.05004
## Max. : 4.3425 Max. : 5.83779 Max. : 3.71982 Max. : 9.56389
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :-1.58529 Min. :-1.4258 Min. :-1.2176 Min. :-3.45983
## 1st Qu.:-0.35823 1st Qu.:-0.8405 1st Qu.:-0.7600 1st Qu.:-0.60279
## Median :-0.16783 Median :-0.2552 Median :-0.2719 Median :-0.02619
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.06489 3rd Qu.: 0.5253 3rd Qu.: 0.4602 3rd Qu.: 0.57899
## Max. :11.08730 Max. : 5.1104 Max. : 7.4152 Max. : 3.61524
## pH sulphates alcohol
## Min. :-3.644836 Min. :-1.9232 Min. :-1.8870
## 1st Qu.:-0.676702 1st Qu.:-0.6321 1st Qu.:-0.8706
## Median :-0.006478 Median :-0.2213 Median :-0.2237
## Mean : 0.000000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.568000 3rd Qu.: 0.4243 3rd Qu.: 0.6079
## Max. : 4.461681 Max. : 7.8774 Max. : 4.1193
PCA is particularly useful when the variables are highly correlated. Correlation indicates that there is redundancy in the data. We can solve this problem by replacing a group of correlated variables with a new variable. The plot below indicates that in our dataset we indeed have quite large correlation for few variables. For example variable fixed.acidity seems to be highly correlated with citric.acid, density and pH.
corrplot(cor<-cor(winenorm, method="pearson"))
corrplot(cor)
Now we will compute and determine the number of PCs
# PCA with prcomp:: (non-rotated) & preditions
pca<-prcomp(winenorm, center=FALSE, scale.=FALSE) # stats::
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7766 1.3705 1.2545 1.1007 0.97888 0.81570 0.74672
## Proportion of Variance 0.2869 0.1708 0.1431 0.1101 0.08711 0.06049 0.05069
## Cumulative Proportion 0.2869 0.4577 0.6007 0.7109 0.79798 0.85847 0.90916
## PC8 PC9 PC10 PC11
## Standard deviation 0.6473 0.58694 0.42099 0.24178
## Proportion of Variance 0.0381 0.03132 0.01611 0.00531
## Cumulative Proportion 0.9473 0.97857 0.99469 1.00000
The eigen values measure the variance retained by each PC and can be used to determine the number of PC. Eigen value > 1 indicate us that the PCs account for more variance than accounted by one of the original variables in standardized data, and is used to determine cutoff point for which we keep PCs.
library("factoextra")
eig.val <- get_eigenvalue(pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.15615793 28.6923448 28.69234
## Dim.2 1.87826140 17.0751036 45.76745
## Dim.3 1.57365209 14.3059281 60.07338
## Dim.4 1.21150477 11.0136798 71.08706
## Dim.5 0.95821355 8.7110323 79.79809
## Dim.6 0.66536918 6.0488107 85.84690
## Dim.7 0.55759739 5.0690672 90.91597
## Dim.8 0.41905879 3.8096253 94.72559
## Dim.9 0.34449397 3.1317633 97.85736
## Dim.10 0.17723144 1.6111950 99.46855
## Dim.11 0.05845948 0.5314498 100.00000
The proportion of variation explained by each eigenvalue is given in the second column. 71% of the variation is explained by the first four eigenvalues.For the first four PCs the eigen values are larger than 1, thus we should retain 4 PCs.
Another way to determine the number of PC is to look at the Scree Plot.
# eigen values on y-axis
fviz_eig(pca, choice='eigenvalue', addlabels = TRUE, barcolor = "dark blue")
var <- get_pca_var(pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
# Coordinates
head(var$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## fixed.acidity 0.86223204 -0.1402402 0.15351508 -0.25981155 -0.09532211
## volatile.acidity -0.40353299 0.3957808 0.55662339 0.08506953 0.25211893
## citric.acid 0.81735044 -0.2007884 -0.30965306 -0.07024316 -0.07603340
## residual.sugar 0.31002053 0.3457238 -0.11460606 -0.44826819 0.68633585
## chlorides 0.39949476 0.2098688 0.06593714 0.71838231 0.26493375
## free.sulfur.dioxide -0.08501148 0.7085767 -0.53726758 -0.05592439 -0.14612960
## Dim.6 Dim.7 Dim.8 Dim.9
## fixed.acidity 0.084318190 -0.25825249 0.12179708 -0.101822013
## volatile.acidity 0.339406432 -0.39480941 0.05413647 0.060401232
## citric.acid 0.072892301 0.07082927 0.24481300 0.203643373
## residual.sugar 0.020243140 0.22901729 -0.20636376 -0.004925393
## chlorides 0.260156435 0.28149531 0.21214546 -0.049370175
## free.sulfur.dioxide -0.007772404 -0.07453572 0.13366656 -0.382528173
## Dim.10 Dim.11
## fixed.acidity -0.12082333 0.152082021
## volatile.acidity 0.14910179 0.002580409
## citric.acid 0.27040640 -0.012975643
## residual.sugar 0.04233413 0.044332141
## chlorides -0.10609899 0.011278046
## free.sulfur.dioxide 0.08752290 -0.009137291
# Cos2: quality on the factore map
head(var$cos2)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## fixed.acidity 0.743444090 0.01966731 0.023566879 0.067502041 0.009086305
## volatile.acidity 0.162838875 0.15664248 0.309829599 0.007236825 0.063563956
## citric.acid 0.668061746 0.04031597 0.095885017 0.004934102 0.005781079
## residual.sugar 0.096112730 0.11952497 0.013134548 0.200944367 0.471056901
## chlorides 0.159596066 0.04404492 0.004347706 0.516073137 0.070189893
## free.sulfur.dioxide 0.007226952 0.50208092 0.288656452 0.003127537 0.021353860
## Dim.6 Dim.7 Dim.8 Dim.9
## fixed.acidity 7.109557e-03 0.066694350 0.014834529 0.0103677222
## volatile.acidity 1.151967e-01 0.155874473 0.002930758 0.0036483089
## citric.acid 5.313288e-03 0.005016785 0.059933406 0.0414706233
## residual.sugar 4.097847e-04 0.052448917 0.042586002 0.0000242595
## chlorides 6.768137e-02 0.079239611 0.045005697 0.0024374142
## free.sulfur.dioxide 6.041027e-05 0.005555574 0.017866749 0.1463278034
## Dim.10 Dim.11
## fixed.acidity 0.014598278 2.312894e-02
## volatile.acidity 0.022231343 6.658510e-06
## citric.acid 0.073119619 1.683673e-04
## residual.sugar 0.001792178 1.965339e-03
## chlorides 0.011256995 1.271943e-04
## free.sulfur.dioxide 0.007660257 8.349008e-05
# Contributions to the principal components
head(var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## fixed.acidity 23.5553513 1.047102 1.4975915 5.5717520 0.9482547
## volatile.acidity 5.1594020 8.339759 19.6885704 0.5973418 6.6335898
## citric.acid 21.1669302 2.146451 6.0931522 0.4072705 0.6033184
## residual.sugar 3.0452446 6.363596 0.8346539 16.5863454 49.1599081
## chlorides 5.0566565 2.344983 0.2762813 42.5976973 7.3250784
## free.sulfur.dioxide 0.2289794 26.731152 18.3430921 0.2581531 2.2285074
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## fixed.acidity 1.06851316 11.9610226 3.5399636 3.009551168 8.236844
## volatile.acidity 17.31320437 27.9546632 0.6993667 1.059034181 12.543678
## citric.acid 0.79854729 0.8997146 14.3019090 12.038127553 41.256572
## residual.sugar 0.06158757 9.4062343 10.1622979 0.007042068 1.011208
## chlorides 10.17200270 14.2109006 10.7397096 0.707534644 6.351579
## free.sulfur.dioxide 0.00907921 0.9963415 4.2635423 42.476158374 4.322177
## Dim.11
## fixed.acidity 39.56405409
## volatile.acidity 0.01138996
## citric.acid 0.28800686
## residual.sugar 3.36188185
## chlorides 0.21757689
## free.sulfur.dioxide 0.14281701
The correlation circle shows the relationship between variables - positively correlated variables are grouped together, while negatively correlated variables are positioned on opposite sides of the plot origin. Variables that are away from the origin are well represented on the factor map. Square cosine (cos2) is the quality of representation of the variables on a factor map. High cos2 means a good representation of the variable on the principal component, while low cos2 dicates that the variable is not perfectly represented by the PCs. Below plot is a correlation circle. the red color means high cos2 and blue color low cos2. Variables that are closed to the center of the plot are less important for the first components.
#correlation circle
fviz_pca_var(pca, col.var = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E03"),
repel = TRUE)
the graph below indicates that for most variables first five PCs (Dim.1 - Dim.5) are a good representation
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
#individual observations quality of representation.
# unlabeled observations in two dimensions with coloured quality of representation
fviz_pca_ind(pca, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ))
Variables that are correlated with our four PCs are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed. We will use the function corrplot() to find the most contributing variables.
Contribution for each PC:
library("corrplot")
corrplot(var$contrib, is.corr=FALSE)
Contribution to first four PCs:
# contributions of individual variables to PC
library(gridExtra)
# Contributions of variables to PC1
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
# Contributions of variables to PC12
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
# Contributions of variables to PC3
c<-fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)
# Contributions of variables to PC4
d<-fviz_contrib(pca, "var", axes=4, xtickslab.rt=90)
grid.arrange(a,b, ncol=2, top='Contribution to the PC1 and PC2')
grid.arrange(c,d, ncol=2, top='Contribution to the PC3 and PC4')
The most contributing variables to the first component are fixed.acidity, citric.acid, pH and density, to the second component: total.sulfur.dioxide, free.sulfur.dioxide and alcohol. For the third PC important are 6 variables, the most important is alcohol volatile.acidity and free.sulfur.dioxide. For the fourth PC the most contributing are:chlorides, sulphates and residual.sugar.
Total contribution to PC1-PC4:
fviz_contrib(pca, choice = "var", axes = 1:4, top = 10)
Generally, the most contributing variables to our four PCs are: fixed.acidity, citric.acid, density, free.sulfur.dioxide, total.sulfur.dioxide and chlorides.
Now we will compare the results of hierarchical clustering for the trees generated with first four PCs and all variables included in the data set. For hierarchical clustering we will use euclidean distance measure and Ward’s linkage method.
# Compute hierarchical clustering on initial dataset
res.dist <- dist(winenorm, method = "euclidean")
hc <- hclust(res.dist, method = "ward.D2")
# Compute hierarchical clustering including 4 PCs
pc_transform = as.data.frame(-pca$x[,1:4])
res.dist1 <- dist(pc_transform, method = "euclidean")
hc1 <- hclust(res.dist1, method = "ward.D2")
# create two dendrograms
library(dendextend)
dend1 <- as.dendrogram (hc)
dend2 <- as.dendrogram (hc1)
dend_list <- dendlist(dend1, dend2)
tanglegram(dend1, dend2)
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)
To verify the quality of the alignment of two trees we can use the entanglement. It takes the values between 0 and 1, the lower the entanglement the better the alignment. The entanglement for our trees is equal to 0.38. The data set is quite large and the tanglegram is not transparent, therefore we will calculate Correlation matrix between dendograms. We will use cor.dendlist() and compute “Baker” and “Cophenetic” correlation. The valueS of these metrics range between -1 to 1, With near 0 values meaning that the two trees are not statistically similar.
# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.4972264
The Cophenetic correlation between trees is 0.4972264
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.4716416
The Baker correlation between trees is 0.4716416
This project aimed to conduct the dimension reduction and find out whether we can retain as much information as possible while limiting the number of variables. Since the variables used dataset were continuous, PCA was applied. According to the obtained results we can explain over 71% of variation in our data using 4 principal components. The fixed.acidity variable has the highest contribution to determined dimensions. Based on the hierarchical clustering carried out on first four PCs and 11 variables included in the dataset, the entanglement was calculated, its value equal to 0.38 indicates rather good alignment.
https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf
https://tuxdoc.com/download/practical-guide-to-principal-component-methods-in-r-multivariate-analysis-book-2-4_pdf
https://www.kaggle.com/yasserh/wine-quality-dataset
http://www.personal.psu.edu/jxb14/M554/factor/efa.pdf