Wine Data - Principal Component Analysis (PCA)

PCA for Wine Data

Data has following 13 attributes
1. Alcohol
2. Malic_acid
3. Ash
4. Alcalinity_of_ash
5. Magnesium
6. Total_phenols
7. Flavanoids
8. Nonflavanoid_phenols
9. Proanthocyanins
20. Color_intensity
11. Hue
12. OD280_OD315_of_diluted_wines
13. Proline

All the variables provided are continious.

Reading data from .csv file

library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(cluster)
library(fpc)
library(NbClust)

setwd("E:/ISB/Residency/3/DM1/Assignment/IndividualAssignment1-8July2017")

winedf = read.csv("Wine_PCA_Analysis.csv")

Building PCA Summary

W.pca <- princomp(winedf[,-1], cor = TRUE, scores = TRUE, covmat = NULL)

summary(W.pca)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.1692972 1.5801816 1.2025273 0.9586313 0.92370351
## Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294
## Cumulative Proportion  0.3619885 0.5540634 0.6652997 0.7359900 0.80162293
##                            Comp.6     Comp.7     Comp.8     Comp.9
## Standard deviation     0.80103498 0.74231281 0.59033665 0.53747553
## Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153
## Cumulative Proportion  0.85098116 0.89336795 0.92017544 0.94239698
##                           Comp.10    Comp.11    Comp.12     Comp.13
## Standard deviation     0.50090167 0.47517222 0.41081655 0.321524394
## Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149
## Cumulative Proportion  0.96169717 0.97906553 0.99204785 1.000000000

Observation

As per the summary above (Importance of components); the first 7 variables contribte ~90% of the information required for the entire data. Hence the 13 components can be reduced to 7 for furhter analysis with 90% information. The other variables can be included in case we intend to have more accurate analysis/forcasting/prediction.

plot(W.pca)

biplot(W.pca)

Cluster Analysis - All Variables

no_of_Clusters = NbClust(winedf, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 7 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  7 
##  
##  
## *******************************************************************

# Plot bar chart for the clusters
fviz_nbclust(no_of_Clusters) + theme_minimal()

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 6 proposed  2 as the best number of clusters
## * 5 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 7 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  7 .

## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

Hierarchical clustering - All Variables

hclust.complete = eclust(winedf, "hclust", k = 7, method = "complete", graph = FALSE) 
fviz_dend(hclust.complete, rect = TRUE, show_labels = FALSE)

K-Means clustering - All Variables

km.7 = eclust(winedf, "kmeans", k = 5, nstart = 25, graph = FALSE)
fviz_cluster(km.7, geom = "point", frame.type = "norm")

## Warning: argument frame is deprecated; please use ellipse instead.

## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.

Cluster Analysis - PCA Suggested Components

Number of clusters suggested by the NbClust function are 7

winedf.pca = winedf[,2:14]

no_of_Clusters = NbClust(winedf.pca, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 7 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  7 
##  
##  
## *******************************************************************

# Plot bar chart for the clusters
fviz_nbclust(no_of_Clusters) + theme_minimal()

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 6 proposed  2 as the best number of clusters
## * 5 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 7 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  7 .

Hierarchical clustering - PCA Suggested Components

hclust.complete = eclust(winedf.pca, "hclust", k = 7, method = "complete", graph = FALSE) 
fviz_dend(hclust.complete, rect = TRUE, show_labels = FALSE)

K-Means clustering - PCA Suggested Components

km.7 = eclust(winedf.pca, "kmeans", k = 5, nstart = 25, graph = FALSE)
fviz_cluster(km.7, geom = "point", frame.type = "norm")

## Warning: argument frame is deprecated; please use ellipse instead.

## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.

Observation(s):

When PCA was applied on the entire set of varibles (13); PCA suggested that 90% of the information can be inferred from the first 7 varaibles. We then plotted dendrogram for both 13 varaibles and 7 varaibles data and found that the number of clustered required are 7 and the dendrogram seem identical.