Data has following 13 attributes
1. Alcohol
2. Malic_acid
3. Ash
4. Alcalinity_of_ash
5. Magnesium
6. Total_phenols
7. Flavanoids
8. Nonflavanoid_phenols
9. Proanthocyanins
20. Color_intensity
11. Hue
12. OD280_OD315_of_diluted_wines
13. Proline
All the variables provided are continious.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(cluster)
library(fpc)
library(NbClust)
setwd("E:/ISB/Residency/3/DM1/Assignment/IndividualAssignment1-8July2017")
winedf = read.csv("Wine_PCA_Analysis.csv")
W.pca <- princomp(winedf[,-1], cor = TRUE, scores = TRUE, covmat = NULL)
summary(W.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.1692972 1.5801816 1.2025273 0.9586313 0.92370351
## Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294
## Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.80103498 0.74231281 0.59033665 0.53747553
## Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153
## Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.50090167 0.47517222 0.41081655 0.321524394
## Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149
## Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000
As per the summary above (Importance of components); the first 7 variables contribte ~90% of the information required for the entire data. Hence the 13 components can be reduced to 7 for furhter analysis with 90% information. The other variables can be included in case we intend to have more accurate analysis/forcasting/prediction.
plot(W.pca)
biplot(W.pca)
no_of_Clusters = NbClust(winedf, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 7
##
##
## *******************************************************************
# Plot bar chart for the clusters
fviz_nbclust(no_of_Clusters) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 6 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 7 .
## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
hclust.complete = eclust(winedf, "hclust", k = 7, method = "complete", graph = FALSE)
fviz_dend(hclust.complete, rect = TRUE, show_labels = FALSE)
km.7 = eclust(winedf, "kmeans", k = 5, nstart = 25, graph = FALSE)
fviz_cluster(km.7, geom = "point", frame.type = "norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.
Number of clusters suggested by the NbClust function are 7
winedf.pca = winedf[,2:14]
no_of_Clusters = NbClust(winedf.pca, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 7
##
##
## *******************************************************************
# Plot bar chart for the clusters
fviz_nbclust(no_of_Clusters) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 6 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 7 .
hclust.complete = eclust(winedf.pca, "hclust", k = 7, method = "complete", graph = FALSE)
fviz_dend(hclust.complete, rect = TRUE, show_labels = FALSE)
km.7 = eclust(winedf.pca, "kmeans", k = 5, nstart = 25, graph = FALSE)
fviz_cluster(km.7, geom = "point", frame.type = "norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.
When PCA was applied on the entire set of varibles (13); PCA suggested that 90% of the information can be inferred from the first 7 varaibles. We then plotted dendrogram for both 13 varaibles and 7 varaibles data and found that the number of clustered required are 7 and the dendrogram seem identical.