This wine dataset is a result of chemical analysis of wines grown in a particular area. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavonoids, Non-Flavonoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no missing values.
Load thedata set
Wine <- read.csv("file:///C:/Users/badal/Desktop/datset_/Wine.csv")
head(Wine)
str(Wine)
'data.frame': 178 obs. of 14 variables:
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ Malic_Acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ Ash_Alcanity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
$ Total_Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ Nonflavanoid_Phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color_Intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ OD280 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
$ Customer_Segment : int 1 1 1 1 1 1 1 1 1 1 ...
summary(Wine)
Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205
Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Median :2.355 Median :2.135
Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029
3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875
Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270 Min. : 278.0
1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938 1st Qu.: 500.5
Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780 Median : 673.5
Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612 Mean : 746.9
3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170 3rd Qu.: 985.0
Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000 Max. :1680.0
Customer_Segment
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.938
3rd Qu.:3.000
Max. :3.000
any(is.na(Wine))
[1] FALSE
Wine_scale <- scale(Wine[-1])
dim(Wine_scale)
[1] 178 13
head(Wine_scale,3)
Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
[1,] -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217 1.0319081 -0.6577078 1.2214385
[2,] -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481 0.7315653 -0.8184106 -0.5431887
[3,] 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217 1.2121137 -0.4970050 2.1299594
Color_Intensity Hue OD280 Proline Customer_Segment
[1,] 0.2510088 0.3611585 1.8427215 1.0101594 -1.210529
[2,] -0.2924962 0.4049085 1.1103172 0.9625263 -1.210529
[3,] 0.2682629 0.3174085 0.7863692 1.3912237 -1.210529
library(cluster)
library(factoextra)
library(ggplot2)
A fundamental question is how to determine the value of the parameter
. If we looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(Wine_scale, nc=10)
fviz_nbclust(Wine_scale, kmeans, method = 'wss') +
geom_vline(xintercept =3, linetype=5, col= "darkred")
k.means <- kmeans(Wine_scale, 3,nstart = 25) # k = 4
k.means
K-means clustering with 3 clusters of sizes 61, 49, 68
Cluster means:
Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
1 -0.3496504 -0.4749050 0.1821600 -0.63500454 -0.1786906 -0.03931453 0.09838569 -0.1419158
2 0.9024258 0.2485092 0.5820616 -0.05049296 -0.9857762 -1.23271740 0.71482528 -0.7474990
3 -0.3366204 0.2469449 -0.5828349 0.60602106 0.8706347 0.92354910 -0.60335244 0.6659457
Color_Intensity Hue OD280 Proline Customer_Segment
1 -0.8885415 0.4498060 0.2139764 -0.8242592 0.07973544
2 0.9857177 -1.1879477 -1.2978785 -0.3789756 1.34366784
3 0.0867774 0.4525187 0.7432866 1.0124943 -1.03975861
Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[57] 3 3 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 3 3 1
[113] 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[169] 2 2 2 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 451.3039 284.5561 459.7529
(between_SS / total_SS = 48.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault"
# Centroids:
k.means$centers
Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins
1 -0.3496504 -0.4749050 0.1821600 -0.63500454 -0.1786906 -0.03931453 0.09838569 -0.1419158
2 0.9024258 0.2485092 0.5820616 -0.05049296 -0.9857762 -1.23271740 0.71482528 -0.7474990
3 -0.3366204 0.2469449 -0.5828349 0.60602106 0.8706347 0.92354910 -0.60335244 0.6659457
Color_Intensity Hue OD280 Proline Customer_Segment
1 -0.8885415 0.4498060 0.2139764 -0.8242592 0.07973544
2 0.9857177 -1.1879477 -1.2978785 -0.3789756 1.34366784
3 0.0867774 0.4525187 0.7432866 1.0124943 -1.03975861
# Clusters:
k.means$cluster
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[57] 3 3 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 3 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 3 3 1
[113] 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[169] 2 2 2 2 2 2 2 2 2 2
Determining cluster size that is, the number of points in each cluster.
# Cluster size:
k.means$size
[1] 61 49 68
2D representation of clusters
clusplot(Wine_scale, k.means$cluster, main='2D representation of the Cluster',
color=TRUE, shade=TRUE,
labels=2, lines=0)
fviz_cluster(object = k.means, # kmeans object
data = Wine_scale, # data used for clustering
ellipse.type = "norm",
geom = "point",
palette = "jco",
main = "",
ggtheme = theme_minimal())
The optimal number of clusters that are 3 in numbers and visualize K-mean clustering.