A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Let us take a data set where diferent wines and their individual detials such asalcohol content,ash,alkaline content etc.. has given.Now, we need to identify which wines are similar and which are not based on the given data.
Here we will use the clustering algorithm to achive the target.
#1.Loading dataset:
rm(list=ls())
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(digest)
library(stringi)
library(cluster)
data(wine, package="rattle")
summary(wine)
## Type Alcohol Malic Ash Alcalinity
## 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Phenols Flavanoids Nonflavanoids
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color Hue Dilution
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
#2.Note that the variables have a large different means and variances. This is explained by the fact that the variables are measured in different units; They must be standardized (i.e., scaled) to make them comparable.Standardization consists of transforming the variables such that they have mean zero and standard deviation one. As we don't want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale() as follow:
wine2<- scale(wine[,2:14])
head(wine2)
## Alcohol Malic Ash Alcalinity Magnesium Phenols
## [1,] 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217
## [2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
## [3,] 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217
## [4,] 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372
## [5,] 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217
## [6,] 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991
## Flavanoids Nonflavanoids Proanthocyanins Color Hue
## [1,] 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585
## [2,] 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085
## [3,] 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085
## [4,] 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410
## [5,] 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585
## [6,] 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085
## Dilution Proline
## [1,] 1.8427215 1.01015939
## [2,] 1.1103172 0.96252635
## [3,] 0.7863692 1.39122370
## [4,] 1.1807407 2.32800680
## [5,] 0.4483365 -0.03776747
## [6,] 0.3356589 2.23274072
#3. Determine the number of optimal clusters in the data
#Partitioning methods such as k-Means require the users to specify the number of clusters to be generated. Here, we provide a simple solution. The idea is to compute a clustering algorithm of interest using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
wine2<-na.omit(wine2)
wss2 <- (nrow(wine2)-1)*sum(apply(wine2,2,var))
for(i in 2:10) wss2[i] <- sum(kmeans(wine2,centers=i)$withinss)
plot(1:10, wss2, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares",main="Assessing the Optimal Number of Clusters with the Elbow Method",pch=20, cex=2)
#The bend of the knee in the graph occured when k=3
#4.Compute k-means clustering
km3<-kmeans(wine2,3,nstart = 25)
km3
## K-means clustering with 3 clusters of sizes 51, 62, 65
##
## Cluster means:
## Alcohol Malic Ash Alcalinity Magnesium Phenols
## 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## Flavanoids Nonflavanoids Proanthocyanins Color Hue
## 1 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
## 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
## 3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
## Dilution Proline
## 1 -1.2887761 -0.4059428
## 2 0.7770551 1.1220202
## 3 0.2700025 -0.7517257
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3
## [71] 3 3 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
#5.Now let us cross check the clusters with the type of wines.Ideally all three types of wines should fall in a different clusters.
table(km3$cluster, wine$Type)
##
## 1 2 3
## 1 0 3 48
## 2 59 3 0
## 3 0 65 0
# the actual case in the above table is very near to the ideal scenario.
#6.Let us understand what are the mean values of the qualities of a different types of wines i.e; what is the mean quantiy of alcohol in type1,what is the mean value of ash in type3 etc..
wine_types<-aggregate(wine[-1], by=list(cluster=km3$cluster), mean)
wine_types
## cluster Alcohol Malic Ash Alcalinity Magnesium Phenols
## 1 1 13.13412 3.307255 2.417647 21.24118 98.66667 1.683922
## 2 2 13.67677 1.997903 2.466290 17.46290 107.96774 2.847581
## 3 3 12.25092 1.897385 2.231231 20.06308 92.73846 2.247692
## Flavanoids Nonflavanoids Proanthocyanins Color Hue Dilution
## 1 0.8188235 0.4519608 1.145882 7.234706 0.6919608 1.696667
## 2 3.0032258 0.2920968 1.922097 5.453548 1.0654839 3.163387
## 3 2.0500000 0.3576923 1.624154 2.973077 1.0627077 2.803385
## Proline
## 1 619.0588
## 2 1100.2258
## 3 510.1692
#7.Ploting the clustering result:
clusplot(wine2,km3$cluster, main='2D representation of the Cluster solution',color=TRUE, shade=TRUE,labels=2, lines=0)
#from this plot we can clearly understand which wines are alike and which wines are different. For example from the plot, 84,135 wines are similar.
#8. Let us finally, map each individual wine to the type of clusters.
no=order(km3$cluster)
cluster=data.frame(wine$Type[no],km3$cluster[no])
cluster
## wine.Type.no. km3.cluster.no.
## 1 2 1
## 2 2 1
## 3 2 1
## 4 3 1
## 5 3 1
## 6 3 1
## 7 3 1
## 8 3 1
## 9 3 1
## 10 3 1
## 11 3 1
## 12 3 1
## 13 3 1
## 14 3 1
## 15 3 1
## 16 3 1
## 17 3 1
## 18 3 1
## 19 3 1
## 20 3 1
## 21 3 1
## 22 3 1
## 23 3 1
## 24 3 1
## 25 3 1
## 26 3 1
## 27 3 1
## 28 3 1
## 29 3 1
## 30 3 1
## 31 3 1
## 32 3 1
## 33 3 1
## 34 3 1
## 35 3 1
## 36 3 1
## 37 3 1
## 38 3 1
## 39 3 1
## 40 3 1
## 41 3 1
## 42 3 1
## 43 3 1
## 44 3 1
## 45 3 1
## 46 3 1
## 47 3 1
## 48 3 1
## 49 3 1
## 50 3 1
## 51 3 1
## 52 1 2
## 53 1 2
## 54 1 2
## 55 1 2
## 56 1 2
## 57 1 2
## 58 1 2
## 59 1 2
## 60 1 2
## 61 1 2
## 62 1 2
## 63 1 2
## 64 1 2
## 65 1 2
## 66 1 2
## 67 1 2
## 68 1 2
## 69 1 2
## 70 1 2
## 71 1 2
## 72 1 2
## 73 1 2
## 74 1 2
## 75 1 2
## 76 1 2
## 77 1 2
## 78 1 2
## 79 1 2
## 80 1 2
## 81 1 2
## 82 1 2
## 83 1 2
## 84 1 2
## 85 1 2
## 86 1 2
## 87 1 2
## 88 1 2
## 89 1 2
## 90 1 2
## 91 1 2
## 92 1 2
## 93 1 2
## 94 1 2
## 95 1 2
## 96 1 2
## 97 1 2
## 98 1 2
## 99 1 2
## 100 1 2
## 101 1 2
## 102 1 2
## 103 1 2
## 104 1 2
## 105 1 2
## 106 1 2
## 107 1 2
## 108 1 2
## 109 1 2
## 110 1 2
## 111 2 2
## 112 2 2
## 113 2 2
## 114 2 3
## 115 2 3
## 116 2 3
## 117 2 3
## 118 2 3
## 119 2 3
## 120 2 3
## 121 2 3
## 122 2 3
## 123 2 3
## 124 2 3
## 125 2 3
## 126 2 3
## 127 2 3
## 128 2 3
## 129 2 3
## 130 2 3
## 131 2 3
## 132 2 3
## 133 2 3
## 134 2 3
## 135 2 3
## 136 2 3
## 137 2 3
## 138 2 3
## 139 2 3
## 140 2 3
## 141 2 3
## 142 2 3
## 143 2 3
## 144 2 3
## 145 2 3
## 146 2 3
## 147 2 3
## 148 2 3
## 149 2 3
## 150 2 3
## 151 2 3
## 152 2 3
## 153 2 3
## 154 2 3
## 155 2 3
## 156 2 3
## 157 2 3
## 158 2 3
## 159 2 3
## 160 2 3
## 161 2 3
## 162 2 3
## 163 2 3
## 164 2 3
## 165 2 3
## 166 2 3
## 167 2 3
## 168 2 3
## 169 2 3
## 170 2 3
## 171 2 3
## 172 2 3
## 173 2 3
## 174 2 3
## 175 2 3
## 176 2 3
## 177 2 3
## 178 2 3
#9. We can applythis methodology to Customer segmentation where we can divide cistomers in to different clusters based on their behaviors.This method has a wide application in marketing, consumer finance etc..