clustering/Segmentation - This method has a wide application in Consumer marketing and consumer banking

A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Let us take a data set where diferent wines and their individual detials such asalcohol content,ash,alkaline content etc.. has given.Now, we need to identify which wines are similar and which are not based on the given data.

Here we will use the clustering algorithm to achive the target.

#1.Loading dataset:
rm(list=ls())
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(digest)
library(stringi)
library(cluster)
data(wine, package="rattle")

summary(wine)

##  Type      Alcohol          Malic            Ash          Alcalinity   
##  1:59   Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  2:71   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  3:48   Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##         Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##         3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##         Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium         Phenols        Flavanoids    Nonflavanoids   
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300  
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700  
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400  
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619  
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375  
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600  
##  Proanthocyanins     Color             Hue            Dilution    
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

#2.Note that the variables have a large different means and variances. This is explained by the fact that the variables are measured in different units; They must be standardized (i.e., scaled) to make them comparable.Standardization consists of transforming the variables such that they have mean zero and standard deviation one. As we don't want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale() as follow:

wine2<- scale(wine[,2:14])
head(wine2)

##        Alcohol       Malic        Ash Alcalinity  Magnesium   Phenols
## [1,] 1.5143408 -0.56066822  0.2313998 -1.1663032 1.90852151 0.8067217
## [2,] 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
## [3,] 0.1963252  0.02117152  1.1062139 -0.2679823 0.08810981 0.8067217
## [4,] 1.6867914 -0.34583508  0.4865539 -0.8069748 0.92829983 2.4844372
## [5,] 0.2948684  0.22705328  1.8352256  0.4506745 1.27837900 0.8067217
## [6,] 1.4773871 -0.51591132  0.3043010 -1.2860793 0.85828399 1.5576991
##      Flavanoids Nonflavanoids Proanthocyanins      Color        Hue
## [1,]  1.0319081    -0.6577078       1.2214385  0.2510088  0.3611585
## [2,]  0.7315653    -0.8184106      -0.5431887 -0.2924962  0.4049085
## [3,]  1.2121137    -0.4970050       2.1299594  0.2682629  0.3174085
## [4,]  1.4623994    -0.9791134       1.0292513  1.1827317 -0.4263410
## [5,]  0.6614853     0.2261576       0.4002753 -0.3183774  0.3611585
## [6,]  1.3622851    -0.1755994       0.6623487  0.7298108  0.4049085
##       Dilution     Proline
## [1,] 1.8427215  1.01015939
## [2,] 1.1103172  0.96252635
## [3,] 0.7863692  1.39122370
## [4,] 1.1807407  2.32800680
## [5,] 0.4483365 -0.03776747
## [6,] 0.3356589  2.23274072

#3. Determine the number of optimal clusters in the data

#Partitioning methods such as k-Means require the users to specify the number of clusters to be generated. Here, we provide a simple solution. The idea is to compute a clustering algorithm of interest using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

wine2<-na.omit(wine2)

wss2 <- (nrow(wine2)-1)*sum(apply(wine2,2,var))

for(i in 2:10) wss2[i] <- sum(kmeans(wine2,centers=i)$withinss)

plot(1:10, wss2, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares",main="Assessing the Optimal Number of Clusters with the Elbow Method",pch=20, cex=2)

#The bend of the knee in the graph occured when k=3

#4.Compute k-means clustering

km3<-kmeans(wine2,3,nstart = 25)

km3

## K-means clustering with 3 clusters of sizes 51, 62, 65
## 
## Cluster means:
##      Alcohol      Malic        Ash Alcalinity   Magnesium     Phenols
## 1  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
## 2  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
##    Flavanoids Nonflavanoids Proanthocyanins      Color        Hue
## 1 -1.21182921    0.72402116     -0.77751312  0.9388902 -1.1615122
## 2  0.97506900   -0.56050853      0.57865427  0.1705823  0.4726504
## 3  0.02075402   -0.03343924      0.05810161 -0.8993770  0.4605046
##     Dilution    Proline
## 1 -1.2887761 -0.4059428
## 2  0.7770551  1.1220202
## 3  0.2700025 -0.7517257
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3
##  [71] 3 3 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

#5.Now let us cross check the clusters with the type of wines.Ideally all three types of wines should fall in a different clusters.
table(km3$cluster, wine$Type)

##    
##      1  2  3
##   1  0  3 48
##   2 59  3  0
##   3  0 65  0

# the actual case in the above table is very near to the ideal scenario.

#6.Let us understand what are the mean values of the qualities of a different types of wines i.e; what is the mean quantiy of alcohol in type1,what is the mean value of ash in type3 etc..

wine_types<-aggregate(wine[-1], by=list(cluster=km3$cluster), mean)
wine_types

##   cluster  Alcohol    Malic      Ash Alcalinity Magnesium  Phenols
## 1       1 13.13412 3.307255 2.417647   21.24118  98.66667 1.683922
## 2       2 13.67677 1.997903 2.466290   17.46290 107.96774 2.847581
## 3       3 12.25092 1.897385 2.231231   20.06308  92.73846 2.247692
##   Flavanoids Nonflavanoids Proanthocyanins    Color       Hue Dilution
## 1  0.8188235     0.4519608        1.145882 7.234706 0.6919608 1.696667
## 2  3.0032258     0.2920968        1.922097 5.453548 1.0654839 3.163387
## 3  2.0500000     0.3576923        1.624154 2.973077 1.0627077 2.803385
##     Proline
## 1  619.0588
## 2 1100.2258
## 3  510.1692

#7.Ploting the clustering result:

clusplot(wine2,km3$cluster, main='2D representation of the Cluster solution',color=TRUE, shade=TRUE,labels=2, lines=0)

#from this plot we can clearly understand which wines are alike and which wines are different. For example from the plot, 84,135 wines are similar.

#8. Let us finally, map each individual wine to the type of clusters.
no=order(km3$cluster)
cluster=data.frame(wine$Type[no],km3$cluster[no])
cluster

##     wine.Type.no. km3.cluster.no.
## 1               2               1
## 2               2               1
## 3               2               1
## 4               3               1
## 5               3               1
## 6               3               1
## 7               3               1
## 8               3               1
## 9               3               1
## 10              3               1
## 11              3               1
## 12              3               1
## 13              3               1
## 14              3               1
## 15              3               1
## 16              3               1
## 17              3               1
## 18              3               1
## 19              3               1
## 20              3               1
## 21              3               1
## 22              3               1
## 23              3               1
## 24              3               1
## 25              3               1
## 26              3               1
## 27              3               1
## 28              3               1
## 29              3               1
## 30              3               1
## 31              3               1
## 32              3               1
## 33              3               1
## 34              3               1
## 35              3               1
## 36              3               1
## 37              3               1
## 38              3               1
## 39              3               1
## 40              3               1
## 41              3               1
## 42              3               1
## 43              3               1
## 44              3               1
## 45              3               1
## 46              3               1
## 47              3               1
## 48              3               1
## 49              3               1
## 50              3               1
## 51              3               1
## 52              1               2
## 53              1               2
## 54              1               2
## 55              1               2
## 56              1               2
## 57              1               2
## 58              1               2
## 59              1               2
## 60              1               2
## 61              1               2
## 62              1               2
## 63              1               2
## 64              1               2
## 65              1               2
## 66              1               2
## 67              1               2
## 68              1               2
## 69              1               2
## 70              1               2
## 71              1               2
## 72              1               2
## 73              1               2
## 74              1               2
## 75              1               2
## 76              1               2
## 77              1               2
## 78              1               2
## 79              1               2
## 80              1               2
## 81              1               2
## 82              1               2
## 83              1               2
## 84              1               2
## 85              1               2
## 86              1               2
## 87              1               2
## 88              1               2
## 89              1               2
## 90              1               2
## 91              1               2
## 92              1               2
## 93              1               2
## 94              1               2
## 95              1               2
## 96              1               2
## 97              1               2
## 98              1               2
## 99              1               2
## 100             1               2
## 101             1               2
## 102             1               2
## 103             1               2
## 104             1               2
## 105             1               2
## 106             1               2
## 107             1               2
## 108             1               2
## 109             1               2
## 110             1               2
## 111             2               2
## 112             2               2
## 113             2               2
## 114             2               3
## 115             2               3
## 116             2               3
## 117             2               3
## 118             2               3
## 119             2               3
## 120             2               3
## 121             2               3
## 122             2               3
## 123             2               3
## 124             2               3
## 125             2               3
## 126             2               3
## 127             2               3
## 128             2               3
## 129             2               3
## 130             2               3
## 131             2               3
## 132             2               3
## 133             2               3
## 134             2               3
## 135             2               3
## 136             2               3
## 137             2               3
## 138             2               3
## 139             2               3
## 140             2               3
## 141             2               3
## 142             2               3
## 143             2               3
## 144             2               3
## 145             2               3
## 146             2               3
## 147             2               3
## 148             2               3
## 149             2               3
## 150             2               3
## 151             2               3
## 152             2               3
## 153             2               3
## 154             2               3
## 155             2               3
## 156             2               3
## 157             2               3
## 158             2               3
## 159             2               3
## 160             2               3
## 161             2               3
## 162             2               3
## 163             2               3
## 164             2               3
## 165             2               3
## 166             2               3
## 167             2               3
## 168             2               3
## 169             2               3
## 170             2               3
## 171             2               3
## 172             2               3
## 173             2               3
## 174             2               3
## 175             2               3
## 176             2               3
## 177             2               3
## 178             2               3

#9. We can applythis methodology to Customer segmentation where we can divide cistomers in to different clusters based on their behaviors.This method has a wide application in marketing, consumer finance etc..

clustering/Segmentation - This method has a wide application in Consumer marketing and consumer banking

Raviteja

February 24, 2016