library(readr)
wholesalecustomers <- read_csv("~/Documents/School files/MS Program Spring 2016/Classes 2017/Fall 2017/CSE 891/HW/HW5/wholesalecustomers.csv")
## Parsed with column specification:
## cols(
## Channel = col_integer(),
## Region = col_integer(),
## Fresh = col_integer(),
## Milk = col_integer(),
## Grocery = col_integer(),
## Frozen = col_integer(),
## Detergents_Paper = col_integer(),
## Delicassen = col_integer()
## )
wholesalecustomers[1] = NULL
wholesalecustomers[1] = NULL
library(class)
KM1 = kmeans(wholesalecustomers, 3)
KM1
## K-means clustering with 3 clusters of sizes 330, 50, 60
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
## 2 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
## 3 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
##
## Clustering vector:
## [1] 1 1 1 1 3 1 1 1 1 2 1 1 3 1 3 1 1 1 1 1 1 1 3 2 3 1 1 1 2 3 1 1 1 3 1
## [36] 1 3 1 2 3 3 1 1 2 1 2 2 2 1 2 1 1 3 1 3 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1
## [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 3 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1
## [106] 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1
## [141] 1 3 3 1 1 2 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1
## [176] 1 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 2 3 1 1 2 1 1 1 2
## [211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 3 3 1 1 1
## [246] 1 1 1 1 1 1 2 1 3 1 3 1 1 3 3 1 1 3 1 1 2 2 1 2 1 1 1 1 3 1 1 3 1 1 1
## [281] 1 1 3 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 3 2 1 1
## [316] 1 1 1 1 2 1 1 1 1 3 3 1 1 1 1 1 2 1 2 1 3 1 1 1 1 1 1 1 2 1 1 1 3 1 2
## [351] 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 3 1 3 1 2
## [386] 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 3 3 1 1 3 2 1 1 1 1 1 1 1 1 1 1 2 1
## [421] 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 3 3 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 28184318853 26382784678 25765310312
## (between_SS / total_SS = 49.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
KM1$centers
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
## 2 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
## 3 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
KM1$tot.withinss
## [1] 80332413843
KM1$betweenss
## [1] 77263443323
KM1$totss
## [1] 157595857166
When simply running K means with 3 random centorids the centroids are very uneavenly sized. They are sizes of 60, 330, and 50. By looking at each centroid you can tell that clusters 2 and 3 are much more alike than cluster 1. The overall SSE is not that great and does not indicate that it classifies the data poitns very well.The withinss of the points is very large indicating that the points in each cluster are far from the cetroids. The betweenss measures the distances from each centroid is also very large which generally is what we want to be able to see when distinguishing between clusters.
k <-2:50
SSE.Runs <-integer(length(k))
for(v in k){
KM = kmeans(wholesalecustomers, v)
SSE.Runs[v-1] <- KM$tot.withinss
}
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
plot(k, SSE.Runs, type = 'b')
minor.tick(nx=10)
There doesnt appread to be a very strong elbow but it appears to be somewhere near k = 5.
KM2 = kmeans(wholesalecustomers, 5)
KM2
## K-means clustering with 5 clusters of sizes 113, 227, 71, 5, 24
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 20600.283 3787.832 5089.841 3989.071 1130.142 1639.071
## 2 5655.819 3567.793 4513.040 2386.529 1437.559 1005.031
## 3 5207.831 13191.028 20321.718 1674.028 9036.380 1937.944
## 4 25603.000 43460.600 61472.200 2636.000 29974.200 2708.800
## 5 48777.375 6607.375 6197.792 9462.792 932.125 4435.333
##
## Clustering vector:
## [1] 2 2 2 1 1 2 2 2 2 3 2 2 1 1 1 2 2 2 1 2 1 2 1 3 1 1 2 1 3 5 1 2 1 1 2
## [36] 2 1 1 3 5 1 1 3 3 2 3 3 4 2 3 2 2 5 3 1 2 3 3 1 2 2 4 2 3 2 3 2 1 2 2
## [71] 1 1 2 1 2 1 2 3 2 2 2 3 2 1 2 4 4 5 2 1 2 1 3 1 3 2 2 2 2 2 3 3 2 5 1
## [106] 1 2 3 2 3 2 3 1 1 1 2 2 2 1 2 1 2 2 2 5 5 1 1 2 5 2 2 1 2 2 2 2 2 1 2
## [141] 1 1 5 2 1 3 2 2 2 1 1 2 1 2 2 3 3 1 2 3 2 2 1 3 2 3 2 2 2 2 3 3 2 3 2
## [176] 2 5 2 2 2 2 5 2 5 2 2 2 2 2 3 1 1 2 3 2 1 1 2 2 2 3 3 1 2 2 3 2 2 2 3
## [211] 1 3 2 2 2 3 3 1 3 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 5 1 1 1 2 2
## [246] 3 2 1 1 2 2 3 2 1 2 1 2 2 5 5 2 2 1 2 3 3 3 1 3 1 2 2 2 5 2 2 1 2 2 1
## [281] 2 2 5 1 5 5 2 1 1 5 2 2 2 3 1 2 1 2 2 2 1 3 2 3 3 2 3 1 2 3 2 1 3 2 2
## [316] 3 2 2 2 3 2 2 1 1 1 5 2 2 1 2 2 3 1 4 1 1 1 2 2 2 2 2 2 3 2 2 3 1 2 3
## [351] 2 3 2 3 1 2 1 3 2 2 1 2 2 2 2 2 2 2 1 2 5 1 2 1 2 2 3 5 2 2 1 1 1 2 3
## [386] 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 1 2 1 3 2 2 2 2 2 2 2 2 3 2 3 2
## [421] 2 1 1 1 1 2 3 1 2 2 2 2 1 2 1 1 5 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 9394958498 10804478229 11008166107 5682449098 16226867469
## (between_SS / total_SS = 66.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
As the number of clusters grow the SSE will by default get better becuase less points will be allocated to each centroid. This is becasue the withiness measure will be smaller because the points in a cluster will be closer to a centroid resulting in less error. However, this does not always indicate that it is clustering the groups the best. With 5 clusters 3 of them are small compared to the other 2. CLuster 5 is very heavy in fresh. Cluster 4 is very heavy in Fresh and detergents. Cluster 3 is very off balanced in Fresh. Cluster 2 is more more evenly distributed but is heavy in Fresh as well. Cluster 1 is heavy in Milk and Grovery.
library(fpc)
plotcluster(wholesalecustomers, KM2$cluster)
T1 = data.frame(wholesalecustomers$Frozen, wholesalecustomers$Fresh)
KM3 = kmeans(T1, 5)
KM3
## K-means clustering with 5 clusters of sizes 7, 3, 136, 245, 49
##
## Cluster means:
## wholesalecustomers.Frozen wholesalecustomers.Fresh
## 1 11348.429 68409.714
## 2 44137.333 26959.333
## 3 3168.640 15467.404
## 4 2018.498 4147.886
## 5 4374.122 32665.020
##
## Clustering vector:
## [1] 3 4 4 3 3 4 3 4 4 4 4 3 5 3 5 3 4 4 3 4 3 4 5 5 3 3 3 3 4 5 3 4 3 5 4
## [36] 4 5 3 4 1 5 3 3 4 4 4 4 5 3 4 4 4 5 4 5 4 4 4 3 4 4 5 4 4 4 4 4 3 4 4
## [71] 3 3 4 3 4 3 4 3 3 4 4 4 3 3 3 3 3 5 4 5 3 3 4 2 4 4 4 4 4 3 3 4 4 1 3
## [106] 3 4 4 4 4 3 3 3 3 3 3 3 4 3 4 3 4 3 3 5 1 3 3 4 5 4 4 3 4 4 4 4 4 3 4
## [141] 3 5 5 3 3 3 4 4 4 5 3 4 3 4 4 4 4 3 4 4 4 3 3 4 4 3 4 4 4 4 4 4 4 4 4
## [176] 4 5 3 3 4 3 1 4 2 4 4 4 4 4 4 3 3 4 4 4 3 5 4 3 4 4 4 5 4 4 4 4 4 4 4
## [211] 3 3 4 4 4 4 4 3 4 4 3 4 4 4 4 3 3 4 4 4 3 4 5 4 3 4 4 3 4 5 3 5 3 3 4
## [246] 4 4 3 3 4 4 4 4 5 3 5 3 4 1 1 4 4 3 4 4 4 4 3 3 3 4 4 4 5 4 4 5 3 3 3
## [281] 4 3 5 5 1 5 4 3 3 5 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4 4 3 4 4 4 5 4 3 3
## [316] 4 4 4 3 4 4 4 3 3 5 2 4 4 3 4 3 3 3 4 3 5 3 3 4 4 4 4 4 4 4 4 4 5 4 4
## [351] 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4 4 4 3 4 5 3 4 3 4 4 4 5 4 4 5 3 5 4 3
## [386] 3 4 3 4 4 4 4 4 3 3 4 4 3 3 4 4 5 5 5 3 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4
## [421] 4 3 5 3 3 3 3 5 4 4 4 4 3 4 3 5 5 3 3 4
##
## Within cluster sum of squares by cluster:
## [1] 2863349761 796778949 3417004370 3682236400 3300677414
## (between_SS / total_SS = 82.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plotcluster(T1, KM3$cluster)
For determineing the most natural clusters we would really need to know the objective or motive for clustering to beging with. Because depending on the overall business goal the definition of the most natural clusters would change. However, when clustering on attributes like Frozen and Fresh we acheive a much higher SSE on the first run. This could also help differentiate customers into people that buy Frozen foods and ones that buy Fresh foods for example.