7. K-Means in R (10 pts)

a. Load the data set wholesalecustomers.csv (available on the class website) and examine the data.

library(readr)
wholesalecustomers <- read_csv("~/Documents/School files/MS Program Spring 2016/Classes 2017/Fall 2017/CSE 891/HW/HW5/wholesalecustomers.csv")
## Parsed with column specification:
## cols(
##   Channel = col_integer(),
##   Region = col_integer(),
##   Fresh = col_integer(),
##   Milk = col_integer(),
##   Grocery = col_integer(),
##   Frozen = col_integer(),
##   Detergents_Paper = col_integer(),
##   Delicassen = col_integer()
## )

b. Remove the categorical attributes.

wholesalecustomers[1] = NULL
wholesalecustomers[1] = NULL

c. Run k-means clustering for k=3.

library(class)
KM1 = kmeans(wholesalecustomers, 3)
KM1
## K-means clustering with 3 clusters of sizes 330, 50, 60
## 
## Cluster means:
##      Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  8253.47  3824.603  5280.455 2572.661         1773.058   1137.497
## 2  8000.04 18511.420 27573.900 1996.680        12407.360   2252.020
## 3 35941.40  6044.450  6288.617 6713.967         1039.667   3049.467
## 
## Clustering vector:
##   [1] 1 1 1 1 3 1 1 1 1 2 1 1 3 1 3 1 1 1 1 1 1 1 3 2 3 1 1 1 2 3 1 1 1 3 1
##  [36] 1 3 1 2 3 3 1 1 2 1 2 2 2 1 2 1 1 3 1 3 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 3 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1
## [106] 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1
## [141] 1 3 3 1 1 2 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1
## [176] 1 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 2 3 1 1 2 1 1 1 2
## [211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 3 3 1 1 1
## [246] 1 1 1 1 1 1 2 1 3 1 3 1 1 3 3 1 1 3 1 1 2 2 1 2 1 1 1 1 3 1 1 3 1 1 1
## [281] 1 1 3 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 3 2 1 1
## [316] 1 1 1 1 2 1 1 1 1 3 3 1 1 1 1 1 2 1 2 1 3 1 1 1 1 1 1 1 2 1 1 1 3 1 2
## [351] 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 3 1 3 1 2
## [386] 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 3 3 1 1 3 2 1 1 1 1 1 1 1 1 1 1 2 1
## [421] 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 3 3 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 28184318853 26382784678 25765310312
##  (between_SS / total_SS =  49.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

d. Examine the results: size of each cluster, cluster centers, SSE and separation.

KM1$centers
##      Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  8253.47  3824.603  5280.455 2572.661         1773.058   1137.497
## 2  8000.04 18511.420 27573.900 1996.680        12407.360   2252.020
## 3 35941.40  6044.450  6288.617 6713.967         1039.667   3049.467
KM1$tot.withinss
## [1] 80332413843
KM1$betweenss
## [1] 77263443323
KM1$totss
## [1] 157595857166

When simply running K means with 3 random centorids the centroids are very uneavenly sized. They are sizes of 60, 330, and 50. By looking at each centroid you can tell that clusters 2 and 3 are much more alike than cluster 1. The overall SSE is not that great and does not indicate that it classifies the data poitns very well.The withinss of the points is very large indicating that the points in each cluster are far from the cetroids. The betweenss measures the distances from each centroid is also very large which generally is what we want to be able to see when distinguishing between clusters.

e. Run k-means for k between 2 and 50 and plot the SSE vs k. Based on the plot what is a good choice of k?

k <-2:50 
SSE.Runs <-integer(length(k)) 
for(v in k){ 
 KM = kmeans(wholesalecustomers, v)
 SSE.Runs[v-1] <- KM$tot.withinss 
}
library(Hmisc) 
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
plot(k, SSE.Runs, type = 'b')
minor.tick(nx=10)

There doesnt appread to be a very strong elbow but it appears to be somewhere near k = 5.

f. Rerun k-means for the selected k. Interpret the results.

KM2 = kmeans(wholesalecustomers, 5)
KM2
## K-means clustering with 5 clusters of sizes 113, 227, 71, 5, 24
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1 20600.283  3787.832  5089.841 3989.071         1130.142   1639.071
## 2  5655.819  3567.793  4513.040 2386.529         1437.559   1005.031
## 3  5207.831 13191.028 20321.718 1674.028         9036.380   1937.944
## 4 25603.000 43460.600 61472.200 2636.000        29974.200   2708.800
## 5 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 
## Clustering vector:
##   [1] 2 2 2 1 1 2 2 2 2 3 2 2 1 1 1 2 2 2 1 2 1 2 1 3 1 1 2 1 3 5 1 2 1 1 2
##  [36] 2 1 1 3 5 1 1 3 3 2 3 3 4 2 3 2 2 5 3 1 2 3 3 1 2 2 4 2 3 2 3 2 1 2 2
##  [71] 1 1 2 1 2 1 2 3 2 2 2 3 2 1 2 4 4 5 2 1 2 1 3 1 3 2 2 2 2 2 3 3 2 5 1
## [106] 1 2 3 2 3 2 3 1 1 1 2 2 2 1 2 1 2 2 2 5 5 1 1 2 5 2 2 1 2 2 2 2 2 1 2
## [141] 1 1 5 2 1 3 2 2 2 1 1 2 1 2 2 3 3 1 2 3 2 2 1 3 2 3 2 2 2 2 3 3 2 3 2
## [176] 2 5 2 2 2 2 5 2 5 2 2 2 2 2 3 1 1 2 3 2 1 1 2 2 2 3 3 1 2 2 3 2 2 2 3
## [211] 1 3 2 2 2 3 3 1 3 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 5 1 1 1 2 2
## [246] 3 2 1 1 2 2 3 2 1 2 1 2 2 5 5 2 2 1 2 3 3 3 1 3 1 2 2 2 5 2 2 1 2 2 1
## [281] 2 2 5 1 5 5 2 1 1 5 2 2 2 3 1 2 1 2 2 2 1 3 2 3 3 2 3 1 2 3 2 1 3 2 2
## [316] 3 2 2 2 3 2 2 1 1 1 5 2 2 1 2 2 3 1 4 1 1 1 2 2 2 2 2 2 3 2 2 3 1 2 3
## [351] 2 3 2 3 1 2 1 3 2 2 1 2 2 2 2 2 2 2 1 2 5 1 2 1 2 2 3 5 2 2 1 1 1 2 3
## [386] 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 1 2 1 3 2 2 2 2 2 2 2 2 3 2 3 2
## [421] 2 1 1 1 1 2 3 1 2 2 2 2 1 2 1 1 5 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  9394958498 10804478229 11008166107  5682449098 16226867469
##  (between_SS / total_SS =  66.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

As the number of clusters grow the SSE will by default get better becuase less points will be allocated to each centroid. This is becasue the withiness measure will be smaller because the points in a cluster will be closer to a centroid resulting in less error. However, this does not always indicate that it is clustering the groups the best. With 5 clusters 3 of them are small compared to the other 2. CLuster 5 is very heavy in fresh. Cluster 4 is very heavy in Fresh and detergents. Cluster 3 is very off balanced in Fresh. Cluster 2 is more more evenly distributed but is heavy in Fresh as well. Cluster 1 is heavy in Milk and Grovery.

g. Visualize the clusters obtained.

library(fpc)
plotcluster(wholesalecustomers, KM2$cluster)

h. Can you find a set of variables to base clustering on that produces more natural clusters than those obtained in the previous step?

T1 = data.frame(wholesalecustomers$Frozen, wholesalecustomers$Fresh)
KM3 = kmeans(T1, 5)
KM3
## K-means clustering with 5 clusters of sizes 7, 3, 136, 245, 49
## 
## Cluster means:
##   wholesalecustomers.Frozen wholesalecustomers.Fresh
## 1                 11348.429                68409.714
## 2                 44137.333                26959.333
## 3                  3168.640                15467.404
## 4                  2018.498                 4147.886
## 5                  4374.122                32665.020
## 
## Clustering vector:
##   [1] 3 4 4 3 3 4 3 4 4 4 4 3 5 3 5 3 4 4 3 4 3 4 5 5 3 3 3 3 4 5 3 4 3 5 4
##  [36] 4 5 3 4 1 5 3 3 4 4 4 4 5 3 4 4 4 5 4 5 4 4 4 3 4 4 5 4 4 4 4 4 3 4 4
##  [71] 3 3 4 3 4 3 4 3 3 4 4 4 3 3 3 3 3 5 4 5 3 3 4 2 4 4 4 4 4 3 3 4 4 1 3
## [106] 3 4 4 4 4 3 3 3 3 3 3 3 4 3 4 3 4 3 3 5 1 3 3 4 5 4 4 3 4 4 4 4 4 3 4
## [141] 3 5 5 3 3 3 4 4 4 5 3 4 3 4 4 4 4 3 4 4 4 3 3 4 4 3 4 4 4 4 4 4 4 4 4
## [176] 4 5 3 3 4 3 1 4 2 4 4 4 4 4 4 3 3 4 4 4 3 5 4 3 4 4 4 5 4 4 4 4 4 4 4
## [211] 3 3 4 4 4 4 4 3 4 4 3 4 4 4 4 3 3 4 4 4 3 4 5 4 3 4 4 3 4 5 3 5 3 3 4
## [246] 4 4 3 3 4 4 4 4 5 3 5 3 4 1 1 4 4 3 4 4 4 4 3 3 3 4 4 4 5 4 4 5 3 3 3
## [281] 4 3 5 5 1 5 4 3 3 5 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4 4 3 4 4 4 5 4 3 3
## [316] 4 4 4 3 4 4 4 3 3 5 2 4 4 3 4 3 3 3 4 3 5 3 3 4 4 4 4 4 4 4 4 4 5 4 4
## [351] 4 4 4 4 3 4 3 4 4 4 3 4 4 4 4 4 4 4 3 4 5 3 4 3 4 4 4 5 4 4 5 3 5 4 3
## [386] 3 4 3 4 4 4 4 4 3 3 4 4 3 3 4 4 5 5 5 3 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4
## [421] 4 3 5 3 3 3 3 5 4 4 4 4 3 4 3 5 5 3 3 4
## 
## Within cluster sum of squares by cluster:
## [1] 2863349761  796778949 3417004370 3682236400 3300677414
##  (between_SS / total_SS =  82.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
plotcluster(T1, KM3$cluster)

For determineing the most natural clusters we would really need to know the objective or motive for clustering to beging with. Because depending on the overall business goal the definition of the most natural clusters would change. However, when clustering on attributes like Frozen and Fresh we acheive a much higher SSE on the first run. This could also help differentiate customers into people that buy Frozen foods and ones that buy Fresh foods for example.