Section 21 K Means Clustering

Importing the Mall dataset

df <-  read.csv("G:\\RStudio\\udemy\\ml\\Machine Learning AZ\\Part 4 - Clustering\\Section 24 - K-Means Clustering\\K_Means\\Mall_Customers.csv")
head(df)

Select the fields that we will be working with

df <- df[,4:5]
head(df)

Use the Elbow method to find the optimal number of clusters

# Using the elbow method to find the optimal number of clusters
set.seed(6)
wcss <-  vector()
for (i in 1:10) 
  wcss[i] <-  sum(kmeans(df, i)$ withinss)
plot(1:10, wcss,type="b", main = paste("clusters of clients"), xlab="Number of clusters", ylab="WCSS")

NA

so based on the elbow method, the optimal number of clusters is 5.

Applying K means to the Mall dataset

set.seed(29)
kmeans <-  kmeans(df, 5, iter.max = 300, nstart = 10)
kmeans
K-means clustering with 5 clusters of sizes 22, 81, 23, 39, 35

Cluster means:
  Annual.Income..k.. Spending.Score..1.100.
1           25.72727               79.36364
2           55.29630               49.51852
3           26.30435               20.91304
4           86.53846               82.12821
5           88.20000               17.11429

Clustering vector:
  [1] 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 2 3 1 2 2 2 2 2 2 2
 [54] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[107] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 5 4 2 4 5 4 5 4 2 4 5 4 5 4 5 4 5 4 2 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5
[160] 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4

Within cluster sum of squares by cluster:
[1]  3519.455  9875.111  5098.696 13444.051 12511.143
 (between_SS / total_SS =  83.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
[8] "iter"         "ifault"      

Visualizing the Clusters

library(cluster)
clusplot(df, kmeans$cluster, 
         lines = 0 , 
         shade = TRUE,
         color = TRUE,
         labels = 2, 
         plotchar = FALSE,
         span = TRUE,
         main = paste("Clusters of cleints"),
         xlab="Annual Income",
         ylab="Spending Score")

LS0tDQp0aXRsZTogIk1MIFVzaW5nIFIgU2VjdGlvbiAyMSBLIE1lYW5zIENsdXN0ZXJpbmciDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQojIFNlY3Rpb24gMjEgSyBNZWFucyBDbHVzdGVyaW5nDQojIEltcG9ydGluZyB0aGUgTWFsbCBkYXRhc2V0DQoNCmBgYHtyfQ0KZGYgPC0gIHJlYWQuY3N2KCJHOlxcUlN0dWRpb1xcdWRlbXlcXG1sXFxNYWNoaW5lIExlYXJuaW5nIEFaXFxQYXJ0IDQgLSBDbHVzdGVyaW5nXFxTZWN0aW9uIDI0IC0gSy1NZWFucyBDbHVzdGVyaW5nXFxLX01lYW5zXFxNYWxsX0N1c3RvbWVycy5jc3YiKQ0KaGVhZChkZikNCmBgYA0KDQojIFNlbGVjdCB0aGUgZmllbGRzIHRoYXQgd2Ugd2lsbCBiZSB3b3JraW5nIHdpdGgNCg0KYGBge3J9DQpkZiA8LSBkZlssNDo1XQ0KaGVhZChkZikNCmBgYA0KIyBVc2UgdGhlIEVsYm93IG1ldGhvZCB0byBmaW5kIHRoZSBvcHRpbWFsIG51bWJlciBvZiBjbHVzdGVycw0KDQpgYGB7cn0NCiMgVXNpbmcgdGhlIGVsYm93IG1ldGhvZCB0byBmaW5kIHRoZSBvcHRpbWFsIG51bWJlciBvZiBjbHVzdGVycw0Kc2V0LnNlZWQoNikNCndjc3MgPC0gIHZlY3RvcigpDQpmb3IgKGkgaW4gMToxMCkgDQogIHdjc3NbaV0gPC0gIHN1bShrbWVhbnMoZGYsIGkpJCB3aXRoaW5zcykNCnBsb3QoMToxMCwgd2Nzcyx0eXBlPSJiIiwgbWFpbiA9IHBhc3RlKCJjbHVzdGVycyBvZiBjbGllbnRzIiksIHhsYWI9Ik51bWJlciBvZiBjbHVzdGVycyIsIHlsYWI9IldDU1MiKQ0KDQogIA0KYGBgDQoNCnNvIGJhc2VkIG9uIHRoZSBlbGJvdyBtZXRob2QsIHRoZSBvcHRpbWFsIG51bWJlciBvZiBjbHVzdGVycyBpcyA1Lg0KDQojIEFwcGx5aW5nIEsgbWVhbnMgdG8gdGhlIE1hbGwgZGF0YXNldA0KDQpgYGB7cn0NCnNldC5zZWVkKDI5KQ0Ka21lYW5zIDwtICBrbWVhbnMoZGYsIDUsIGl0ZXIubWF4ID0gMzAwLCBuc3RhcnQgPSAxMCkNCmttZWFucw0KYGBgDQoNCiMgVmlzdWFsaXppbmcgdGhlIENsdXN0ZXJzDQpgYGB7cn0NCmxpYnJhcnkoY2x1c3RlcikNCmNsdXNwbG90KGRmLCBrbWVhbnMkY2x1c3RlciwgDQogICAgICAgICBsaW5lcyA9IDAgLCANCiAgICAgICAgIHNoYWRlID0gVFJVRSwNCiAgICAgICAgIGNvbG9yID0gVFJVRSwNCiAgICAgICAgIGxhYmVscyA9IDIsIA0KICAgICAgICAgcGxvdGNoYXIgPSBGQUxTRSwNCiAgICAgICAgIHNwYW4gPSBUUlVFLA0KICAgICAgICAgbWFpbiA9IHBhc3RlKCJDbHVzdGVycyBvZiBjbGVpbnRzIiksDQogICAgICAgICB4bGFiPSJBbm51YWwgSW5jb21lIiwNCiAgICAgICAgIHlsYWI9IlNwZW5kaW5nIFNjb3JlIikNCmBgYA0KDQo=