K Means Clustering

K Means Intuition Lecture 137 https://www.udemy.com/machinelearning/learn/lecture/5714416

https://en.wikipedia.org/wiki/K-means_clustering

Understanding K Means Random Initialization Trap

Lecture 138 https://www.udemy.com/machinelearning/learn/lecture/5714420 This references the idea that our selection of the initial centroids can impact the end result of our clusters. The solution to this is K-Means++ which is something that thankfully is included in the background of the K means cluster algorithms that are available. We use the K-Means++ function to generate the WCSS which is then interpreted to come up with the best number of clusters to use.

WCSS - Within Cluster Sum of Squares

K Means Selecting the number of clusters Lecture 139 https://www.udemy.com/machinelearning/learn/lecture/5714426 within cluster sum of squares is the answer. The method we’ll use is the Elbow method ;) We can calculate the Within Cluster Sum of Squares.

knitr::include_graphics("/Users/markloessi/Machine_Learning/TheElbowMethodClusteringNums.png")
A caption

A caption

K Means Clustering R Lecture 142 https://www.udemy.com/machinelearning/learn/lecture/5685594 # Why Clustering Clustering is similar to classification, but the basis is different. In Clustering you don’t know what you are looking for, and you are trying to identify some segments or clusters in your data. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up like structures, clusters and groupings you would have never thought of otherwise. # Import data

dataset = read.csv('Mall_Customers.csv')

Quick look

summary(dataset)
##    CustomerID        Genre          Age        Annual.Income..k..
##  Min.   :  1.00   Female:112   Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50                Median :36.00   Median : 61.50    
##  Mean   :100.50                Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

Another look

head(dataset)
##   CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Build array

Now we want to build an array of our two columns we want to test.

dataset = dataset[4:5]

Quick look

summary(dataset)
##  Annual.Income..k.. Spending.Score..1.100.
##  Min.   : 15.00     Min.   : 1.00         
##  1st Qu.: 41.50     1st Qu.:34.75         
##  Median : 61.50     Median :50.00         
##  Mean   : 60.56     Mean   :50.20         
##  3rd Qu.: 78.00     3rd Qu.:73.00         
##  Max.   :137.00     Max.   :99.00

Another look

head(dataset)
##   Annual.Income..k.. Spending.Score..1.100.
## 1                 15                     39
## 2                 15                     81
## 3                 16                      6
## 4                 16                     77
## 5                 17                     40
## 6                 17                     76

Splitting the dataset into the Training set and Test set - won’t be done for KMeans

Feature Scaling - won’t be done for KMeans

Using the elbow method to find the ‘optimal’ number of clusters

The elbow method via determining the Within Clusters Sum of Squares

set.seed(6)
# make an empty vector we'll populate via our loop
wcss = vector()
# for our 10 clusters we'll start with
for (i in 1:10) wcss[i] <- sum(kmeans(dataset, i)$withinss)
plot(1:10,
     wcss,
     type = 'b', # for lines and points
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

Let’s interpret;

knitr::include_graphics("R_KMeans_ElbowMethod.png")
A caption

A caption

Fitting K-Means to the dataset

And adjusting the clusters to 5 based on our analysis

set.seed(29)
kmeans = kmeans(x = dataset, centers = 5, iter.max = 300, nstart = 10)
y_kmeans = kmeans$cluster

Visualising the clusters

This code is only for 2 dimensional clustering.

library(cluster)
clusplot(dataset,
         y_kmeans,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf