K Means Intuition Lecture 137 https://www.udemy.com/machinelearning/learn/lecture/5714416
Lecture 138 https://www.udemy.com/machinelearning/learn/lecture/5714420 This references the idea that our selection of the initial centroids can impact the end result of our clusters. The solution to this is K-Means++ which is something that thankfully is included in the background of the K means cluster algorithms that are available. We use the K-Means++ function to generate the WCSS which is then interpreted to come up with the best number of clusters to use.
K Means Selecting the number of clusters Lecture 139 https://www.udemy.com/machinelearning/learn/lecture/5714426 within cluster sum of squares is the answer. The method we’ll use is the Elbow method ;) We can calculate the Within Cluster Sum of Squares.
knitr::include_graphics("/Users/markloessi/Machine_Learning/TheElbowMethodClusteringNums.png")
A caption
K Means Clustering R Lecture 142 https://www.udemy.com/machinelearning/learn/lecture/5685594 # Why Clustering Clustering is similar to classification, but the basis is different. In Clustering you don’t know what you are looking for, and you are trying to identify some segments or clusters in your data. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up like structures, clusters and groupings you would have never thought of otherwise. # Import data
dataset = read.csv('Mall_Customers.csv')
Quick look
summary(dataset)
## CustomerID Genre Age Annual.Income..k..
## Min. : 1.00 Female:112 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Male : 88 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100.
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
Another look
head(dataset)
## CustomerID Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
Now we want to build an array of our two columns we want to test.
dataset = dataset[4:5]
Quick look
summary(dataset)
## Annual.Income..k.. Spending.Score..1.100.
## Min. : 15.00 Min. : 1.00
## 1st Qu.: 41.50 1st Qu.:34.75
## Median : 61.50 Median :50.00
## Mean : 60.56 Mean :50.20
## 3rd Qu.: 78.00 3rd Qu.:73.00
## Max. :137.00 Max. :99.00
Another look
head(dataset)
## Annual.Income..k.. Spending.Score..1.100.
## 1 15 39
## 2 15 81
## 3 16 6
## 4 16 77
## 5 17 40
## 6 17 76
Splitting the dataset into the Training set and Test set - won’t be done for KMeans
Feature Scaling - won’t be done for KMeans
The elbow method via determining the Within Clusters Sum of Squares
set.seed(6)
# make an empty vector we'll populate via our loop
wcss = vector()
# for our 10 clusters we'll start with
for (i in 1:10) wcss[i] <- sum(kmeans(dataset, i)$withinss)
plot(1:10,
wcss,
type = 'b', # for lines and points
main = paste('The Elbow Method'),
xlab = 'Number of clusters',
ylab = 'WCSS')
Let’s interpret;
knitr::include_graphics("R_KMeans_ElbowMethod.png")
A caption
And adjusting the clusters to 5 based on our analysis
set.seed(29)
kmeans = kmeans(x = dataset, centers = 5, iter.max = 300, nstart = 10)
y_kmeans = kmeans$cluster
This code is only for 2 dimensional clustering.
library(cluster)
clusplot(dataset,
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste('Clusters of customers'),
xlab = 'Annual Income',
ylab = 'Spending Score')
=========================
Github files; https://github.com/ghettocounselor
Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf