Let me walk you through a quick glance at very basic clustering methods (you’ll cover much more detail in the Customer Analytics course is my guess).
I’m going to use an inbuilt R dataset ‘mtcars’ for this purpose. So, here goes …
mydata = mtcars # inbuilt R dataset.
head(mydata) # view top few rows of dataset
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
K-means, the most popular clustering method, requires you to specify the number of clusters to extract.
We can use a fit criterion to let the data guide us on the optimal #clusters
# Determine number of clusters
set.seed(seed = 0000) # set seed for reproducible work
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) # wss is within group sum of squares
for (i in 2:15) wss[i] <- sum( # checking model fit for 2 to 15 clusters
kmeans(mydata, centers = i)$withinss) # note use of kmeans() func
plot(1:15, wss, type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares")
From the plot we see that optimal #clusters = 2 (look for sharpest ‘elbow bend’)
So in what follows below, I outline a 2 cluster solution.
# K-Means Cluster Analysis
fit <- kmeans(mydata, 2) # 5 cluster solution
# get cluster means
aggregate(mydata, by = list(fit$cluster), FUN = mean) # using aggregate() func to characterize cluster means
## Group.1 mpg cyl disp hp drat wt qsec
## 1 1 15.10000 8.000000 353.1000 209.21429 3.229286 3.999214 16.77214
## 2 2 23.97222 4.777778 135.5389 98.05556 3.882222 2.609056 18.68611
## vs am gear carb
## 1 0.0000000 0.1428571 3.285714 3.500000
## 2 0.7777778 0.6111111 4.000000 2.277778
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster) # put cluster number as identifier in a separate column
head(mydata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## fit.cluster
## Mazda RX4 2
## Mazda RX4 Wag 2
## Datsun 710 2
## Hornet 4 Drive 2
## Hornet Sportabout 1
## Valiant 2
There are a wide range of hierarchical clustering approaches, e.g., Ward’s Hierarchical method below.
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method = "ward.D")
plot(fit) # display dendogram
groups <- cutree(fit, k = 2) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k = 2, border ="red")
There’s a lot more to clustering but I’ll stop here for now because it is all I need to get to what I want to cover next.