Cluster Analysis in R

Let me walk you through a quick glance at very basic clustering methods (you’ll cover much more detail in the Customer Analytics course is my guess).

I’m going to use an inbuilt R dataset ‘mtcars’ for this purpose. So, here goes …

mydata = mtcars  # inbuilt R dataset.
head(mydata)  # view top few rows of dataset
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

K-means, the most popular clustering method, requires you to specify the number of clusters to extract.

We can use a fit criterion to let the data guide us on the optimal #clusters

# Determine number of clusters
set.seed(seed = 0000)   # set seed for reproducible work
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))  # wss is within group sum of squares

for (i in 2:15) wss[i] <- sum(      # checking model fit for 2 to 15 clusters
                            kmeans(mydata,  centers = i)$withinss)  # note use of kmeans() func

plot(1:15, wss, type="b", 
     xlab="Number of Clusters",
     ylab="Within groups sum of squares")

From the plot we see that optimal #clusters = 2 (look for sharpest ‘elbow bend’)

So in what follows below, I outline a 2 cluster solution.

# K-Means Cluster Analysis
fit <- kmeans(mydata, 2) # 5 cluster solution

# get cluster means 
aggregate(mydata, by = list(fit$cluster), FUN = mean) # using aggregate() func to characterize cluster means
##   Group.1      mpg      cyl     disp        hp     drat       wt     qsec
## 1       1 15.10000 8.000000 353.1000 209.21429 3.229286 3.999214 16.77214
## 2       2 23.97222 4.777778 135.5389  98.05556 3.882222 2.609056 18.68611
##          vs        am     gear     carb
## 1 0.0000000 0.1428571 3.285714 3.500000
## 2 0.7777778 0.6111111 4.000000 2.277778
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)  # put cluster number as identifier in a separate column
head(mydata)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                   fit.cluster
## Mazda RX4                   2
## Mazda RX4 Wag               2
## Datsun 710                  2
## Hornet 4 Drive              2
## Hornet Sportabout           1
## Valiant                     2

There are a wide range of hierarchical clustering approaches, e.g., Ward’s Hierarchical method below.

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method = "ward.D") 

plot(fit) # display dendogram
groups <- cutree(fit, k = 2) # cut tree into 5 clusters

# draw dendogram with red borders around the 5 clusters 
rect.hclust(fit, k = 2, border ="red")

There’s a lot more to clustering but I’ll stop here for now because it is all I need to get to what I want to cover next.

Sudhir