Clustering Data in R

Clustering Methods in R

Load some packages first

toInstall <- c("cluster", "fpc", "mclust")
install.packages(toInstall, dependencies=TRUE)

Data Preparation

# Today, we'll use 2005 data on crime, in 50 US states & DC (Crime2005.rda). 
# For variable names, see "Crime.txt", in the "Documents, Data, and Class Materials" page.
Crime <- read.csv(file.choose(), header=TRUE)

rownames(Crime) <- Crime[,1] # Turn state names into observation IDs

Remember this part. We will revisit the original data at the end of each clustering adventure.

Now, convert the “Crime” data to a data object called “mydata” so that we don’t confuse the two. Mydata will be our working data. We’ll come back to the data set titled “Crime” later.

mydata <- Crime[,3:8] # subset data set to just include the data you want to cluster
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
# mydata <-attr(mydata, "scaled:scale")
# mydata <- attr(mydata, "scaled:center")

That is enough data preparation.

Hierarchical clustering

Hierarchical clustering can be a fast and easy way to assess the clusters in your data. Reading the dendrogram that we usually plot to display the hierarchy does get a little annoying when the number of observations is really large.

The thing to keep in mind is that there are a number of different approaches to calculating distances. Each one depends on the type of data being used. Here is a small selection of what is available in R.

Distance “method” options:

Measure	method=""	Best used for…	Calculation
Euclidean distance	euclidean	continuous data	edist(x, y) <- \(\sqrt{((x_1 -y_1)^2 + (x_2 - y_2)^2 + ...)%}\)
Hamming distance	hamming	categorical data	hdist(x, y) <- \(\sum{((x_1 \neq y_1) + (x_2 \neq y_2) + ...)}\)
Manhattan distance	manhattan	city block distance between the two vectors	mdist(x, y) <- \(\sum{(\|(x_1 -y_1)\| + \|(x_2 - y_2)\| + ...)}\)
Canberra distance	canberra	non-negative values (e.g., counts))	cdist(x, y ) <- \(\sum( \frac {\|x-y\|}{\|x+y\|})\)
Jaccard asymmetric binary distance	binary	binary data	(Using the table below) j.dist(x,y) <- \(\frac{b + c}{b + c + d}\)

Note: the function “vegdist” in the package “vegan” offers 12 more distance options. You can also use a correlation matrix.

d <- dist(mydata, method="euclidean") # First, construct a distance matrix.

Additional option: Mixed data types

If you have mixed categorical and numeric data, then you can still create a distance matrix. For those situations, we use “Gower scaling.”

Gower scaling calculates the differences between observations on a 0 to 1.0 scale using a variety of comparison methods, as appropriate. Qualitative, quantitative and mixed differences are calculated on the same scale, resulting in a dissimilarity matrix.

library(cluster)
gower_d <- daisy(mydata, metric = "gower")

You will note that we do not use the Gower metric in the functions below. That is because we have all continuous data for this analysis. If you do have mixed data, substitute gower_d from the calculation, above, in the hclust function, below.

Similarity/dissimilarity “method” options for hierarchial clustering:

Once you have created a distance matrix, it is time to use the matrix in a hierarchical clustering model. This brings you to your next decision point: selecting between what you may mean by “similarity.”

There are many options for how to calculate similarities between observations. Methods, ward.D and ward.D2 are generally popular and effective. But, here are a few other options if you wish.

Method	Process	Result
`single`	Measures the distance between the two closest points in each cluster	Generally better for identifying outliers that don’t cluster well
`complete`	Measures the distance between the two most distant points in each cluster	Generally produces tighter clusters
`centroid`	Measures the distance between the center of each cluster	Generally works better for data with fewer similarities
`median`	Measures the median distance between each cluster’s median point	Similar to `centroid`, but weighted toward where the majority of observations are located
`average`	Measures the average (mean) distance between each observation in each cluster, weighted by the number of observations in each cluster	Generally similar to `complete` linkage, better at incorporating outliers
`mcquitty`	Similar to `average`, but does not take number of points in the cluster into account.	Generally similar to `single` linkage
`ward.D`	Minimizes within cluster variance (sum of errors). Clusters are combined according to smallest between cluster distance.	Generally produces more compact clusters
`ward.D2`	Same as `ward.D`, but the differences are squared (sum of squared errors)	Emphasizes the differences identified in `ward.D`, making clusters easier to differentiate

I have tried to offer some general, if vague, guidance here. But, most people tend to recommend that you try out a few combinations of various distance options (from the section above) and similarity options (presented below) to see what makes your clusters more interpretable.

Don’t just find one method and use it for every data set you receive. Remember, a solution isn’t “best” just because it works in more situations.

Here, we try a couple of options for the sake of comparison.

fit <- hclust(d, method="ward.D")   # Unmodified distances
fit2 <- hclust(d, method="ward.D2")  # Squared distances

Plot it. You do not need to use the rectangles to identify the various clusters. But it does help if you are communicating the information to people who have never read a dendrogram.

Here, we can see the differences between ward.D and ward.D2. The groupings are the same for this data set. But ward.D2 produces more differentiation between groups.

par(mfrow=c(1,2)) # Set the plot to 1 row, 2 columns
  # Plot 1
plot(fit, cex=0.4) # cex refers to font size in the plot
rect.hclust(fit, k=5) # k refers to the number of clusters
  # Plot 2
plot(fit2, cex=0.4)
rect.hclust(fit, k=5)
par(mfrow=c(1,1)) # Set the plot window back to one plot

An alternate visualization: Clusterplot

This option allows you to plot the similarities and differences in two-dimensional space. Keep in mind that the similarities and differences may be plotted better in three dimensions.

Depending on the number of observations, you may wish to suppress the labels. This will help you to clean up the visualization for presentation.

. labels=1 Identify points individually (no cluster numbers) . labels=2 Identify all points & clusters . labels=3 Identify all points only . labels=4 Identify all clusters only . labels=5 Identify individual points (w. cluster numbers)

clusters=5
groups <- cutree(fit, k=clusters) # Assign observations to groups

Take a look at which states fall into which groups.

groups

## AK AL AR AZ CA CO CT DE FL GA HI ID IL IN IO KS KY LA MA MD ME MI MN MO MS MT 
##  1  2  2  3  3  1  1  1  3  2  3  1  3  1  4  1  2  2  1  1  4  1  4  1  2  4 
## NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY DC 
##  2  4  4  4  1  2  3  3  1  2  1  1  3  2  4  2  2  1  1  4  1  4  2  4  5

Next, plot it for a succinct representation of how these states are similar and different.

library(cluster)
clusplot(mydata, groups, color=TRUE, shade=TRUE, labels=2, lines=0)

Bootstrapping to Test Cluster Adequacy

So far, things may look good. But how do you know whether the grouping you found is a valid representation of cohesive clusters in this data set?

Remember the idea of bootstrapping? We can randomly resample, with replacement, from the data set and rerun the clustering algorithm. This will give us a much better idea of how stable the findings are.

library(fpc)
clusters <- 5
clus.boot <- clusterboot(mydata, 
                         B=1000, # Number of bootstrap resamples
                         clustermethod=hclustCBI, # for hierarchical clustering 
                         method="ward.D", # use what we used in "hclust"
                         k=clusters, 
                         count=FALSE) # Show progress on screen?
# clus.boot

Build a simple table to evaluate the results. Keep in mind that you are looking for Average Jaccard measures greater than 0.6.

Read this as a reflection of the frequency with which the observations in each cluster co-appear together. If they all cluster the same way most of the time, regardless of the resampling distribution, then the average Jaccard value will be high.

When there is roughly an even chance as to whether the various observations will cluster together, then the average Jaccard value will be around 0.5. We are looking for some value that indicates that there is a much better than even chance that the solution will be consistent, so we use 0.6 or better as the low end of what is acceptable.

Rule of thumb: . AvgJaccard <0.6 is unstable . AvgJaccard >0.85 is highly stable

set.seed(8675309)

AvgJaccard <- clus.boot$bootmean
Instability <- clus.boot$bootbrd/1000
Clusters <- c(1:clusters)
Eval <- cbind(Clusters, AvgJaccard, Instability)
Eval

##      Clusters AvgJaccard Instability
## [1,]        1  0.7189252       0.094
## [2,]        2  0.8330209       0.086
## [3,]        3  0.4792701       0.636
## [4,]        4  0.8297518       0.114
## [5,]        5  0.6360000       0.364

Cluster three is very unstable, indicating that this is not a good solution.

Maybe another number of clusters would be better. It is okay to reconsider at this point and try a different number of clusters.

If, however, you find this to be an adequate representation of the data, you can save the cluster designations back into the data set for later use.

hierarchical_clusters <- as.factor(groups) # First, make it a factor so that it is not confused with numeric data
state <- row.names(mydata)   # Extract row names from the working data. If we removed 
                              # any rows from the original data due to missingness, then
                              # we will need this to merge the clusters with the original data.
# Combine the two into a data frame
HC_results <- cbind(state, hierarchical_clusters) # Make cluster data

# Merge the original "Crime" data with the new groups
Crime <- merge(Crime,          # Original data
               HC_results,     # Cluster data
               by.x="STATE",   # "State" variable, as it appears in original data
               by.y="state")   # "State" variable, as it appears in cluster data

k-means clustering

k-means clustering demands that you specify the number of clusters up front. To get an idea of how many clusters there are, use:

library(mclust)
guess <- Mclust(mydata)
summary(guess)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 6
## components: 
## 
##  log-likelihood  n df       BIC       ICL
##       -233.0663 51 62 -709.9057 -713.3701
## 
## Clustering table:
##  1  2  3  4  5  6 
##  1 12  7 19 11  1

Mclust found six clusters to be optimal for this data set. The table displays the cluster numbers and the count of observations that fall into each.

If you would like to use the “elbow method”, which is analogous to a flipped scree plot, you can plot the results. If you wish to do this, select “1: BIC”. It appears as the first of the plots, below.

plot(guess)

Use a combination of these methods to decide for yourself what the optimal value of “k” should be when you run the k-means algorithm. Then, try a few out and compare the results.

actually performing k-means clustering:

Fine. Mclust suggests that there are six clusters. Let’s try that this time.

clusters <- 6
k_fit <- kmeans(mydata, 
              clusters,     # number of clusters
              iter.max=100, # max steps in the fitting process
              nstart=100)   # number of random starts

Visualize:

Cluster Plot against 1st 2 principal components

library(cluster)
clusplot(mydata, 
         k_fit$cluster, 
         color=TRUE, 
         shade=TRUE, 
         labels=2, 
         lines=0)

Centroid Plot against 1st 2 discriminant functions

library(fpc)
plotcluster(mydata, k_fit$cluster)

Bootstrapping to evaluate cluster adequacy:

As with hierarchical clustering, we would like to know whether the groupings we found using k-means are a valid representation of cohesive clusters in this data set.

Again, with bootstrapping, we randomly resample, with replacement, from the data set and rerun the clustering algorithm. This will give us a much better idea of how stable the findings are.

clusters <- 6
clus.boot <- clusterboot(mydata, 
                         B=1000, 
                         clustermethod=kmeansCBI, # for k-means clustering
                         k=clusters, 
                         count=FALSE) # Show progress on screen?

Now let’s evaluate the results