Clustering with k-means
k-means: how well did you do earlier?
Remember the seeds dataset from Chapter 2 which you clustered using the k-means method? Well, we’ve found the labels of the seeds. They are stored in the vector seeds_type; there were indeed three types of seeds!
While clusters are made without the use of true labels, if you happen to have them, it is simply interesting to see how well the clusters you made correspond to these true labels.
It’s Up to you now to cluster the instances in seeds and compare the resulting clusters with seeds_type. Both objects are available in your workspace.
# seeds and seeds_type are pre-loaded in your workspace
seeds <- read.csv("seeds.csv")
seeds_type_a <- c(1,2,3)
seeds_type <- rep(seeds_type_a, each=70)
# Set random seed. Don't remove this line.
set.seed(100)
# Do k-means clustering with three clusters, repeat 20 times: seeds_km
seeds_km <- kmeans(seeds, centers = 3, nstart = 20)
# Print out seeds_km
seeds_km
## K-means clustering with 3 clusters of sizes 77, 72, 61
##
## Cluster means:
## area perimeter compactness length width asymmetry groove_length
## 1 11.96442 13.27481 0.8522000 5.229286 2.872922 4.759740 5.088519
## 2 14.64847 14.46042 0.8791667 5.563778 3.277903 2.648933 5.192319
## 3 18.72180 16.29738 0.8850869 6.208934 3.722672 3.603590 6.066098
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [36] 2 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1
## [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 2 2 2 2 3 2 2 2
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 195.7453 207.4648 184.1086
## (between_SS / total_SS = 78.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
# Compare clusters with actual seed types. Set k-means clusters as rows
table(seeds_km$cluster, seeds_type)
## seeds_type
## 1 2 3
## 1 9 0 68
## 2 60 10 2
## 3 1 60 0
# Plot the length as function of width. Color by cluster
plot(seeds$width, seeds$length, xlab = "Length", ylab = "Width", col = seeds_km$cluster)

- If you have a look at the table that got generated, you clearly see three groups with 60 elements or more. Overall, you can say that your formed clusters adequately represent adequately the different types of seeds. These larger groups represent the correspondence between the clusters and the actual types. E.g. cluster 1 corresponds to seed_type 3.
The influence of starting centroids
If you call kmeans() without specifying your centroids, R will randomly assign them for you. In this exercise, you will see the influence of these starting centroids yourself using the seeds dataset.
To compare the clusters of two cluster models, you can again use table(). If every row and every column has one value, the resulting clusters completely overlap. If this is not the case, some objects are placed in different clusters.
# Set random seed. Don't remove this line.
set.seed(100)
# Apply kmeans to seeds twice: seeds_km_1 and seeds_km_2
seeds_km_1 <- kmeans(seeds, 5, nstart=1)
seeds_km_2 <- kmeans(seeds, 5, nstart=1)
# Return the ratio of the within cluster sum of squares. Print the ratio of the WSS
seeds_km_1$tot.withinss/seeds_km_2$tot.withinss
## [1] 1.006146
# Compare the resulting clusters
table(seeds_km_1$cluster, seeds_km_2$cluster)
##
## 1 2 3 4 5
## 1 0 16 0 0 18
## 2 0 56 0 0 0
## 3 0 0 12 0 0
## 4 0 0 0 18 41
## 5 33 0 3 13 0
- As you can see, some clusters remained the same, others have changed. For example, cluster 5 from seeds_km_1 completely contains cluster 1 from seeds_km_2 (33 objects). Cluster 4 from seeds_km_1 is split, 18 objects were put in seeds_km_2’s fourth cluster and 41 in its fifth cluster. For consistent and decent results, you should set nstart > 1 or determine a prior estimation of your centroids.
Making a scree plot!
Let’s move on to some new data: school results! You’re given a dataset school_result containing school level data recording reading and arithmetic scores for each school’s 4th and 6th graders. (Source: cluster.datasets package). We’re wondering if it’s possible to define distinct groups of students based on their scores and if so how many groups should we consider?
Your job is to cluster the schools based on their scores with k-means, for different values of k. On each run, you’ll record the ratio of the within cluster sum of squares to the total sum of squares. The scree plot will tell you which k is optimal!
# The dataset school_result is pre-loaded
school_result <- read.csv("school_result.csv")
# Set random seed. Don't remove this line.
set.seed(100)
# Explore the structure of your data
str(school_result)
## 'data.frame': 25 obs. of 4 variables:
## $ reading.4 : num 2.7 3.9 4.8 3.1 3.4 3.1 4.6 3.1 3.8 5.2 ...
## $ arithmetic.4: num 3.2 3.8 4.1 3.5 3.7 3.4 4.4 3.3 3.7 4.9 ...
## $ reading.6 : num 4.5 5.9 6.8 4.3 5.1 4.1 6.6 4 4.7 8.2 ...
## $ arithmetic.6: num 4.8 6.2 5.5 4.6 5.6 4.7 6.1 4.9 4.9 6.9 ...
# Initialise ratio_ss. Initialize a vector of length 7, ratio_ss, that contains all zeros.
ratio_ss <- rep(0, 7)
# Finish the for-loop
for (k in 1:7) {
# Apply k-means to school_result: school_km
school_km <- kmeans(school_result, k, nstart = 20)
# Save the ratio between of WSS to TSS in kth element of ratio_ss. Save the corresponding ratio tot.withinss to totss in the vector ratio_ss at index k. These values are found in the school_km object.
ratio_ss[k] <- school_km$tot.withinss / school_km$totss
}
# Make a scree plot with type "b" (connecting the points) and xlab "k"
plot(ratio_ss, type = "b", xlab = "k")

What is your optimal k?
In the last exercise you made a scree plot, showing the ratio of the within cluster sum of squares to the total sum of squares.
You want to choose k such that your clusters are compact and well separated. However, the ratio_ss keeps decreasing as k increases. Hence, if you were to minimize this ratio as function of k, you’d end up with a cluster for every school, which is a meaningless result. You should choose k such that when increasing it, the impact on ratio_ss is not significant. The elbow in the scree plot will help you identify this turning point.
Can you tell which of the following values of k will provide a meaningful clustering with overall compact and well separated clusters? The ratio, ratio_ss, is still in your workspace.
- 3 or 4
- The plot shows a considerable drop for k equal to 3. Furthermore, the ratio_ss for k equal to 2 is higher than 20%.
Performance and scaling issues
When to standardize your data?
It’s always important to assess the scale of your variables prior to clustering. If your variables are on different scales, it’s best to either re-scale some of them or ultimately standardize them all.
Which of the following datasets should best be fully standardized? Before you answer this question, you can take a look at the summary of each dataset (Source: cluster.datasets package). They are all available in the workspace.
school_result <- read.csv("school_result.csv")
head(school_result)
## reading.4 arithmetic.4 reading.6 arithmetic.6
## 1 2.7 3.2 4.5 4.8
## 2 3.9 3.8 5.9 6.2
## 3 4.8 4.1 6.8 5.5
## 4 3.1 3.5 4.3 4.6
## 5 3.4 3.7 5.1 5.6
## 6 3.1 3.4 4.1 4.7
moon_data<-read.csv("moon_data.csv")
head(moon_data)
## X distance period
## 1 Earth.1 239000 655.0
## 2 Mars.1 5800 7.7
## 3 Mars.2 14600 30.0
## 4 Jupiter.01 112000 12.0
## 5 Jupiter.02 262000 42.0
## 6 Jupiter.03 417000 85.0
run_record<-read.csv("run_record.csv")
head(run_record)
## X X100m X200m X400m X800m X1500m X5000m X10000m marathon
## 1 Argentina 10.23 20.37 46.18 106.2 220.8 799.8 1659.0 7774.2
## 2 Australia 9.93 20.06 44.38 104.4 211.8 775.8 1651.8 7650.6
## 3 Austria 10.15 20.45 45.80 106.2 214.8 795.6 1663.2 7933.2
## 4 Belgium 10.14 20.19 45.02 103.8 214.2 769.8 1612.2 7632.0
## 5 Bermuda 10.27 20.30 45.26 107.4 222.0 878.4 1829.4 8782.2
## 6 Brazil 10.00 19.89 44.29 102.0 214.2 808.8 1687.8 7563.0
cereal_data<-read.csv("cereal_data.csv")
head(cereal_data)
## X Calories Potassium Sodium
## 1 ACCheerios 110 70 180
## 2 Cheerios 110 105 290
## 3 CocoaPuffs 110 55 180
## 4 CountChocula 110 65 180
## 5 GoldenGrahams 110 45 280
## 6 HoneyNutCheerios 110 90 250
- 1 school_result, containing scores (from 0 to 10) of schools.
- 2 moon_data, containing the distance to the planet (in miles) and the orbit’s period (in earth days).
- 3 run_record, that contains the Olympic run records for 50 countries for the 100m, 200m … to the marathon. The records are all denoted in seconds. <—
- 4 cereal_data, listing the the amount of calories, sodium and potassium several cereals contain.
Indeed! The scaling between the run records differ substantially. Moreover, they all share the same unit, so standardizing will not make the interpretation of the clusters any harder!
Standardized vs non-standardized clustering (1)
You agreed that the run_record data should be standardized prior to clustering, but is it truly that important?
In this exercise you’ll cluster the countries based on their unstandardized records and calculate Dunn’s index. Take your time to check out the dataset, you’ll be using it in the upcoming exercises.
Standardized vs non-standardized clustering (2)
You expected it already, the unstandardized clusters don’t produce satisfying results. Let’s see if standardizing helps!
Your job is the standardize the run_record dataset and apply the k-means algorithm again. Calculate Dunn’s index and compare the results!
# The dataset run_record as well as run_km are available
# Set random seed. Don't remove this line.
set.seed(1)
# Standardize run_record, transform to a dataframe: run_record_sc
run_record_sc <- as.data.frame(scale(run_record2))
# Cluster run_record_sc using k-means: run_km_sc. 5 groups, let R start over 20 times
run_km_sc <- kmeans(run_record_sc, 5, nstart = 20)
# Plot records on 100m as function of the marathon. Color using the clusters in run_km_sc
plot(run_record2$marathon, run_record2$X100m, col = run_km_sc$cluster,
xlab = "marathon", ylab ="100m", main = "Run Records")

# Compare the resulting clusters in a nice table
table(run_km$cluster, run_km_sc$cluster)
##
## 1 2 3 4 5
## 1 0 6 0 0 0
## 2 7 0 10 0 1
## 3 1 0 10 0 13
## 4 2 2 0 0 0
## 5 0 0 0 2 0
# Calculate Dunn's index: dunn_km_sc. Print it.
dunn_km_sc <- dunn(clusters = run_km_sc$cluster, Data = run_record_sc)
dunn_km_sc
## [1] 0.1453556
- The plot now shows the influence of the 100m records on the resulting clusters! Dunn’s index is clear about it, the standardized clusters are more compact or/and better separated!
Hierarchical Clustering
Which of the following statements is false?
- Single linkage hierarchical clustering can lead to undesirable chaining of your objects.
- K-means is overall computationally less intensive than bottom-up hierarchical clustering.
- The dendrogram is a tree that represents the hierarchical clustering. It tells you which clusters merged when.
- In complete linkage, the distance between two clusters is taken to be the minimal distance from any member of one cluster to any member of the other cluster. #### Indeed! Complete linkage is actually the maximal distance between the objects from any member of one cluster to any member of the other cluster.
Single Hierarchical Clustering
Let’s return to the Olympic records example. You’ve already clustered the countries using the k-means algorithm, but this gave you a fixed amount of clusters. We’re interested in more!
In this exercise, you’ll apply the hierarchical method to cluster the countries. Of course, you’ll be working with the standardized data. Make sure to visualize your results!
# The dataset run_record_sc has been loaded in your workspace
run_record_sc = read.csv("run_record_sc.csv")
# Calculate the Euclidean distance matrix of run_record_sc using dist(). Assign it to run_dist. dist() uses the Euclidean method by default.
run_dist = dist(run_record_sc, method = "euclidean")
## Warning in dist(run_record_sc, method = "euclidean"): NAs introduced by
## coercion
# Use the run_dist matrix to cluster your data hierarchically, based on single-linkage. Use hclust() with two arguments. Assign it to run_single.
run_single = hclust(run_dist, method = "single")
# Cut the tree using cutree() at 5 clusters. Assign the result to memb_single.
memb_single = cutree(run_single, k = 5)
# Make a dendrogram of run_single using plot(). If you pass a hierarchical clustering object to plot(), it will draw the dendrogram of this clustering.
plot(run_single)
# Draw boxes around the 5 clusters using rect.hclust(). Set the border argument to 2:6, for different colors.
rect.hclust(run_single, k=5, border=2:6)

- However, it appears the two islands Samoa and Cook’s Islands, who are not known for their sports performances, have both been placed in their own groups. Maybe, we’re dealing with some chaining issues? Let’s try a different linkage method in the next exercise!
Complete Hierarchical Clustering
The clusters of the last exercise weren’t truly satisfying. The single-linkage method appears to be placing each outlier in its own cluster. Let’s see if complete-linkage agrees with this clustering!
In this exercise, you’ll repeat some steps from the last exercise, but this time for the complete-linkage method. Visualize your results and compare with the single-linkage results. A model solution to the previous exercise is already available to inspire you. It’s up to you to add code for complete-linkage.
# run_record_sc is pre-loaded
# Code for single-linkage
#run_dist <- dist(run_record_sc, method = "euclidean")
#run_single <- hclust(run_dist, method = "single")
#memb_single <- cutree(run_single, 5)
#plot(run_single)
#rect.hclust(run_single, k = 5, border = 2:6)
# Apply hclust() to run_dist: run_complete
run_complete = hclust(run_dist, method = "complete")
# Apply cutree() to run_complete: memb_complete
memb_complete <- cutree(run_complete, 5)
# Apply plot() on run_complete to draw the dendrogram
plot(run_complete)
# Apply rect.hclust() on run_complete to draw the boxes
rect.hclust(run_complete, k = 5, border = 2:6)

# table() the clusters memb_single and memb_complete. Put memb_single in the rows
table(memb_single,memb_complete)
## memb_complete
## memb_single 1 2 3 4 5
## 1 27 7 14 0 1
## 2 0 0 0 1 0
## 3 0 0 0 0 1
## 4 0 0 0 0 2
## 5 0 0 0 1 0
- Compare the two plots. The five clusters differ significantly from the single-linkage clusters. That one big cluster you had before, is now split up into 4 medium sized clusters. Have a look at the table you generated as well!
Hierarchical vs k-means
So you’ve clustered the countries based on their Olympic run performances using three different methods: k-means clustering, hierarchical clustering with single linkage and hierarchical clustering with complete linkage. You can ask yourself: which method returns the best separated and the most compact clusters?
Let’s use Dunn’s index. Remember, it returns the ratio between the minimum intercluster distance to the maximum intracluster diameter. The dunn() function in R, requires the argument clusters, indicating the cluster partitioning, the Data and a method to determine the distance. In this case, that’s “euclidean”, which is the default.
Your job is to calculate Dunn’s index for all three clusterings and compare the clusters to each other. The R objects you calculated in the previous exercises are already available in the workspace.
# run_record_sc, run_km_sc, memb_single and memb_complete are pre-calculated
# Set random seed. Don't remove this line.
set.seed(100)
# Dunn's index for k-means: dunn_km
dunn_km <- dunn(clusters = run_km_sc$cluster, Data = run_record_sc)
## Warning in dist(Data, method = method): NAs introduced by coercion
# Dunn's index for single-linkage: dunn_single
dunn_single <- dunn(clusters = memb_single, Data = run_record_sc)
## Warning in dist(Data, method = method): NAs introduced by coercion
# Dunn's index for complete-linkage: dunn_complete
dunn_complete <- dunn(clusters = memb_complete, Data = run_record_sc)
## Warning in dist(Data, method = method): NAs introduced by coercion
# Compare k-means with single-linkage
table(run_km_sc$cluster, memb_single)
## memb_single
## 1 2 3 4 5
## 1 9 0 1 0 0
## 2 6 0 0 2 0
## 3 20 0 0 0 0
## 4 0 1 0 0 1
## 5 14 0 0 0 0
# Compare k-means with complete-linkage
table(run_km_sc$cluster, memb_complete)
## memb_complete
## 1 2 3 4 5
## 1 0 0 8 0 2
## 2 0 0 6 0 2
## 3 20 0 0 0 0
## 4 0 0 0 2 0
## 5 7 7 0 0 0
- The table shows that the clusters obtained from the complete linkage method are similar to those of k-means. Now it’s up to you to compare the values of Dunn’s index and decide which method returns the best separated and most compact clusters.
Interpreting Dunn’s Index
Do you remember what Dunn’s index measured? Let’s put that to the test!
Which of the following statements about your clusterings of the countries in run_record_sc is correct? Use the results of Dunn’s index, found in dunn_km, dunn_single and dunn_complete in your workspace.
dunn_km
## [1] 0.1453556
dunn_single
## [1] 0.2921946
dunn_complete
## [1] 0.1808437
- The complete-linkage method returned the lowest ratio of minimal intercluster-distance to minimal cluster diameter. 1. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter.
- Based on Dunn’s index, the complete-linkage method returned the most compact and separated clusters. 2. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter. Hence, a lower Dunn’s index indicates your minimal intercluster ratio decreased and/or the maximal cluster diameter increased. Therefore, a lower Dunn’s index tells you at least one pair of clusters became less separated and/or all clusters became less compact.
- The single-linkage method returned the highest ratio of minimal intercluster-distance to maximal cluster diameter. 3. Correct! Are you satisfied with this result? The single-linkage method that caused chaining effects, actually returned the most compact and separated clusters . If you think about, it can make sense. The simple linkage method puts every outlier in its own cluster, increasing the intercluster distances and reducing the diameters, hence giving a higher Dunn’s index. Therefore, you could conclude that the single linkage method did a fine job identifying the outliers. However, if you’d like to report your clusters to the local newspapers, then complete linkage or k-means are probably the better choice!
- Based on Dunn’s index, the single-linkage method returned the least compact and separated clusters. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter. Hence, a higher Dunn’s index indicates your minimal intercluster ratio increased and/or the maximal cluster diameter decreased. Therefore, a higher Dunn’s index tells you atleast one pair of clusters became more separated and/or all clusters became more compact.
Clustering US states based on criminal activity
You’ve seen that different clustering methods can return entirely different clusters, each with their own interpretation and uses. It’s time to put your skills, both the programming and the interpretation, to the test!
Your client has provided you with a dataset, crime_data, containing info on the crimes committed in each of the 50 US states and the percentage of urban population (Source: Edureka). He’d like you to group the states in 4 clusters. He didn’t specify which similarity to use, but the euclidean distance seems acceptable, don’t you agree?
You decide to try out two techniques: k-means and single-linkage hierarchical clustering. You then want to compare the results by calculating the Dunn’s indices to make a conclusion. Which clustering will you deliver to your client?
crime_data2 <- read.csv("crime_data2.csv")
# Set random seed. Don't remove this line.
set.seed(1)
# Scale the dataset: crime_data_sc
crime_data_sc = scale(crime_data2)
# Perform k-means clustering: crime_km
crime_km = kmeans(crime_data_sc, 4, nstart=20)
# Perform single-linkage hierarchical clustering
## Calculate the distance matrix: dist_matrix
dist_matrix = dist(crime_data_sc, method= "euclidean")
## Calculate the clusters using hclust(): crime_single
crime_single = hclust(dist_matrix, method = "single")
## Cut the clusters using cutree: memb_single
memb_single = cutree(crime_single, k = 4)
# Calculate the Dunn's index for both clusterings: dunn_km, dunn_single
dunn_km <- dunn(clusters = crime_km$cluster, Data = crime_data_sc)
dunn_single <- dunn(clusters = memb_single, Data = crime_data_sc)
# Print out the results
dunn_km
## [1] 0.1604403
dunn_single
## [1] 0.2438734