Clustering with k-means

k-means: how well did you do earlier?

Remember the seeds dataset from Chapter 2 which you clustered using the k-means method? Well, we’ve found the labels of the seeds. They are stored in the vector seeds_type; there were indeed three types of seeds!

While clusters are made without the use of true labels, if you happen to have them, it is simply interesting to see how well the clusters you made correspond to these true labels.

It’s Up to you now to cluster the instances in seeds and compare the resulting clusters with seeds_type. Both objects are available in your workspace.

# seeds and seeds_type are pre-loaded in your workspace
seeds <- read.csv("seeds.csv")
seeds_type_a <- c(1,2,3)
seeds_type <- rep(seeds_type_a, each=70)
# Set random seed. Don't remove this line.
set.seed(100)

# Do k-means clustering with three clusters, repeat 20 times: seeds_km
seeds_km <- kmeans(seeds, centers = 3, nstart = 20)

# Print out seeds_km
seeds_km

## K-means clustering with 3 clusters of sizes 77, 72, 61
## 
## Cluster means:
##       area perimeter compactness   length    width asymmetry groove_length
## 1 11.96442  13.27481   0.8522000 5.229286 2.872922  4.759740      5.088519
## 2 14.64847  14.46042   0.8791667 5.563778 3.277903  2.648933      5.192319
## 3 18.72180  16.29738   0.8850869 6.208934 3.722672  3.603590      6.066098
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
##  [36] 2 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1
##  [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 2 2 2 2 3 2 2 2
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 195.7453 207.4648 184.1086
##  (between_SS / total_SS =  78.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# Compare clusters with actual seed types. Set k-means clusters as rows
table(seeds_km$cluster, seeds_type)

##    seeds_type
##      1  2  3
##   1  9  0 68
##   2 60 10  2
##   3  1 60  0

# Plot the length as function of width. Color by cluster
plot(seeds$width, seeds$length, xlab = "Length", ylab = "Width", col = seeds_km$cluster)

If you have a look at the table that got generated, you clearly see three groups with 60 elements or more. Overall, you can say that your formed clusters adequately represent adequately the different types of seeds. These larger groups represent the correspondence between the clusters and the actual types. E.g. cluster 1 corresponds to seed_type 3.

The influence of starting centroids

If you call kmeans() without specifying your centroids, R will randomly assign them for you. In this exercise, you will see the influence of these starting centroids yourself using the seeds dataset.

To compare the clusters of two cluster models, you can again use table(). If every row and every column has one value, the resulting clusters completely overlap. If this is not the case, some objects are placed in different clusters.

# Set random seed. Don't remove this line.
set.seed(100)

# Apply kmeans to seeds twice: seeds_km_1 and seeds_km_2
seeds_km_1 <-  kmeans(seeds, 5, nstart=1)
seeds_km_2 <-  kmeans(seeds, 5, nstart=1)

# Return the ratio of the within cluster sum of squares. Print the ratio of the WSS
seeds_km_1$tot.withinss/seeds_km_2$tot.withinss

## [1] 1.006146

# Compare the resulting clusters
table(seeds_km_1$cluster, seeds_km_2$cluster)

##    
##      1  2  3  4  5
##   1  0 16  0  0 18
##   2  0 56  0  0  0
##   3  0  0 12  0  0
##   4  0  0  0 18 41
##   5 33  0  3 13  0

As you can see, some clusters remained the same, others have changed. For example, cluster 5 from seeds_km_1 completely contains cluster 1 from seeds_km_2 (33 objects). Cluster 4 from seeds_km_1 is split, 18 objects were put in seeds_km_2’s fourth cluster and 41 in its fifth cluster. For consistent and decent results, you should set nstart > 1 or determine a prior estimation of your centroids.

Making a scree plot!

Let’s move on to some new data: school results! You’re given a dataset school_result containing school level data recording reading and arithmetic scores for each school’s 4th and 6th graders. (Source: cluster.datasets package). We’re wondering if it’s possible to define distinct groups of students based on their scores and if so how many groups should we consider?

Your job is to cluster the schools based on their scores with k-means, for different values of k. On each run, you’ll record the ratio of the within cluster sum of squares to the total sum of squares. The scree plot will tell you which k is optimal!

# The dataset school_result is pre-loaded
school_result <-  read.csv("school_result.csv")

# Set random seed. Don't remove this line.
set.seed(100)

# Explore the structure of your data
str(school_result)

## 'data.frame':    25 obs. of  4 variables:
##  $ reading.4   : num  2.7 3.9 4.8 3.1 3.4 3.1 4.6 3.1 3.8 5.2 ...
##  $ arithmetic.4: num  3.2 3.8 4.1 3.5 3.7 3.4 4.4 3.3 3.7 4.9 ...
##  $ reading.6   : num  4.5 5.9 6.8 4.3 5.1 4.1 6.6 4 4.7 8.2 ...
##  $ arithmetic.6: num  4.8 6.2 5.5 4.6 5.6 4.7 6.1 4.9 4.9 6.9 ...

# Initialise ratio_ss. Initialize a vector of length 7, ratio_ss, that contains all zeros.
ratio_ss <- rep(0, 7)

# Finish the for-loop
for (k in 1:7) {
  
  # Apply k-means to school_result: school_km
  school_km <- kmeans(school_result, k, nstart = 20)
  
  # Save the ratio between of WSS to TSS in kth element of ratio_ss. Save the corresponding ratio tot.withinss to totss in the vector ratio_ss at index k. These values are found in the school_km object.
  ratio_ss[k] <- school_km$tot.withinss / school_km$totss
  
}

# Make a scree plot with type "b" (connecting the points) and xlab "k"
plot(ratio_ss, type = "b", xlab = "k")

What is your optimal k?

In the last exercise you made a scree plot, showing the ratio of the within cluster sum of squares to the total sum of squares.

You want to choose k such that your clusters are compact and well separated. However, the ratio_ss keeps decreasing as k increases. Hence, if you were to minimize this ratio as function of k, you’d end up with a cluster for every school, which is a meaningless result. You should choose k such that when increasing it, the impact on ratio_ss is not significant. The elbow in the scree plot will help you identify this turning point.

Can you tell which of the following values of k will provide a meaningful clustering with overall compact and well separated clusters? The ratio, ratio_ss, is still in your workspace.

3 or 4
The plot shows a considerable drop for k equal to 3. Furthermore, the ratio_ss for k equal to 2 is higher than 20%.

Performance and scaling issues

When to standardize your data?

It’s always important to assess the scale of your variables prior to clustering. If your variables are on different scales, it’s best to either re-scale some of them or ultimately standardize them all.

Which of the following datasets should best be fully standardized? Before you answer this question, you can take a look at the summary of each dataset (Source: cluster.datasets package). They are all available in the workspace.

school_result <-  read.csv("school_result.csv")
head(school_result)

##   reading.4 arithmetic.4 reading.6 arithmetic.6
## 1       2.7          3.2       4.5          4.8
## 2       3.9          3.8       5.9          6.2
## 3       4.8          4.1       6.8          5.5
## 4       3.1          3.5       4.3          4.6
## 5       3.4          3.7       5.1          5.6
## 6       3.1          3.4       4.1          4.7

moon_data<-read.csv("moon_data.csv")
head(moon_data)

##            X distance period
## 1    Earth.1   239000  655.0
## 2     Mars.1     5800    7.7
## 3     Mars.2    14600   30.0
## 4 Jupiter.01   112000   12.0
## 5 Jupiter.02   262000   42.0
## 6 Jupiter.03   417000   85.0

run_record<-read.csv("run_record.csv")
head(run_record)

##           X X100m X200m X400m X800m X1500m X5000m X10000m marathon
## 1 Argentina 10.23 20.37 46.18 106.2  220.8  799.8  1659.0   7774.2
## 2 Australia  9.93 20.06 44.38 104.4  211.8  775.8  1651.8   7650.6
## 3   Austria 10.15 20.45 45.80 106.2  214.8  795.6  1663.2   7933.2
## 4   Belgium 10.14 20.19 45.02 103.8  214.2  769.8  1612.2   7632.0
## 5   Bermuda 10.27 20.30 45.26 107.4  222.0  878.4  1829.4   8782.2
## 6    Brazil 10.00 19.89 44.29 102.0  214.2  808.8  1687.8   7563.0

cereal_data<-read.csv("cereal_data.csv")
head(cereal_data)

##                  X Calories Potassium Sodium
## 1       ACCheerios      110        70    180
## 2         Cheerios      110       105    290
## 3       CocoaPuffs      110        55    180
## 4     CountChocula      110        65    180
## 5    GoldenGrahams      110        45    280
## 6 HoneyNutCheerios      110        90    250

1 school_result, containing scores (from 0 to 10) of schools.
2 moon_data, containing the distance to the planet (in miles) and the orbit’s period (in earth days).
3 run_record, that contains the Olympic run records for 50 countries for the 100m, 200m … to the marathon. The records are all denoted in seconds. <—
4 cereal_data, listing the the amount of calories, sodium and potassium several cereals contain.

Standardized vs non-standardized clustering (1)

You agreed that the run_record data should be standardized prior to clustering, but is it truly that important?

In this exercise you’ll cluster the countries based on their unstandardized records and calculate Dunn’s index. Take your time to check out the dataset, you’ll be using it in the upcoming exercises.

In the instructions that follow, you’ll also create a scatter plot of two variables of the run_record dataset. Note that the clustering is performed based on all the data, not only these two columns!

# The dataset run_record has been loaded in your workspace
# Remove the country, only 8 variables in run_record2
run_record2<-read.csv("run_record2.csv")
# Set random seed. Don't remove this line.
set.seed(1)

# Explore your data with str() and summary()
str(run_record2)

## 'data.frame':    54 obs. of  8 variables:
##  $ X100m   : num  10.23 9.93 10.15 10.14 10.27 ...
##  $ X200m   : num  20.4 20.1 20.4 20.2 20.3 ...
##  $ X400m   : num  46.2 44.4 45.8 45 45.3 ...
##  $ X800m   : num  106 104 106 104 107 ...
##  $ X1500m  : num  221 212 215 214 222 ...
##  $ X5000m  : num  800 776 796 770 878 ...
##  $ X10000m : num  1659 1652 1663 1612 1829 ...
##  $ marathon: num  7774 7651 7933 7632 8782 ...

summary(run_record2)

##      X100m           X200m           X400m           X800m      
##  Min.   : 9.78   Min.   :19.32   Min.   :43.18   Min.   :101.4  
##  1st Qu.:10.10   1st Qu.:20.17   1st Qu.:44.91   1st Qu.:103.8  
##  Median :10.20   Median :20.43   Median :45.58   Median :105.6  
##  Mean   :10.22   Mean   :20.54   Mean   :45.83   Mean   :106.1  
##  3rd Qu.:10.32   3rd Qu.:20.84   3rd Qu.:46.32   3rd Qu.:108.0  
##  Max.   :10.97   Max.   :22.46   Max.   :51.40   Max.   :116.4  
##      X1500m          X5000m          X10000m        marathon    
##  Min.   :206.4   Min.   : 759.6   Min.   :1588   Min.   : 7473  
##  1st Qu.:213.0   1st Qu.: 788.9   1st Qu.:1653   1st Qu.: 7701  
##  Median :216.6   Median : 805.2   Median :1675   Median : 7819  
##  Mean   :219.2   Mean   : 817.1   Mean   :1712   Mean   : 8009  
##  3rd Qu.:224.2   3rd Qu.: 834.5   3rd Qu.:1739   3rd Qu.: 8050  
##  Max.   :254.4   Max.   :1002.0   Max.   :2123   Max.   :10276

# Cluster run_record using k-means: run_km. 5 clusters, repeat 20 times
run_km <- kmeans(run_record2, 5, nstart=20)

# Plot the 100m as function of the marathon. Color using clusters
plot(run_record2$marathon, run_record2$X100m, col = run_km$cluster)

# Calculate Dunn's index: dunn_km. Print it.
library(clValid)

## Warning: package 'clValid' was built under R version 3.2.5

## Loading required package: cluster

dunn_km = dunn(clusters = run_km$cluster, Data=run_record2)
dunn_km

## [1] 0.05651773

The unstandarized clusters are completely dominated by the marathon records; you can even separate every cluster only based on the marathon records! Moreover Dunn’s index seems to be quite low.

Standardized vs non-standardized clustering (2)

You expected it already, the unstandardized clusters don’t produce satisfying results. Let’s see if standardizing helps!

Your job is the standardize the run_record dataset and apply the k-means algorithm again. Calculate Dunn’s index and compare the results!

# The dataset run_record as well as run_km are available

# Set random seed. Don't remove this line.
set.seed(1)

# Standardize run_record, transform to a dataframe: run_record_sc
run_record_sc <- as.data.frame(scale(run_record2))

# Cluster run_record_sc using k-means: run_km_sc. 5 groups, let R start over 20 times
run_km_sc <- kmeans(run_record_sc, 5, nstart = 20)

# Plot records on 100m as function of the marathon. Color using the clusters in run_km_sc
plot(run_record2$marathon, run_record2$X100m, col = run_km_sc$cluster,
     xlab = "marathon", ylab ="100m", main = "Run Records")

# Compare the resulting clusters in a nice table
table(run_km$cluster, run_km_sc$cluster)

##    
##      1  2  3  4  5
##   1  0  6  0  0  0
##   2  7  0 10  0  1
##   3  1  0 10  0 13
##   4  2  2  0  0  0
##   5  0  0  0  2  0

# Calculate Dunn's index: dunn_km_sc. Print it.
dunn_km_sc <- dunn(clusters = run_km_sc$cluster, Data = run_record_sc)
dunn_km_sc

## [1] 0.1453556

The plot now shows the influence of the 100m records on the resulting clusters! Dunn’s index is clear about it, the standardized clusters are more compact or/and better separated!

Hierarchical Clustering

Which of the following statements is false?

Single linkage hierarchical clustering can lead to undesirable chaining of your objects.
K-means is overall computationally less intensive than bottom-up hierarchical clustering.
The dendrogram is a tree that represents the hierarchical clustering. It tells you which clusters merged when.
In complete linkage, the distance between two clusters is taken to be the minimal distance from any member of one cluster to any member of the other cluster. #### Indeed! Complete linkage is actually the maximal distance between the objects from any member of one cluster to any member of the other cluster.

Single Hierarchical Clustering

Let’s return to the Olympic records example. You’ve already clustered the countries using the k-means algorithm, but this gave you a fixed amount of clusters. We’re interested in more!

In this exercise, you’ll apply the hierarchical method to cluster the countries. Of course, you’ll be working with the standardized data. Make sure to visualize your results!

# The dataset run_record_sc has been loaded in your workspace
run_record_sc = read.csv("run_record_sc.csv")
# Calculate the Euclidean distance matrix of run_record_sc using dist(). Assign it to run_dist. dist() uses the Euclidean method by default.
run_dist = dist(run_record_sc, method = "euclidean")

## Warning in dist(run_record_sc, method = "euclidean"): NAs introduced by
## coercion

# Use the run_dist matrix to cluster your data hierarchically, based on single-linkage. Use hclust() with two arguments. Assign it to run_single.
run_single = hclust(run_dist, method = "single")

# Cut the tree using cutree() at 5 clusters. Assign the result to memb_single.
memb_single = cutree(run_single, k = 5) 

# Make a dendrogram of run_single using plot(). If you pass a hierarchical clustering object to plot(), it will draw the dendrogram of this clustering.
plot(run_single)

# Draw boxes around the 5 clusters using rect.hclust(). Set the border argument to 2:6, for different colors.
rect.hclust(run_single, k=5, border=2:6)

However, it appears the two islands Samoa and Cook’s Islands, who are not known for their sports performances, have both been placed in their own groups. Maybe, we’re dealing with some chaining issues? Let’s try a different linkage method in the next exercise!

Complete Hierarchical Clustering

The clusters of the last exercise weren’t truly satisfying. The single-linkage method appears to be placing each outlier in its own cluster. Let’s see if complete-linkage agrees with this clustering!

In this exercise, you’ll repeat some steps from the last exercise, but this time for the complete-linkage method. Visualize your results and compare with the single-linkage results. A model solution to the previous exercise is already available to inspire you. It’s up to you to add code for complete-linkage.

# run_record_sc is pre-loaded

# Code for single-linkage
#run_dist <- dist(run_record_sc, method = "euclidean")
#run_single <- hclust(run_dist, method = "single")
#memb_single <- cutree(run_single, 5)
#plot(run_single)
#rect.hclust(run_single, k = 5, border = 2:6)

# Apply hclust() to run_dist: run_complete
run_complete = hclust(run_dist, method = "complete")

# Apply cutree() to run_complete: memb_complete
memb_complete <- cutree(run_complete, 5)

# Apply plot() on run_complete to draw the dendrogram
plot(run_complete)

# Apply rect.hclust() on run_complete to draw the boxes
rect.hclust(run_complete, k = 5, border = 2:6)

# table() the clusters memb_single and memb_complete. Put memb_single in the rows
table(memb_single,memb_complete)

##            memb_complete
## memb_single  1  2  3  4  5
##           1 27  7 14  0  1
##           2  0  0  0  1  0
##           3  0  0  0  0  1
##           4  0  0  0  0  2
##           5  0  0  0  1  0

Compare the two plots. The five clusters differ significantly from the single-linkage clusters. That one big cluster you had before, is now split up into 4 medium sized clusters. Have a look at the table you generated as well!

Hierarchical vs k-means

So you’ve clustered the countries based on their Olympic run performances using three different methods: k-means clustering, hierarchical clustering with single linkage and hierarchical clustering with complete linkage. You can ask yourself: which method returns the best separated and the most compact clusters?

Let’s use Dunn’s index. Remember, it returns the ratio between the minimum intercluster distance to the maximum intracluster diameter. The dunn() function in R, requires the argument clusters, indicating the cluster partitioning, the Data and a method to determine the distance. In this case, that’s “euclidean”, which is the default.

Your job is to calculate Dunn’s index for all three clusterings and compare the clusters to each other. The R objects you calculated in the previous exercises are already available in the workspace.

# run_record_sc, run_km_sc, memb_single and memb_complete are pre-calculated

# Set random seed. Don't remove this line.
set.seed(100)

# Dunn's index for k-means: dunn_km
dunn_km <- dunn(clusters = run_km_sc$cluster, Data = run_record_sc)

## Warning in dist(Data, method = method): NAs introduced by coercion

# Dunn's index for single-linkage: dunn_single
dunn_single <- dunn(clusters = memb_single, Data = run_record_sc)

## Warning in dist(Data, method = method): NAs introduced by coercion

# Dunn's index for complete-linkage: dunn_complete
dunn_complete <- dunn(clusters = memb_complete, Data = run_record_sc)

## Warning in dist(Data, method = method): NAs introduced by coercion

# Compare k-means with single-linkage
table(run_km_sc$cluster, memb_single)

##    memb_single
##      1  2  3  4  5
##   1  9  0  1  0  0
##   2  6  0  0  2  0
##   3 20  0  0  0  0
##   4  0  1  0  0  1
##   5 14  0  0  0  0

# Compare k-means with complete-linkage
table(run_km_sc$cluster, memb_complete)

##    memb_complete
##      1  2  3  4  5
##   1  0  0  8  0  2
##   2  0  0  6  0  2
##   3 20  0  0  0  0
##   4  0  0  0  2  0
##   5  7  7  0  0  0

The table shows that the clusters obtained from the complete linkage method are similar to those of k-means. Now it’s up to you to compare the values of Dunn’s index and decide which method returns the best separated and most compact clusters.

Interpreting Dunn’s Index

Do you remember what Dunn’s index measured? Let’s put that to the test!

Which of the following statements about your clusterings of the countries in run_record_sc is correct? Use the results of Dunn’s index, found in dunn_km, dunn_single and dunn_complete in your workspace.

dunn_km

## [1] 0.1453556

dunn_single

## [1] 0.2921946

dunn_complete

## [1] 0.1808437

The complete-linkage method returned the lowest ratio of minimal intercluster-distance to minimal cluster diameter. 1. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter.
Based on Dunn’s index, the complete-linkage method returned the most compact and separated clusters. 2. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter. Hence, a lower Dunn’s index indicates your minimal intercluster ratio decreased and/or the maximal cluster diameter increased. Therefore, a lower Dunn’s index tells you at least one pair of clusters became less separated and/or all clusters became less compact.
The single-linkage method returned the highest ratio of minimal intercluster-distance to maximal cluster diameter. 3. Correct! Are you satisfied with this result? The single-linkage method that caused chaining effects, actually returned the most compact and separated clusters . If you think about, it can make sense. The simple linkage method puts every outlier in its own cluster, increasing the intercluster distances and reducing the diameters, hence giving a higher Dunn’s index. Therefore, you could conclude that the single linkage method did a fine job identifying the outliers. However, if you’d like to report your clusters to the local newspapers, then complete linkage or k-means are probably the better choice!
Based on Dunn’s index, the single-linkage method returned the least compact and separated clusters. Remember, Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter. Hence, a higher Dunn’s index indicates your minimal intercluster ratio increased and/or the maximal cluster diameter decreased. Therefore, a higher Dunn’s index tells you atleast one pair of clusters became more separated and/or all clusters became more compact.

Clustering US states based on criminal activity

You’ve seen that different clustering methods can return entirely different clusters, each with their own interpretation and uses. It’s time to put your skills, both the programming and the interpretation, to the test!

Your client has provided you with a dataset, crime_data, containing info on the crimes committed in each of the 50 US states and the percentage of urban population (Source: Edureka). He’d like you to group the states in 4 clusters. He didn’t specify which similarity to use, but the euclidean distance seems acceptable, don’t you agree?

You decide to try out two techniques: k-means and single-linkage hierarchical clustering. You then want to compare the results by calculating the Dunn’s indices to make a conclusion. Which clustering will you deliver to your client?

crime_data2 <- read.csv("crime_data2.csv")
# Set random seed. Don't remove this line.
set.seed(1)

# Scale the dataset: crime_data_sc
crime_data_sc = scale(crime_data2)

# Perform k-means clustering: crime_km
crime_km = kmeans(crime_data_sc, 4, nstart=20)

# Perform single-linkage hierarchical clustering
## Calculate the distance matrix: dist_matrix
dist_matrix = dist(crime_data_sc, method= "euclidean")

## Calculate the clusters using hclust(): crime_single
crime_single = hclust(dist_matrix, method = "single")

## Cut the clusters using cutree: memb_single
memb_single = cutree(crime_single, k = 4) 


# Calculate the Dunn's index for both clusterings: dunn_km, dunn_single
dunn_km <- dunn(clusters = crime_km$cluster, Data = crime_data_sc)
dunn_single <- dunn(clusters = memb_single, Data = crime_data_sc)

# Print out the results
dunn_km

## [1] 0.1604403

dunn_single

## [1] 0.2438734

Clustering with k-means

k-means: how well did you do earlier?

Remember the seeds dataset from Chapter 2 which you clustered using the k-means method? Well, we’ve found the labels of the seeds. They are stored in the vector seeds_type; there were indeed three types of seeds!

While clusters are made without the use of true labels, if you happen to have them, it is simply interesting to see how well the clusters you made correspond to these true labels.

It’s Up to you now to cluster the instances in seeds and compare the resulting clusters with seeds_type. Both objects are available in your workspace.

The influence of starting centroids

If you call kmeans() without specifying your centroids, R will randomly assign them for you. In this exercise, you will see the influence of these starting centroids yourself using the seeds dataset.

To compare the clusters of two cluster models, you can again use table(). If every row and every column has one value, the resulting clusters completely overlap. If this is not the case, some objects are placed in different clusters.

Making a scree plot!

Your job is to cluster the schools based on their scores with k-means, for different values of k. On each run, you’ll record the ratio of the within cluster sum of squares to the total sum of squares. The scree plot will tell you which k is optimal!

What is your optimal k?

In the last exercise you made a scree plot, showing the ratio of the within cluster sum of squares to the total sum of squares.

Can you tell which of the following values of k will provide a meaningful clustering with overall compact and well separated clusters? The ratio, ratio_ss, is still in your workspace.

Performance and scaling issues

When to standardize your data?

It’s always important to assess the scale of your variables prior to clustering. If your variables are on different scales, it’s best to either re-scale some of them or ultimately standardize them all.

Which of the following datasets should best be fully standardized? Before you answer this question, you can take a look at the summary of each dataset (Source: cluster.datasets package). They are all available in the workspace.

Indeed! The scaling between the run records differ substantially. Moreover, they all share the same unit, so standardizing will not make the interpretation of the clusters any harder!

Standardized vs non-standardized clustering (1)

You agreed that the run_record data should be standardized prior to clustering, but is it truly that important?

In this exercise you’ll cluster the countries based on their unstandardized records and calculate Dunn’s index. Take your time to check out the dataset, you’ll be using it in the upcoming exercises.

In the instructions that follow, you’ll also create a scatter plot of two variables of the run_record dataset. Note that the clustering is performed based on all the data, not only these two columns!

Standardized vs non-standardized clustering (2)

You expected it already, the unstandardized clusters don’t produce satisfying results. Let’s see if standardizing helps!

Your job is the standardize the run_record dataset and apply the k-means algorithm again. Calculate Dunn’s index and compare the results!

Hierarchical Clustering

Which of the following statements is false?

Single Hierarchical Clustering

Let’s return to the Olympic records example. You’ve already clustered the countries using the k-means algorithm, but this gave you a fixed amount of clusters. We’re interested in more!

In this exercise, you’ll apply the hierarchical method to cluster the countries. Of course, you’ll be working with the standardized data. Make sure to visualize your results!

Complete Hierarchical Clustering

The clusters of the last exercise weren’t truly satisfying. The single-linkage method appears to be placing each outlier in its own cluster. Let’s see if complete-linkage agrees with this clustering!

Hierarchical vs k-means

Your job is to calculate Dunn’s index for all three clusterings and compare the clusters to each other. The R objects you calculated in the previous exercises are already available in the workspace.

Interpreting Dunn’s Index

Do you remember what Dunn’s index measured? Let’s put that to the test!

Which of the following statements about your clusterings of the countries in run_record_sc is correct? Use the results of Dunn’s index, found in dunn_km, dunn_single and dunn_complete in your workspace.

Clustering US states based on criminal activity

You’ve seen that different clustering methods can return entirely different clusters, each with their own interpretation and uses. It’s time to put your skills, both the programming and the interpretation, to the test!

You decide to try out two techniques: k-means and single-linkage hierarchical clustering. You then want to compare the results by calculating the Dunn’s indices to make a conclusion. Which clustering will you deliver to your client?