Lab 3: Clustering

Introduction.

The purpose of this lab is to get practice using the K-means algorithm to cluster like vectors. When using datasets be sure to use the “scale” function (as during class) unless otherwise instructed. Set “nstart = 25” when using the kmeans() function to make the algorithm’s result less sensitive to which random vectors are the first iteration group representatives (unless otherwise instructed). You will need the “stats” and the “factoextra” packages installed and libraries loaded for this lab (see class notes). Once the packages are installed you don’t need to install them again, simply load them for this lab by running the (uncommented) lines of code below.

# library(stats)
# library(factoextra)

Use the following vectors in Exercises 1-6.

x1 = c(0,1)
x2 = c(1,1)
x3 = c(0,0)
x4 = c(4,5)
x5 = c(5,3)
x6 = c(0,7)

Exercise 1.

Suppose group representative vectors are randomly assigned to be z1 = x1, z2 = x2, and z3 = x5. This implies that there are __ total groups.

Exercise 2.

Calculate and report J^clust for the groupings that result from the representative vectors in the previous exercise. (Hint: do this by hand like we did in class.)

Exercise 3. Now, run the K-means algorithm once (that is, complete the steps a single time (iteration 1) considering the previous exercise to be Step 1). The updated version of the third group representative is now z3 = <,>. (Hint: do this by hand like we did in class).

Exercise 4.

Now that you have updated the group representatives in the previous exercise, do Step 1 of a second iteration of the K-means algorithm. There are now ___ vectors in group 2.

Exercise 5.

If you ran the K-means algorithm another time the vectors ____ (will, will not) change groups because the groupings after the previous exercise appear to be _____ (optimal, sub-optimal).

Exercise 6.

Use the kmeans() function to cluster the 6 vectors into similar groups. Do not use the scale() function. The final J^clust value for these groupings is ___.

Exercise 7.

Use the kmeans() function to cluster the Superior Court Judges in the “USJudgeRatings” dataset into 4 groups. One of these final groupings is “Cohen”, “Bracken”, “Sidor” and “______” (just last name).

The Iris Dataset.

The “Iris” dataset is a classic in statistics (it even has its own Wikipedia page, read about it at https://en.wikipedia.org/wiki/Iris_flower_data_set). Ronald Fisher (widely considered the father of modern statistics; he developed the majority of techniques that are fundamental to the discipline) introduced it to demonstrate the effectiveness of his Linear Discriminant Analysis (LDA) method. LDA is a fundamental classification (Supervised Machine Learning) method: for the iris problem it uses the 4 numerical variables as inputs and then predicts which of iris species a flower belongs to based on those inputs. That is, it is a Supervised Learning technique since it makes predictions using labeled data (the dataset includes observed iris species for each flower). We will revisit this dataset in the second half of the class when we study classification methods. In this exercise we will remove the labels (strip the “Species” variable from the dataset) and try group the flowers in the dataset without them. This may seem like a strange thing to do, but make sure you understand why we would need to do this in order to use this dataset to practice clustering!

To remove the species label from the dataset run these lines of code when you import and scale the data:

data("iris") 
df_iris = scale(iris[,-5]) #instead of scale(iris)

The [,-5] removes the 5th column from “iris”.

Exercise 8.

Which appears to have more distinctly separated clusters: k = 2 or k = 3?

Exercise 9.

We modified this dataset so it doesn’t have labels, but the original data was labeled. Given what we know about those labels, we know that there are actually ___ distinct groups in this dataset; one for each value of the _____ variable. (Hint: you can view the dataset directly by typing “iris” into your console.)

Exercise 10.

If we run kmeans() with 5 groups we get a final value of J^clust = ___.

Exercise 11.

Extra Credit. A character in this scene from the movie “Don’t Look Up” refers to an “algorithm” that sounds suspiciously like K-means with k = ___. https://www.youtube.com/watch?v=Ik7OBoQ_Kcw