Topics for today!

  1. Overview of Supervised Learning
  2. Kmeans algorithm and ideas
  3. Kmeans in R
  4. The Elbow Method
  5. Plotting clusters

Data for today

setwd("~/Desktop/R Materials/mih140/Lecture 24 - Unsupervised Learning II")
cereal = read.table("BreakfastCereals.txt", sep = "\t", header = T) # 43 Cereals from Europe
# Let's look at the cereal data
summary(cereal)
##     Brand           Manufacturer          Calories        Protein     
##  Length:43          Length:43          Min.   : 50.0   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:100.0   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :110.0   Median :2.000  
##                                        Mean   :107.9   Mean   :2.465  
##                                        3rd Qu.:110.0   3rd Qu.:3.000  
##                                        Max.   :160.0   Max.   :6.000  
##       Fat             Sodium          Fiber           Carbs      
##  Min.   :0.0000   Min.   :  0.0   Min.   :0.000   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.:145.0   1st Qu.:0.500   1st Qu.:12.00  
##  Median :1.0000   Median :190.0   Median :1.000   Median :14.00  
##  Mean   :0.9767   Mean   :180.5   Mean   :1.714   Mean   :14.26  
##  3rd Qu.:1.5000   3rd Qu.:220.0   3rd Qu.:2.850   3rd Qu.:17.00  
##  Max.   :3.0000   Max.   :320.0   Max.   :9.000   Max.   :22.00  
##      Sugar          Potassium          Group      
##  Min.   : 0.000   Min.   : 15.00   Min.   :1.000  
##  1st Qu.: 3.000   1st Qu.: 37.50   1st Qu.:1.000  
##  Median : 8.000   Median : 60.00   Median :2.000  
##  Mean   : 7.605   Mean   : 84.42   Mean   :1.744  
##  3rd Qu.:12.000   3rd Qu.:110.00   3rd Qu.:2.000  
##  Max.   :15.000   Max.   :320.00   Max.   :3.000

Topic 1: Overview of Unsupervised Learning

Motivating Question: Whats a good grouping of the cereals? For instance you might wonder which sorts of cereals are similarly nutritious so you can catagorize cereals as Healthy, Okay, Pure Sugar. This sort of classification can determine how advertisers market their cereals, how it should be displayed, etc. It also is just interesting.

To answer these questions we will use Unsupervised Learning (also known as Clustering)

Topic 2: KMeans

K means is an algorithm to identify similar clusters of points. It takes as input k, the desired number of clusters.

Algorithm: kmeans() 0. Choose a number of clusters to find k 1. Randomly place each observation into one of the k clusters. 2. Find the center of each cluster. 3. Reassign observations to clusters based on the nearest center. 3a. If some observations changed clusters, go to 2. 3b. If no observations changed clusters, STOP.

Some notes about kmeans - scaling data

Note: You must scale your numeric features before running kmeans if you want them to be equally important in measuring similarity.

Lets illistrate this with a short example. Note kmeans uses the distance between points to measure similarity. This means that large features will disproportiately dominate the distance metric unless you scale it!

Example: Suppose you look at just the calories and protein of a cereal, and for some reason the calories are measured as regular calories and not kCal as we are used to. Suppose you have the following three cereals:

Ex. cereal 1: 10 p, 125000 cal Ex. cereal 2: 35 p, 120000 cal Ex. cereal 3: 11 p, 121000 cal

d(c1, c2) = sqrt{(10 - 35)^2 + (125000 - 120000)^2} = Big number, thus disimilar d(c1, c3) = sqrt{(10 - 11)^2 + (125000 - 121000)^2} = Big number, thus dissimilar d(c2, c3) = sqrt{(35 - 11)^2 + (120000 - 121000)^2} = Small number, thus similar

Then kmeans will decide that Cereal 2 and Cereal 3 are similar even though we should think that Cereal 1 and Cereal 3 are similar! This is why we must scale our variables.

Topic 3: Kmeans in R

cereal.std = data.frame(scale(cereal[,3:10]))

# As an example lets try look at 
kmeans_4 = kmeans(cereal.std, centers = 4)
kmeans_4
## K-means clustering with 4 clusters of sizes 10, 4, 13, 16
## 
## Cluster means:
##     Calories    Protein        Fat      Sodium       Fiber      Carbs
## 1  0.5848156  0.6014750  0.7776716  0.08249710  1.21495417 -0.2010289
## 2 -1.7348303  0.2331667 -0.5948607 -2.27821163 -0.02164945 -1.1171310
## 3 -0.1735302  0.3748237 -0.4508888  0.69330984 -0.18303851  1.0240542
## 4  0.2091911 -0.7387578  0.0290176 -0.04532203 -0.60521520 -0.4271182
##        Sugar  Potassium
## 1  0.3957733  1.4231111
## 2 -1.2355099 -0.2370093
## 3 -0.9811518 -0.2995511
## 4  0.8587050 -0.5868069
## 
## Clustering vector:
##  [1] 4 3 4 4 4 4 3 4 3 1 1 3 1 3 4 3 4 1 4 3 4 1 3 4 4 2 1 3 1 4 1 3 3 1 3 4 3 4
## [39] 4 1 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 64.47685 28.15471 43.22834 29.82076
##  (between_SS / total_SS =  50.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# kmeans_4$cluster gets the vector of cluster labels
# kmeans_4$tot.withinss gets the measure of total variance in the cluster

Now we have a clustering and it’s somewhat interesting. Question: Was k = 4 the right choice for the number of clusters? How can we decide? Ans (in this class): We will use the elbow method!

Idea: The best number of clusters should be the one that, if you had one less cluster your missing out, and if you had one more cluster then that cluster is kind of pointless. To capture this idea we plot the intercluster sum of squared differences and look for when this measure stops going down quickly.

Topic 4: The elbow method for choosing the best k

# To find the best value of k, lets plot the within cluster variance by the number of clusters.
# We'll try clusters from 1 to 15
cluster_perf = rep(0, 15)

for(i in 1:15){
  curr_clust = kmeans(cereal.std, centers = i)
  cluster_perf[i] = curr_clust$tot.withinss
}
plot(1:15, cluster_perf, xlab = "# of Clusters", ylab = "Tot SS", main = "# Clust vs Performance")

# Maybe we conclude the best clustering is of size 6

best_clust = kmeans(cereal.std, centers = 6)

Finally once we have our nice clustering, we can try and understand/visualize it using a number of methodologies

Topic 5: Understanding and plotting your clusters with agggregate, pairs

# To understand whats going in your clusters of size 4 (kmeans_4)
boxplot(cereal.std$Protein ~ best_clust$cluster)

# More generally to see all the information use aggregate
# SYNTAX: aggregate(data.frame, list_of_labels, funct)
aggregate(cereal[,3:10], by = list(best_clust$cluster), median) # Note here we all looking at unnormalized data
# Another way
pairs(cereal[,3:5], col = as.factor(best_clust$cluster)) # Another way to look at the points, colored by their cluster labesl

Note: Sometimes points are stacked on top of one another making them hard to visualize. We can solve this using the jitter function which adds 1% noise to each observation

# Can add some jitter to make plots easier to read
plot(data = cereal, Calories ~ Protein, col = best_clust$cluster) # Points are stacked and unreadable

cereal[,3] = jitter(cereal[,3])
cereal[,4] = jitter(cereal[,4])
cereal[,5] = jitter(cereal[,5])

plot(data = cereal, Calories ~ Protein, col = best_clust$cluster) # After jittering, much easier to see!

# Similarly pairs gets better too!
pairs(cereal[,3:5], col = as.factor(kmeans_4$cluster)) #