setwd("~/Desktop/R Materials/mih140/Lecture 24 - Unsupervised Learning II")
cereal = read.table("BreakfastCereals.txt", sep = "\t", header = T) # 43 Cereals from Europe
# Let's look at the cereal data
summary(cereal)
## Brand Manufacturer Calories Protein
## Length:43 Length:43 Min. : 50.0 Min. :1.000
## Class :character Class :character 1st Qu.:100.0 1st Qu.:2.000
## Mode :character Mode :character Median :110.0 Median :2.000
## Mean :107.9 Mean :2.465
## 3rd Qu.:110.0 3rd Qu.:3.000
## Max. :160.0 Max. :6.000
## Fat Sodium Fiber Carbs
## Min. :0.0000 Min. : 0.0 Min. :0.000 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:145.0 1st Qu.:0.500 1st Qu.:12.00
## Median :1.0000 Median :190.0 Median :1.000 Median :14.00
## Mean :0.9767 Mean :180.5 Mean :1.714 Mean :14.26
## 3rd Qu.:1.5000 3rd Qu.:220.0 3rd Qu.:2.850 3rd Qu.:17.00
## Max. :3.0000 Max. :320.0 Max. :9.000 Max. :22.00
## Sugar Potassium Group
## Min. : 0.000 Min. : 15.00 Min. :1.000
## 1st Qu.: 3.000 1st Qu.: 37.50 1st Qu.:1.000
## Median : 8.000 Median : 60.00 Median :2.000
## Mean : 7.605 Mean : 84.42 Mean :1.744
## 3rd Qu.:12.000 3rd Qu.:110.00 3rd Qu.:2.000
## Max. :15.000 Max. :320.00 Max. :3.000
Motivating Question: Whats a good grouping of the cereals? For instance you might wonder which sorts of cereals are similarly nutritious so you can catagorize cereals as Healthy, Okay, Pure Sugar. This sort of classification can determine how advertisers market their cereals, how it should be displayed, etc. It also is just interesting.
To answer these questions we will use Unsupervised Learning (also known as Clustering)
K means is an algorithm to identify similar clusters of points. It takes as input k, the desired number of clusters.
Algorithm: kmeans() 0. Choose a number of clusters to find k 1. Randomly place each observation into one of the k clusters. 2. Find the center of each cluster. 3. Reassign observations to clusters based on the nearest center. 3a. If some observations changed clusters, go to 2. 3b. If no observations changed clusters, STOP.
Note: You must scale your numeric features before running kmeans if you want them to be equally important in measuring similarity.
Lets illistrate this with a short example. Note kmeans uses the distance between points to measure similarity. This means that large features will disproportiately dominate the distance metric unless you scale it!
Example: Suppose you look at just the calories and protein of a cereal, and for some reason the calories are measured as regular calories and not kCal as we are used to. Suppose you have the following three cereals:
Ex. cereal 1: 10 p, 125000 cal Ex. cereal 2: 35 p, 120000 cal Ex. cereal 3: 11 p, 121000 cal
d(c1, c2) = sqrt{(10 - 35)^2 + (125000 - 120000)^2} = Big number, thus disimilar d(c1, c3) = sqrt{(10 - 11)^2 + (125000 - 121000)^2} = Big number, thus dissimilar d(c2, c3) = sqrt{(35 - 11)^2 + (120000 - 121000)^2} = Small number, thus similar
Then kmeans will decide that Cereal 2 and Cereal 3 are similar even though we should think that Cereal 1 and Cereal 3 are similar! This is why we must scale our variables.
cereal.std = data.frame(scale(cereal[,3:10]))
# As an example lets try look at
kmeans_4 = kmeans(cereal.std, centers = 4)
kmeans_4
## K-means clustering with 4 clusters of sizes 10, 4, 13, 16
##
## Cluster means:
## Calories Protein Fat Sodium Fiber Carbs
## 1 0.5848156 0.6014750 0.7776716 0.08249710 1.21495417 -0.2010289
## 2 -1.7348303 0.2331667 -0.5948607 -2.27821163 -0.02164945 -1.1171310
## 3 -0.1735302 0.3748237 -0.4508888 0.69330984 -0.18303851 1.0240542
## 4 0.2091911 -0.7387578 0.0290176 -0.04532203 -0.60521520 -0.4271182
## Sugar Potassium
## 1 0.3957733 1.4231111
## 2 -1.2355099 -0.2370093
## 3 -0.9811518 -0.2995511
## 4 0.8587050 -0.5868069
##
## Clustering vector:
## [1] 4 3 4 4 4 4 3 4 3 1 1 3 1 3 4 3 4 1 4 3 4 1 3 4 4 2 1 3 1 4 1 3 3 1 3 4 3 4
## [39] 4 1 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 64.47685 28.15471 43.22834 29.82076
## (between_SS / total_SS = 50.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# kmeans_4$cluster gets the vector of cluster labels
# kmeans_4$tot.withinss gets the measure of total variance in the cluster
Now we have a clustering and it’s somewhat interesting. Question: Was k = 4 the right choice for the number of clusters? How can we decide? Ans (in this class): We will use the elbow method!
Idea: The best number of clusters should be the one that, if you had one less cluster your missing out, and if you had one more cluster then that cluster is kind of pointless. To capture this idea we plot the intercluster sum of squared differences and look for when this measure stops going down quickly.
# To find the best value of k, lets plot the within cluster variance by the number of clusters.
# We'll try clusters from 1 to 15
cluster_perf = rep(0, 15)
for(i in 1:15){
curr_clust = kmeans(cereal.std, centers = i)
cluster_perf[i] = curr_clust$tot.withinss
}
plot(1:15, cluster_perf, xlab = "# of Clusters", ylab = "Tot SS", main = "# Clust vs Performance")
# Maybe we conclude the best clustering is of size 6
best_clust = kmeans(cereal.std, centers = 6)
Finally once we have our nice clustering, we can try and understand/visualize it using a number of methodologies
# To understand whats going in your clusters of size 4 (kmeans_4)
boxplot(cereal.std$Protein ~ best_clust$cluster)
# More generally to see all the information use aggregate
# SYNTAX: aggregate(data.frame, list_of_labels, funct)
aggregate(cereal[,3:10], by = list(best_clust$cluster), median) # Note here we all looking at unnormalized data
# Another way
pairs(cereal[,3:5], col = as.factor(best_clust$cluster)) # Another way to look at the points, colored by their cluster labesl
Note: Sometimes points are stacked on top of one another making them hard to visualize. We can solve this using the jitter function which adds 1% noise to each observation
# Can add some jitter to make plots easier to read
plot(data = cereal, Calories ~ Protein, col = best_clust$cluster) # Points are stacked and unreadable
cereal[,3] = jitter(cereal[,3])
cereal[,4] = jitter(cereal[,4])
cereal[,5] = jitter(cereal[,5])
plot(data = cereal, Calories ~ Protein, col = best_clust$cluster) # After jittering, much easier to see!
# Similarly pairs gets better too!
pairs(cereal[,3:5], col = as.factor(kmeans_4$cluster)) #