library(readr)
library(dplyr)
library(ggplot2)
library(stringr)

Chapter 1: Unsupervised learning in R

1.1: Quiz - Identify Clustering Problems

The k-means algorithm is one common approach to clustering. Learn how the algorithm works under the hood, implement k-means clustering in R, visualize and interpret the results, and select the number of clusters when it’s not known ahead of time. By the end of the chapter, you’ll have applied k-means clustering to a fun “real-world” dataset!

Identify clustering problems

Which of the following are clustering problems?

  1. Determining how many features it takes to describe most of the variability in data

  2. Determining the natural groupings of houses for sale based on size, number of bedrooms, etc.

  3. Visualizing 13 dimensional data (data with 13 features)

  4. Determining if there are common patterns in the demographics of people at a commerce site
  5. Predicting if someone will click on a web advertisement

Answer the question

50 XP

Possible Answers

  1. 1, 3, and 5

  2. 2 and 3

  3. 1, 2, and 4

  4. 2 and 4 [ans]

  5. All 5 are clustering problems

1.2: k-means clustering

We have created some two-dimensional data and stored it in a variable called x in your workspace. The scatter plot on the right is a visual representation of the data.

In this exercise, your task is to create a k-means model of the x data using 3 clusters, then to look at the structure of the resulting model using the summary() function.

Instructions

100 XP

  • Fit a k-means model to x using 3 centers and run the k-means algorithm 20 times. Store the result in km.out.

  • Inspect the result with the summary() function.

# Create the k-means model: km.out
x<-read.csv("unsupervised_learning_data_x.csv")
km.out<- kmeans(x, 3, nstart = 20)
# Inspect the result
summary(km.out)
             Length Class  Mode   
cluster      299    -none- numeric
centers        6    -none- numeric
totss          1    -none- numeric
withinss       3    -none- numeric
tot.withinss   1    -none- numeric
betweenss      1    -none- numeric
size           3    -none- numeric
iter           1    -none- numeric
ifault         1    -none- numeric

1.3: Results of kmeans()

The kmeans() function produces several outputs. In the video, we discussed one output of modeling, the cluster membership.

In this exercise, you will access the cluster component directly. This is useful anytime you need the cluster membership for each observation of the data used to build the clustering model. A future exercise will show an example of how this cluster membership might be used to help communicate the results of k-means modeling.

k-means models also have a print method to give a human friendly output of basic modeling results. This is available by using print() or simply typing the name of the model.

Instructions

100 XP

  • The k-means model you built in the last exercise, km.out, is still available in your workspace.

  • Print a list of the cluster membership to the console.
  • Use a print method to print out the km.out model.

# Print the cluster membership component of the model
km.out$cluster
  [1] 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 2 3 2 2 2 2 2
 [49] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [97] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[145] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[193] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[241] 1 1 1 1 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 3 3 3 3 2 3 3
[289] 3 3 3 3 2 3 3 3 2 3 3
# Print the km.out object
km.out
K-means clustering with 3 clusters of sizes 150, 97, 52

Cluster means:
  X3.37095845 X1.995379232
1  -5.0556758   1.96991743
2   2.2052160   2.05168141
3   0.6642455  -0.09132968

Clustering vector:
  [1] 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 2 3 2 2 2 2 2
 [49] 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [97] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[145] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[193] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[241] 1 1 1 1 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2 3 3 3 3 3 3 2 3 3
[289] 3 3 3 3 2 3 3 3 2 3 3

Within cluster sum of squares by cluster:
[1] 295.16925 147.29959  95.50625
 (between_SS / total_SS =  87.1 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"      

Remarks: Take a look at all the different components of a k-means model object as you may need to access them in later exercises. Because printing the whole model object to the console outputs many different things, you may wish to instead print a specific component of the model object using the $ operator.

1.4: Visualizing and interpreting results of kmeans()

One of the more intuitive ways to interpret the results of k-means models is by plotting the data as a scatter plot and using color to label the samples’ cluster membership. In this exercise, you will use the standard plot() function to accomplish this.

To create a scatter plot, you can pass data with two features (i.e. columns) to plot() with an extra argument col = km.out$cluster, which sets the color of each point in the scatter plot according to its cluster membership.

Instructions

100 XP

  • x and km.out are available in your workspace. Using the plot() function to create a scatter plot of data x:

  • Color the dots on the scatterplot by setting the col argument to the cluster component in km.out.

  • Title the plot “k-means with 3 clusters” using the main argument to plot().

  • Ensure there are no axis labels by specifying "" for both the xlab and ylab arguments to plot().

# Scatter plot of x
plot(x, 
  col = km.out$cluster,
  main = "k-means with 3 clusters",
  xlab = "",
  ylab = "")

LS0tDQp0aXRsZTogIkRhdGFjYW1wIFIgLSBVbnN1cGVydmlzZWQgTGVhcm5pbmcgaW4gUiA6IENoYXB0ZXIgMSINCmF1dGhvcjogIkNoZW4gV2VpcWlhbmciDQpkYXRlOiAiTm92ZW1iZXIgMjgsIDIwMTgiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpgYGB7cn0NCmxpYnJhcnkocmVhZHIpDQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeShzdHJpbmdyKQ0KYGBgDQoNCiMgQ2hhcHRlciAxOiBVbnN1cGVydmlzZWQgbGVhcm5pbmcgaW4gUg0KDQojIyAxLjE6IFF1aXogLSBJZGVudGlmeSBDbHVzdGVyaW5nIFByb2JsZW1zDQoNClRoZSBrLW1lYW5zIGFsZ29yaXRobSBpcyBvbmUgY29tbW9uIGFwcHJvYWNoIHRvIGNsdXN0ZXJpbmcuIExlYXJuIGhvdyB0aGUgYWxnb3JpdGhtIHdvcmtzIHVuZGVyIHRoZSBob29kLCBpbXBsZW1lbnQgay1tZWFucyBjbHVzdGVyaW5nIGluIFIsIHZpc3VhbGl6ZSBhbmQgaW50ZXJwcmV0IHRoZSByZXN1bHRzLCBhbmQgc2VsZWN0IHRoZSBudW1iZXIgb2YgY2x1c3RlcnMgd2hlbiBpdCdzIG5vdCBrbm93biBhaGVhZCBvZiB0aW1lLiBCeSB0aGUgZW5kIG9mIHRoZSBjaGFwdGVyLCB5b3UnbGwgaGF2ZSBhcHBsaWVkIGstbWVhbnMgY2x1c3RlcmluZyB0byBhIGZ1biAicmVhbC13b3JsZCIgZGF0YXNldCENCg0KSWRlbnRpZnkgY2x1c3RlcmluZyBwcm9ibGVtcw0KDQpXaGljaCBvZiB0aGUgZm9sbG93aW5nIGFyZSBjbHVzdGVyaW5nIHByb2JsZW1zPw0KDQoNCjEuIERldGVybWluaW5nIGhvdyBtYW55IGZlYXR1cmVzIGl0IHRha2VzIHRvIGRlc2NyaWJlIG1vc3Qgb2YgdGhlIHZhcmlhYmlsaXR5IGluIGRhdGENCg0KMi4gRGV0ZXJtaW5pbmcgdGhlIG5hdHVyYWwgZ3JvdXBpbmdzIG9mIGhvdXNlcyBmb3Igc2FsZSBiYXNlZCBvbiBzaXplLCBudW1iZXIgb2YgYmVkcm9vbXMsIGV0Yy4NCg0KMy4gVmlzdWFsaXppbmcgMTMgZGltZW5zaW9uYWwgZGF0YSAoZGF0YSB3aXRoIDEzIGZlYXR1cmVzKQ0KDQo0LiBEZXRlcm1pbmluZyBpZiB0aGVyZSBhcmUgY29tbW9uIHBhdHRlcm5zIGluIHRoZSBkZW1vZ3JhcGhpY3Mgb2YgcGVvcGxlIGF0IGEgY29tbWVyY2Ugc2l0ZQ0KNS4gUHJlZGljdGluZyBpZiBzb21lb25lIHdpbGwgY2xpY2sgb24gYSB3ZWIgYWR2ZXJ0aXNlbWVudA0KDQpBbnN3ZXIgdGhlIHF1ZXN0aW9uDQoNCjUwIFhQDQoNClBvc3NpYmxlIEFuc3dlcnMNCg0KMS4gMSwgMywgYW5kIDUNCg0KMi4gMiBhbmQgMw0KDQozLiAxLCAyLCBhbmQgNA0KDQo0LiAyIGFuZCA0IFthbnNdDQoNCjUuIEFsbCA1IGFyZSBjbHVzdGVyaW5nIHByb2JsZW1zDQoNCiMjIDEuMjogay1tZWFucyBjbHVzdGVyaW5nDQoNCldlIGhhdmUgY3JlYXRlZCBzb21lIHR3by1kaW1lbnNpb25hbCBkYXRhIGFuZCBzdG9yZWQgaXQgaW4gYSB2YXJpYWJsZSBjYWxsZWQgeCBpbiB5b3VyIHdvcmtzcGFjZS4gVGhlIHNjYXR0ZXIgcGxvdCBvbiB0aGUgcmlnaHQgaXMgYSB2aXN1YWwgcmVwcmVzZW50YXRpb24gb2YgdGhlIGRhdGEuDQoNCkluIHRoaXMgZXhlcmNpc2UsIHlvdXIgdGFzayBpcyB0byBjcmVhdGUgYSBrLW1lYW5zIG1vZGVsIG9mIHRoZSB4IGRhdGEgdXNpbmcgMyBjbHVzdGVycywgdGhlbiB0byBsb29rIGF0IHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIHJlc3VsdGluZyBtb2RlbCB1c2luZyB0aGUgc3VtbWFyeSgpIGZ1bmN0aW9uLg0KDQpJbnN0cnVjdGlvbnMNCg0KMTAwIFhQDQoNCi0gRml0IGEgay1tZWFucyBtb2RlbCB0byB4IHVzaW5nIDMgY2VudGVycyBhbmQgcnVuIHRoZSBrLW1lYW5zIGFsZ29yaXRobSAyMCB0aW1lcy4gU3RvcmUgdGhlIHJlc3VsdCBpbiBrbS5vdXQuDQoNCi0gSW5zcGVjdCB0aGUgcmVzdWx0IHdpdGggdGhlIHN1bW1hcnkoKSBmdW5jdGlvbi4NCg0KYGBge3J9DQojIENyZWF0ZSB0aGUgay1tZWFucyBtb2RlbDoga20ub3V0DQp4PC1yZWFkLmNzdigidW5zdXBlcnZpc2VkX2xlYXJuaW5nX2RhdGFfeC5jc3YiKQ0Ka20ub3V0PC0ga21lYW5zKHgsIDMsIG5zdGFydCA9IDIwKQ0KDQojIEluc3BlY3QgdGhlIHJlc3VsdA0Kc3VtbWFyeShrbS5vdXQpDQpgYGANCg0KIyMgMS4zOiBSZXN1bHRzIG9mIGttZWFucygpDQoNClRoZSBrbWVhbnMoKSBmdW5jdGlvbiBwcm9kdWNlcyBzZXZlcmFsIG91dHB1dHMuIEluIHRoZSB2aWRlbywgd2UgZGlzY3Vzc2VkIG9uZSBvdXRwdXQgb2YgbW9kZWxpbmcsIHRoZSBjbHVzdGVyIG1lbWJlcnNoaXAuDQoNCkluIHRoaXMgZXhlcmNpc2UsIHlvdSB3aWxsIGFjY2VzcyB0aGUgY2x1c3RlciBjb21wb25lbnQgZGlyZWN0bHkuIFRoaXMgaXMgdXNlZnVsIGFueXRpbWUgeW91IG5lZWQgdGhlIGNsdXN0ZXIgbWVtYmVyc2hpcCBmb3IgZWFjaCBvYnNlcnZhdGlvbiBvZiB0aGUgZGF0YSB1c2VkIHRvIGJ1aWxkIHRoZSBjbHVzdGVyaW5nIG1vZGVsLiBBIGZ1dHVyZSBleGVyY2lzZSB3aWxsIHNob3cgYW4gZXhhbXBsZSBvZiBob3cgdGhpcyBjbHVzdGVyIG1lbWJlcnNoaXAgbWlnaHQgYmUgdXNlZCB0byBoZWxwIGNvbW11bmljYXRlIHRoZSByZXN1bHRzIG9mIGstbWVhbnMgbW9kZWxpbmcuDQoNCmstbWVhbnMgbW9kZWxzIGFsc28gaGF2ZSBhIHByaW50IG1ldGhvZCB0byBnaXZlIGEgaHVtYW4gZnJpZW5kbHkgb3V0cHV0IG9mIGJhc2ljIG1vZGVsaW5nIHJlc3VsdHMuIFRoaXMgaXMgYXZhaWxhYmxlIGJ5IHVzaW5nIHByaW50KCkgb3Igc2ltcGx5IHR5cGluZyB0aGUgbmFtZSBvZiB0aGUgbW9kZWwuDQoNCkluc3RydWN0aW9ucw0KDQoxMDAgWFANCg0KLSBUaGUgay1tZWFucyBtb2RlbCB5b3UgYnVpbHQgaW4gdGhlIGxhc3QgZXhlcmNpc2UsIGttLm91dCwgaXMgc3RpbGwgYXZhaWxhYmxlIGluIHlvdXIgd29ya3NwYWNlLg0KDQotIFByaW50IGEgbGlzdCBvZiB0aGUgY2x1c3RlciBtZW1iZXJzaGlwIHRvIHRoZSBjb25zb2xlLg0KLSBVc2UgYSBwcmludCBtZXRob2QgdG8gcHJpbnQgb3V0IHRoZSBrbS5vdXQgbW9kZWwuDQoNCmBgYHtyfQ0KIyBQcmludCB0aGUgY2x1c3RlciBtZW1iZXJzaGlwIGNvbXBvbmVudCBvZiB0aGUgbW9kZWwNCmttLm91dCRjbHVzdGVyDQoNCiMgUHJpbnQgdGhlIGttLm91dCBvYmplY3QNCmttLm91dA0KYGBgDQpSZW1hcmtzOiBUYWtlIGEgbG9vayBhdCBhbGwgdGhlIGRpZmZlcmVudCBjb21wb25lbnRzIG9mIGEgay1tZWFucyBtb2RlbCBvYmplY3QgYXMgeW91IG1heSBuZWVkIHRvIGFjY2VzcyB0aGVtIGluIGxhdGVyIGV4ZXJjaXNlcy4gQmVjYXVzZSBwcmludGluZyB0aGUgd2hvbGUgbW9kZWwgb2JqZWN0IHRvIHRoZSBjb25zb2xlIG91dHB1dHMgbWFueSBkaWZmZXJlbnQgdGhpbmdzLCB5b3UgbWF5IHdpc2ggdG8gaW5zdGVhZCBwcmludCBhIHNwZWNpZmljIGNvbXBvbmVudCBvZiB0aGUgbW9kZWwgb2JqZWN0IHVzaW5nIHRoZSAkIG9wZXJhdG9yLiANCg0KIyMgMS40OiBWaXN1YWxpemluZyBhbmQgaW50ZXJwcmV0aW5nIHJlc3VsdHMgb2Yga21lYW5zKCkNCg0KT25lIG9mIHRoZSBtb3JlIGludHVpdGl2ZSB3YXlzIHRvIGludGVycHJldCB0aGUgcmVzdWx0cyBvZiBrLW1lYW5zIG1vZGVscyBpcyBieSBwbG90dGluZyB0aGUgZGF0YSBhcyBhIHNjYXR0ZXIgcGxvdCBhbmQgdXNpbmcgY29sb3IgdG8gbGFiZWwgdGhlIHNhbXBsZXMnIGNsdXN0ZXIgbWVtYmVyc2hpcC4gSW4gdGhpcyBleGVyY2lzZSwgeW91IHdpbGwgdXNlIHRoZSBzdGFuZGFyZCBwbG90KCkgZnVuY3Rpb24gdG8gYWNjb21wbGlzaCB0aGlzLg0KDQpUbyBjcmVhdGUgYSBzY2F0dGVyIHBsb3QsIHlvdSBjYW4gcGFzcyBkYXRhIHdpdGggdHdvIGZlYXR1cmVzIChpLmUuIGNvbHVtbnMpIHRvIHBsb3QoKSB3aXRoIGFuIGV4dHJhIGFyZ3VtZW50IGNvbCA9IGttLm91dCRjbHVzdGVyLCB3aGljaCBzZXRzIHRoZSBjb2xvciBvZiBlYWNoIHBvaW50IGluIHRoZSBzY2F0dGVyIHBsb3QgYWNjb3JkaW5nIHRvIGl0cyBjbHVzdGVyIG1lbWJlcnNoaXAuDQoNCkluc3RydWN0aW9ucw0KDQoxMDAgWFANCg0KLSB4IGFuZCBrbS5vdXQgYXJlIGF2YWlsYWJsZSBpbiB5b3VyIHdvcmtzcGFjZS4gVXNpbmcgdGhlIHBsb3QoKSBmdW5jdGlvbiB0byBjcmVhdGUgYSBzY2F0dGVyIHBsb3Qgb2YgZGF0YSB4Og0KDQotIENvbG9yIHRoZSBkb3RzIG9uIHRoZSBzY2F0dGVycGxvdCBieSBzZXR0aW5nIHRoZSBjb2wgYXJndW1lbnQgdG8gdGhlIGNsdXN0ZXIgY29tcG9uZW50IGluIGttLm91dC4NCg0KLSBUaXRsZSB0aGUgcGxvdCAiay1tZWFucyB3aXRoIDMgY2x1c3RlcnMiIHVzaW5nIHRoZSBtYWluIGFyZ3VtZW50IHRvIHBsb3QoKS4NCg0KLSBFbnN1cmUgdGhlcmUgYXJlIG5vIGF4aXMgbGFiZWxzIGJ5IHNwZWNpZnlpbmcgIiIgZm9yIGJvdGggdGhlIHhsYWIgYW5kIHlsYWIgYXJndW1lbnRzIHRvIHBsb3QoKS4NCg0KYGBge3J9DQojIFNjYXR0ZXIgcGxvdCBvZiB4DQpwbG90KHgsIA0KICBjb2wgPSBrbS5vdXQkY2x1c3RlciwNCiAgbWFpbiA9ICJrLW1lYW5zIHdpdGggMyBjbHVzdGVycyIsDQogIHhsYWIgPSAiIiwNCiAgeWxhYiA9ICIiKQ0KYGBgDQoNCg==