types of machine learning
Unsupervised learning - dimensionality reduction
finding homogeneous subgroups within larger group
finding patterns in the features of the data
visualization of high dimensional data
pre-processing before supervised learning
Challenges and benefits
k-means clustering
We have created some two-dimensional data and stored it in a variable called x in your workspace. The scatter plot below is a visual representation of the data.
In this exercise, your task is to create a k-means model of the x data using 3 clusters, then to look at the structure of the resulting model using the summary() function.
# Create the k-means model: km.out
km.out <- kmeans(x, centers = 3, nstart = 20)
# Inspect the result
summary(km.out)
Length Class Mode
cluster 300 -none- numeric
centers 6 -none- numeric
totss 1 -none- numeric
withinss 3 -none- numeric
tot.withinss 1 -none- numeric
betweenss 1 -none- numeric
size 3 -none- numeric
iter 1 -none- numeric
ifault 1 -none- numeric
Results of kmeans()
The kmeans() function produces several outputs. In the video, we discussed one output of modeling, the cluster membership.
In this exercise, you will access the cluster component directly. This is useful anytime you need the cluster membership for each observation of the data used to build the clustering model. A future exercise will show an example of how this cluster membership might be used to help communicate the results of k-means modeling.
k-means models also have a print method to give a human friendly output of basic modeling results. This is available by using print() or simply typing the name of the model.
# Print the cluster membership component of the model
km.out$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3
[38] 3 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 1 3 3 3 3
[260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3
[297] 3 1 3 3
# Print the km.out object
km.out
K-means clustering with 3 clusters of sizes 98, 150, 52
Cluster means:
[,1] [,2]
1 2.2171113 2.05110690
2 -5.0556758 1.96991743
3 0.6642455 -0.09132968
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3
[38] 3 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 1 3 3 3 3
[260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3
[297] 3 1 3 3
Within cluster sum of squares by cluster:
[1] 148.64781 295.16925 95.50625
(between_SS / total_SS = 87.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Take a look at all the different components of a k-means model object as you may need to access them in later exercises. Because printing the whole model object to the console outputs many different things, you may wish to instead print a specific component of the model object using the $ operator. Great work!
Visualizing and interpreting results of kmeans()
One of the more intuitive ways to interpret the results of k-means models is by plotting the data as a scatter plot and using color to label the samples’ cluster membership. In this exercise, you will use the standard plot() function to accomplish this.
To create a scatter plot, you can pass data with two features (i.e. columns) to plot() with an extra argument col = km.out$cluster, which sets the color of each point in the scatter plot according to its cluster membership.
# Scatter plot of x
plot(x, col = km.out$cluster,
main = "k-means with 3 clusters",
xlab = "", ylab = "")
Excellent! Let’s see how the kmeans() function works under the hood in the next video.
Objectives
Model Selection
Handling random algorithms
In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minima for the k-means algorithm. This exercise will demonstrate both results.
At the top of each plot, the measure of model quality—total within cluster sum of squares error—will be plotted. Look for the model(s) with the lowest error to find models with the better model results.
Because kmeans() initializes observations to random clusters, it is important to set the random number generator seed for reproducibility.
# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))
# Set seed
set.seed(1)
for(i in 1:6) {
# Run kmeans() on x with three clusters and one start
km.out <- kmeans(x, centers = 3, nstart = 1)
# Plot clusters
plot(x, col = km.out$cluster,
main = km.out$tot.withinss,
xlab = "", ylab = "")
}
Interesting! Because of the random initialization of the k-means algorithm, there’s quite some variation in cluster assignments among the six models.
Selecting number of clusters
The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.
In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.
The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
km.out <- kmeans(x, centers = i, nstart = 20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
# Set k equal to the number of clusters corresponding to the elbow location
k <- 2 # 3 is probably OK, too
Looking at the scree plot, it looks like there are inherently 2 or 3 clusters in the data. Awesome job!
Data challenges
Practical matters: working with real data
Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.
The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.
The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.
An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
pokemon <- read.csv("_data/Pokemon.csv")
pokemon <- pokemon %>% select(HitPoints, Attack, Defense, SpecialAttack, SpecialDefense, Speed) %>% as.matrix()
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in 1:15) {
# Fit the model: km.out
km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
# Save the within cluster sum of squares
wss[i] <- km.out$tot.withinss
}
# Produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
# Select number of clusters (2, 3, 4 probably OK)
k <- 3
# Build model with k clusters: km.out
km.out <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# View the resulting model
km.out
## K-means clustering with 3 clusters of sizes 175, 270, 355
##
## Cluster means:
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 79.30857 97.29714 108.93143 66.71429 87.04571 57.29143
## 2 81.90370 96.15926 77.65556 104.12222 86.87778 94.71111
## 3 54.68732 56.93239 53.64507 52.02254 53.04789 53.58873
##
## Clustering vector:
## [1] 3 3 2 2 3 3 2 2 2 3 3 1 2 3 3 3 3 3 3 2 3 3 2 2 3 3 3 2 3 2 3 2 3 1 3 3 1
## [38] 3 3 2 3 2 3 2 3 3 3 2 3 3 2 3 1 3 2 3 3 3 2 3 2 3 2 3 2 3 3 1 3 2 2 2 3 1
## [75] 1 3 3 2 3 2 3 1 1 3 2 3 1 1 3 2 3 3 2 3 1 3 1 3 1 3 2 2 2 1 3 1 3 1 3 2 3
## [112] 2 3 1 1 1 3 3 1 3 1 3 1 1 1 3 2 3 1 3 2 2 2 2 2 2 1 1 1 3 1 1 1 3 3 2 2 2
## [149] 3 3 1 3 1 2 2 1 2 2 2 3 3 2 2 2 2 2 3 3 1 3 3 2 3 3 1 3 3 3 2 3 3 3 3 2 3
## [186] 2 3 3 3 3 3 3 2 3 3 2 2 1 3 3 1 2 3 3 2 3 3 3 3 3 1 2 1 3 1 2 3 3 2 3 1 3
## [223] 1 1 1 3 1 3 1 1 1 1 1 3 3 1 3 1 3 1 3 3 2 3 2 1 3 2 2 2 3 1 2 2 3 3 1 3 3
## [260] 3 1 2 2 2 1 3 3 1 1 2 2 2 3 3 2 2 3 3 2 2 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 2
## [297] 3 3 2 3 3 3 1 3 3 2 2 3 3 3 1 3 3 2 3 2 3 3 3 2 3 1 3 1 3 3 3 1 3 1 3 1 1
## [334] 1 3 3 2 3 2 2 3 3 3 3 3 3 1 3 2 2 3 2 3 2 1 1 3 2 3 3 3 2 3 2 3 1 2 2 2 2
## [371] 1 3 1 3 1 3 1 3 1 3 1 3 2 3 1 3 2 2 3 1 1 3 2 2 3 3 2 2 3 3 2 3 1 1 1 3 3
## [408] 1 2 2 3 1 1 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 3 1 1 3 3 2 3 3 2 3 3 2
## [445] 3 3 3 3 3 3 2 3 2 3 1 3 1 3 1 1 1 2 3 1 3 3 2 3 2 3 1 2 3 2 3 2 2 2 2 3 2
## [482] 3 3 2 3 1 3 3 3 3 1 3 3 2 2 3 3 2 2 3 1 3 1 3 2 1 3 2 3 3 2 1 2 2 1 1 1 2
## [519] 2 2 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3
## [556] 3 2 3 3 2 3 3 2 3 3 3 3 1 3 2 3 2 3 2 3 2 3 1 3 3 2 3 2 3 1 1 3 2 3 2 1 1
## [593] 3 1 1 3 3 2 1 1 3 3 2 3 3 2 3 2 3 2 2 3 3 2 3 1 2 2 3 1 3 1 2 3 1 3 1 3 2
## [630] 3 1 3 2 3 2 3 3 1 3 3 2 3 2 3 3 2 3 2 2 3 1 3 1 3 2 1 3 2 3 1 3 1 1 3 3 2
## [667] 3 2 3 3 2 3 1 1 3 1 2 3 2 1 3 2 1 3 1 3 1 1 3 1 3 1 2 1 3 3 2 3 2 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 1 3 3 2 3 3 2 3 3 3 3 2 3 3 3 3 2 3 3 2
## [741] 3 2 3 1 2 3 2 2 3 1 2 1 3 1 3 2 3 1 3 1 3 1 3 2 3 2 3 1 3 2 2 2 2 1 3 2 2
## [778] 1 3 1 3 3 3 3 1 1 1 1 3 1 3 2 2 2 1 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 709020.5 1018348.0 812079.9
## (between_SS / total_SS = 40.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
col = km.out$cluster,
main = paste("k-means clustering of Pokemon with", k, "clusters"),
xlab = "Defense", ylab = "Speed")
Nice job! You’re really getting the hang of k-means clustering quickly!
Chapter review
Hierarchical clustering
Hierarchical clustering with results
In this exercise, you will create your first hierarchical clustering model using the hclust() function.
We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.
You will look at the structure of the resulting model using the summary() function.
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x))
# Inspect the result
summary(hclust.out)
summary(hclust.out)
Length Class Mode
merge 98 -none- numeric
height 49 -none- numeric
order 50 -none- numeric
labels 0 -none- NULL
method 1 -none- character
call 2 -none- call
dist.method 1 -none- character
Awesome! Now that you’ve made your first hierarchical clustering model, let’s learn how to use it to solve problems.
Cutting the tree
Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.
In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.
# Cut by height
cutree(hclust.out, h = 7)
[1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2
# Cut by number of clusters
cutree(hclust.out, k = 3)
[1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2
If you’re wondering what the output means, remember, there are 50 observations in the original dataset x. The output of each cutree() call represents the cluster assignments for each observation in the original dataset. Great work!
Linking clusters in hierarchical clustering
how is distance between clusters determined? Rules?
Four methods to determine which cluster should be linked
Practical matters
Linkage methods
In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.
You’ll be asked to interpret the results in the next exercise.
# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")
# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")
# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")
# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")
# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")
# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")
Before moving on, make sure to toggle through the plots to compare and contrast the three dendrograms you created. You’ll learn about the implications of these differences in the next exercise. Excellent work!
Whether you want balanced or unbalanced trees for your hierarchical clustering model depends on the context of the problem you’re trying to solve. Balanced trees are essential if you want an even number of observations assigned to each cluster. On the other hand, if you want to detect outliers, for example, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.
Practical matters: scaling
Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.
In this exercise, you will go back to working with “real” data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.
# View column means
colMeans(pokemon)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
# View column standard deviations
apply(pokemon, 2, sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
# Scale the data
pokemon.scaled <- scale(pokemon)
# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")
Let’s quickly recap what you just did. You first checked to see if the column means and standard deviations vary. Because they do, you scaled the data, converted the scaled data to a similarity matrix and passed it into the hclust() function. Great work!
Comparing kmeans() and hclust()
Comparing k-means and hierarchical clustering, you’ll see the two methods produce different cluster memberships. This is because the two algorithms make different assumptions about how the data is generated. In a more advanced course, we could choose to use one model over another based on the quality of the models’ assumptions, but for now, it’s enough to observe that they are different.
This exercise will have you compare results from the two models on the pokemon dataset to see how they differ.
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)
# Compare methods
table(km.out$cluster, cut.pokemon)
## cut.pokemon
## 1 2 3
## 1 171 3 1
## 2 267 3 0
## 3 350 5 0
Looking at the table, it looks like the hierarchical clustering model assigns most of the observations to cluster 1, while the k-means algorithm distributes the observations relatively evenly among all clusters. It’s important to note that there’s no consensus on which method produces better clusters. The job of the analyst in unsupervised clustering is to observe the cluster assignments and make a judgment call as to which method provides more insights into the data. Excellent job!
Two methods of clustering
dimensionality reduction
PCA using prcomp()
In this exercise, you will create your first PCA model and observe the diagnostic results.
We have loaded the Pokemon data from earlier, which has four dimensions, and placed it in a variable called pokemon. Your task is to create a PCA model of the data, then to inspect the resulting model using the summary() function.
pokemon <- read.csv("_data/Pokemon.csv", row.names = "Name")
(pokemon <- pokemon %>% select(HitPoints, Attack, Defense, Speed) %>% slice(1:50) %>% as.matrix())
## HitPoints Attack Defense Speed
## Bulbasaur 45 49 49 45
## Ivysaur 60 62 63 60
## Venusaur 80 82 83 80
## VenusaurMega Venusaur 80 100 123 80
## Charmander 39 52 43 65
## Charmeleon 58 64 58 80
## Charizard 78 84 78 100
## CharizardMega Charizard X 78 130 111 100
## CharizardMega Charizard Y 78 104 78 100
## Squirtle 44 48 65 43
## Wartortle 59 63 80 58
## Blastoise 79 83 100 78
## BlastoiseMega Blastoise 79 103 120 78
## Caterpie 45 30 35 45
## Metapod 50 20 55 30
## Butterfree 60 45 50 70
## Weedle 40 35 30 50
## Kakuna 45 25 50 35
## Beedrill 65 90 40 75
## BeedrillMega Beedrill 65 150 40 145
## Pidgey 40 45 40 56
## Pidgeotto 63 60 55 71
## Pidgeot 83 80 75 101
## PidgeotMega Pidgeot 83 80 80 121
## Rattata 30 56 35 72
## Raticate 55 81 60 97
## Spearow 40 60 30 70
## Fearow 65 90 65 100
## Ekans 35 60 44 55
## Arbok 60 85 69 80
## Pikachu 35 55 40 90
## Raichu 60 90 55 110
## Sandshrew 50 75 85 40
## Sandslash 75 100 110 65
## Nidoranâ\231\200 55 47 52 41
## Nidorina 70 62 67 56
## Nidoqueen 90 92 87 76
## Nidoranâ\231‚ 46 57 40 50
## Nidorino 61 72 57 65
## Nidoking 81 102 77 85
## Clefairy 70 45 48 35
## Clefable 95 70 73 60
## Vulpix 38 41 40 65
## Ninetales 73 76 75 100
## Jigglypuff 115 45 20 20
## Wigglytuff 140 70 45 45
## Zubat 40 45 35 55
## Golbat 75 80 70 90
## Oddish 45 50 55 30
## Gloom 60 65 70 40
# Perform scaled PCA: pr.out
pr.out <- prcomp(pokemon, scale = TRUE)
# Inspect model output
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5467 0.9441 0.7490 0.39431
## Proportion of Variance 0.5981 0.2228 0.1402 0.03887
## Cumulative Proportion 0.5981 0.8209 0.9611 1.00000
The first two principal components describe around 82% of the variance.
Additional results of PCA
PCA models in R produce additional diagnostic and output components:
center: the column means used to center to the data, or FALSE if the data weren’t centeredscale: the column standard deviations used to scale the data, or FALSE if the data weren’t scaledrotation: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal componentsx: the value of each observation in the original dataset projected to the principal componentsYou can access these the same as other model components. For example, use pr.out$rotation to access the rotation component.
Calling dim() on pr.out$rotation and pokemon, you can see they have different dimensions.
Interpreting biplots (1)
As stated in the video, the biplot() function plots both the principal components loadings and the mapping of the observations to their first two principal component values. The next couple of exercises will check your interpretation of the biplot() visualization.
Question: Using the biplot() of the pr.out model, which two original variables have approximately the same loadings in the first two principal components?
Answer: Attack and HitPoints
Interpreting biplots (2)
In the last exercise, you saw that Attack and HitPoints have approximately the same loadings in the first two principal components.
Question: Again using the biplot() of the pr.out model, which two Pokemon are the least similar in terms of the second principal component?
Answer: Kadabra and Torkoal
Variance explained
The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.
In this and the next exercise, you will prepare data from the pr.out model you created at the beginning of the chapter for use in a scree plot. Preparing the data for plotting is required because there is not a built-in function in R to create this type of plot.
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
Visualize variance explained
Now you will create a scree plot showing the proportion of variance explained by each principal component, as well as the cumulative proportion of variance explained.
Recall from the video that these plots can help to determine the number of principal components to retain. One way to determine the number of principal components to retain is by looking for an elbow in the scree plot showing that as the number of principal components increases, the rate at which variance is explained decreases substantially. In the absence of a clear elbow, you can use the scree plot as a guide for setting a threshold.
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Awesome! Notice that when the number of principal components is equal to the number of original features in the data, the cumulative proportion of variance explained is 1.
Practical issues: scaling
You saw in the video that scaling your data before doing PCA changes the results of the PCA modeling. Here, you will perform PCA with and without scaling, then visualize the results using biplots.
Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.
# Mean of each variable
colMeans(pokemon)
## HitPoints Attack Defense Speed
## 63.10 69.10 62.10 69.16
# Standard deviation of each variable
apply(pokemon, 2, sd)
## HitPoints Attack Defense Speed
## 21.30847 25.77552 23.80383 25.98301
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(pokemon, scale = TRUE)
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(pokemon, scale = FALSE)
# Create biplots of both for comparison
biplot(pr.with.scaling)
biplot(pr.without.scaling)
Good job! The new Total column contains much more variation, on average, than the other four columns, so it has a disproportionate effect on the PCA model when scaling is not performed. After scaling the data, there’s a much more even distribution of the loading vectors.
Objectives
Analysis
Unsupervised learning is open-ended
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(url)
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[3:32])
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
str(wisc.data)
## num [1:569, 1:30] 18 20.6 19.7 11.4 20.3 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:569] "842302" "842517" "84300903" "84348301" ...
## ..$ : chr [1:30] "radius_mean" "texture_mean" "perimeter_mean" "area_mean" ...
head(wisc.data)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302 17.99 10.38 122.80 1001.0 0.11840
## 842517 20.57 17.77 132.90 1326.0 0.08474
## 84300903 19.69 21.25 130.00 1203.0 0.10960
## 84348301 11.42 20.38 77.58 386.1 0.14250
## 84358402 20.29 14.34 135.10 1297.0 0.10030
## 843786 12.45 15.70 82.57 477.1 0.12780
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302 0.27760 0.3001 0.14710 0.2419
## 842517 0.07864 0.0869 0.07017 0.1812
## 84300903 0.15990 0.1974 0.12790 0.2069
## 84348301 0.28390 0.2414 0.10520 0.2597
## 84358402 0.13280 0.1980 0.10430 0.1809
## 843786 0.17000 0.1578 0.08089 0.2087
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302 0.07871 1.0950 0.9053 8.589 153.40
## 842517 0.05667 0.5435 0.7339 3.398 74.08
## 84300903 0.05999 0.7456 0.7869 4.585 94.03
## 84348301 0.09744 0.4956 1.1560 3.445 27.23
## 84358402 0.05883 0.7572 0.7813 5.438 94.44
## 843786 0.07613 0.3345 0.8902 2.217 27.19
## smoothness_se compactness_se concavity_se concave.points_se
## 842302 0.006399 0.04904 0.05373 0.01587
## 842517 0.005225 0.01308 0.01860 0.01340
## 84300903 0.006150 0.04006 0.03832 0.02058
## 84348301 0.009110 0.07458 0.05661 0.01867
## 84358402 0.011490 0.02461 0.05688 0.01885
## 843786 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302 0.03003 0.006193 25.38 17.33
## 842517 0.01389 0.003532 24.99 23.41
## 84300903 0.02250 0.004571 23.57 25.53
## 84348301 0.05963 0.009208 14.91 26.50
## 84358402 0.01756 0.005115 22.54 16.67
## 843786 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 842302 184.60 2019.0 0.1622 0.6656
## 842517 158.80 1956.0 0.1238 0.1866
## 84300903 152.50 1709.0 0.1444 0.4245
## 84348301 98.87 567.7 0.2098 0.8663
## 84358402 152.20 1575.0 0.1374 0.2050
## 843786 103.40 741.6 0.1791 0.5249
## concavity_worst concave.points_worst symmetry_worst
## 842302 0.7119 0.2654 0.4601
## 842517 0.2416 0.1860 0.2750
## 84300903 0.4504 0.2430 0.3613
## 84348301 0.6869 0.2575 0.6638
## 84358402 0.4000 0.1625 0.2364
## 843786 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 842302 0.11890
## 842517 0.08902
## 84300903 0.08758
## 84348301 0.17300
## 84358402 0.07678
## 843786 0.12440
head(diagnosis)
## [1] 1 1 1 1 1 1
Great work! You’ve successfully prepared the data for exploratory data analysis.
Performing PCA
The next step in your analysis is to perform PCA on wisc.data.
You saw in the last chapter that it’s important to check if the data need to be scaled before performing PCA. Recall two common reasons for scaling data:
# Check column means and standard deviations
colMeans(wisc.data)
## radius_mean texture_mean perimeter_mean
## 1.412729e+01 1.928965e+01 9.196903e+01
## area_mean smoothness_mean compactness_mean
## 6.548891e+02 9.636028e-02 1.043410e-01
## concavity_mean concave.points_mean symmetry_mean
## 8.879932e-02 4.891915e-02 1.811619e-01
## fractal_dimension_mean radius_se texture_se
## 6.279761e-02 4.051721e-01 1.216853e+00
## perimeter_se area_se smoothness_se
## 2.866059e+00 4.033708e+01 7.040979e-03
## compactness_se concavity_se concave.points_se
## 2.547814e-02 3.189372e-02 1.179614e-02
## symmetry_se fractal_dimension_se radius_worst
## 2.054230e-02 3.794904e-03 1.626919e+01
## texture_worst perimeter_worst area_worst
## 2.567722e+01 1.072612e+02 8.805831e+02
## smoothness_worst compactness_worst concavity_worst
## 1.323686e-01 2.542650e-01 2.721885e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 1.146062e-01 2.900756e-01 8.394582e-02
apply(wisc.data, 2, sd)
## radius_mean texture_mean perimeter_mean
## 3.524049e+00 4.301036e+00 2.429898e+01
## area_mean smoothness_mean compactness_mean
## 3.519141e+02 1.406413e-02 5.281276e-02
## concavity_mean concave.points_mean symmetry_mean
## 7.971981e-02 3.880284e-02 2.741428e-02
## fractal_dimension_mean radius_se texture_se
## 7.060363e-03 2.773127e-01 5.516484e-01
## perimeter_se area_se smoothness_se
## 2.021855e+00 4.549101e+01 3.002518e-03
## compactness_se concavity_se concave.points_se
## 1.790818e-02 3.018606e-02 6.170285e-03
## symmetry_se fractal_dimension_se radius_worst
## 8.266372e-03 2.646071e-03 4.833242e+00
## texture_worst perimeter_worst area_worst
## 6.146258e+00 3.360254e+01 5.693570e+02
## smoothness_worst compactness_worst concavity_worst
## 2.283243e-02 1.573365e-01 2.086243e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 6.573234e-02 6.186747e-02 1.806127e-02
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = TRUE)
# Look at summary of results
summary(wisc.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion 0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion 0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
## PC29 PC30
## Standard deviation 0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000
Interpreting PCA results
Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.
You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.
# Create a biplot of wisc.pr
biplot(wisc.pr)
# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
# Do additional data exploration of your choosing below (optional)
Excellent work! Because principal component 2 explains more variance in the original data than principal component 3, you can see that the first plot has a cleaner cut separating the two subgroups.
Variance explained
In this exercise, you will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.
As you look at these plots, ask yourself if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var <- wisc.pr$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Great work! Before moving on, answer the following question: What is the minimum number of principal components needed to explain 80% of the variance in the data? Write it down as you may need this in the next exercise :)
Review thus far
Next steps
Hierarchical clustering of case data
The goal of this exercise is to do hierarchical clustering of the observations. Recall from Chapter 2 that this type of clustering does not assume in advance the number of natural groups that exist in the data.
As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)
# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
plot(wisc.hclust)
Nice! Let’s continue to the next exercise.
Selecting number of clusters
In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.
When performing supervised learning—that is, when you’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)
# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
## diagnosis
## wisc.hclust.clusters 0 1
## 1 12 165
## 2 2 5
## 3 343 40
## 4 0 2
Four clusters were picked after some exploration. Before moving on, you may want to explore how different numbers of clusters affect the ability of the hierarchical clustering to separate the different diagnoses. Great job!
k-means clustering and comparing results
As you now know, there are two main types of clustering: hierarchical and k-means.
In this exercise, you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)
# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
## diagnosis
## 0 1
## 1 14 175
## 2 343 37
# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
##
## wisc.hclust.clusters 1 2
## 1 160 17
## 2 7 0
## 3 20 363
## 4 2 0
Nice! Looking at the second table you generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.
Clustering on PCA results
In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.
Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.
Let’s see if PCA improves or degrades the performance of hierarchical clustering.
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)
## wisc.pr.hclust.clusters
## diagnosis 1 2 3 4
## 0 5 350 2 0
## 1 113 97 0 2
# Compare to k-means and hierarchical
table(diagnosis, wisc.hclust.clusters)
## wisc.hclust.clusters
## diagnosis 1 2 3 4
## 0 12 2 343 0
## 1 165 5 40 2
table(diagnosis, wisc.km$cluster)
##
## diagnosis 1 2
## 0 14 343
## 1 175 37
Case study wrap-up