Hierarchical clustering is another popular method for clustering. The goal of this chapter is to go over how it works, how to use it, and how it compares to k-means clustering.

library(readr)
library(dplyr)
library(ggplot2)
library(stringr)

2.1: Hierarchical clustering with results

In this exercise, you will create your first hierarchical clustering model using the hclust() function.

We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.

You will look at the structure of the resulting model using the summary() function.

Instructions

100 XP

x<-read.csv("Datacamp_R_Unsupervised_Learning_Chapter2_x.csv")
# Create hierarchical clustering model: hclust.out
head(x)
hclust.out <- hclust(dist(x))
# Inspect the result
summary(hclust.out)
            Length Class  Mode     
merge       98     -none- numeric  
height      49     -none- numeric  
order       50     -none- numeric  
labels       0     -none- NULL     
method       1     -none- character
call         2     -none- call     
dist.method  1     -none- character

2.2: Cutting the tree

Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.

In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.

Instructions

100 XP

# Cut by height
cutree(hclust.out, h = 7) 
 [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[50] 2
plot(hclust.out)
abline(h = 7, col = "red")

# Cut by number of clusters
cutree(hclust.out, k = 3)
 [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[50] 2

Remark: If you’re wondering what the output means, remember, there are 50 observations in the original dataset x. The output of each cutree() call represents the cluster assignments for each observation in the original dataset.

2.3: Linkage methods

In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.

You’ll be asked to interpret the results in the next exercise.

Instructions

100 XP

# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")
# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")
# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")
# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")
abline(h = 7, col = "red")

# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")
abline(h = 4.5, col = "red")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")
abline(h = 1.5, col = "red")

# Cut by height
#cutree(hclust.complete, h = 7)
# Cut by number of clusters
#cutree(hclust.complete, k = 3)

Remarks: Whether you want balanced or unbalanced trees for your hierarchical clustering model depends on the context of the problem you’re trying to solve. Balanced trees are essential if you want an even number of observations assigned to each cluster. On the other hand, if you want to detect outliers, for example, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

2.4: Practical matters: scaling

Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.

In this exercise, you will go back to working with “real” data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.

Instructions

100 XP

pokemon_raw <- read_csv('Pokemon.csv')
Parsed with column specification:
cols(
  `#` = col_integer(),
  Name = col_character(),
  `Type 1` = col_character(),
  `Type 2` = col_character(),
  Total = col_integer(),
  HP = col_integer(),
  Attack = col_integer(),
  Defense = col_integer(),
  `Sp. Atk` = col_integer(),
  `Sp. Def` = col_integer(),
  Speed = col_integer(),
  Generation = col_integer(),
  Legendary = col_logical()
)
#head(pokemon_raw)
pokemon <- pokemon_raw %>% select(6:11)
head(pokemon)
#str(pokemon)
# View column means
colMeans(pokemon)
      HP   Attack  Defense  Sp. Atk  Sp. Def    Speed 
69.25875 79.00125 73.84250 72.82000 71.90250 68.27750 
# View column standard deviations
apply(pokemon,2,sd)
      HP   Attack  Defense  Sp. Atk  Sp. Def    Speed 
25.53467 32.45737 31.18350 32.72229 27.82892 29.06047 
# Scale the data
pokemon.scaled<-scale(pokemon)
# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon<-hclust(dist(pokemon.scaled), method="complete")
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon<-cutree(hclust.pokemon,k=3)
##############################################################
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.pokemon
  km.pokemon <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.pokemon$tot.withinss
}
# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Select number of clusters
k <- 3
# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
#####################################################################
# Compare methods
table(km.pokemon$cluster, cut.pokemon)
   cut.pokemon
      1   2   3
  1 267   3   0
  2 171   3   1
  3 350   5   0

Remarks: Looking at the table, it looks like the hierarchical clustering model assigns most of the observations to cluster 1, while the k-means algorithm distributes the observations relatively evenly among all clusters. It’s important to note that there’s no consensus on which method produces better clusters. The job of the analyst in unsupervised clustering is to observe the cluster assignments and make a judgment call as to which method provides more insights into the data.

---
title: "Datacamp R - Unsupervised Learning in R : Chapter 2 (Hierarchical clustering)"
author: "Chen Weiqiang"
date: "November 28, 2018"
output: html_notebook
---


Hierarchical clustering is another popular method for clustering. The goal of this chapter is to go over how it works, how to use it, and how it compares to k-means clustering.

```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
```

# 2.1: Hierarchical clustering with results

In this exercise, you will create your first hierarchical clustering model using the hclust() function.

We have created some data that has two dimensions and placed it in a variable called x. Your task is to create a hierarchical clustering model of x. Remember from the video that the first step to hierarchical clustering is determining the similarity between observations, which you will do with the dist() function.

You will look at the structure of the resulting model using the summary() function.

Instructions

100 XP

- Fit a hierarchical clustering model to x using the hclust() function. Store the result in hclust.out.

- Inspect the result with the summary() function.

```{r}
x<-read.csv("Datacamp_R_Unsupervised_Learning_Chapter2_x.csv")
# Create hierarchical clustering model: hclust.out
head(x)
hclust.out <- hclust(dist(x))

# Inspect the result
summary(hclust.out)
```

# 2.2: Cutting the tree

Remember from the video that cutree() is the R function that cuts a hierarchical model. The h and k arguments to cutree() allow you to cut the tree based on a certain height h or a certain number of clusters k.

In this exercise, you will use cutree() to cut the hierarchical model you created earlier based on each of these two criteria.

Instructions

100 XP

- The hclust.out model you created earlier is available in your workspace.

- Cut the hclust.out model at height 7.

- Cut the hclust.out model to create 3 clusters.

```{r}
# Cut by height
cutree(hclust.out, h = 7) 

plot(hclust.out)
abline(h = 7, col = "red")

# Cut by number of clusters
cutree(hclust.out, k = 3)
```

Remark: If you're wondering what the output means, remember, there are 50 observations in the original dataset x. The output of each cutree() call represents the cluster assignments for each observation in the original dataset. 

# 2.3: Linkage methods

In this exercise, you will produce hierarchical clustering models using different linkages and plot the dendrogram for each, observing the overall structure of the trees.

You'll be asked to interpret the results in the next exercise.

Instructions

100 XP

- Produce three hierarchical clustering models on x using the "complete", "average", and "single" linkage methods, respectively.

- Plot a dendrogram for each model, using titles of "Complete", "Average", and "Single", respectively.

```{r}
# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")

# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")

# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")

# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")
abline(h = 7, col = "red")

# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")
abline(h = 4.5, col = "red")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")
abline(h = 1.5, col = "red")

# Cut by height
#cutree(hclust.complete, h = 7)
# Cut by number of clusters
#cutree(hclust.complete, k = 3)


```

Remarks: Whether you want balanced or unbalanced trees for your hierarchical clustering model depends on the context of the problem you're trying to solve. Balanced trees are essential if you want an even number of observations assigned to each cluster. On the other hand, if you want to detect outliers, for example, an unbalanced tree is more desirable because pruning an unbalanced tree can result in most observations assigned to one cluster and only a few observations assigned to other clusters.

# 2.4: Practical matters: scaling

Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.

In this exercise, you will go back to working with "real" data, the pokemon dataset introduced in the first chapter. You will observe the distribution (mean and standard deviation) of each feature, scale the data accordingly, then produce a hierarchical clustering model using the complete linkage method.

Instructions

100 XP

- The data is stored in the pokemon object in your workspace.

- Observe the mean of each variable in pokemon using the colMeans() function.

- Observe the standard deviation of each variable using the apply() and sd() functions. Since the variables are the columns of your matrix, make sure to specify 2 as the MARGIN argument to apply().

- Scale the pokemon data using the scale() function and store the result in pokemon.scaled.

- Create a hierarchical clustering model of the pokemon.scaled data using the complete linkage method. 

- Manually specify the method argument and store the result in hclust.pokemon.

```{r}
pokemon_raw <- read_csv('Pokemon.csv')
#head(pokemon_raw)

pokemon <- pokemon_raw %>% select(6:11)
head(pokemon)
#str(pokemon)


# View column means
colMeans(pokemon)

# View column standard deviations
apply(pokemon,2,sd)

# Scale the data
pokemon.scaled<-scale(pokemon)

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon<-hclust(dist(pokemon.scaled), method="complete")
```



```{r}
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon<-cutree(hclust.pokemon,k=3)
##############################################################
# Initialize total within sum of squares error: wss
wss <- 0

# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.pokemon
  km.pokemon <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.pokemon$tot.withinss
}

# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Select number of clusters
k <- 3

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
#####################################################################
# Compare methods
table(km.pokemon$cluster, cut.pokemon)
```

Remarks: Looking at the table, it looks like the hierarchical clustering model assigns most of the observations to cluster 1, while the k-means algorithm distributes the observations relatively evenly among all clusters. It's important to note that there's no consensus on which method produces better clusters. The job of the analyst in unsupervised clustering is to observe the cluster assignments and make a judgment call as to which method provides more insights into the data. 