Let’s load some libraries
library('ggplot2')
library('cluster')
library('apcluster')
Here’s the spread_data from Nic’s doc:
group_size = 30
spread_data = data.frame(x = c(rnorm(group_size), rnorm(group_size, 0),
rnorm(group_size, 10), rnorm(group_size, 20),
rnorm(group_size, 20)),
y = c(rnorm(group_size), rnorm(group_size, 20),
rnorm(group_size, 10), rnorm(group_size, 20),
rnorm(group_size)))
spread_data$kind = "Spread"
ggplot(spread_data, aes(x=x, y=y, color=kind)) + geom_point()
Let’s try to find the number of clusters at different heights in an agnes dendrogram.
spread_agnes = agnes(spread_data[,1:2], stand=FALSE, method="average")
sum(spread_agnes$height>10) + 1 #Number of clusters at least averaging 10 distance apart. Have to add 1 since this is counting the number of cluster splits above the specified height.
## [1] 5
plot(spread_agnes, which.plots=2)
If you use a different linkage, you’ll get a different result.
spread_agnes_single = agnes(spread_data[,1:2], stand=FALSE, method="single")
sum(spread_agnes_single$height>10) + 1 #Number of clusters at least averaging 10 distance apart. Have to add 1 since this is counting the number of cluster splits above the specified height.
## [1] 3
plot(spread_agnes_single, which.plots=2)
Let’s add the scaled data, at 1/100 original distances.
scaled_data <- as.data.frame(scale(spread_data[, 1:2], center=FALSE, scale=c(100, 100)))
scaled_data$kind = "Scaled"
all_data <- rbind(spread_data, scaled_data)
ggplot(all_data, aes(x=x, y=y, color=kind)) + geom_point()
Now let’s do the same clustering, showing that we get a different number of clusters since the heights will be much smaller.
scaled_agnes = agnes(scaled_data[,1:2], stand=FALSE, method="average")
sum(scaled_agnes$height>10) + 1 #Number of clusters at least averaging 10 distance apart. Have to add 1 since this is counting the number of cluster splits above the specified height.
## [1] 1
plot(scaled_agnes, which.plots=2)