Anna Shirokanova
15 11 2021
There are two classic approaches in cluster analysis, top-down (K-means, PAM), and bottom-up (hierarchical clustering).
K-means is very fast but keeps it simple (sometimes too simple).
Hierarchical clustering, once built, can quickly do solutions for various numbers of clusters. There is also a choice of linkage criteria to match the data as we want. It works bottom-up, so uses more memory usage and computation time when there are > 30,000 observations.
(Source: https://jmonlong.github.io/Hippocamplus/2018/02/13/tsne-and-clustering/).
Let’s look at iris data:
We know the species from the beginning, thus, can compare solutions
In real life, the ‘species’ is unknown, and the solution has to be tested against reality
In this dataset, the object features are metric only.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
Petal and Sepal
Petal.Length and Petal.Width are similar within the same species but vary considerably between different species:
## item group1 vars n mean sd
## Sepal.Length1 1 setosa 1 50 5.006 0.3524897
## Sepal.Length2 2 versicolor 1 50 5.936 0.5161711
## Sepal.Length3 3 virginica 1 50 6.588 0.6358796
## Sepal.Width1 4 setosa 2 50 3.428 0.3790644
## Sepal.Width2 5 versicolor 2 50 2.770 0.3137983
## Sepal.Width3 6 virginica 2 50 2.974 0.3224966
## Petal.Length1 7 setosa 3 50 1.462 0.1736640
## Petal.Length2 8 versicolor 3 50 4.260 0.4699110
## Petal.Length3 9 virginica 3 50 5.552 0.5518947
## Petal.Width1 10 setosa 4 50 0.246 0.1053856
## Petal.Width2 11 versicolor 4 50 1.326 0.1977527
## Petal.Width3 12 virginica 4 50 2.026 0.2746501
Let us visualize their distribution:
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +
geom_point() +
theme_classic()set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)
library(sjPlot)
tab_corr(iris[-5])| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | -0.118 | 0.872*** | 0.818*** | |
| Sepal.Width | -0.118 | -0.428*** | -0.366*** | |
| Petal.Length | 0.872*** | -0.428*** | 0.963*** | |
| Petal.Width | 0.818*** | -0.366*** | 0.963*** | |
| Computed correlation used pearson-method with listwise-deletion. | ||||
irisCluster <- kmeans(iris[ , c("Petal.Length", "Petal.Width")], # Petal.Length and Petal.Width
3, # how many groups to locate
nstart = 20 # R will try 20 different random starting assignments
# and then select the one with the lowest within cluster variation.
)Let’s compare the results of clustering against the pre-set classification:
##
## setosa versicolor virginica
## 1 0 48 4
## 2 50 0 0
## 3 0 2 46
Six objects (2 + 4) have been misidentified.
What if we standardize the data first?
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::%+%() masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
iris2 <- iris[ , c("Petal.Length", "Petal.Width")] %>%
scale() %>%
as.data.frame()
irisCluster2 <- kmeans(iris2,
3,
nstart = 20)
table(irisCluster2$cluster, iris$Species)##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 2 46
## 3 0 48 4
We can use the ‘elbow method’ to determine the optimal k.
To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case.
A low withinss shows high intra-cluster similarity and low inter-cluster similarity.
The optimal number of clusters (k) is where the elbow occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.
The silhouette width statistic:
"The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster.
A high average silhouette width indicates a good clustering."
Source: https://uc-r.github.io/kmeans_clustering
The gap statistic:
"The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering).
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering)." Source: https://uc-r.github.io/kmeans_clustering#gap
According to our scree plot (elbow plot), the optimal k is 2 or 3.
To visualize the solution in two dimensions, use:
fviz_cluster(irisCluster, data = iris[ ,3:4],
ellipse.type = "convex",
palette = "jco",
repel = TRUE)Compare this solution to the one based on ALL the variables:
Conclusion: k-means works well when domain knowledge is rich.
If little is known, use the most different variables for clustering. Decision on the number of clusters can be based on metrics.
In this example, irises can be clustered best by using 2 variables out of 4.
library(palmerpenguins)
library(psych)
describe(penguins) # flipper length and body mass have much larger variances## vars n mean sd median trimmed mad min max
## species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
## island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
## bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
## bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
## flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
## body_mass_g 6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
## year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
## range skew kurtosis se
## species* 2.0 0.16 -1.73 0.05
## island* 2.0 0.61 -0.91 0.04
## bill_length_mm 27.5 0.05 -0.89 0.30
## bill_depth_mm 8.4 -0.14 -0.92 0.11
## flipper_length_mm 59.0 0.34 -1.00 0.76
## body_mass_g 3600.0 0.47 -0.74 43.36
## sex* 1.0 -0.02 -2.01 0.03
## year 2.0 -0.05 -1.51 0.04
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700
## 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550
## Median :44.45 Median :17.30 Median :197.0 Median :4050
## Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202
## 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
## Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300
## NA's :2 NA's :2 NA's :2 NA's :2
| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| bill_length_mm | -0.235*** | 0.656*** | 0.595*** | |
| bill_depth_mm | -0.235*** | -0.584*** | -0.472*** | |
| flipper_length_mm | 0.656*** | -0.584*** | 0.871*** | |
| body_mass_g | 0.595*** | -0.472*** | 0.871*** | |
| Computed correlation used pearson-method with listwise-deletion. | ||||
peng_sc <- as.data.frame(scale(peng))
tab_corr(peng_sc) # if you scale the data, the correlations would remain the same| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
|---|---|---|---|---|
| bill_length_mm | -0.235*** | 0.656*** | 0.595*** | |
| bill_depth_mm | -0.235*** | -0.584*** | -0.472*** | |
| flipper_length_mm | 0.656*** | -0.584*** | 0.871*** | |
| body_mass_g | 0.595*** | -0.472*** | 0.871*** | |
| Computed correlation used pearson-method with listwise-deletion. | ||||
## vars n mean sd median trimmed mad min max range skew
## bill_length_mm 1 342 0 1 0.10 0.00 1.29 -2.17 2.87 5.04 0.05
## bill_depth_mm 2 342 0 1 0.08 0.01 1.13 -2.05 2.20 4.25 -0.14
## flipper_length_mm 3 342 0 1 -0.28 -0.04 1.16 -2.06 2.14 4.20 0.34
## body_mass_g 4 342 0 1 -0.19 -0.06 1.11 -1.87 2.62 4.49 0.47
## kurtosis se
## bill_length_mm -0.89 0.05
## bill_depth_mm -0.92 0.05
## flipper_length_mm -1.00 0.05
## body_mass_g -0.74 0.05
Error bars depend on simulation results. Gap is the difference between hypothetical unclustered data and the dataset in hand.
pCluster <- kmeans(peng[, c(1, 3)], 3)
penguins1 <- penguins %>%
dplyr::select(species, contains("mm"), contains("body")) %>%
na.omit()
table(pCluster$cluster, penguins1$species) # 55 misfits##
## Adelie Chinstrap Gentoo
## 1 38 54 1
## 2 111 9 0
## 3 2 5 122
pCluster <- kmeans(peng, 3) # on all 4 variables, 112 misfits
table(pCluster$cluster, penguins1$species)##
## Adelie Chinstrap Gentoo
## 1 0 0 81
## 2 53 22 42
## 3 98 46 0
##
## Adelie Chinstrap Gentoo
## 1 14 5 114
## 2 137 63 9
Over to you!
The choice of the metric may have a large impact depending on what data you have.
iris.use <- subset(iris, select = -Species) # select all variables but the "Species"
rownames(iris.use) <- c()
d <- dist(iris.use, method = 'euclidean') # calculate a distance matrix on selected variablesSee an example here:
Figure. Three clustering methods for the target-looking data.
What distance to choose?
| distance | metric description |
|---|---|
| euclidean - | usual distance between the two vectors (interval vars) |
| manhattan - | absolute distance between the two vectors |
| canberra - | for non-negative values (e.g., counts) |
| minkowski - | can be used for both ordinal and quantitative variables |
| gower - | the weighted mean of the contributions of each variable (for mixed types) |
| binary - | Jaccard distance for dichotomous variables |
Q: How to choose the correct distance?
A: Learning + practice.
“An appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.” https://stats.stackexchange.com/questions/3713/choosing-a-clustering-method
Read more here: http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling
Once you have the distance matrix, a plot with distances between observations and a dendrogram can be produced:
In this plot, clusters are squares of red color.
How many clusters can you see here, 2 or 3?
library(cluster)
hfit_ward <- agnes(d, method = "ward") # principle: minimal average distance between clusters when merging two clusters together
hfit_average <- agnes(d, method = "average") # average distance between an observation and existing clusters
hfit_complete <- agnes(d, method = "complete") # maximal distance between an observation and existing clustersClick here to read more about these methods: https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 49 15
## 3 0 1 35
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 50 14
## 3 0 0 36
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 23 49
## 3 0 27 1
In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives.
In the real world, you may look at the dendrogram to evaluate how well a particular method has clustered the data:
h_avg_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_avg_cut, iris.use, ellipse.type = "convex")Let’s further visualise the results!
cl <- cutree(hfit_average, k = 3) # cut the dendrogram to obtain exactly 3 clusters
library(dplyr)
iris <- mutate(iris, cluster = cl) # add cluster membership as a column to the data frame
head(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 1
## 3 4.7 3.2 1.3 0.2 setosa 1
## 4 4.6 3.1 1.5 0.2 setosa 1
## 5 5.0 3.6 1.4 0.2 setosa 1
## 6 5.4 3.9 1.7 0.4 setosa 1
library(magrittr)
iris[ ,-5] %>%
group_by(cluster) %>%
summarise_all(funs(mean(.))) # get the mean values of all variables by cluster## # A tibble: 3 x 5
## cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.01 3.43 1.46 0.246
## 2 2 5.93 2.76 4.41 1.44
## 3 3 6.85 3.08 5.79 2.10
Show the result on the dendrogram by coloring the branches (optional)
A more elaborate option:
library(dendextend)
dend <- as.dendrogram(hfit_average)
dend_col <- color_branches(dend, k = 3)
labels_colors(dend_col) <- "white"
plot(dend_col)This is how the dendrogram was cut to obtain three clusters.
Yet another possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance.
Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 1.8 means that the averagefor Euclidean distances between any two observations will not exceed 1.8:
h_avg <- hclust(d, method = "average")
cl2 <- cutree(h_avg, h = 1.8)
dend <- as.dendrogram(h_avg)
dend_col <- color_branches(dend, h = 1.8)
plot(dend_col, labels = FALSE)Better dendrograms: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels
A bonus below (optional)
The ‘banner plot’ below:
"The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don’t want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It’s clear from the banner plot that if we lowered the height to, say 1.5, we’d create a fourth cluster with only a few observations.
The banner plot is just an alternative to the dendogram"
Link on banner plot: https://www.stat.berkeley.edu/~s133/Cluster2a.html
Link on agglomeration coefficient: https://datascience.stackexchange.com/questions/18860/how-to-interpret-agglomerative-coefficient-agnes-function
Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.
Read more: https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
For illustration, use the College dataset: https://rdrr.io/cran/ISLR/man/College.html
All the variables:
set.seed(2021) # for reproducibility
library(ISLR) # the college dataset
library(Rtsne) # for t-SNE plot
library(tidyverse)## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
Let’s evaluate the distributions:
What is the college with a graduation rate of 118%?
## Private Apps Accept Enroll Top10perc Top25perc F.Undergrad
## Cazenovia College Yes 3847 3433 527 9 35 1010
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## Cazenovia College 12 9384 4840 600 500 22 47
## S.F.Ratio perc.alumni Expend Grad.Rate
## Cazenovia College 14.3 20 7697 118
This is a right-skewed variable which we will now transform into a new one:
college_clean = College %>%
mutate(
name = row.names(.),
# make college names a variable
accept_rate = Accept / Apps,
# ratio of accepted students to applications
isElite = cut(
Top10perc,
# how many top-10% students are enrolled, % of all students
breaks = c(0, 50, 100),
labels = c("Not Elite", "Elite"),
include.lowest = TRUE
)
) %>%
mutate(isElite = factor(isElite)) %>% # label colleges with many good students
select(name, accept_rate, Outstate, Enroll, # "elites"
Grad.Rate, Private, isElite)Enroll seems to be having a heavy right skew. This may negatively affect the distance matrix calculation, so we are going to use it log-transformed.
## Rows: 777
## Columns: 7
## $ name <chr> "Abilene Christian University", "Adelphi University", "Adr~
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, 0.8~
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559~
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,~
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55~
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes~
## $ isElite <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, Not Eli~
Create a distance object (taking out the name of the school in the 1st column)
library(cluster)
gower_dist <- daisy(college_clean[ , -1],
metric = "gower",
type = list(logratio = 3)) # It means: "Use Enroll as log(Enroll)"
summary(gower_dist)## 301476 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735
## Metric : mixed ; Types = I, I, I, I, N, N
## Number of objects : 777
Check the attributes to ensure the correct methods are being used (I = interval, N = nominal)
The most similar pair of colleges:
college_clean[
which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), # lowest observed value
arr.ind = TRUE)[1, ], ]## name accept_rate Outstate
## University of St. Thomas MN University of St. Thomas MN 0.8784638 11712
## John Carroll University John Carroll University 0.8711276 11700
## Enroll Grad.Rate Private isElite
## University of St. Thomas MN 828 89 Yes Not Elite
## John Carroll University 820 89 Yes Not Elite
The most dissimilar pair of colleges:
college_clean[
which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), # highest observed value
arr.ind = TRUE)[1, ], ]## name
## University of Sci. and Arts of Oklahoma University of Sci. and Arts of Oklahoma
## Harvard University Harvard University
## accept_rate Outstate Enroll Grad.Rate
## University of Sci. and Arts of Oklahoma 0.9824561 3687 208 43
## Harvard University 0.1561486 18485 1606 100
## Private isElite
## University of Sci. and Arts of Oklahoma No Not Elite
## Harvard University Yes Elite
These pairs make sense.
We will use partitioning around medoids (PAM) for handling a custom distance matrix.
Q: How does PAM work?
A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.
Q: How many clusters should be retained in the solution?
A: Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}Plot the silhouette width (larger value is better):
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width",
ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)The same graph in one line:
Conclusion: select the 3-cluster solutions as it has the highest silhouette width.
pam_results <- college_clean %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary## [[1]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3283 Min. : 2340 Min. : 35.0 Min. : 15.00 No : 0
## 1st Qu.:0.7225 1st Qu.: 8842 1st Qu.: 194.8 1st Qu.: 56.00 Yes:500
## Median :0.8004 Median :10905 Median : 308.0 Median : 67.50
## Mean :0.7820 Mean :11200 Mean : 418.6 Mean : 66.97
## 3rd Qu.:0.8581 3rd Qu.:13240 3rd Qu.: 484.8 3rd Qu.: 78.25
## Max. :1.0000 Max. :21700 Max. :4615.0 Max. :118.00
## isElite cluster
## Not Elite:500 Min. :1
## Elite : 0 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## [[2]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.1545 Min. : 5224 Min. : 137.0 Min. : 54.00 No : 4
## 1st Qu.:0.4135 1st Qu.:13850 1st Qu.: 391.0 1st Qu.: 77.00 Yes:65
## Median :0.5329 Median :17238 Median : 601.0 Median : 89.00
## Mean :0.5392 Mean :16225 Mean : 882.5 Mean : 84.78
## 3rd Qu.:0.6988 3rd Qu.:18590 3rd Qu.:1191.0 3rd Qu.: 94.00
## Max. :0.9605 Max. :20100 Max. :4893.0 Max. :100.00
## isElite cluster
## Not Elite: 0 Min. :2
## Elite :69 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
##
## [[3]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3746 Min. : 2580 Min. : 153 Min. : 10.00 No :208
## 1st Qu.:0.6423 1st Qu.: 5295 1st Qu.: 694 1st Qu.: 46.00 Yes: 0
## Median :0.7458 Median : 6598 Median :1302 Median : 54.50
## Mean :0.7315 Mean : 6698 Mean :1615 Mean : 55.42
## 3rd Qu.:0.8368 3rd Qu.: 7748 3rd Qu.:2184 3rd Qu.: 65.00
## Max. :1.0000 Max. :15516 Max. :6392 Max. :100.00
## isElite cluster
## Not Elite:199 Min. :3
## Elite : 9 1st Qu.:3
## Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
college_clean$cluster <- as.factor(pam_fit$clustering)
describeBy(college_clean, college_clean$cluster)##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max
## name* 1 500 250.50 144.48 250.5 250.50 185.32 1.00 500
## accept_rate 2 500 0.78 0.11 0.8 0.79 0.10 0.33 1
## Outstate 3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll 4 500 418.57 430.94 308.0 342.51 199.41 35.00 4615
## Grad.Rate 5 500 66.97 16.20 67.5 67.42 17.05 15.00 118
## Private* 6 500 2.00 0.00 2.0 2.00 0.00 2.00 2
## isElite* 7 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## cluster* 8 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## range skew kurtosis se
## name* 499.00 0.00 -1.21 6.46
## accept_rate 0.67 -0.94 1.32 0.01
## Outstate 19360.00 0.45 0.07 148.96
## Enroll 4580.00 4.68 31.74 19.27
## Grad.Rate 103.00 -0.33 0.26 0.72
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## name* 1 69 35.00 20.06 35.00 35.00 25.20 1.00 69.00
## accept_rate 2 69 0.54 0.20 0.53 0.54 0.21 0.15 0.96
## Outstate 3 69 16224.51 3210.41 17238.00 16563.04 2345.47 5224.00 20100.00
## Enroll 4 69 882.48 813.12 601.00 740.65 464.05 137.00 4893.00
## Grad.Rate 5 69 84.78 11.98 89.00 85.72 11.86 54.00 100.00
## Private* 6 69 1.94 0.24 2.00 2.00 0.00 1.00 2.00
## isElite* 7 69 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 69 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## range skew kurtosis se
## name* 68.00 0.00 -1.25 2.42
## accept_rate 0.81 0.11 -0.91 0.02
## Outstate 14876.00 -1.09 0.61 386.49
## Enroll 4756.00 2.41 7.56 97.89
## Grad.Rate 46.00 -0.65 -0.64 1.44
## Private* 1.00 -3.70 11.87 0.03
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## name* 1 208 104.50 60.19 104.50 104.50 77.10 1.00 208
## accept_rate 2 208 0.73 0.14 0.75 0.74 0.14 0.37 1
## Outstate 3 208 6697.75 1980.73 6598.50 6558.74 1801.36 2580.00 15516
## Enroll 4 208 1614.72 1246.02 1302.00 1445.40 1034.11 153.00 6392
## Grad.Rate 5 208 55.42 13.98 54.50 55.17 13.34 10.00 100
## Private* 6 208 1.00 0.00 1.00 1.00 0.00 1.00 1
## isElite* 7 208 1.04 0.20 1.00 1.00 0.00 1.00 2
## cluster* 8 208 3.00 0.00 3.00 3.00 0.00 3.00 3
## range skew kurtosis se
## name* 207.00 0.00 -1.22 4.17
## accept_rate 0.63 -0.35 -0.41 0.01
## Outstate 12936.00 0.90 2.01 137.34
## Enroll 6239.00 1.44 2.29 86.40
## Grad.Rate 90.00 0.17 0.37 0.97
## Private* 0.00 NaN NaN 0.00
## isElite* 1.00 4.46 17.95 0.01
## cluster* 0.00 NaN NaN 0.00
Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment.
Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates.
Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.
Hint: medoids serve as exemplars of each cluster:
## name accept_rate
## Saint Francis College Saint Francis College 0.7877629
## Barnard College Barnard College 0.5616987
## Grand Valley State University Grand Valley State University 0.7525653
## Outstate Enroll Grad.Rate Private isElite
## Saint Francis College 10880 284 69 Yes Not Elite
## Barnard College 17926 531 91 Yes Elite
## Grand Valley State University 6108 1561 57 No Not Elite
## cluster
## Saint Francis College 1
## Barnard College 2
## Grand Valley State University 3
To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE.
It tries to preserve local structure so as to make clusters visible in 2D or 3D.
set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
# $Y is a matrix containing the new representations for the objects
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = college_clean$name)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))Have you noticed the small group?
It consists of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics
tsne_data %>%
filter(X > 12 & X < 15, # cluster coordinates may differ
Y > -15 & Y < -12) %>%
left_join(college_clean, by = "name") %>%
collect %>%
.[["name"]]## [1] "College of William and Mary"
## [2] "Georgia Institute of Technology"
## [3] "SUNY at Binghamton"
## [4] "SUNY College at Geneseo"
## [5] "Trenton State College"
## [6] "University of California at Berkeley"
## [7] "University of California at Irvine"
## [8] "University of Florida"
## [9] "University of Illinois - Urbana"
## [10] "University of Michigan at Ann Arbor"
## [11] "University of Minnesota at Morris"
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"
Explore the data: http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705
Let us repeat the analysis using 4 clusters and look whether the results will be satisfactory.
pam_fit2 <- pam(gower_dist, diss = TRUE, k = 4)
pam_results2 <- college_clean %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit2$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results2$the_summary## [[1]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3283 Min. : 2340 Min. : 35.0 Min. : 15.00 No : 0
## 1st Qu.:0.7423 1st Qu.: 8031 1st Qu.: 155.2 1st Qu.: 50.00 Yes:266
## Median :0.8128 Median : 9185 Median : 216.0 Median : 58.00
## Mean :0.7927 Mean : 9292 Mean : 274.1 Mean : 57.36
## 3rd Qu.:0.8674 3rd Qu.:10500 3rd Qu.: 297.2 3rd Qu.: 65.75
## Max. :1.0000 Max. :21700 Max. :4615.0 Max. :100.00
## isElite cluster
## Not Elite:265 Min. :1
## Elite : 1 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## [[2]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.4387 Min. : 4990 Min. : 96.0 Min. : 46.0 No : 0
## 1st Qu.:0.7086 1st Qu.:11600 1st Qu.: 338.0 1st Qu.: 70.0 Yes:235
## Median :0.7882 Median :12925 Median : 465.0 Median : 78.0
## Mean :0.7699 Mean :13354 Mean : 580.9 Mean : 77.8
## 3rd Qu.:0.8488 3rd Qu.:14995 3rd Qu.: 665.5 3rd Qu.: 84.0
## Max. :1.0000 Max. :19964 Max. :3810.0 Max. :118.0
## isElite cluster
## Not Elite:235 Min. :2
## Elite : 0 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
##
## [[3]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.1545 Min. : 5224 Min. : 137.0 Min. : 59.00 No : 4
## 1st Qu.:0.4132 1st Qu.:14232 1st Qu.: 411.2 1st Qu.: 77.00 Yes:64
## Median :0.5308 Median :17267 Median : 616.5 Median : 89.50
## Mean :0.5355 Mean :16318 Mean : 893.3 Mean : 85.24
## 3rd Qu.:0.6900 3rd Qu.:18599 3rd Qu.:1197.5 3rd Qu.: 94.25
## Max. :0.9605 Max. :20100 Max. :4893.0 Max. :100.00
## isElite cluster
## Not Elite: 0 Min. :3
## Elite :68 1st Qu.:3
## Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
##
## [[4]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3746 Min. : 2580 Min. : 153 Min. : 10.00 No :208
## 1st Qu.:0.6423 1st Qu.: 5295 1st Qu.: 694 1st Qu.: 46.00 Yes: 0
## Median :0.7458 Median : 6598 Median :1302 Median : 54.50
## Mean :0.7315 Mean : 6698 Mean :1615 Mean : 55.42
## 3rd Qu.:0.8368 3rd Qu.: 7748 3rd Qu.:2184 3rd Qu.: 65.00
## Max. :1.0000 Max. :15516 Max. :6392 Max. :100.00
## isElite cluster
## Not Elite:199 Min. :4
## Elite : 9 1st Qu.:4
## Median :4
## Mean :4
## 3rd Qu.:4
## Max. :4
## name accept_rate
## University of Charleston University of Charleston 0.7844575
## Merrimack College Merrimack College 0.7778900
## Barnard College Barnard College 0.5616987
## Grand Valley State University Grand Valley State University 0.7525653
## Outstate Enroll Grad.Rate Private isElite
## University of Charleston 9500 204 57 Yes Not Elite
## Merrimack College 12500 514 80 Yes Not Elite
## Barnard College 17926 531 91 Yes Elite
## Grand Valley State University 6108 1561 57 No Not Elite
## cluster
## University of Charleston 1
## Merrimack College 1
## Barnard College 2
## Grand Valley State University 3
set.seed(42)
tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data2 <- tsne_obj2$Y %>%
# $Y is a matrix containing the new representations for the objects
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit2$clustering),
name = college_clean$name)
ggplot(aes(x = X, y = Y), data = tsne_data2) +
geom_point(aes(color = cluster))Nope.
A 4-cluster solution is less satisfactory than a 3-cluster one.
However, the smallest cluster as represented by t-SNE can be re-coded manually.
levels(tsne_data$cluster) <- c("1", "2", "3", "4")
tsne_data$cluster[tsne_data$X > 12 &
tsne_data$X < 15 & tsne_data$Y > -15 & tsne_data$Y < -12] <- 4
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max
## name* 1 500 250.50 144.48 250.5 250.50 185.32 1.00 500
## accept_rate 2 500 0.78 0.11 0.8 0.79 0.10 0.33 1
## Outstate 3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll 4 500 418.57 430.94 308.0 342.51 199.41 35.00 4615
## Grad.Rate 5 500 66.97 16.20 67.5 67.42 17.05 15.00 118
## Private* 6 500 2.00 0.00 2.0 2.00 0.00 2.00 2
## isElite* 7 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## cluster* 8 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## range skew kurtosis se
## name* 499.00 0.00 -1.21 6.46
## accept_rate 0.67 -0.94 1.32 0.01
## Outstate 19360.00 0.45 0.07 148.96
## Enroll 4580.00 4.68 31.74 19.27
## Grad.Rate 103.00 -0.33 0.26 0.72
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## name* 1 65 33.00 18.91 33.00 33.00 23.72 1.00 65.00
## accept_rate 2 65 0.54 0.20 0.53 0.54 0.22 0.15 0.96
## Outstate 3 65 16433.52 3163.36 17475.00 16844.92 2157.18 5224.00 20100.00
## Enroll 4 65 752.12 544.77 573.00 677.53 424.02 137.00 2505.00
## Grad.Rate 5 65 84.57 12.20 89.00 85.53 11.86 54.00 100.00
## Private* 6 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## isElite* 7 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## range skew kurtosis se
## name* 64.00 0.00 -1.26 2.35
## accept_rate 0.81 0.06 -0.93 0.03
## Outstate 14876.00 -1.28 1.20 392.37
## Enroll 2368.00 1.32 1.36 67.57
## Grad.Rate 46.00 -0.61 -0.74 1.51
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## name* 1 199 100.00 57.59 100.00 100.00 74.13 1.00 199
## accept_rate 2 199 0.74 0.14 0.75 0.74 0.14 0.37 1
## Outstate 3 199 6649.42 1969.22 6597.00 6516.04 1848.80 2580.00 15516
## Enroll 4 199 1577.97 1219.52 1292.00 1412.39 1009.65 153.00 6392
## Grad.Rate 5 199 54.64 13.63 54.00 54.41 13.34 10.00 100
## Private* 6 199 1.00 0.00 1.00 1.00 0.00 1.00 1
## isElite* 7 199 1.00 0.00 1.00 1.00 0.00 1.00 1
## cluster* 8 199 3.00 0.00 3.00 3.00 0.00 3.00 3
## range skew kurtosis se
## name* 198.00 0.00 -1.22 4.08
## accept_rate 0.63 -0.37 -0.34 0.01
## Outstate 12936.00 0.91 2.14 139.59
## Enroll 6239.00 1.47 2.50 86.45
## Grad.Rate 90.00 0.20 0.63 0.97
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max
## name* 1 13 7.00 3.89 7.00 7.00 4.45 1.00 13.00
## accept_rate 2 13 0.54 0.14 0.53 0.54 0.17 0.34 0.78
## Outstate 3 13 9323.77 3108.54 8400.00 9098.73 2833.25 5391.00 15732.00
## Enroll 4 13 2603.69 1541.66 2478.00 2505.00 1697.58 588.00 5705.00
## Grad.Rate 5 13 77.46 11.98 80.00 78.27 10.38 51.00 95.00
## Private* 6 13 1.00 0.00 1.00 1.00 0.00 1.00 1.00
## isElite* 7 13 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 13 2.69 0.48 3.00 2.73 0.00 2.00 3.00
## range skew kurtosis se
## name* 12.00 0.00 -1.48 1.08
## accept_rate 0.44 0.22 -1.51 0.04
## Outstate 10341.00 0.49 -1.08 862.15
## Enroll 5117.00 0.51 -0.90 427.58
## Grad.Rate 44.00 -0.52 -0.42 3.32
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 1.00 -0.74 -1.56 0.13
To conclude, there are four cluster among the 777 colleges.
Cluster 1 are mainly private, non-elite colleges with medium levels of out-of-state tuition and smaller levels of student enrollment.
Cluster 2 consists of private, elite colleges with lower acceptance rates and higher rates of out-of-state tuition and graduation.
Cluster 3 is mainly public and non-elite, with the lowest rates of tuition and graduation but the largest levels of enrollment.
Lastly, Cluster 4 consists of 13 public colleges with rather high enrolment, low acceptance rate, medium tuition rate, and a high graduation rate.
Q: So, what would our college clusters would look like when plotted with principal components?
A: Let’s see:
pam_fit <- pam(college_clean[, -1], k = 3) # pass the raw data here
fviz_cluster(pam_fit,
data = college_clean,
palette = "Set2",
ggtheme = theme_minimal(),
geom = "point")pam_fit <- pam(college_clean[, -1], k = 2) # pass the raw data here
fviz_cluster(pam_fit,
data = college_clean,
palette = "Set2",
ggtheme = theme_minimal(),
geom = "point")It doesn’t look good to me.
Table of Difference between PCA and t-SNE (Source: https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/)
| PCA | t-SNE |
|---|---|
| is a linear Dimensionality reduction technique. | is a non-linear Dimensionality reduction technique |
| tries to preserve the global structure of data. | tries to preserve the local data structure* |
| does not involve hyperparameters | has hyperparameters (perplexity, learning rate, # of steps |
| is highly affected by outliers | can handle outliers |
| is deterministic (can be reproduced on another df) | is a non-deterministic or randomised algorithm |
| rotates vectors for preserving variance | minimises the distance between the points in a guassian |
| decide on # of PC using eigen values | preserve not variance but distance using hyperparameters |
*“t-SNE preserves only local similarities whereas PCA preserves large pairwise distance maximize variance” (https://medium.com/analytics-vidhya/pca-vs-t-sne-17bcd882bf3d)
Bottomline:
The best comparative discussion so far, in my opinion: https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne
Q: What is t-SNE in practice?
A: - t-SNE is a method of non-linear dimensionality reduction (this algorithm allows us to separate data that cannot be separated by any straight line) - t-SNE is iterative so unlike PCA you cannot apply it on another dataset (e.g., in train/test) When to apply: t-SNE is mostly used to understand high-dimensional data and project it into low-dimensional space (like 2D or 3D)
! It’s not deterministic and iterative so each time it runs, it could produce a different result (Source: https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a)
! You cannot see relative sizes of clusters in a t-SNE plot (Source: https://distill.pub/2016/misread-tsne/)
Example: An empirical comparison on the pictures of sign language (Source: https://towardsdatascience.com/dimensionality-reduction-for-data-visualization-pca-vs-tsne-vs-umap-be4aa7b1cb29)
Q: How to interpret t-SNE axes? Do they compare to principal components?
A: “Individual axes in t-SNE have no meaning at all. Algorithms such as MDS, SNE, t-SNE, etc. only care about pairwise distances between points.” https://stats.stackexchange.com/questions/348927/what-is-the-meaning-of-the-axes-in-t-sne
“Unlike PCA, axes in the low dimensional space don’t have a particular meaning. In fact, one could arbitrarily rotate the low dimensional points and the t-SNE cost function wouldn’t change. The relevant information is in the relative distances between low dimensional points. t-SNE captures structure in the sense that neighboring points in the input space will tend to be neighbors in the low dimensional space.”
“t-SNE does not scale well with size of dataset, while PCA does” https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne
Q: What is the best use of t-SNE?
A: “use t-SNE for visualization (and try different parameters to get something visually pleasing!), but rather do not run clustering afterwards” https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne
One way that t-SNE visualizations can be useful is by combining them with external information. https://stats.stackexchange.com/questions/331745/how-to-interpret-t-sne-plot
finding cluster solutions with cluster membership probability
library(cluster)
fuzzy_fit <- fanny(iris[ ,3:4], 3, metric = "euclidean", stand = FALSE)
head(fuzzy_fit$membership, 3)## [,1] [,2] [,3]
## [1,] 0.9717502 0.01658896 0.01166088
## [2,] 0.9717502 0.01658896 0.01166088
## [3,] 0.9460213 0.03157233 0.02240638
fviz_cluster(fuzzy_fit,
ellipse.type = "norm",
repel = TRUE,
palette = "jco",
ggtheme = theme_minimal(),
legend = "right")Another function for fuzzy C-means:
https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html
## Cluster 1 Cluster 2 Cluster 3
## 1 0.001993004 0.9971282 0.0008788244
## 2 0.014852798 0.9787709 0.0063762576
## 3 0.012332674 0.9821882 0.0054790852
## Cluster 1 Cluster 2 Cluster 3
## 0.74697950 0.02831727 0.22470322
##
## 1 2 3
## 1 0 50 0
## 2 64 0 0
## 3 0 0 36
fviz_cluster(ppclust2(fcm_fit, "kmeans"), data = iris[ ,3:4],
ellipse.type = "norm",
repel = TRUE,
palette = "jco",
ggtheme = theme_minimal(),
legend = "right")Read more about DBSCAN: http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
“Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the \(\epsilon\)-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius."
Data source: https://www.r-bloggers.com/chaining-effect-in-clustering/
In what follows, I replicate the example of a Christmas tree-based dataset clustered with DBSCAN.
DBSCAN requires setting hyperparameters, which is why below I will try and compare different solutions to find the best one.
First, I reproduce the data picture with 500 points.
How would hierarchical clustering work here? Would it be able to locate 24 clusters?
library(factoextra)
centers <- as.data.frame(km$centers) # these are the 500 data points that you see
d <- dist(centers, method ="euclidean")
hc <- hclust(d, method = "single")
g <- fviz_dend(hc, k=24, show_labels = F) # we get a dendrogramHow do these clusters correspond to the data?
library(ggrepel)
cluster <- cutree(hc, k = 24)
c <- centers %>% bind_cols(as.data.frame(cluster)) %>%
group_by(cluster) %>%
summarise(x = first(x), y = first(y))
g <- centers %>% bind_cols(as.data.frame(cluster)) %>%
ggplot(aes(x, y, color = as.factor(cluster))) +
geom_point() +
geom_label_repel(data = c,
aes(x, y, label = cluster),
box.padding = 1) +
theme(legend.position = "none")
gYes, it is a success.
Let’s compare the result to DBSCAN.
To find out the optimal eps parameter, one can “calculate the average of the distances of every point to its k nearest neighbors. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve."
dbscan(centers,
eps = 28, # the radius of neighborhood around a point x
MinPts = 2) # the minimum number of neighbors within “eps” radius## DBSCAN clustering for 500 objects.
## Parameters: eps = 28, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 18 14 12 131 57 3 11 5 15 11 78 11 8 89 11 3 3 8 6 3
## 21
## 3
##
## Available fields: cluster, eps, minPts
This solution results in only 1 cluster.
Let us try out varying the parameters and comparing the results:
## DBSCAN clustering for 500 objects.
## Parameters: eps = 40, minPts = 2
## The clustering contains 16 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 157 17 12 57 3 11 93 11 11 8 89 11 3 8 6 3
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 30, minPts = 2
## The clustering contains 19 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 18 14 12 136 57 3 11 93 11 11 8 89 11 3 3 8 6 3 3
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 28, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 18 14 12 131 57 3 11 5 15 11 78 11 8 89 11 3 3 8 6 3
## 21
## 3
##
## Available fields: cluster, eps, minPts
res4 <- dbscan(centers, eps = 20, MinPts = 2) # 23 clusters + 2 = this is the best solution, see plots
res4## DBSCAN clustering for 500 objects.
## Parameters: eps = 20, minPts = 2
## The clustering contains 23 cluster(s) and 4 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 4 18 14 12 130 57 3 11 5 15 11 52 19 11 8 86 11 7 3 8
## 20 21 22 23
## 6 3 3 3
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 10, minPts = 2
## The clustering contains 27 cluster(s) and 367 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 367 18 5 15 4 3 5 11 9 3 7 10 5 2 3 8 2 2 2 2
## 20 21 22 23 24 25 26 27
## 2 2 3 2 2 2 2 2
##
## Available fields: cluster, eps, minPts
h_avg_cut2 <- hcut(centers, k = 24, hc_method = "average")
fviz_cluster(h_avg_cut2, centers, ellipse.type = "convex")hfit_ward2 <- hcut(centers, k = 24, hc_method = "ward.D2")
fviz_cluster(hfit_ward2, centers, ellipse.type = "convex")Compared to k-means or hierarchical clustering with average or Ward linkage, density-based DBSCAN method is much more efficient and comparable to single-linkage hierarchical clustering.
Classify US states by crime types
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Tasks for practice:
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
## [1] "Murder" "Assault" "UrbanPop" "Rape" "name"
## Murder Assault UrbanPop Rape
## Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
## 1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
## Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
## Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444
## name
## Length:50
## Class :character
## Mode :character
##
##
##
d <- dist(df[,-5], method = 'euclidean')
heatmap(
as.matrix(d),
symm = T,
distfun = function(x)
as.dist(x)
)d1 <- dist(df1[,-5], method = 'euclidean')
heatmap(
as.matrix(d1),
symm = T,
distfun = function(x)
as.dist(x)
)library(cluster)
h_ward <- agnes(d, method = "ward")
h_average <- agnes(d, method = "average")
h_complete <- agnes(d, method = "complete")
h_single <- agnes(d, method = "single")
library(factoextra)
fviz_dend(h_ward, show_labels = F, rect_border = T)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
h_ave_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_ave_cut, data = df[,-5], ellipse.type = "convex")cl <- cutree(h_average, k = 3)
df <-
mutate(df, cluster = cl) # add cluster membership as a column to the data frame
head(df)## Murder Assault UrbanPop Rape name cluster
## Alabama 13.2 236 58 21.2 Alabama 1
## Alaska 10.0 263 48 44.5 Alaska 1
## Arizona 8.1 294 80 31.0 Arizona 1
## Arkansas 8.8 190 50 19.5 Arkansas 2
## California 9.0 276 91 40.6 California 1
## Colorado 7.9 204 78 38.7 Colorado 2
## # A tibble: 3 x 5
## cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 11.8 273. 68.3 28.4
## 2 2 8.21 173. 70.6 22.8
## 3 3 4.27 87.6 59.8 14.4
library(cluster)
h_ward <- agnes(d1, method = "ward")
h_average <- agnes(d1, method = "average")
h_complete <- agnes(d1, method = "complete")
h_single <- agnes(d1, method = "single")
library(factoextra)
fviz_dend(h_ward, show_labels = F, rect_border = T)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
h_ave_cut <- hcut(d1, k = 2, hc_method = "average")
fviz_cluster(h_ave_cut, data = df1[,-5], ellipse.type = "convex")cl <- cutree(h_average, k = 2)
df1 <-
mutate(df1, cluster = cl) # add cluster membership as a column to the data frame
head(df1)## Murder Assault UrbanPop Rape name cluster
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473 Alabama 1
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941 Alaska 1
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388 Arizona 1
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602 Arkansas 2
## California 0.27826823 1.2628144 1.7589234 2.067820292 California 1
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207 Colorado 1
## # A tibble: 2 x 5
## cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.00 1.01 0.198 0.847
## 2 2 -0.670 -0.676 -0.132 -0.565
set.seed(1119)
df_cl <- kmeans(df[ , 1:4], 3, nstart = 20)
df$cluster <- df_cl$cluster
df_scaled <- as.data.frame(cbind(scale(df[,1:4]), df$cluster))
head(df_scaled)## Murder Assault UrbanPop Rape V5
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473 1
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941 1
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388 1
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602 3
## California 0.27826823 1.2628144 1.7589234 2.067820292 1
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207 3
names(df_scaled) <- c("Murder", "Assault", "UrbanPop", "Rape", "cluster")
describeBy(df_scaled, df$cluster)##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis
## Murder 1 16 0.92 0.64 0.91 0.95 0.54 -0.43 1.91 2.34 -0.28 -0.74
## Assault 2 16 1.22 0.37 1.08 1.20 0.29 0.78 1.99 1.21 0.84 -0.48
## UrbanPop 3 16 0.19 1.07 0.38 0.20 1.13 -1.49 1.76 3.25 -0.32 -1.40
## Rape 4 16 0.76 1.03 0.61 0.72 0.86 -0.58 2.64 3.22 0.40 -1.10
## cluster 5 16 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN
## se
## Murder 0.16
## Assault 0.09
## UrbanPop 0.27
## Rape 0.26
## cluster 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew
## Murder 1 20 -0.81 0.53 -0.97 -0.85 0.49 -1.60 0.44 2.04 0.57
## Assault 2 20 -1.00 0.34 -0.92 -0.98 0.38 -1.51 -0.61 0.90 -0.30
## UrbanPop 3 20 -0.40 0.96 -0.42 -0.38 0.82 -2.32 1.21 3.52 -0.13
## Rape 4 20 -0.73 0.50 -0.71 -0.74 0.55 -1.49 0.18 1.67 0.23
## cluster 5 20 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.00 NaN
## kurtosis se
## Murder -0.66 0.12
## Assault -1.63 0.08
## UrbanPop -0.96 0.22
## Rape -1.20 0.11
## cluster NaN 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew kurtosis
## Murder 1 14 0.10 0.91 -0.03 0.01 0.70 -1.01 2.21 3.21 0.82 -0.24
## Assault 2 14 0.03 0.27 -0.04 0.02 0.31 -0.31 0.48 0.79 0.33 -1.51
## UrbanPop 3 14 0.35 0.82 0.24 0.37 0.92 -1.07 1.62 2.69 0.06 -1.31
## Rape 4 14 0.17 0.79 0.20 0.16 0.64 -1.38 1.86 3.25 0.12 -0.24
## cluster 5 14 3.00 0.00 3.00 3.00 0.00 3.00 3.00 0.00 NaN NaN
## se
## Murder 0.24
## Assault 0.07
## UrbanPop 0.22
## Rape 0.21
## cluster 0.00
df_scaled$cluster <- as.factor(df_scaled$cluster)
levels(df_scaled$cluster) <- c("Criminal urbanized", "Peaceful rural", "Peaceful urbanized")
rownames(df) <- df$name
head(df)## Murder Assault UrbanPop Rape name cluster
## Alabama 13.2 236 58 21.2 Alabama 1
## Alaska 10.0 263 48 44.5 Alaska 1
## Arizona 8.1 294 80 31.0 Arizona 1
## Arkansas 8.8 190 50 19.5 Arkansas 3
## California 9.0 276 91 40.6 California 1
## Colorado 7.9 204 78 38.7 Colorado 3
##
## Descriptive statistics by group
## group: Criminal urbanized
## vars n mean sd median trimmed mad min max range skew
## Murder 1 16 11.81 2.80 11.75 11.93 2.37 5.9 16.1 10.2 -0.28
## Assault 2 16 272.56 31.05 261.00 270.57 24.46 236.0 337.0 101.0 0.84
## UrbanPop 3 16 68.31 15.49 71.00 68.43 16.31 44.0 91.0 47.0 -0.32
## Rape 4 16 28.38 9.60 26.95 28.01 8.08 15.8 46.0 30.2 0.40
## name* 5 16 8.50 4.76 8.50 8.50 5.93 1.0 16.0 15.0 0.00
## cluster* 6 16 1.00 0.00 1.00 1.00 0.00 1.0 1.0 0.0 NaN
## kurtosis se
## Murder -0.74 0.70
## Assault -0.48 7.76
## UrbanPop -1.40 3.87
## Rape -1.10 2.40
## name* -1.43 1.19
## cluster* NaN 0.00
## ------------------------------------------------------------
## group: Peaceful rural
## vars n mean sd median trimmed mad min max range skew
## Murder 1 20 4.27 2.30 3.55 4.09 2.15 0.8 9.7 8.9 0.57
## Assault 2 20 87.55 28.16 94.00 88.75 31.88 45.0 120.0 75.0 -0.30
## UrbanPop 3 20 59.75 13.92 59.50 60.06 11.86 32.0 83.0 51.0 -0.13
## Rape 4 20 14.39 4.67 14.55 14.28 5.11 7.3 22.9 15.6 0.23
## name* 5 20 10.50 5.92 10.50 10.50 7.41 1.0 20.0 19.0 0.00
## cluster* 6 20 2.00 0.00 2.00 2.00 0.00 2.0 2.0 0.0 NaN
## kurtosis se
## Murder -0.66 0.52
## Assault -1.63 6.30
## UrbanPop -0.96 3.11
## Rape -1.20 1.04
## name* -1.38 1.32
## cluster* NaN 0.00
## ------------------------------------------------------------
## group: Peaceful urbanized
## vars n mean sd median trimmed mad min max range skew
## Murder 1 14 8.21 3.94 7.65 7.85 3.04 3.4 17.4 14.0 0.82
## Assault 2 14 173.29 22.18 167.50 172.50 25.95 145.0 211.0 66.0 0.33
## UrbanPop 3 14 70.64 11.85 69.00 70.83 13.34 50.0 89.0 39.0 0.06
## Rape 4 14 22.84 7.40 23.10 22.73 6.00 8.3 38.7 30.4 0.12
## name* 5 14 7.50 4.18 7.50 7.50 5.19 1.0 14.0 13.0 0.00
## cluster* 6 14 3.00 0.00 3.00 3.00 0.00 3.0 3.0 0.0 NaN
## kurtosis se
## Murder -0.24 1.05
## Assault -1.51 5.93
## UrbanPop -1.31 3.17
## Rape -0.24 1.98
## name* -1.46 1.12
## cluster* NaN 0.00
df_cl$cluster <- df$cluster
fviz_cluster(df_cl, data = df[, 1:4], stand = T, show.clust.cent = T,
palette = "Set2", ggtheme = theme_minimal(), geom = "text", repel = T)## Saving 8 x 6 in image
# Representative states have average values. I would answer this question by running a PAM and calling the medoids:
pam_us <- pam(d, diss = TRUE, k = 3)
df[pam_us$medoids, ]## Murder Assault UrbanPop Rape name cluster
## Michigan 12.1 255 74 35.1 Michigan Criminal urbanized
## Missouri 9.0 178 70 28.2 Missouri Peaceful urbanized
## Nebraska 4.3 102 62 16.5 Nebraska Peaceful rural
# Since this is a dataset by states, you can also colour the states by cluster:
library(usmap)
library(ggplot2)
names(df) <- c("Murder", "Assault", "UrbanPop", "Rape", "state", "cluster")
plot_usmap(data = df, values = "cluster") +
theme(legend.position = "top",
legend.box = "vertical") +
scale_fill_discrete(name = "Cluster") +
labs(title = "US Clusters by Crime",
subtitle = "urban areas show more crime",
caption = "Data: USArrests data")