PARTITIONING: K-means clustering (1)

Let’s try to clusterize irises using the top-down approach

set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)

library(sjPlot)
tab_corr(iris[-5])

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length		-0.118	0.872***	0.818***
Sepal.Width	-0.118		-0.428***	-0.366***
Petal.Length	0.872***	-0.428***		0.963***
Petal.Width	0.818***	-0.366***	0.963***
Computed correlation used pearson-method with listwise-deletion.

irisCluster <- kmeans(iris[ , c("Petal.Length", "Petal.Width")],  # Petal.Length and Petal.Width
                      3, # how many groups to locate
                      nstart = 20 # R will try 20 different random starting assignments 
                      # and then select the one with the lowest within cluster variation.
)

Let’s compare the results of clustering against the pre-set classification:

table(irisCluster$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1      0         48         4
##   2     50          0         0
##   3      0          2        46

Six objects (2 + 4) have been misidentified.

What if we standardize the data first?

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()

iris2 <- iris[ , c("Petal.Length", "Petal.Width")] %>% 
  scale() %>% 
  as.data.frame()
irisCluster2 <- kmeans(iris2,
                      3,
                      nstart = 20)
table(irisCluster2$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0          2        46
##   3      0         48         4

We can use the ‘elbow method’ to determine the optimal k.

To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case.

A low withinss shows high intra-cluster similarity and low inter-cluster similarity.

The optimal number of clusters (k) is where the elbow occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.

library(factoextra)
library(cluster)
fviz_nbclust(iris[ , 3:4], kmeans, method = "wss")

The silhouette width statistic:

"The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster.

A high average silhouette width indicates a good clustering."

Source: https://uc-r.github.io/kmeans_clustering

fviz_nbclust(iris[ , 3:4], kmeans, method = "silhouette")

The gap statistic:

"The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering).

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering)." Source: https://uc-r.github.io/kmeans_clustering#gap

fviz_nbclust(iris[ , 3:4], kmeans, method = "gap")

According to our scree plot (elbow plot), the optimal k is 2 or 3.

PARTITIONING: K-means clustering (2)

To visualize the solution in two dimensions, use:

fviz_cluster(irisCluster, data = iris[ ,3:4],
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

fviz_cluster(irisCluster, data = iris[ ,3:4],
   palette = "Set2", 
   ggtheme = theme_minimal())

Compare this solution to the one based on ALL the variables:

fviz_cluster(irisCluster, data = iris[, -5],
   palette = "Set2", ggtheme = theme_minimal())

Conclusion: k-means works well when domain knowledge is rich.

If little is known, use the most different variables for clustering. Decision on the number of clusters can be based on metrics.

In this example, irises can be clustered best by using 2 variables out of 4.

Practice: Palmer penguins

library(palmerpenguins)
library(psych)
describe(penguins) # flipper length and body mass have much larger variances

##                   vars   n    mean     sd  median trimmed    mad    min    max
## species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
## island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
## bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
## bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
## flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
## body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
## year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
##                    range  skew kurtosis    se
## species*             2.0  0.16    -1.73  0.05
## island*              2.0  0.61    -0.91  0.04
## bill_length_mm      27.5  0.05    -0.89  0.30
## bill_depth_mm        8.4 -0.14    -0.92  0.11
## flipper_length_mm   59.0  0.34    -1.00  0.76
## body_mass_g       3600.0  0.47    -0.74 43.36
## sex*                 1.0 -0.02    -2.01  0.03
## year                 2.0 -0.05    -1.51  0.04

library(stringr)
peng <- dplyr::select(penguins, contains("mm"), contains("body"))
summary(peng)

##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.45   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300  
##  NA's   :2       NA's   :2       NA's   :2         NA's   :2

peng <- na.omit(peng)
tab_corr(peng)

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
bill_length_mm		-0.235***	0.656***	0.595***
bill_depth_mm	-0.235***		-0.584***	-0.472***
flipper_length_mm	0.656***	-0.584***		0.871***
body_mass_g	0.595***	-0.472***	0.871***
Computed correlation used pearson-method with listwise-deletion.

peng_sc <- as.data.frame(scale(peng))
tab_corr(peng_sc) # if you scale the data, the correlations would remain the same

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
bill_length_mm		-0.235***	0.656***	0.595***
bill_depth_mm	-0.235***		-0.584***	-0.472***
flipper_length_mm	0.656***	-0.584***		0.871***
body_mass_g	0.595***	-0.472***	0.871***
Computed correlation used pearson-method with listwise-deletion.

describe(peng_sc) # the  variances are equal now

##                   vars   n mean sd median trimmed  mad   min  max range  skew
## bill_length_mm       1 342    0  1   0.10    0.00 1.29 -2.17 2.87  5.04  0.05
## bill_depth_mm        2 342    0  1   0.08    0.01 1.13 -2.05 2.20  4.25 -0.14
## flipper_length_mm    3 342    0  1  -0.28   -0.04 1.16 -2.06 2.14  4.20  0.34
## body_mass_g          4 342    0  1  -0.19   -0.06 1.11 -1.87 2.62  4.49  0.47
##                   kurtosis   se
## bill_length_mm       -0.89 0.05
## bill_depth_mm        -0.92 0.05
## flipper_length_mm    -1.00 0.05
## body_mass_g          -0.74 0.05

fviz_nbclust(peng, kmeans, method = "silhouette")

fviz_nbclust(peng, kmeans, method = "wss")

fviz_nbclust(peng, kmeans, method = "gap") # uses simulations and takes longer

Error bars depend on simulation results. Gap is the difference between hypothetical unclustered data and the dataset in hand.

pCluster <- kmeans(peng[, c(1, 3)], 3)
penguins1 <- penguins %>%
  dplyr::select(species, contains("mm"), contains("body")) %>%
  na.omit()
table(pCluster$cluster, penguins1$species) # 55 misfits

##    
##     Adelie Chinstrap Gentoo
##   1     38        54      1
##   2    111         9      0
##   3      2         5    122

pCluster <- kmeans(peng, 3) # on all 4 variables, 112 misfits
table(pCluster$cluster, penguins1$species)

##    
##     Adelie Chinstrap Gentoo
##   1      0         0     81
##   2     53        22     42
##   3     98        46      0

pCluster <- kmeans(peng, 2)
table(pCluster$cluster, penguins1$species) # 71 + 19 misfits

##    
##     Adelie Chinstrap Gentoo
##   1     14         5    114
##   2    137        63      9

fviz_cluster(pCluster,
             data = peng,
             palette = "Set2",
             ggtheme = theme_minimal())

Over to you!

HIERARCHICAL: agnes - Agglomerative Hierarchical clustering algorithm

Now, let’s try to clusterize irises using the bottom-up approach:

First - calculate the distance matrix
there are two functions to compute the distance matrix:
- dist() - for ‘euclidean’, ‘manhattan’, ‘binary’, ‘canberra’, or ‘minkowski’, or
- daisy() - for ‘euclidean’, ‘manhattan’, or ‘gower’ methods.

The choice of the metric may have a large impact depending on what data you have.

iris.use <- subset(iris, select = -Species) # select all variables but the "Species"
rownames(iris.use) <- c()
d <- dist(iris.use, method = 'euclidean') # calculate a distance matrix on selected variables

See an example here:

Figure. Three clustering methods for the target-looking data.

Source: https://www.semanticscholar.org/paper/U*C%3A-Self-organized-Clustering-with-Emergent-Maps-Ultsch/4b222ec0a8f7d016e5702ae79ec20707fa9a54f9

What distance to choose?

distance	metric description
euclidean -	usual distance between the two vectors (interval vars)
manhattan -	absolute distance between the two vectors
canberra -	for non-negative values (e.g., counts)
minkowski -	can be used for both ordinal and quantitative variables
gower -	the weighted mean of the contributions of each variable (for mixed types)
binary -	Jaccard distance for dichotomous variables

Q: How to choose the correct distance?

A: Learning + practice.

“An appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.” https://stats.stackexchange.com/questions/3713/choosing-a-clustering-method

Once you have the distance matrix, a plot with distances between observations and a dendrogram can be produced:

heatmap(as.matrix(d), symm = T,
        distfun = function(x) as.dist(x))

In this plot, clusters are squares of red color.

How many clusters can you see here, 2 or 3?

On to hierarchical clustering

Get the cluster solution

library(cluster)
hfit_ward <- agnes(d, method = "ward") # principle: minimal average distance between clusters when merging two clusters together
hfit_average <- agnes(d, method = "average") # average distance between an observation and existing clusters
hfit_complete <- agnes(d, method = "complete") # maximal distance between an observation and existing clusters

Click here to read more about these methods: https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf

In this case, the species are known, so classification quality can be compared:

table(cutree(hfit_ward, 3), iris$Species) # 1 + 15 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         49        15
##   3      0          1        35

table(cutree(hfit_average, 3), iris$Species) # 14 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         50        14
##   3      0          0        36

table(cutree(hfit_complete, 3), iris$Species) # 23 + 1 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         23        49
##   3      0         27         1

In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives.

In the real world, you may look at the dendrogram to evaluate how well a particular method has clustered the data:

library(factoextra)
fviz_dend(hfit_ward, 
          show_labels = FALSE, 
          rect_border = "blue")

fviz_dend(hfit_average, 
          show_labels = FALSE, 
          rect_border = TRUE)

fviz_dend(hfit_complete, 
          show_labels = FALSE, 
          rect_border = TRUE)

h_avg_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_avg_cut, iris.use, ellipse.type = "convex")

Let’s further visualise the results!

Dendrograms to visualize results

cl <- cutree(hfit_average, k = 3) # cut the dendrogram to obtain exactly 3 clusters 

library(dplyr)
iris <- mutate(iris, cluster = cl) # add cluster membership as a column to the data frame
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1          5.1         3.5          1.4         0.2  setosa       1
## 2          4.9         3.0          1.4         0.2  setosa       1
## 3          4.7         3.2          1.3         0.2  setosa       1
## 4          4.6         3.1          1.5         0.2  setosa       1
## 5          5.0         3.6          1.4         0.2  setosa       1
## 6          5.4         3.9          1.7         0.4  setosa       1

library(magrittr)
iris[ ,-5] %>%
  group_by(cluster) %>% 
  summarise_all(funs(mean(.))) # get the mean values of all variables by cluster

## # A tibble: 3 x 5
##   cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
##     <int>        <dbl>       <dbl>        <dbl>       <dbl>
## 1       1         5.01        3.43         1.46       0.246
## 2       2         5.93        2.76         4.41       1.44 
## 3       3         6.85        3.08         5.79       2.10

Show the result on the dendrogram by coloring the branches (optional)

A more elaborate option:

library(dendextend)
dend <- as.dendrogram(hfit_average)
dend_col <- color_branches(dend, k = 3)
labels_colors(dend_col) <- "white"
plot(dend_col)

This is how the dendrogram was cut to obtain three clusters.

Yet another possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance.

Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 1.8 means that the averagefor Euclidean distances between any two observations will not exceed 1.8:

h_avg <- hclust(d, method = "average")
cl2 <- cutree(h_avg, h = 1.8)
dend <- as.dendrogram(h_avg)
dend_col <- color_branches(dend, h = 1.8)
plot(dend_col, labels = FALSE)

plot(h_avg)
rect.hclust(h_avg, k = 3, border = "green")

Better dendrograms: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels

A bonus below (optional)

The ‘banner plot’ below:

"The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don’t want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It’s clear from the banner plot that if we lowered the height to, say 1.5, we’d create a fourth cluster with only a few observations.

The banner plot is just an alternative to the dendogram"

Link on banner plot: https://www.stat.berkeley.edu/~s133/Cluster2a.html

Link on agglomeration coefficient: https://datascience.stackexchange.com/questions/18860/how-to-interpret-agglomerative-coefficient-agnes-function

plot(hfit_average)

Clustering Mixed Data Types in R

Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.

For illustration, use the College dataset: https://rdrr.io/cran/ISLR/man/College.html

the ISLR package
shows various statistics of US Colleges from 1995 (N = 777)
Continuous variables in analysis:
- Acceptance rate,
- Out of state students’ tuition,
- Number of new students enrolled
Categorical variables in analysis:
- Whether a college is public/private,
- Whether a college is elite, defined as having more than 50% of new students who graduated in the top 10% of their high school class

All the variables:

Private - A factor with levels No and Yes indicating private or public university
Apps - Number of applications received
Accept - Number of applications accepted
Enroll - Number of new students enrolled
Top10perc - Pct. new students from top 10% of H.S. class
Top25perc - Pct. new students from top 25% of H.S. class
F.Undergrad - Number of fulltime undergraduates
P.Undergrad - Number of parttime undergraduates
Outstate - Out-of-state tuition
Room.Board - Room and board costs
Books - Estimated book costs
Personal - Estimated personal spending
PhD - Pct. of faculty with Ph.D.’s
Terminal - Pct. of faculty with terminal degree
S.F.Ratio - Student/faculty ratio
perc.alumni - Pct. alumni who donate
Expend - Instructional expenditure per student
Grad.Rate - Graduation rate

Clustering Mixed Data Types in R (2)

set.seed(2021) # for reproducibility

library(ISLR) # the college dataset
library(Rtsne) # for t-SNE plot
library(tidyverse)

Clean the data:

head(College, 3)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54

Let’s evaluate the distributions:

hist(College$Grad.Rate)

What is the college with a graduation rate of 118%?

College[College$Grad.Rate == 118, ]

##                   Private Apps Accept Enroll Top10perc Top25perc F.Undergrad
## Cazenovia College     Yes 3847   3433    527         9        35        1010
##                   P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## Cazenovia College          12     9384       4840   600      500  22       47
##                   S.F.Ratio perc.alumni Expend Grad.Rate
## Cazenovia College      14.3          20   7697       118

hist(College$Top10perc)

This is a right-skewed variable which we will now transform into a new one:

college_clean = College %>%
  mutate(
    name = row.names(.),
    # make college names a variable
    accept_rate = Accept / Apps,
    # ratio of accepted students to applications
    isElite = cut(
      Top10perc,
      # how many top-10% students are enrolled, % of all students
      breaks = c(0, 50, 100),
      labels = c("Not Elite", "Elite"),
      include.lowest = TRUE
    )
  ) %>%
  mutate(isElite = factor(isElite)) %>% # label colleges with many good students
  select(name, accept_rate, Outstate, Enroll,  # "elites"
         Grad.Rate, Private, isElite)

Enroll seems to be having a heavy right skew. This may negatively affect the distance matrix calculation, so we are going to use it log-transformed.

hist(college_clean$Enroll)

hist(log(college_clean$Enroll))

glimpse(college_clean)

## Rows: 777
## Columns: 7
## $ name        <chr> "Abilene Christian University", "Adelphi University", "Adr~
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, 0.8~
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1559~
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 484,~
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68, 55~
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes~
## $ isElite     <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, Not Eli~

Clustering Mixed Data Types in R (3)

Then we have to use a distance metric that can handle mixed data types.

For each variable type, a particular distance metric that works well for that type is used and scaled between 0 and 1.
Warning: Distance calculations are sensitive to non-normality and outliers. Big samples take longer.

Create a distance object (taking out the name of the school in the 1st column)

library(cluster)
gower_dist <- daisy(college_clean[ , -1],
                    metric = "gower",
                    type = list(logratio = 3)) # It means: "Use Enroll as log(Enroll)"
summary(gower_dist)

## 301476 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735 
## Metric :  mixed ;  Types = I, I, I, I, N, N 
## Number of objects : 777

Check the attributes to ensure the correct methods are being used (I = interval, N = nominal)

Clustering Mixed Data Types in R (4)

Look at the most similar and dissimilar pairs in the data to see if they make sense

gower_mat <- as.matrix(gower_dist)

The most similar pair of colleges:

college_clean[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), # lowest observed value
        arr.ind = TRUE)[1, ], ]

##                                                    name accept_rate Outstate
## University of St. Thomas MN University of St. Thomas MN   0.8784638    11712
## John Carroll University         John Carroll University   0.8711276    11700
##                             Enroll Grad.Rate Private   isElite
## University of St. Thomas MN    828        89     Yes Not Elite
## John Carroll University        820        89     Yes Not Elite

The most dissimilar pair of colleges:

college_clean[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), # highest observed value
        arr.ind = TRUE)[1, ], ]

##                                                                            name
## University of Sci. and Arts of Oklahoma University of Sci. and Arts of Oklahoma
## Harvard University                                           Harvard University
##                                         accept_rate Outstate Enroll Grad.Rate
## University of Sci. and Arts of Oklahoma   0.9824561     3687    208        43
## Harvard University                        0.1561486    18485   1606       100
##                                         Private   isElite
## University of Sci. and Arts of Oklahoma      No Not Elite
## Harvard University                          Yes     Elite

These pairs make sense.

Clustering Mixed Data Types in R (5)

Choose the clustering algorithm

We will use partitioning around medoids (PAM) for handling a custom distance matrix.

Q: How does PAM work?

A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.

Q: How many clusters should be retained in the solution?

A: Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.

Calculate silhouette width for many several solutions using PAM:

sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

Plot the silhouette width (larger value is better):

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width",
     ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

The same graph in one line:

library(factoextra)
fviz_nbclust(gower_mat, cluster::pam, method = "silhouette")

Conclusion: select the 3-cluster solutions as it has the highest silhouette width.

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)
pam_fit$data <- college_clean[ , -1]

Clustering Mixed Data Types in R (6)

Interpret the solution:

by running summary on each cluster

pam_results <- college_clean %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
pam_results$the_summary

## [[1]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private  
##  Min.   :0.3283   Min.   : 2340   Min.   :  35.0   Min.   : 15.00   No :  0  
##  1st Qu.:0.7225   1st Qu.: 8842   1st Qu.: 194.8   1st Qu.: 56.00   Yes:500  
##  Median :0.8004   Median :10905   Median : 308.0   Median : 67.50            
##  Mean   :0.7820   Mean   :11200   Mean   : 418.6   Mean   : 66.97            
##  3rd Qu.:0.8581   3rd Qu.:13240   3rd Qu.: 484.8   3rd Qu.: 78.25            
##  Max.   :1.0000   Max.   :21700   Max.   :4615.0   Max.   :118.00            
##       isElite       cluster 
##  Not Elite:500   Min.   :1  
##  Elite    :  0   1st Qu.:1  
##                  Median :1  
##                  Mean   :1  
##                  3rd Qu.:1  
##                  Max.   :1  
## 
## [[2]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private 
##  Min.   :0.1545   Min.   : 5224   Min.   : 137.0   Min.   : 54.00   No : 4  
##  1st Qu.:0.4135   1st Qu.:13850   1st Qu.: 391.0   1st Qu.: 77.00   Yes:65  
##  Median :0.5329   Median :17238   Median : 601.0   Median : 89.00           
##  Mean   :0.5392   Mean   :16225   Mean   : 882.5   Mean   : 84.78           
##  3rd Qu.:0.6988   3rd Qu.:18590   3rd Qu.:1191.0   3rd Qu.: 94.00           
##  Max.   :0.9605   Max.   :20100   Max.   :4893.0   Max.   :100.00           
##       isElite      cluster 
##  Not Elite: 0   Min.   :2  
##  Elite    :69   1st Qu.:2  
##                 Median :2  
##                 Mean   :2  
##                 3rd Qu.:2  
##                 Max.   :2  
## 
## [[3]]
##   accept_rate        Outstate         Enroll       Grad.Rate      Private  
##  Min.   :0.3746   Min.   : 2580   Min.   : 153   Min.   : 10.00   No :208  
##  1st Qu.:0.6423   1st Qu.: 5295   1st Qu.: 694   1st Qu.: 46.00   Yes:  0  
##  Median :0.7458   Median : 6598   Median :1302   Median : 54.50            
##  Mean   :0.7315   Mean   : 6698   Mean   :1615   Mean   : 55.42            
##  3rd Qu.:0.8368   3rd Qu.: 7748   3rd Qu.:2184   3rd Qu.: 65.00            
##  Max.   :1.0000   Max.   :15516   Max.   :6392   Max.   :100.00            
##       isElite       cluster 
##  Not Elite:199   Min.   :3  
##  Elite    :  9   1st Qu.:3  
##                  Median :3  
##                  Mean   :3  
##                  3rd Qu.:3  
##                  Max.   :3

college_clean$cluster <- as.factor(pam_fit$clustering)
describeBy(college_clean, college_clean$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##             vars   n     mean      sd  median  trimmed     mad     min   max
## name*          1 500   250.50  144.48   250.5   250.50  185.32    1.00   500
## accept_rate    2 500     0.78    0.11     0.8     0.79    0.10    0.33     1
## Outstate       3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll         4 500   418.57  430.94   308.0   342.51  199.41   35.00  4615
## Grad.Rate      5 500    66.97   16.20    67.5    67.42   17.05   15.00   118
## Private*       6 500     2.00    0.00     2.0     2.00    0.00    2.00     2
## isElite*       7 500     1.00    0.00     1.0     1.00    0.00    1.00     1
## cluster*       8 500     1.00    0.00     1.0     1.00    0.00    1.00     1
##                range  skew kurtosis     se
## name*         499.00  0.00    -1.21   6.46
## accept_rate     0.67 -0.94     1.32   0.01
## Outstate    19360.00  0.45     0.07 148.96
## Enroll       4580.00  4.68    31.74  19.27
## Grad.Rate     103.00 -0.33     0.26   0.72
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 2
##             vars  n     mean      sd   median  trimmed     mad     min      max
## name*          1 69    35.00   20.06    35.00    35.00   25.20    1.00    69.00
## accept_rate    2 69     0.54    0.20     0.53     0.54    0.21    0.15     0.96
## Outstate       3 69 16224.51 3210.41 17238.00 16563.04 2345.47 5224.00 20100.00
## Enroll         4 69   882.48  813.12   601.00   740.65  464.05  137.00  4893.00
## Grad.Rate      5 69    84.78   11.98    89.00    85.72   11.86   54.00   100.00
## Private*       6 69     1.94    0.24     2.00     2.00    0.00    1.00     2.00
## isElite*       7 69     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## cluster*       8 69     2.00    0.00     2.00     2.00    0.00    2.00     2.00
##                range  skew kurtosis     se
## name*          68.00  0.00    -1.25   2.42
## accept_rate     0.81  0.11    -0.91   0.02
## Outstate    14876.00 -1.09     0.61 386.49
## Enroll       4756.00  2.41     7.56  97.89
## Grad.Rate      46.00 -0.65    -0.64   1.44
## Private*        1.00 -3.70    11.87   0.03
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 3
##             vars   n    mean      sd  median trimmed     mad     min   max
## name*          1 208  104.50   60.19  104.50  104.50   77.10    1.00   208
## accept_rate    2 208    0.73    0.14    0.75    0.74    0.14    0.37     1
## Outstate       3 208 6697.75 1980.73 6598.50 6558.74 1801.36 2580.00 15516
## Enroll         4 208 1614.72 1246.02 1302.00 1445.40 1034.11  153.00  6392
## Grad.Rate      5 208   55.42   13.98   54.50   55.17   13.34   10.00   100
## Private*       6 208    1.00    0.00    1.00    1.00    0.00    1.00     1
## isElite*       7 208    1.04    0.20    1.00    1.00    0.00    1.00     2
## cluster*       8 208    3.00    0.00    3.00    3.00    0.00    3.00     3
##                range  skew kurtosis     se
## name*         207.00  0.00    -1.22   4.17
## accept_rate     0.63 -0.35    -0.41   0.01
## Outstate    12936.00  0.90     2.01 137.34
## Enroll       6239.00  1.44     2.29  86.40
## Grad.Rate      90.00  0.17     0.37   0.97
## Private*        0.00   NaN      NaN   0.00
## isElite*        1.00  4.46    17.95   0.01
## cluster*        0.00   NaN      NaN   0.00

Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment.

Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates.

Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.

Hint: medoids serve as exemplars of each cluster:

college_clean[pam_fit$medoids, ]

##                                                        name accept_rate
## Saint Francis College                 Saint Francis College   0.7877629
## Barnard College                             Barnard College   0.5616987
## Grand Valley State University Grand Valley State University   0.7525653
##                               Outstate Enroll Grad.Rate Private   isElite
## Saint Francis College            10880    284        69     Yes Not Elite
## Barnard College                  17926    531        91     Yes     Elite
## Grand Valley State University     6108   1561        57      No Not Elite
##                               cluster
## Saint Francis College               1
## Barnard College                     2
## Grand Valley State University       3

Clustering Mixed Data Types in R (7)

Visualization for the delivery of results:

To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE.

It tries to preserve local structure so as to make clusters visible in 2D or 3D.

set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>% 
  # $Y is a matrix containing the new representations for the objects
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = college_clean$name)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Have you noticed the small group?

It consists of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics

tsne_data %>%
  filter(X > 12 & X < 15,   # cluster coordinates may differ
         Y > -15 & Y < -12) %>% 
  left_join(college_clean, by = "name") %>%
  collect %>%
  .[["name"]]

##  [1] "College of William and Mary"                
##  [2] "Georgia Institute of Technology"            
##  [3] "SUNY at Binghamton"                         
##  [4] "SUNY College at Geneseo"                    
##  [5] "Trenton State College"                      
##  [6] "University of California at Berkeley"       
##  [7] "University of California at Irvine"         
##  [8] "University of Florida"                      
##  [9] "University of Illinois - Urbana"            
## [10] "University of Michigan at Ann Arbor"        
## [11] "University of Minnesota at Morris"          
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"

Explore the data: http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705

Let us repeat the analysis using 4 clusters and look whether the results will be satisfactory.

Clustering Mixed Data - 2

heatmap(gower_mat, symm = T,
        distfun = function(x) as.dist(x))

pam_fit2 <- pam(gower_dist, diss = TRUE, k = 4)

pam_results2 <- college_clean %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit2$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
pam_results2$the_summary

## [[1]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private  
##  Min.   :0.3283   Min.   : 2340   Min.   :  35.0   Min.   : 15.00   No :  0  
##  1st Qu.:0.7423   1st Qu.: 8031   1st Qu.: 155.2   1st Qu.: 50.00   Yes:266  
##  Median :0.8128   Median : 9185   Median : 216.0   Median : 58.00            
##  Mean   :0.7927   Mean   : 9292   Mean   : 274.1   Mean   : 57.36            
##  3rd Qu.:0.8674   3rd Qu.:10500   3rd Qu.: 297.2   3rd Qu.: 65.75            
##  Max.   :1.0000   Max.   :21700   Max.   :4615.0   Max.   :100.00            
##       isElite       cluster 
##  Not Elite:265   Min.   :1  
##  Elite    :  1   1st Qu.:1  
##                  Median :1  
##                  Mean   :1  
##                  3rd Qu.:1  
##                  Max.   :1  
## 
## [[2]]
##   accept_rate        Outstate         Enroll         Grad.Rate     Private  
##  Min.   :0.4387   Min.   : 4990   Min.   :  96.0   Min.   : 46.0   No :  0  
##  1st Qu.:0.7086   1st Qu.:11600   1st Qu.: 338.0   1st Qu.: 70.0   Yes:235  
##  Median :0.7882   Median :12925   Median : 465.0   Median : 78.0            
##  Mean   :0.7699   Mean   :13354   Mean   : 580.9   Mean   : 77.8            
##  3rd Qu.:0.8488   3rd Qu.:14995   3rd Qu.: 665.5   3rd Qu.: 84.0            
##  Max.   :1.0000   Max.   :19964   Max.   :3810.0   Max.   :118.0            
##       isElite       cluster 
##  Not Elite:235   Min.   :2  
##  Elite    :  0   1st Qu.:2  
##                  Median :2  
##                  Mean   :2  
##                  3rd Qu.:2  
##                  Max.   :2  
## 
## [[3]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private 
##  Min.   :0.1545   Min.   : 5224   Min.   : 137.0   Min.   : 59.00   No : 4  
##  1st Qu.:0.4132   1st Qu.:14232   1st Qu.: 411.2   1st Qu.: 77.00   Yes:64  
##  Median :0.5308   Median :17267   Median : 616.5   Median : 89.50           
##  Mean   :0.5355   Mean   :16318   Mean   : 893.3   Mean   : 85.24           
##  3rd Qu.:0.6900   3rd Qu.:18599   3rd Qu.:1197.5   3rd Qu.: 94.25           
##  Max.   :0.9605   Max.   :20100   Max.   :4893.0   Max.   :100.00           
##       isElite      cluster 
##  Not Elite: 0   Min.   :3  
##  Elite    :68   1st Qu.:3  
##                 Median :3  
##                 Mean   :3  
##                 3rd Qu.:3  
##                 Max.   :3  
## 
## [[4]]
##   accept_rate        Outstate         Enroll       Grad.Rate      Private  
##  Min.   :0.3746   Min.   : 2580   Min.   : 153   Min.   : 10.00   No :208  
##  1st Qu.:0.6423   1st Qu.: 5295   1st Qu.: 694   1st Qu.: 46.00   Yes:  0  
##  Median :0.7458   Median : 6598   Median :1302   Median : 54.50            
##  Mean   :0.7315   Mean   : 6698   Mean   :1615   Mean   : 55.42            
##  3rd Qu.:0.8368   3rd Qu.: 7748   3rd Qu.:2184   3rd Qu.: 65.00            
##  Max.   :1.0000   Max.   :15516   Max.   :6392   Max.   :100.00            
##       isElite       cluster 
##  Not Elite:199   Min.   :4  
##  Elite    :  9   1st Qu.:4  
##                  Median :4  
##                  Mean   :4  
##                  3rd Qu.:4  
##                  Max.   :4

college_clean[pam_fit2$medoids, ]

##                                                        name accept_rate
## University of Charleston           University of Charleston   0.7844575
## Merrimack College                         Merrimack College   0.7778900
## Barnard College                             Barnard College   0.5616987
## Grand Valley State University Grand Valley State University   0.7525653
##                               Outstate Enroll Grad.Rate Private   isElite
## University of Charleston          9500    204        57     Yes Not Elite
## Merrimack College                12500    514        80     Yes Not Elite
## Barnard College                  17926    531        91     Yes     Elite
## Grand Valley State University     6108   1561        57      No Not Elite
##                               cluster
## University of Charleston            1
## Merrimack College                   1
## Barnard College                     2
## Grand Valley State University       3

set.seed(42)
tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data2 <- tsne_obj2$Y %>% 
  # $Y is a matrix containing the new representations for the objects
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit2$clustering),
         name = college_clean$name)

ggplot(aes(x = X, y = Y), data = tsne_data2) +
  geom_point(aes(color = cluster))

Nope.

Clustering Mixed Data - 2 (2)

A 4-cluster solution is less satisfactory than a 3-cluster one.

However, the smallest cluster as represented by t-SNE can be re-coded manually.

levels(tsne_data$cluster) <- c("1", "2", "3", "4")
tsne_data$cluster[tsne_data$X >  12 &
                    tsne_data$X < 15 &  tsne_data$Y > -15 & tsne_data$Y < -12] <- 4
ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

describeBy(college_clean, tsne_data$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##             vars   n     mean      sd  median  trimmed     mad     min   max
## name*          1 500   250.50  144.48   250.5   250.50  185.32    1.00   500
## accept_rate    2 500     0.78    0.11     0.8     0.79    0.10    0.33     1
## Outstate       3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll         4 500   418.57  430.94   308.0   342.51  199.41   35.00  4615
## Grad.Rate      5 500    66.97   16.20    67.5    67.42   17.05   15.00   118
## Private*       6 500     2.00    0.00     2.0     2.00    0.00    2.00     2
## isElite*       7 500     1.00    0.00     1.0     1.00    0.00    1.00     1
## cluster*       8 500     1.00    0.00     1.0     1.00    0.00    1.00     1
##                range  skew kurtosis     se
## name*         499.00  0.00    -1.21   6.46
## accept_rate     0.67 -0.94     1.32   0.01
## Outstate    19360.00  0.45     0.07 148.96
## Enroll       4580.00  4.68    31.74  19.27
## Grad.Rate     103.00 -0.33     0.26   0.72
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 2
##             vars  n     mean      sd   median  trimmed     mad     min      max
## name*          1 65    33.00   18.91    33.00    33.00   23.72    1.00    65.00
## accept_rate    2 65     0.54    0.20     0.53     0.54    0.22    0.15     0.96
## Outstate       3 65 16433.52 3163.36 17475.00 16844.92 2157.18 5224.00 20100.00
## Enroll         4 65   752.12  544.77   573.00   677.53  424.02  137.00  2505.00
## Grad.Rate      5 65    84.57   12.20    89.00    85.53   11.86   54.00   100.00
## Private*       6 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## isElite*       7 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## cluster*       8 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
##                range  skew kurtosis     se
## name*          64.00  0.00    -1.26   2.35
## accept_rate     0.81  0.06    -0.93   0.03
## Outstate    14876.00 -1.28     1.20 392.37
## Enroll       2368.00  1.32     1.36  67.57
## Grad.Rate      46.00 -0.61    -0.74   1.51
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 3
##             vars   n    mean      sd  median trimmed     mad     min   max
## name*          1 199  100.00   57.59  100.00  100.00   74.13    1.00   199
## accept_rate    2 199    0.74    0.14    0.75    0.74    0.14    0.37     1
## Outstate       3 199 6649.42 1969.22 6597.00 6516.04 1848.80 2580.00 15516
## Enroll         4 199 1577.97 1219.52 1292.00 1412.39 1009.65  153.00  6392
## Grad.Rate      5 199   54.64   13.63   54.00   54.41   13.34   10.00   100
## Private*       6 199    1.00    0.00    1.00    1.00    0.00    1.00     1
## isElite*       7 199    1.00    0.00    1.00    1.00    0.00    1.00     1
## cluster*       8 199    3.00    0.00    3.00    3.00    0.00    3.00     3
##                range  skew kurtosis     se
## name*         198.00  0.00    -1.22   4.08
## accept_rate     0.63 -0.37    -0.34   0.01
## Outstate    12936.00  0.91     2.14 139.59
## Enroll       6239.00  1.47     2.50  86.45
## Grad.Rate      90.00  0.20     0.63   0.97
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 4
##             vars  n    mean      sd  median trimmed     mad     min      max
## name*          1 13    7.00    3.89    7.00    7.00    4.45    1.00    13.00
## accept_rate    2 13    0.54    0.14    0.53    0.54    0.17    0.34     0.78
## Outstate       3 13 9323.77 3108.54 8400.00 9098.73 2833.25 5391.00 15732.00
## Enroll         4 13 2603.69 1541.66 2478.00 2505.00 1697.58  588.00  5705.00
## Grad.Rate      5 13   77.46   11.98   80.00   78.27   10.38   51.00    95.00
## Private*       6 13    1.00    0.00    1.00    1.00    0.00    1.00     1.00
## isElite*       7 13    2.00    0.00    2.00    2.00    0.00    2.00     2.00
## cluster*       8 13    2.69    0.48    3.00    2.73    0.00    2.00     3.00
##                range  skew kurtosis     se
## name*          12.00  0.00    -1.48   1.08
## accept_rate     0.44  0.22    -1.51   0.04
## Outstate    10341.00  0.49    -1.08 862.15
## Enroll       5117.00  0.51    -0.90 427.58
## Grad.Rate      44.00 -0.52    -0.42   3.32
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        1.00 -0.74    -1.56   0.13

To conclude, there are four cluster among the 777 colleges.

Cluster 1 are mainly private, non-elite colleges with medium levels of out-of-state tuition and smaller levels of student enrollment.

Cluster 2 consists of private, elite colleges with lower acceptance rates and higher rates of out-of-state tuition and graduation.

Cluster 3 is mainly public and non-elite, with the lowest rates of tuition and graduation but the largest levels of enrollment.

Lastly, Cluster 4 consists of 13 public colleges with rather high enrolment, low acceptance rate, medium tuition rate, and a high graduation rate.

Q: So, what would our college clusters would look like when plotted with principal components?

A: Let’s see:

pam_fit <- pam(college_clean[, -1], k = 3) # pass the raw data here

fviz_cluster(pam_fit,
             data = college_clean,
             palette = "Set2",
             ggtheme = theme_minimal(),
             geom = "point")

pam_fit <- pam(college_clean[, -1], k = 2) # pass the raw data here

fviz_cluster(pam_fit,
             data = college_clean,
             palette = "Set2",
             ggtheme = theme_minimal(),
             geom = "point")

It doesn’t look good to me.

What is the difference between using PCA and t-SNE for visualization of clusters?

Table of Difference between PCA and t-SNE (Source: https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/)

PCA	t-SNE
is a linear Dimensionality reduction technique.	is a non-linear Dimensionality reduction technique
tries to preserve the global structure of data.	tries to preserve the local data structure*
does not involve hyperparameters	has hyperparameters (perplexity, learning rate, # of steps
is highly affected by outliers	can handle outliers
is deterministic (can be reproduced on another df)	is a non-deterministic or randomised algorithm
rotates vectors for preserving variance	minimises the distance between the points in a guassian
decide on # of PC using eigen values	preserve not variance but distance using hyperparameters

*“t-SNE preserves only local similarities whereas PCA preserves large pairwise distance maximize variance” (https://medium.com/analytics-vidhya/pca-vs-t-sne-17bcd882bf3d)

Bottomline:

PCA is linear, reproducible, interpretable, it preserves global distances.
t-SNE is non-linear, iterative, its dimensions are non-interpretable.
Sometimes one fits betters, sometimes another.
There are also OTHER options for visualizing clusters (see https://towardsdatascience.com/dimensionality-reduction-for-data-visualization-pca-vs-tsne-vs-umap-be4aa7b1cb29)

The best comparative discussion so far, in my opinion: https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne

Q: What is t-SNE in practice?

A: - t-SNE is a method of non-linear dimensionality reduction (this algorithm allows us to separate data that cannot be separated by any straight line) - t-SNE is iterative so unlike PCA you cannot apply it on another dataset (e.g., in train/test) When to apply: t-SNE is mostly used to understand high-dimensional data and project it into low-dimensional space (like 2D or 3D)

! It’s not deterministic and iterative so each time it runs, it could produce a different result (Source: https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a)

! You cannot see relative sizes of clusters in a t-SNE plot (Source: https://distill.pub/2016/misread-tsne/)

Example: An empirical comparison on the pictures of sign language (Source: https://towardsdatascience.com/dimensionality-reduction-for-data-visualization-pca-vs-tsne-vs-umap-be4aa7b1cb29)

Computation time: PCA = 0.29 seconds, t-SNE = 109.5 seconds.
“PCA was not able to do a good job in differentiating the signs. The main drawback of PCA is that it is highly influenced by outliers present in the data. PCA is a linear projection, which means it can’t capture non-linear dependencies, its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. t-SNE does a better job(it tries to preserve topology neighbourhood structure) as compared to PCA when it comes to visualising the different patterns of the clusters. Similar labels are clustered together, even though there are big agglomerates of data points on top of each other, certainly not good enough to expect a clustering algorithm to perform well.”
“at the end of the day there is no way to map a high-dimensional data into low dimensions and preserving the whole structure at the same time, there is always a trade-off of qualities one technique will have compared to the other.”

Q: How to interpret t-SNE axes? Do they compare to principal components?

A: “Individual axes in t-SNE have no meaning at all. Algorithms such as MDS, SNE, t-SNE, etc. only care about pairwise distances between points.” https://stats.stackexchange.com/questions/348927/what-is-the-meaning-of-the-axes-in-t-sne

“Unlike PCA, axes in the low dimensional space don’t have a particular meaning. In fact, one could arbitrarily rotate the low dimensional points and the t-SNE cost function wouldn’t change. The relevant information is in the relative distances between low dimensional points. t-SNE captures structure in the sense that neighboring points in the input space will tend to be neighbors in the low dimensional space.”

“t-SNE does not scale well with size of dataset, while PCA does” https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne

Q: What is the best use of t-SNE?

A: “use t-SNE for visualization (and try different parameters to get something visually pleasing!), but rather do not run clustering afterwards” https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

One way that t-SNE visualizations can be useful is by combining them with external information. https://stats.stackexchange.com/questions/331745/how-to-interpret-t-sne-plot

Optional: Fuzzy C-means

finding cluster solutions with cluster membership probability

library(cluster)
fuzzy_fit <- fanny(iris[ ,3:4], 3, metric = "euclidean", stand = FALSE)
head(fuzzy_fit$membership, 3)

##           [,1]       [,2]       [,3]
## [1,] 0.9717502 0.01658896 0.01166088
## [2,] 0.9717502 0.01658896 0.01166088
## [3,] 0.9460213 0.03157233 0.02240638

fviz_cluster(fuzzy_fit, 
             ellipse.type = "norm", 
             repel = TRUE,
             palette = "jco", 
             ggtheme = theme_minimal(),
             legend = "right")

Another function for fuzzy C-means:

https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html

library(ppclust)
fcm_fit <- fcm(iris[ ,-5], centers = 3)
head(fcm_fit$u, 3)

##     Cluster 1 Cluster 2    Cluster 3
## 1 0.001993004 0.9971282 0.0008788244
## 2 0.014852798 0.9787709 0.0063762576
## 3 0.012332674 0.9821882 0.0054790852

fcm_fit$u[150,]

##  Cluster 1  Cluster 2  Cluster 3 
## 0.74697950 0.02831727 0.22470322

table(cutree(h_avg, 3), fcm_fit$cluster) # same results

##    
##      1  2  3
##   1  0 50  0
##   2 64  0  0
##   3  0  0 36

fviz_cluster(ppclust2(fcm_fit, "kmeans"), data = iris[ ,3:4],
             ellipse.type = "norm", 
             repel = TRUE,
             palette = "jco", 
             ggtheme = theme_minimal(),
             legend = "right")

Optional: DBSCAN

“Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the \(\epsilon\)-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius."

Data source: https://www.r-bloggers.com/chaining-effect-in-clustering/

In what follows, I replicate the example of a Christmas tree-based dataset clustered with DBSCAN.

DBSCAN requires setting hyperparameters, which is why below I will try and compare different solutions to find the best one.

First, I reproduce the data picture with 500 points.

How would hierarchical clustering work here? Would it be able to locate 24 clusters?

library(factoextra)
centers <-  as.data.frame(km$centers) # these are the 500 data points that you see
d <-  dist(centers, method ="euclidean")
hc <- hclust(d, method = "single")
g <- fviz_dend(hc, k=24, show_labels = F) # we get a dendrogram

How do these clusters correspond to the data?

library(ggrepel)
cluster <- cutree(hc, k = 24)

c <- centers %>% bind_cols(as.data.frame(cluster)) %>%
  group_by(cluster) %>%
  summarise(x = first(x), y = first(y))

g <-  centers %>% bind_cols(as.data.frame(cluster)) %>%
  ggplot(aes(x, y, color = as.factor(cluster))) +
  geom_point() +
  geom_label_repel(data = c,
                   aes(x, y, label = cluster),
                   box.padding = 1) +
  theme(legend.position = "none")
g

Yes, it is a success.

Let’s compare the result to DBSCAN.

To find out the optimal eps parameter, one can “calculate the average of the distances of every point to its k nearest neighbors. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve."

library(fpc)
library(dbscan)
kNNdistplot(centers, k = 5) # elbow at ca.150
abline(h = 28, lty = 2)

dbscan(centers, 
       eps = 28, # the radius of neighborhood around a point x
       MinPts = 2) # the minimum number of neighbors within “eps” radius

## DBSCAN clustering for 500 objects.
## Parameters: eps = 28, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##  18  14  12 131  57   3  11   5  15  11  78  11   8  89  11   3   3   8   6   3 
##  21 
##   3 
## 
## Available fields: cluster, eps, minPts

This solution results in only 1 cluster.

Let us try out varying the parameters and comparing the results:

res <- dbscan(centers, eps = 40, MinPts = 2) # 2 clusters
res

## DBSCAN clustering for 500 objects.
## Parameters: eps = 40, minPts = 2
## The clustering contains 16 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 157  17  12  57   3  11  93  11  11   8  89  11   3   8   6   3 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res$cluster + 1L, pch = res$cluster + 1L)

res2 <- dbscan(centers, eps = 30, MinPts = 2) # 20 clusters + 1
res2

## DBSCAN clustering for 500 objects.
## Parameters: eps = 30, minPts = 2
## The clustering contains 19 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##  18  14  12 136  57   3  11  93  11  11   8  89  11   3   3   8   6   3   3 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res2$cluster + 1L, pch = res2$cluster + 1L)

res3 <- dbscan(centers, eps = 28, MinPts = 2) # 20 clusters + 1
res3

## DBSCAN clustering for 500 objects.
## Parameters: eps = 28, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##  18  14  12 131  57   3  11   5  15  11  78  11   8  89  11   3   3   8   6   3 
##  21 
##   3 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res3$cluster + 1L, pch = res3$cluster + 1L)

res4 <- dbscan(centers, eps = 20, MinPts = 2) # 23 clusters + 2 = this is the best solution, see plots
res4

## DBSCAN clustering for 500 objects.
## Parameters: eps = 20, minPts = 2
## The clustering contains 23 cluster(s) and 4 noise points.
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##   4  18  14  12 130  57   3  11   5  15  11  52  19  11   8  86  11   7   3   8 
##  20  21  22  23 
##   6   3   3   3 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res4$cluster + 1L, pch = res4$cluster + 1L)

res5 <- dbscan(centers, eps = 10, MinPts = 2) # 33 clusters + 380
res5

## DBSCAN clustering for 500 objects.
## Parameters: eps = 10, minPts = 2
## The clustering contains 27 cluster(s) and 367 noise points.
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 367  18   5  15   4   3   5  11   9   3   7  10   5   2   3   8   2   2   2   2 
##  20  21  22  23  24  25  26  27 
##   2   2   3   2   2   2   2   2 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res5$cluster + 1L, pch = res5$cluster + 1L)

res5  <- kmeans(centers,  24)
fviz_cluster(res5, centers, ellipse.type = "convex")

h_avg_cut2 <- hcut(centers, k = 24, hc_method = "average")
fviz_cluster(h_avg_cut2, centers, ellipse.type = "convex")

hfit_ward2 <- hcut(centers, k = 24, hc_method = "ward.D2")
fviz_cluster(hfit_ward2, centers, ellipse.type = "convex")

Compared to k-means or hierarchical clustering with average or Ward linkage, density-based DBSCAN method is much more efficient and comparable to single-linkage hierarchical clustering.

Practice problem sets:

Classify US states by crime types

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Tasks for practice:

Cluster states using hierarchical clustering

df <- USArrests
summary(df)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

library(tidyverse)
df <- df %>%
  mutate(name = row.names(.))
names(df)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"     "name"

df1 <- scale(df[,-5]) %>% 
  as.data.frame() %>% 
  mutate(name = df$name)
summary(df1)

##      Murder           Assault           UrbanPop             Rape        
##  Min.   :-1.6044   Min.   :-1.5090   Min.   :-2.31714   Min.   :-1.4874  
##  1st Qu.:-0.8525   1st Qu.:-0.7411   1st Qu.:-0.76271   1st Qu.:-0.6574  
##  Median :-0.1235   Median :-0.1411   Median : 0.03178   Median :-0.1209  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.7949   3rd Qu.: 0.9388   3rd Qu.: 0.84354   3rd Qu.: 0.5277  
##  Max.   : 2.2069   Max.   : 1.9948   Max.   : 1.75892   Max.   : 2.6444  
##      name          
##  Length:50         
##  Class :character  
##  Mode  :character  
##                    
##                    
##

d <- dist(df[,-5], method = 'euclidean')
heatmap(
  as.matrix(d),
  symm = T,
  distfun = function(x)
    as.dist(x)
)

d1 <- dist(df1[,-5], method = 'euclidean')
heatmap(
  as.matrix(d1),
  symm = T,
  distfun = function(x)
    as.dist(x)
)

library(cluster)
h_ward <- agnes(d, method = "ward")
h_average <- agnes(d, method = "average")
h_complete <- agnes(d, method = "complete")
h_single <- agnes(d, method = "single")
library(factoextra)
fviz_dend(h_ward, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_average, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_complete, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_single, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

h_ave_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_ave_cut, data = df[,-5], ellipse.type = "convex")

cl <- cutree(h_average, k = 3)

df <-
  mutate(df, cluster = cl) # add cluster membership as a column to the data frame
head(df)

##            Murder Assault UrbanPop Rape       name cluster
## Alabama      13.2     236       58 21.2    Alabama       1
## Alaska       10.0     263       48 44.5     Alaska       1
## Arizona       8.1     294       80 31.0    Arizona       1
## Arkansas      8.8     190       50 19.5   Arkansas       2
## California    9.0     276       91 40.6 California       1
## Colorado      7.9     204       78 38.7   Colorado       2

df[, -5] %>%
  group_by(cluster) %>%
  summarise_all(funs(mean(.)))

## # A tibble: 3 x 5
##   cluster Murder Assault UrbanPop  Rape
##     <int>  <dbl>   <dbl>    <dbl> <dbl>
## 1       1  11.8    273.      68.3  28.4
## 2       2   8.21   173.      70.6  22.8
## 3       3   4.27    87.6     59.8  14.4

library(cluster)
h_ward <- agnes(d1, method = "ward")
h_average <- agnes(d1, method = "average")
h_complete <- agnes(d1, method = "complete")
h_single <- agnes(d1, method = "single")
library(factoextra)
fviz_dend(h_ward, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_average, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_complete, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

fviz_dend(h_single, show_labels = F, rect_border = T)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

h_ave_cut <- hcut(d1, k = 2, hc_method = "average")
fviz_cluster(h_ave_cut, data = df1[,-5], ellipse.type = "convex")

cl <- cutree(h_average, k = 2)
df1 <-
  mutate(df1, cluster = cl) # add cluster membership as a column to the data frame
head(df1)

##                Murder   Assault   UrbanPop         Rape       name cluster
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473    Alabama       1
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941     Alaska       1
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388    Arizona       1
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602   Arkansas       2
## California 0.27826823 1.2628144  1.7589234  2.067820292 California       1
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207   Colorado       1

df1[, -5] %>%
  group_by(cluster) %>%
  summarise_all(funs(mean(.)))

## # A tibble: 2 x 5
##   cluster Murder Assault UrbanPop   Rape
##     <int>  <dbl>   <dbl>    <dbl>  <dbl>
## 1       1  1.00    1.01     0.198  0.847
## 2       2 -0.670  -0.676   -0.132 -0.565

Pick up the best k and cluster US states with k-means

fviz_nbclust(df[ , -5], kmeans, method = "gap")

set.seed(1119)
df_cl <- kmeans(df[ , 1:4], 3, nstart = 20)
df$cluster <- df_cl$cluster
df_scaled <- as.data.frame(cbind(scale(df[,1:4]), df$cluster))

head(df_scaled)

##                Murder   Assault   UrbanPop         Rape V5
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473  1
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941  1
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388  1
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602  3
## California 0.27826823 1.2628144  1.7589234  2.067820292  1
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207  3

names(df_scaled) <- c("Murder", "Assault", "UrbanPop", "Rape", "cluster")
describeBy(df_scaled, df$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##          vars  n mean   sd median trimmed  mad   min  max range  skew kurtosis
## Murder      1 16 0.92 0.64   0.91    0.95 0.54 -0.43 1.91  2.34 -0.28    -0.74
## Assault     2 16 1.22 0.37   1.08    1.20 0.29  0.78 1.99  1.21  0.84    -0.48
## UrbanPop    3 16 0.19 1.07   0.38    0.20 1.13 -1.49 1.76  3.25 -0.32    -1.40
## Rape        4 16 0.76 1.03   0.61    0.72 0.86 -0.58 2.64  3.22  0.40    -1.10
## cluster     5 16 1.00 0.00   1.00    1.00 0.00  1.00 1.00  0.00   NaN      NaN
##            se
## Murder   0.16
## Assault  0.09
## UrbanPop 0.27
## Rape     0.26
## cluster  0.00
## ------------------------------------------------------------ 
## group: 2
##          vars  n  mean   sd median trimmed  mad   min   max range  skew
## Murder      1 20 -0.81 0.53  -0.97   -0.85 0.49 -1.60  0.44  2.04  0.57
## Assault     2 20 -1.00 0.34  -0.92   -0.98 0.38 -1.51 -0.61  0.90 -0.30
## UrbanPop    3 20 -0.40 0.96  -0.42   -0.38 0.82 -2.32  1.21  3.52 -0.13
## Rape        4 20 -0.73 0.50  -0.71   -0.74 0.55 -1.49  0.18  1.67  0.23
## cluster     5 20  2.00 0.00   2.00    2.00 0.00  2.00  2.00  0.00   NaN
##          kurtosis   se
## Murder      -0.66 0.12
## Assault     -1.63 0.08
## UrbanPop    -0.96 0.22
## Rape        -1.20 0.11
## cluster       NaN 0.00
## ------------------------------------------------------------ 
## group: 3
##          vars  n mean   sd median trimmed  mad   min  max range skew kurtosis
## Murder      1 14 0.10 0.91  -0.03    0.01 0.70 -1.01 2.21  3.21 0.82    -0.24
## Assault     2 14 0.03 0.27  -0.04    0.02 0.31 -0.31 0.48  0.79 0.33    -1.51
## UrbanPop    3 14 0.35 0.82   0.24    0.37 0.92 -1.07 1.62  2.69 0.06    -1.31
## Rape        4 14 0.17 0.79   0.20    0.16 0.64 -1.38 1.86  3.25 0.12    -0.24
## cluster     5 14 3.00 0.00   3.00    3.00 0.00  3.00 3.00  0.00  NaN      NaN
##            se
## Murder   0.24
## Assault  0.07
## UrbanPop 0.22
## Rape     0.21
## cluster  0.00

df_scaled$cluster <- as.factor(df_scaled$cluster)
levels(df_scaled$cluster) <- c("Criminal urbanized", "Peaceful rural", "Peaceful urbanized")

rownames(df) <- df$name
head(df)

##            Murder Assault UrbanPop Rape       name cluster
## Alabama      13.2     236       58 21.2    Alabama       1
## Alaska       10.0     263       48 44.5     Alaska       1
## Arizona       8.1     294       80 31.0    Arizona       1
## Arkansas      8.8     190       50 19.5   Arkansas       3
## California    9.0     276       91 40.6 California       1
## Colorado      7.9     204       78 38.7   Colorado       3

df$cluster <- df_scaled$cluster
describeBy(df, df$cluster)

## 
##  Descriptive statistics by group 
## group: Criminal urbanized
##          vars  n   mean    sd median trimmed   mad   min   max range  skew
## Murder      1 16  11.81  2.80  11.75   11.93  2.37   5.9  16.1  10.2 -0.28
## Assault     2 16 272.56 31.05 261.00  270.57 24.46 236.0 337.0 101.0  0.84
## UrbanPop    3 16  68.31 15.49  71.00   68.43 16.31  44.0  91.0  47.0 -0.32
## Rape        4 16  28.38  9.60  26.95   28.01  8.08  15.8  46.0  30.2  0.40
## name*       5 16   8.50  4.76   8.50    8.50  5.93   1.0  16.0  15.0  0.00
## cluster*    6 16   1.00  0.00   1.00    1.00  0.00   1.0   1.0   0.0   NaN
##          kurtosis   se
## Murder      -0.74 0.70
## Assault     -0.48 7.76
## UrbanPop    -1.40 3.87
## Rape        -1.10 2.40
## name*       -1.43 1.19
## cluster*      NaN 0.00
## ------------------------------------------------------------ 
## group: Peaceful rural
##          vars  n  mean    sd median trimmed   mad  min   max range  skew
## Murder      1 20  4.27  2.30   3.55    4.09  2.15  0.8   9.7   8.9  0.57
## Assault     2 20 87.55 28.16  94.00   88.75 31.88 45.0 120.0  75.0 -0.30
## UrbanPop    3 20 59.75 13.92  59.50   60.06 11.86 32.0  83.0  51.0 -0.13
## Rape        4 20 14.39  4.67  14.55   14.28  5.11  7.3  22.9  15.6  0.23
## name*       5 20 10.50  5.92  10.50   10.50  7.41  1.0  20.0  19.0  0.00
## cluster*    6 20  2.00  0.00   2.00    2.00  0.00  2.0   2.0   0.0   NaN
##          kurtosis   se
## Murder      -0.66 0.52
## Assault     -1.63 6.30
## UrbanPop    -0.96 3.11
## Rape        -1.20 1.04
## name*       -1.38 1.32
## cluster*      NaN 0.00
## ------------------------------------------------------------ 
## group: Peaceful urbanized
##          vars  n   mean    sd median trimmed   mad   min   max range skew
## Murder      1 14   8.21  3.94   7.65    7.85  3.04   3.4  17.4  14.0 0.82
## Assault     2 14 173.29 22.18 167.50  172.50 25.95 145.0 211.0  66.0 0.33
## UrbanPop    3 14  70.64 11.85  69.00   70.83 13.34  50.0  89.0  39.0 0.06
## Rape        4 14  22.84  7.40  23.10   22.73  6.00   8.3  38.7  30.4 0.12
## name*       5 14   7.50  4.18   7.50    7.50  5.19   1.0  14.0  13.0 0.00
## cluster*    6 14   3.00  0.00   3.00    3.00  0.00   3.0   3.0   0.0  NaN
##          kurtosis   se
## Murder      -0.24 1.05
## Assault     -1.51 5.93
## UrbanPop    -1.31 3.17
## Rape        -0.24 1.98
## name*       -1.46 1.12
## cluster*      NaN 0.00

Visualise your results showing labels

df_cl$cluster <- df$cluster

fviz_cluster(df_cl, data = df[, 1:4], stand = T, show.clust.cent = T,
   palette = "Set2", ggtheme = theme_minimal(), geom = "text", repel = T)

ggsave("states.png", plot = last_plot())

## Saving 8 x 6 in image

What are the representative states for each cluster?

# Representative states have average values. I would answer this question by running a PAM and calling the medoids:
pam_us <- pam(d, diss = TRUE, k = 3)
df[pam_us$medoids, ]

##          Murder Assault UrbanPop Rape     name            cluster
## Michigan   12.1     255       74 35.1 Michigan Criminal urbanized
## Missouri    9.0     178       70 28.2 Missouri Peaceful urbanized
## Nebraska    4.3     102       62 16.5 Nebraska     Peaceful rural

# Since this is a dataset by states, you can also colour the states by cluster:
library(usmap)
library(ggplot2)
names(df) <- c("Murder", "Assault", "UrbanPop", "Rape", "state", "cluster")
plot_usmap(data = df, values = "cluster") +
  theme(legend.position = "top",
          legend.box = "vertical") +
  scale_fill_discrete(name = "Cluster") +
  labs(title = "US Clusters by Crime", 
       subtitle = "urban areas show more crime",
       caption = "Data: USArrests data")

Clusters 102

Example 1: Fisher’s Irises