Example 1: Fisher’s Irises

We know the species from the beginning, thus, can compare solutions
IRL, ‘species’ is unknown, and the solution has to be tested against reality

In this dataset, the object features are metric only.

Start by describing the data and looking into most different variables

library(datasets)
head(iris, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

library(psych)
desc <- describeBy(iris, iris$Species, mat = TRUE)

Petal.Length and Petal.Width are similar within the same species but vary considerably between different species:

desc[1:12, 1:6]

##               item     group1 vars  n  mean        sd
## Sepal.Length1    1     setosa    1 50 5.006 0.3524897
## Sepal.Length2    2 versicolor    1 50 5.936 0.5161711
## Sepal.Length3    3  virginica    1 50 6.588 0.6358796
## Sepal.Width1     4     setosa    2 50 3.428 0.3790644
## Sepal.Width2     5 versicolor    2 50 2.770 0.3137983
## Sepal.Width3     6  virginica    2 50 2.974 0.3224966
## Petal.Length1    7     setosa    3 50 1.462 0.1736640
## Petal.Length2    8 versicolor    3 50 4.260 0.4699110
## Petal.Length3    9  virginica    3 50 5.552 0.5518947
## Petal.Width1    10     setosa    4 50 0.246 0.1053856
## Petal.Width2    11 versicolor    4 50 1.326 0.1977527
## Petal.Width3    12  virginica    4 50 2.026 0.2746501

tapply(iris$Petal.Length, iris$Species, mean)

##     setosa versicolor  virginica 
##      1.462      4.260      5.552

tapply(iris$Petal.Width, iris$Species, mean)

##     setosa versicolor  virginica 
##      0.246      1.326      2.026

Let us visualize their distribution:

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + 
  geom_point() +
  theme_classic()

PARTITIONING: K-means clustering (1)

Let’s try to clusterize irises using the top-down approach

set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)
irisCluster <- kmeans(iris[ , 3:4],  # Petal.Length and Petal.Width
                      3, # how many groups to locate
                      nstart = 20 # R will try 20 different random starting assignments 
                      # and then select the one with the lowest within cluster variation.
)

Let’s compare the results of clustering against the pre-set classification:

table(irisCluster$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1      0         48         4
##   2      0          2        46
##   3     50          0         0

Six objects (2 + 4) have been misidentified.

We can use the ‘elbow method’ to determine the optimal k.

To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case.

A low withinss shows high intra-cluster similarity and low inter-cluster similarity.

The optimal number of clusters (k) is where the elbow occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.

library(factoextra)
library(cluster)
fviz_nbclust(iris[ , 3:4], kmeans, method = "wss")

The silhouette width statistic:

"The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster.

A high average silhouette width indicates a good clustering." Source: https://uc-r.github.io/kmeans_clustering

fviz_nbclust(iris[ , 3:4], kmeans, method = "silhouette")

The gap statistic:

"The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering).

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering)." Source: https://uc-r.github.io/kmeans_clustering#gap

fviz_nbclust(iris[ , 3:4], kmeans, method = "gap")

According to our scree plot (elbow plot), the optimal k is 2 or 3.

PARTITIONING: K-means clustering (2)

To visualize the solution in two dimensions, use:

fviz_cluster(irisCluster, data = iris[ ,3:4],
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

fviz_cluster(irisCluster, data = iris[ ,3:4],
   palette = "Set2", 
   ggtheme = theme_minimal())

Compare this solution to the one based on ALL the variables:

fviz_cluster(irisCluster, data = iris[, -5],
   palette = "Set2", ggtheme = theme_minimal())

Conclusion: k-means works well when domain knowledge is rich.

If little is known, use the most different variables for clustering. Decision on the number of clusters can be based on metrics.

In this example, irises can be clustered best by using 2 variables out of 4.

HIERARCHICAL: agnes - Agglomerative Hierarchical clustering algorithm

Now, let’s try to clusterize irises using the bottom-up approach:

First - calculate the distance matrix
there are two functions to compute the distance matrix:
- dist() - for ‘euclidean’, ‘manhattan’, ‘binary’, ‘canberra’, or ‘minkowski’, or
- daisy() - for ‘euclidean’, ‘manhattan’, or ‘gower’ methods.

The choice of the metric may have a large impact depending on what data you have.

iris.use <- subset(iris, select = -Species) # select all variables but the "Species"
d <- dist(iris.use, method = 'euclidean') # calculate a distance matrix on selected variables

See an example here:

Figure. Three clustering methods for the target-looking data.

Source: https://www.semanticscholar.org/paper/U*C%3A-Self-organized-Clustering-with-Emergent-Maps-Ultsch/4b222ec0a8f7d016e5702ae79ec20707fa9a54f9

What distance to choose?

distance	metric description
euclidean -	usual distance between the two vectors (interval vars)
manhattan -	absolute distance between the two vectors
canberra -	for non-negative values (e.g., counts)
minkowski -	can be used for both ordinal and quantitative variables
gower -	the weighted mean of the contributions of each variable (for mixed types)
binary -	Jaccard distance for dichotomous variables

Once you have the distance matrix, a plot with distances between observations and a dendrogram can be produced:

heatmap(as.matrix(d), symm = T,
        distfun = function(x) as.dist(x))

In this plot, clusters are squares of red color.

How many clusters can you see here, 2 or 3?

On to hierarchical clustering

Get the cluster solution

library(cluster)
hfit_ward <- agnes(d, method = "ward") # principle: minimal average distance between clusters
hfit_average <- agnes(d, method = "average") # average distance between an observation and existing clusters
hfit_complete <- agnes(d, method = "complete") # maximal distance between an observation and existing clusters

In this case, the species are known, so classification quality can be compared:

table(cutree(hfit_ward, 3), iris$Species) # 16 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         49        15
##   3      0          1        35

table(cutree(hfit_average, 3), iris$Species) # 14 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         50        14
##   3      0          0        36

table(cutree(hfit_complete, 3), iris$Species) # 24 incorrect

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         23        49
##   3      0         27         1

In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives.

h_avg_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_avg_cut, iris.use, ellipse.type = "convex")

Let’s further visualise the result!

Dendrograms to visualize results

cl <- cutree(hfit_average, k = 3) # cut the dendrogram to obtain exactly 3 clusters 

library(dplyr)
iris <- mutate(iris, cluster = cl) # add cluster membership as a column to the data frame
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1          5.1         3.5          1.4         0.2  setosa       1
## 2          4.9         3.0          1.4         0.2  setosa       1
## 3          4.7         3.2          1.3         0.2  setosa       1
## 4          4.6         3.1          1.5         0.2  setosa       1
## 5          5.0         3.6          1.4         0.2  setosa       1
## 6          5.4         3.9          1.7         0.4  setosa       1

library(magrittr)
iris[ ,-5] %>%
  group_by(cluster) %>% 
  summarise_all(funs(mean(.))) # get the mean values of all variables by cluster

## # A tibble: 3 x 5
##   cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
##     <int>        <dbl>       <dbl>        <dbl>       <dbl>
## 1       1         5.01        3.43         1.46       0.246
## 2       2         5.93        2.76         4.41       1.44 
## 3       3         6.85        3.08         5.79       2.10

Show the result on the dendrogram by coloring the branches (optional)

A lazy option:

library(factoextra)
fviz_dend(hfit_average, 
          show_labels = FALSE, 
          rect_border = TRUE)

A more elaborate option:

library(dendextend)
dend <- as.dendrogram(hfit_average)
dend_col <- color_branches(dend, k = 3)
plot(dend_col)

This is how the dendrogram was cut to obtain three clusters.

Yet another possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance.

Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 1.8 means that the averagefor Euclidean distances between any two observations will not exceed 1.8:

h_avg <- hclust(d, method = "average")
cl2 <- cutree(h_avg, h = 1.8)
dend <- as.dendrogram(h_avg)
dend_col <- color_branches(dend, h = 1.8)
plot(dend_col)

plot(h_avg)
rect.hclust(h_avg, k = 3, border = "green")

Better dendrograms: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels

A bonus below (optional)

The ‘banner plot’ below:

"The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don’t want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It’s clear from the banner plot that if we lowered the height to, say 1.5, we’d create a fourth cluster with only a few observations.

The banner plot is just an alternative to the dendogram"

Link on banner plot: https://www.stat.berkeley.edu/~s133/Cluster2a.html

Link on agglomeration coefficient: https://datascience.stackexchange.com/questions/18860/how-to-interpret-agglomerative-coefficient-agnes-function

plot(hfit_average)

Clustering Mixed Data Types in R

Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.

For illustration, use the College dataset:

the ISLR package
shows various statistics of US Colleges from 1995 (N = 777)
Continuous variables:
- Acceptance rate,
- Out of state students’ tuition,
- Number of new students enrolled
Categorical variables:
- Whether a college is public/private,
- Whether a college is elite, defined as having more than 50% of new students who graduated in the top 10% of their high school class

set.seed(2020) # for reproducibility

library(ISLR) # the college dataset
library(Rtsne) # for t-SNE plot
library(tidyverse)

Clustering Mixed Data Types in R (2)

Clean the data:

head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

hist(College$Top10perc)

college_clean = College %>%
  mutate(name = row.names(.), # make college names a variable
         accept_rate = Accept/Apps,# ratio of applications to accepted student
         isElite = cut(Top10perc, # how many top students are enrolled, % of all students
                       breaks = c(0, 50, 100),
                       labels = c("Not Elite", "Elite"),
                       include.lowest = TRUE)) %>%
  mutate(isElite = factor(isElite)) %>% # label colleges with many good students
  select(name, accept_rate, Outstate, Enroll,  # "elites"
         Grad.Rate, Private, isElite)
glimpse(college_clean)

## Observations: 777
## Variables: 7
## $ name        <chr> "Abilene Christian University", "Adelphi University", "...
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, ...
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1...
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 4...
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68,...
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ isElite     <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, Not ...

hist(college_clean$Enroll)

Clustering Mixed Data Types in R (3)

Then we have to use a distance metric that can handle mixed data types.

For each variable type, a particular distance metric that works well for that type is used and scaled between 0 and 1.
Warning: Distance calculations are sensitive to non-normality and outliers. Big samples take longer.

Create a distance object (taking out the name of the school in the 1st column)

library(cluster)
gower_dist <- daisy(college_clean[ , -1],
                    metric = "gower",
                    type = list(logratio = 3)) # Enroll = log(Enroll)

summary(gower_dist)

## 301476 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735 
## Metric :  mixed ;  Types = I, I, I, I, N, N 
## Number of objects : 777

Check attributes to ensure the correct methods are being used (I = interval, N = nominal)

Clustering Mixed Data Types in R (4)

Look at the most similar and dissimilar pairs in the data to see if they make sense

gower_mat <- as.matrix(gower_dist)

The most similar pair of colleges:

college_clean[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), # lowest of those not equal to zero
        arr.ind = TRUE)[1, ], ]

##                            name accept_rate Outstate Enroll Grad.Rate Private
## 682 University of St. Thomas MN   0.8784638    11712    828        89     Yes
## 284     John Carroll University   0.8711276    11700    820        89     Yes
##       isElite
## 682 Not Elite
## 284 Not Elite

The most dissimilar pair of colleges:

college_clean[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
        arr.ind = TRUE)[1, ], ] # arr.ind. = array indices, a logical variable

##                                        name accept_rate Outstate Enroll
## 673 University of Sci. and Arts of Oklahoma   0.9824561     3687    208
## 251                      Harvard University   0.1561486    18485   1606
##     Grad.Rate Private   isElite
## 673        43      No Not Elite
## 251       100     Yes     Elite

These pairs make sense.

Clustering Mixed Data Types in R (5)

Choose the clustering algorithm

We will use partitioning around medoids (PAM) for handling a custom distance matrix.

Q: How does PAM work?

A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.

Q: How many clusters should be retained in the solution?

A: Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.

Calculate silhouette width for many several solutions using PAM:

sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

Plot the sihouette width (larger value is better):

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width",
     ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

Conclusion: select the 3-cluster solutions as it has the highest silhouette width.

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)
pam_fit$data <- college_clean[ , -1]

Clustering Mixed Data Types in R (6)

Interpret the solution:

by running summary on each cluster

pam_results <- college_clean %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
pam_results$the_summary

## [[1]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private  
##  Min.   :0.3283   Min.   : 2340   Min.   :  35.0   Min.   : 15.00   No :  0  
##  1st Qu.:0.7225   1st Qu.: 8842   1st Qu.: 194.8   1st Qu.: 56.00   Yes:500  
##  Median :0.8004   Median :10905   Median : 308.0   Median : 67.50            
##  Mean   :0.7820   Mean   :11200   Mean   : 418.6   Mean   : 66.97            
##  3rd Qu.:0.8581   3rd Qu.:13240   3rd Qu.: 484.8   3rd Qu.: 78.25            
##  Max.   :1.0000   Max.   :21700   Max.   :4615.0   Max.   :118.00            
##       isElite       cluster 
##  Not Elite:500   Min.   :1  
##  Elite    :  0   1st Qu.:1  
##                  Median :1  
##                  Mean   :1  
##                  3rd Qu.:1  
##                  Max.   :1  
## 
## [[2]]
##   accept_rate        Outstate         Enroll         Grad.Rate      Private 
##  Min.   :0.1545   Min.   : 5224   Min.   : 137.0   Min.   : 54.00   No : 4  
##  1st Qu.:0.4135   1st Qu.:13850   1st Qu.: 391.0   1st Qu.: 77.00   Yes:65  
##  Median :0.5329   Median :17238   Median : 601.0   Median : 89.00           
##  Mean   :0.5392   Mean   :16225   Mean   : 882.5   Mean   : 84.78           
##  3rd Qu.:0.6988   3rd Qu.:18590   3rd Qu.:1191.0   3rd Qu.: 94.00           
##  Max.   :0.9605   Max.   :20100   Max.   :4893.0   Max.   :100.00           
##       isElite      cluster 
##  Not Elite: 0   Min.   :2  
##  Elite    :69   1st Qu.:2  
##                 Median :2  
##                 Mean   :2  
##                 3rd Qu.:2  
##                 Max.   :2  
## 
## [[3]]
##   accept_rate        Outstate         Enroll       Grad.Rate      Private  
##  Min.   :0.3746   Min.   : 2580   Min.   : 153   Min.   : 10.00   No :208  
##  1st Qu.:0.6423   1st Qu.: 5295   1st Qu.: 694   1st Qu.: 46.00   Yes:  0  
##  Median :0.7458   Median : 6598   Median :1302   Median : 54.50            
##  Mean   :0.7315   Mean   : 6698   Mean   :1615   Mean   : 55.42            
##  3rd Qu.:0.8368   3rd Qu.: 7748   3rd Qu.:2184   3rd Qu.: 65.00            
##  Max.   :1.0000   Max.   :15516   Max.   :6392   Max.   :100.00            
##       isElite       cluster 
##  Not Elite:199   Min.   :3  
##  Elite    :  9   1st Qu.:3  
##                  Median :3  
##                  Mean   :3  
##                  3rd Qu.:3  
##                  Max.   :3

college_clean$cluster <- as.factor(pam_fit$clustering)
describeBy(college_clean, college_clean$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##             vars   n     mean      sd  median  trimmed     mad     min   max
## name*          1 500      NaN      NA      NA      NaN      NA     Inf  -Inf
## accept_rate    2 500     0.78    0.11     0.8     0.79    0.10    0.33     1
## Outstate       3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll         4 500   418.57  430.94   308.0   342.51  199.41   35.00  4615
## Grad.Rate      5 500    66.97   16.20    67.5    67.42   17.05   15.00   118
## Private*       6 500     2.00    0.00     2.0     2.00    0.00    2.00     2
## isElite*       7 500     1.00    0.00     1.0     1.00    0.00    1.00     1
## cluster*       8 500     1.00    0.00     1.0     1.00    0.00    1.00     1
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.67 -0.94     1.32   0.01
## Outstate    19360.00  0.45     0.07 148.96
## Enroll       4580.00  4.68    31.74  19.27
## Grad.Rate     103.00 -0.33     0.26   0.72
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 2
##             vars  n     mean      sd   median  trimmed     mad     min      max
## name*          1 69      NaN      NA       NA      NaN      NA     Inf     -Inf
## accept_rate    2 69     0.54    0.20     0.53     0.54    0.21    0.15     0.96
## Outstate       3 69 16224.51 3210.41 17238.00 16563.04 2345.47 5224.00 20100.00
## Enroll         4 69   882.48  813.12   601.00   740.65  464.05  137.00  4893.00
## Grad.Rate      5 69    84.78   11.98    89.00    85.72   11.86   54.00   100.00
## Private*       6 69     1.94    0.24     2.00     2.00    0.00    1.00     2.00
## isElite*       7 69     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## cluster*       8 69     2.00    0.00     2.00     2.00    0.00    2.00     2.00
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.81  0.11    -0.91   0.02
## Outstate    14876.00 -1.09     0.61 386.49
## Enroll       4756.00  2.41     7.56  97.89
## Grad.Rate      46.00 -0.65    -0.64   1.44
## Private*        1.00 -3.70    11.87   0.03
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 3
##             vars   n    mean      sd  median trimmed     mad     min   max
## name*          1 208     NaN      NA      NA     NaN      NA     Inf  -Inf
## accept_rate    2 208    0.73    0.14    0.75    0.74    0.14    0.37     1
## Outstate       3 208 6697.75 1980.73 6598.50 6558.74 1801.36 2580.00 15516
## Enroll         4 208 1614.72 1246.02 1302.00 1445.40 1034.11  153.00  6392
## Grad.Rate      5 208   55.42   13.98   54.50   55.17   13.34   10.00   100
## Private*       6 208    1.00    0.00    1.00    1.00    0.00    1.00     1
## isElite*       7 208    1.04    0.20    1.00    1.00    0.00    1.00     2
## cluster*       8 208    3.00    0.00    3.00    3.00    0.00    3.00     3
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.63 -0.35    -0.41   0.01
## Outstate    12936.00  0.90     2.01 137.34
## Enroll       6239.00  1.44     2.29  86.40
## Grad.Rate      90.00  0.17     0.37   0.97
## Private*        0.00   NaN      NaN   0.00
## isElite*        1.00  4.46    17.95   0.01
## cluster*        0.00   NaN      NaN   0.00

Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment.

Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates.

Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.

Hint: medoids serve as exemplars of each cluster:

college_clean[pam_fit$medoids, ]

##                              name accept_rate Outstate Enroll Grad.Rate Private
## 492         Saint Francis College   0.7877629    10880    284        69     Yes
## 38                Barnard College   0.5616987    17926    531        91     Yes
## 234 Grand Valley State University   0.7525653     6108   1561        57      No
##       isElite cluster
## 492 Not Elite       1
## 38      Elite       2
## 234 Not Elite       3

Clustering Mixed Data Types in R (7)

Visualization for the delivery of results:

To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE.

It tries to preserve local structure so as to make clusters visible in 2D or 3D.

set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>% # $Y is a matrix containing the new representations for the objects
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = college_clean$name)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Have you noticed the small group?

It consists of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics

tsne_data %>%
  filter(X > 12 & X < 15,   # cluster coordinates may differ
         Y > -15 & Y < -12) %>% 
  left_join(college_clean, by = "name") %>%
  collect %>%
  .[["name"]]

##  [1] "College of William and Mary"                
##  [2] "Georgia Institute of Technology"            
##  [3] "SUNY at Binghamton"                         
##  [4] "SUNY College at Geneseo"                    
##  [5] "Trenton State College"                      
##  [6] "University of California at Berkeley"       
##  [7] "University of California at Irvine"         
##  [8] "University of Florida"                      
##  [9] "University of Illinois - Urbana"            
## [10] "University of Michigan at Ann Arbor"        
## [11] "University of Minnesota at Morris"          
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"

Explore the data: http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705

Let us repeat the analysis using 4 clusters and look whether the results will be satisfactory.

Clustering Mixed Data - 2

heatmap(gower_mat, symm = T,
        distfun = function(x) as.dist(x))

pam_fit2 <- pam(gower_dist, diss = TRUE, k = 4)

pam_results2 <- college_clean %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit2$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

college_clean[pam_fit2$medoids, ]

##                              name accept_rate Outstate Enroll Grad.Rate Private
## 609      University of Charleston   0.7844575     9500    204        57     Yes
## 363             Merrimack College   0.7778900    12500    514        80     Yes
## 38                Barnard College   0.5616987    17926    531        91     Yes
## 234 Grand Valley State University   0.7525653     6108   1561        57      No
##       isElite cluster
## 609 Not Elite       1
## 363 Not Elite       1
## 38      Elite       2
## 234 Not Elite       3

tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data2 <- tsne_obj2$Y %>% # $Y is a matrix containing the new representations for the objects
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit2$clustering),
         name = college_clean$name)

ggplot(aes(x = X, y = Y), data = tsne_data2) +
  geom_point(aes(color = cluster))

Nope.

Clustering Mixed Data - 2 (2)

Gower distance + t-SNE is the best instrument for clustering mixed data at the moment. However, a 4-cluster solution is less satisfactory than a 3-cluster one.

The smallest cluster as represented by t-SNE can be re-coded manually.

levels(tsne_data$cluster) <- c("1", "2", "3", "4")
tsne_data$cluster[tsne_data$X >  12 & tsne_data$X < 15 &    tsne_data$Y > -15 & tsne_data$Y < -12] <- 4
ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

describeBy(college_clean, tsne_data$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##             vars   n     mean      sd  median  trimmed     mad     min   max
## name*          1 500      NaN      NA      NA      NaN      NA     Inf  -Inf
## accept_rate    2 500     0.78    0.11     0.8     0.79    0.10    0.33     1
## Outstate       3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll         4 500   418.57  430.94   308.0   342.51  199.41   35.00  4615
## Grad.Rate      5 500    66.97   16.20    67.5    67.42   17.05   15.00   118
## Private*       6 500     2.00    0.00     2.0     2.00    0.00    2.00     2
## isElite*       7 500     1.00    0.00     1.0     1.00    0.00    1.00     1
## cluster*       8 500     1.00    0.00     1.0     1.00    0.00    1.00     1
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.67 -0.94     1.32   0.01
## Outstate    19360.00  0.45     0.07 148.96
## Enroll       4580.00  4.68    31.74  19.27
## Grad.Rate     103.00 -0.33     0.26   0.72
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 2
##             vars  n     mean      sd   median  trimmed     mad     min      max
## name*          1 65      NaN      NA       NA      NaN      NA     Inf     -Inf
## accept_rate    2 65     0.54    0.20     0.53     0.54    0.22    0.15     0.96
## Outstate       3 65 16433.52 3163.36 17475.00 16844.92 2157.18 5224.00 20100.00
## Enroll         4 65   752.12  544.77   573.00   677.53  424.02  137.00  2505.00
## Grad.Rate      5 65    84.57   12.20    89.00    85.53   11.86   54.00   100.00
## Private*       6 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## isElite*       7 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
## cluster*       8 65     2.00    0.00     2.00     2.00    0.00    2.00     2.00
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.81  0.06    -0.93   0.03
## Outstate    14876.00 -1.28     1.20 392.37
## Enroll       2368.00  1.32     1.36  67.57
## Grad.Rate      46.00 -0.61    -0.74   1.51
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 3
##             vars   n    mean      sd  median trimmed     mad     min   max
## name*          1 199     NaN      NA      NA     NaN      NA     Inf  -Inf
## accept_rate    2 199    0.74    0.14    0.75    0.74    0.14    0.37     1
## Outstate       3 199 6649.42 1969.22 6597.00 6516.04 1848.80 2580.00 15516
## Enroll         4 199 1577.97 1219.52 1292.00 1412.39 1009.65  153.00  6392
## Grad.Rate      5 199   54.64   13.63   54.00   54.41   13.34   10.00   100
## Private*       6 199    1.00    0.00    1.00    1.00    0.00    1.00     1
## isElite*       7 199    1.00    0.00    1.00    1.00    0.00    1.00     1
## cluster*       8 199    3.00    0.00    3.00    3.00    0.00    3.00     3
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.63 -0.37    -0.34   0.01
## Outstate    12936.00  0.91     2.14 139.59
## Enroll       6239.00  1.47     2.50  86.45
## Grad.Rate      90.00  0.20     0.63   0.97
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        0.00   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 4
##             vars  n    mean      sd  median trimmed     mad     min      max
## name*          1 13     NaN      NA      NA     NaN      NA     Inf     -Inf
## accept_rate    2 13    0.54    0.14    0.53    0.54    0.17    0.34     0.78
## Outstate       3 13 9323.77 3108.54 8400.00 9098.73 2833.25 5391.00 15732.00
## Enroll         4 13 2603.69 1541.66 2478.00 2505.00 1697.58  588.00  5705.00
## Grad.Rate      5 13   77.46   11.98   80.00   78.27   10.38   51.00    95.00
## Private*       6 13    1.00    0.00    1.00    1.00    0.00    1.00     1.00
## isElite*       7 13    2.00    0.00    2.00    2.00    0.00    2.00     2.00
## cluster*       8 13    2.69    0.48    3.00    2.73    0.00    2.00     3.00
##                range  skew kurtosis     se
## name*           -Inf    NA       NA     NA
## accept_rate     0.44  0.22    -1.51   0.04
## Outstate    10341.00  0.49    -1.08 862.15
## Enroll       5117.00  0.51    -0.90 427.58
## Grad.Rate      44.00 -0.52    -0.42   3.32
## Private*        0.00   NaN      NaN   0.00
## isElite*        0.00   NaN      NaN   0.00
## cluster*        1.00 -0.74    -1.56   0.13

To conclude, there are four cluster among the 777 colleges.

Cluster 1 are mainly private, non-elite colleges with medium levels of out-of-state tuition and smaller levels of student enrollment.

Cluster 2 consists of private, elite colleges with lower acceptance rates and higher rates of out-of-state tuition and graduation.

Cluster 3 is mainly public and non-elite, with the lowest rates of tuition and graduation but the largest levels of enrollment.

Lastly, Cluster 4 consists of 13 public colleges with rather high enrolment, low acceptance rate, medium tuition rate, and a high graduation rate.

Optional: Fuzzy C-means

finding cluster solutions with cluster membership probability

library(cluster)
fuzzy_fit <- fanny(iris[ ,3:4], 3, metric = "euclidean", stand = FALSE)
head(fuzzy_fit$membership, 3)

##           [,1]       [,2]       [,3]
## [1,] 0.9717502 0.01658896 0.01166088
## [2,] 0.9717502 0.01658896 0.01166088
## [3,] 0.9460213 0.03157233 0.02240638

fviz_cluster(fuzzy_fit, 
             ellipse.type = "norm", 
             repel = TRUE,
             palette = "jco", 
             ggtheme = theme_minimal(),
             legend = "right")

Read: https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html

library(ppclust)
fcm_fit <- fcm(iris[ ,-5], centers = 3)
head(fcm_fit$u, 3)

##     Cluster 1 Cluster 2    Cluster 3
## 1 0.001993004 0.9971282 0.0008788244
## 2 0.014852798 0.9787709 0.0063762576
## 3 0.012332674 0.9821882 0.0054790852

fcm_fit$u[102,]

##  Cluster 1  Cluster 2  Cluster 3 
## 0.75255612 0.03000091 0.21744297

fcm_fit$u[115,]

##  Cluster 1  Cluster 2  Cluster 3 
## 0.60544111 0.04407181 0.35048708

fcm_fit$u[143,]

##  Cluster 1  Cluster 2  Cluster 3 
## 0.75255612 0.03000091 0.21744297

table(cutree(h_avg, 3), fcm_fit$cluster) # same results

##    
##      1  2  3
##   1  0 50  0
##   2 64  0  0
##   3  0  0 36

fviz_cluster(ppclust2(fcm_fit, "kmeans"), data = iris[ ,3:4],
             ellipse.type = "norm", 
             repel = TRUE,
             palette = "jco", 
             ggtheme = theme_minimal(),
             legend = "right")

Optional: DBSCAN

About: http://www.sthda.com/english/wiki/wiki.php?id_contents=7940

Data source: https://www.r-bloggers.com/chaining-effect-in-clustering/

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(layerName)` instead of `layerName` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 3018750)

library(factoextra)
centers <-  as.data.frame(km$centers)
d <-  dist(centers, method ="euclidean")
hc <- hclust(d, method = "single")
g <- fviz_dend(hc, k=24, show_labels = FALSE)
g

library(ggrepel)

cluster <- cutree(hc, k = 24)

c <- centers %>% bind_cols(as.data.frame(cluster)) %>%
    group_by(cluster) %>%
    summarise(x=first(x), y=first(y))


g <-  centers %>% bind_cols(as.data.frame(cluster)) %>%
    ggplot(aes(x, y, color=as.factor(cluster))) +
    geom_point() +
    geom_label_repel(data=c, aes(x, y, label=cluster), box.padding = 1) +
    theme(legend.position="none")

library(fpc)
library(dbscan)
kNNdistplot(centers, k = 24) # elbow at ca.150

dbscan(centers, 
       eps = 150, # the radius of neighborhood around a point x
       MinPts = 2) # the minimum number of neighbors within “eps” radius

## DBSCAN clustering for 500 objects.
## Parameters: eps = 150, minPts = 2
## The clustering contains 1 cluster(s) and 0 noise points.
## 
##   1 
## 500 
## 
## Available fields: cluster, eps, minPts

This solution results in only 1 cluster.

Let us try out varying the parameters and comparing the results:

res <- dbscan(centers, eps = 80, MinPts = 2) # 2 clusters
res

## DBSCAN clustering for 500 objects.
## Parameters: eps = 80, minPts = 2
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 444  56 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res$cluster + 1L, pch = res$cluster + 1L)

res2 <- dbscan(centers, eps = 50, MinPts = 2) # 6 clusters
res2

## DBSCAN clustering for 500 objects.
## Parameters: eps = 50, minPts = 2
## The clustering contains 6 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6 
## 157  82 197  29   8  27 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res2$cluster + 1L, pch = res2$cluster + 1L)

res3 <- dbscan(centers, eps = 30, MinPts = 2) # 20 clusters + 1
res3

## DBSCAN clustering for 500 objects.
## Parameters: eps = 30, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   8  64 127   3  11  11 110   8  66  10   7   7  11   4   9   6  14  10   8   4 
##  21 
##   2 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res3$cluster + 1L, pch = res3$cluster + 1L)

res4 <- dbscan(centers, eps = 20, MinPts = 2) # 23 clusters + 2 = this is the best solution, see plots
res4

## DBSCAN clustering for 500 objects.
## Parameters: eps = 20, minPts = 2
## The clustering contains 24 cluster(s) and 0 noise points.
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   8  46 127   3  11  11 107   8  11  66   7  10   7   7  11   4   9   6  14  10 
##  21  22  23  24 
##   8   4   3   2 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res4$cluster + 1L, pch = res4$cluster + 1L)

res5 <- dbscan(centers, eps = 10, MinPts = 2) # 33 clusters + 380
res5

## DBSCAN clustering for 500 objects.
## Parameters: eps = 10, minPts = 2
## The clustering contains 35 cluster(s) and 398 noise points.
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 398   3   4   7   2   2   3   3   2   2   3   4   2   2   3   2  10   2   2   2 
##  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##   3   3   3   2   5   4   2   2   3   2   2   3   2   2   2   2 
## 
## Available fields: cluster, eps, minPts

plot(centers, col = res5$cluster + 1L, pch = res5$cluster + 1L)

Compared to k-means or traditional hierarchical clustering, density-based DBSCAN method is much more efficient.

h_avg_cut2 <- hcut(centers, k = 24, hc_method = "average")
fviz_cluster(h_avg_cut2, centers, ellipse.type = "convex")

hfit_ward2 <- hcut(centers, k = 24, hc_method = "ward.D2")
fviz_cluster(hfit_ward2, centers, ellipse.type = "convex")

Clusters 102