Anna Shirokanova
10 02 2020
We know the species from the beginning, thus, can compare solutions
IRL, ‘species’ is unknown, and the solution has to be tested against reality
In this dataset, the object features are metric only.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
Petal.Length and Petal.Width are similar within the same species but vary considerably between different species:
## item group1 vars n mean sd
## Sepal.Length1 1 setosa 1 50 5.006 0.3524897
## Sepal.Length2 2 versicolor 1 50 5.936 0.5161711
## Sepal.Length3 3 virginica 1 50 6.588 0.6358796
## Sepal.Width1 4 setosa 2 50 3.428 0.3790644
## Sepal.Width2 5 versicolor 2 50 2.770 0.3137983
## Sepal.Width3 6 virginica 2 50 2.974 0.3224966
## Petal.Length1 7 setosa 3 50 1.462 0.1736640
## Petal.Length2 8 versicolor 3 50 4.260 0.4699110
## Petal.Length3 9 virginica 3 50 5.552 0.5518947
## Petal.Width1 10 setosa 4 50 0.246 0.1053856
## Petal.Width2 11 versicolor 4 50 1.326 0.1977527
## Petal.Width3 12 virginica 4 50 2.026 0.2746501
## setosa versicolor virginica
## 1.462 4.260 5.552
## setosa versicolor virginica
## 0.246 1.326 2.026
Let us visualize their distribution:
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +
geom_point() +
theme_classic()set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)
irisCluster <- kmeans(iris[ , 3:4], # Petal.Length and Petal.Width
3, # how many groups to locate
nstart = 20 # R will try 20 different random starting assignments
# and then select the one with the lowest within cluster variation.
)Let’s compare the results of clustering against the pre-set classification:
##
## setosa versicolor virginica
## 1 0 48 4
## 2 0 2 46
## 3 50 0 0
Six objects (2 + 4) have been misidentified.
We can use the ‘elbow method’ to determine the optimal k.
To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case.
A low withinss shows high intra-cluster similarity and low inter-cluster similarity.
The optimal number of clusters (k) is where the elbow occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.
The silhouette width statistic:
"The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster.
A high average silhouette width indicates a good clustering." Source: https://uc-r.github.io/kmeans_clustering
The gap statistic:
"The approach can be applied to any clustering method (i.e. K-means clustering, hierarchical clustering).
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering)." Source: https://uc-r.github.io/kmeans_clustering#gap
According to our scree plot (elbow plot), the optimal k is 2 or 3.
To visualize the solution in two dimensions, use:
fviz_cluster(irisCluster, data = iris[ ,3:4],
ellipse.type = "convex",
palette = "jco",
repel = TRUE)Compare this solution to the one based on ALL the variables:
Conclusion: k-means works well when domain knowledge is rich.
If little is known, use the most different variables for clustering. Decision on the number of clusters can be based on metrics.
In this example, irises can be clustered best by using 2 variables out of 4.
The choice of the metric may have a large impact depending on what data you have.
iris.use <- subset(iris, select = -Species) # select all variables but the "Species"
d <- dist(iris.use, method = 'euclidean') # calculate a distance matrix on selected variablesSee an example here:
Figure. Three clustering methods for the target-looking data.
What distance to choose?
| distance | metric description |
|---|---|
| euclidean - | usual distance between the two vectors (interval vars) |
| manhattan - | absolute distance between the two vectors |
| canberra - | for non-negative values (e.g., counts) |
| minkowski - | can be used for both ordinal and quantitative variables |
| gower - | the weighted mean of the contributions of each variable (for mixed types) |
| binary - | Jaccard distance for dichotomous variables |
Read more here: http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling
Once you have the distance matrix, a plot with distances between observations and a dendrogram can be produced:
In this plot, clusters are squares of red color.
How many clusters can you see here, 2 or 3?
library(cluster)
hfit_ward <- agnes(d, method = "ward") # principle: minimal average distance between clusters
hfit_average <- agnes(d, method = "average") # average distance between an observation and existing clusters
hfit_complete <- agnes(d, method = "complete") # maximal distance between an observation and existing clustersRead more on methods: https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 49 15
## 3 0 1 35
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 50 14
## 3 0 0 36
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 23 49
## 3 0 27 1
In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives.
h_avg_cut <- hcut(d, k = 3, hc_method = "average")
fviz_cluster(h_avg_cut, iris.use, ellipse.type = "convex")Let’s further visualise the result!
cl <- cutree(hfit_average, k = 3) # cut the dendrogram to obtain exactly 3 clusters
library(dplyr)
iris <- mutate(iris, cluster = cl) # add cluster membership as a column to the data frame
head(iris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 1
## 3 4.7 3.2 1.3 0.2 setosa 1
## 4 4.6 3.1 1.5 0.2 setosa 1
## 5 5.0 3.6 1.4 0.2 setosa 1
## 6 5.4 3.9 1.7 0.4 setosa 1
library(magrittr)
iris[ ,-5] %>%
group_by(cluster) %>%
summarise_all(funs(mean(.))) # get the mean values of all variables by cluster## # A tibble: 3 x 5
## cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.01 3.43 1.46 0.246
## 2 2 5.93 2.76 4.41 1.44
## 3 3 6.85 3.08 5.79 2.10
Show the result on the dendrogram by coloring the branches (optional)
A lazy option:
A more elaborate option:
library(dendextend)
dend <- as.dendrogram(hfit_average)
dend_col <- color_branches(dend, k = 3)
plot(dend_col)This is how the dendrogram was cut to obtain three clusters.
Yet another possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance.
Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 1.8 means that the averagefor Euclidean distances between any two observations will not exceed 1.8:
h_avg <- hclust(d, method = "average")
cl2 <- cutree(h_avg, h = 1.8)
dend <- as.dendrogram(h_avg)
dend_col <- color_branches(dend, h = 1.8)
plot(dend_col)Better dendrograms: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels
A bonus below (optional)
The ‘banner plot’ below:
"The white area on the left of the banner plot represents the unclustered data while the white lines that stick into the red are show the heights at which the clusters were formed. Since we don’t want to include too many clusters that joined together at similar heights, it looks like three clusters, at a height of about 2 is a good solution. It’s clear from the banner plot that if we lowered the height to, say 1.5, we’d create a fourth cluster with only a few observations.
The banner plot is just an alternative to the dendogram"
Link on banner plot: https://www.stat.berkeley.edu/~s133/Cluster2a.html
Link on agglomeration coefficient: https://datascience.stackexchange.com/questions/18860/how-to-interpret-agglomerative-coefficient-agnes-function
Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.
Read more: https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
For illustration, use the College dataset:
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
college_clean = College %>%
mutate(name = row.names(.), # make college names a variable
accept_rate = Accept/Apps,# ratio of applications to accepted student
isElite = cut(Top10perc, # how many top students are enrolled, % of all students
breaks = c(0, 50, 100),
labels = c("Not Elite", "Elite"),
include.lowest = TRUE)) %>%
mutate(isElite = factor(isElite)) %>% # label colleges with many good students
select(name, accept_rate, Outstate, Enroll, # "elites"
Grad.Rate, Private, isElite)
glimpse(college_clean)## Observations: 777
## Variables: 7
## $ name <chr> "Abilene Christian University", "Adelphi University", "...
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, ...
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1...
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472, 4...
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, 68,...
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ isElite <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, Not ...
Create a distance object (taking out the name of the school in the 1st column)
library(cluster)
gower_dist <- daisy(college_clean[ , -1],
metric = "gower",
type = list(logratio = 3)) # Enroll = log(Enroll)
summary(gower_dist)## 301476 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735
## Metric : mixed ; Types = I, I, I, I, N, N
## Number of objects : 777
Check attributes to ensure the correct methods are being used (I = interval, N = nominal)
The most similar pair of colleges:
college_clean[
which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), # lowest of those not equal to zero
arr.ind = TRUE)[1, ], ]## name accept_rate Outstate Enroll Grad.Rate Private
## 682 University of St. Thomas MN 0.8784638 11712 828 89 Yes
## 284 John Carroll University 0.8711276 11700 820 89 Yes
## isElite
## 682 Not Elite
## 284 Not Elite
The most dissimilar pair of colleges:
college_clean[
which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
arr.ind = TRUE)[1, ], ] # arr.ind. = array indices, a logical variable## name accept_rate Outstate Enroll
## 673 University of Sci. and Arts of Oklahoma 0.9824561 3687 208
## 251 Harvard University 0.1561486 18485 1606
## Grad.Rate Private isElite
## 673 43 No Not Elite
## 251 100 Yes Elite
These pairs make sense.
We will use partitioning around medoids (PAM) for handling a custom distance matrix.
Q: How does PAM work?
A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.
Q: How many clusters should be retained in the solution?
A: Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}Plot the sihouette width (larger value is better):
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width",
ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)Conclusion: select the 3-cluster solutions as it has the highest silhouette width.
pam_results <- college_clean %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary## [[1]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3283 Min. : 2340 Min. : 35.0 Min. : 15.00 No : 0
## 1st Qu.:0.7225 1st Qu.: 8842 1st Qu.: 194.8 1st Qu.: 56.00 Yes:500
## Median :0.8004 Median :10905 Median : 308.0 Median : 67.50
## Mean :0.7820 Mean :11200 Mean : 418.6 Mean : 66.97
## 3rd Qu.:0.8581 3rd Qu.:13240 3rd Qu.: 484.8 3rd Qu.: 78.25
## Max. :1.0000 Max. :21700 Max. :4615.0 Max. :118.00
## isElite cluster
## Not Elite:500 Min. :1
## Elite : 0 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## [[2]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.1545 Min. : 5224 Min. : 137.0 Min. : 54.00 No : 4
## 1st Qu.:0.4135 1st Qu.:13850 1st Qu.: 391.0 1st Qu.: 77.00 Yes:65
## Median :0.5329 Median :17238 Median : 601.0 Median : 89.00
## Mean :0.5392 Mean :16225 Mean : 882.5 Mean : 84.78
## 3rd Qu.:0.6988 3rd Qu.:18590 3rd Qu.:1191.0 3rd Qu.: 94.00
## Max. :0.9605 Max. :20100 Max. :4893.0 Max. :100.00
## isElite cluster
## Not Elite: 0 Min. :2
## Elite :69 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
##
## [[3]]
## accept_rate Outstate Enroll Grad.Rate Private
## Min. :0.3746 Min. : 2580 Min. : 153 Min. : 10.00 No :208
## 1st Qu.:0.6423 1st Qu.: 5295 1st Qu.: 694 1st Qu.: 46.00 Yes: 0
## Median :0.7458 Median : 6598 Median :1302 Median : 54.50
## Mean :0.7315 Mean : 6698 Mean :1615 Mean : 55.42
## 3rd Qu.:0.8368 3rd Qu.: 7748 3rd Qu.:2184 3rd Qu.: 65.00
## Max. :1.0000 Max. :15516 Max. :6392 Max. :100.00
## isElite cluster
## Not Elite:199 Min. :3
## Elite : 9 1st Qu.:3
## Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
college_clean$cluster <- as.factor(pam_fit$clustering)
describeBy(college_clean, college_clean$cluster)##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max
## name* 1 500 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 500 0.78 0.11 0.8 0.79 0.10 0.33 1
## Outstate 3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll 4 500 418.57 430.94 308.0 342.51 199.41 35.00 4615
## Grad.Rate 5 500 66.97 16.20 67.5 67.42 17.05 15.00 118
## Private* 6 500 2.00 0.00 2.0 2.00 0.00 2.00 2
## isElite* 7 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## cluster* 8 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.67 -0.94 1.32 0.01
## Outstate 19360.00 0.45 0.07 148.96
## Enroll 4580.00 4.68 31.74 19.27
## Grad.Rate 103.00 -0.33 0.26 0.72
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## name* 1 69 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 69 0.54 0.20 0.53 0.54 0.21 0.15 0.96
## Outstate 3 69 16224.51 3210.41 17238.00 16563.04 2345.47 5224.00 20100.00
## Enroll 4 69 882.48 813.12 601.00 740.65 464.05 137.00 4893.00
## Grad.Rate 5 69 84.78 11.98 89.00 85.72 11.86 54.00 100.00
## Private* 6 69 1.94 0.24 2.00 2.00 0.00 1.00 2.00
## isElite* 7 69 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 69 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.81 0.11 -0.91 0.02
## Outstate 14876.00 -1.09 0.61 386.49
## Enroll 4756.00 2.41 7.56 97.89
## Grad.Rate 46.00 -0.65 -0.64 1.44
## Private* 1.00 -3.70 11.87 0.03
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## name* 1 208 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 208 0.73 0.14 0.75 0.74 0.14 0.37 1
## Outstate 3 208 6697.75 1980.73 6598.50 6558.74 1801.36 2580.00 15516
## Enroll 4 208 1614.72 1246.02 1302.00 1445.40 1034.11 153.00 6392
## Grad.Rate 5 208 55.42 13.98 54.50 55.17 13.34 10.00 100
## Private* 6 208 1.00 0.00 1.00 1.00 0.00 1.00 1
## isElite* 7 208 1.04 0.20 1.00 1.00 0.00 1.00 2
## cluster* 8 208 3.00 0.00 3.00 3.00 0.00 3.00 3
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.63 -0.35 -0.41 0.01
## Outstate 12936.00 0.90 2.01 137.34
## Enroll 6239.00 1.44 2.29 86.40
## Grad.Rate 90.00 0.17 0.37 0.97
## Private* 0.00 NaN NaN 0.00
## isElite* 1.00 4.46 17.95 0.01
## cluster* 0.00 NaN NaN 0.00
Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment.
Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates.
Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.
Hint: medoids serve as exemplars of each cluster:
## name accept_rate Outstate Enroll Grad.Rate Private
## 492 Saint Francis College 0.7877629 10880 284 69 Yes
## 38 Barnard College 0.5616987 17926 531 91 Yes
## 234 Grand Valley State University 0.7525653 6108 1561 57 No
## isElite cluster
## 492 Not Elite 1
## 38 Elite 2
## 234 Not Elite 3
To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE.
It tries to preserve local structure so as to make clusters visible in 2D or 3D.
set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>% # $Y is a matrix containing the new representations for the objects
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = college_clean$name)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))Have you noticed the small group?
It consists of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics
tsne_data %>%
filter(X > 12 & X < 15, # cluster coordinates may differ
Y > -15 & Y < -12) %>%
left_join(college_clean, by = "name") %>%
collect %>%
.[["name"]]## [1] "College of William and Mary"
## [2] "Georgia Institute of Technology"
## [3] "SUNY at Binghamton"
## [4] "SUNY College at Geneseo"
## [5] "Trenton State College"
## [6] "University of California at Berkeley"
## [7] "University of California at Irvine"
## [8] "University of Florida"
## [9] "University of Illinois - Urbana"
## [10] "University of Michigan at Ann Arbor"
## [11] "University of Minnesota at Morris"
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"
Explore the data: http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705
Let us repeat the analysis using 4 clusters and look whether the results will be satisfactory.
pam_fit2 <- pam(gower_dist, diss = TRUE, k = 4)
pam_results2 <- college_clean %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit2$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
college_clean[pam_fit2$medoids, ]## name accept_rate Outstate Enroll Grad.Rate Private
## 609 University of Charleston 0.7844575 9500 204 57 Yes
## 363 Merrimack College 0.7778900 12500 514 80 Yes
## 38 Barnard College 0.5616987 17926 531 91 Yes
## 234 Grand Valley State University 0.7525653 6108 1561 57 No
## isElite cluster
## 609 Not Elite 1
## 363 Not Elite 1
## 38 Elite 2
## 234 Not Elite 3
tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data2 <- tsne_obj2$Y %>% # $Y is a matrix containing the new representations for the objects
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit2$clustering),
name = college_clean$name)
ggplot(aes(x = X, y = Y), data = tsne_data2) +
geom_point(aes(color = cluster))Nope.
Gower distance + t-SNE is the best instrument for clustering mixed data at the moment. However, a 4-cluster solution is less satisfactory than a 3-cluster one.
The smallest cluster as represented by t-SNE can be re-coded manually.
levels(tsne_data$cluster) <- c("1", "2", "3", "4")
tsne_data$cluster[tsne_data$X > 12 & tsne_data$X < 15 & tsne_data$Y > -15 & tsne_data$Y < -12] <- 4
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max
## name* 1 500 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 500 0.78 0.11 0.8 0.79 0.10 0.33 1
## Outstate 3 500 11199.56 3330.84 10905.0 11033.57 3157.94 2340.00 21700
## Enroll 4 500 418.57 430.94 308.0 342.51 199.41 35.00 4615
## Grad.Rate 5 500 66.97 16.20 67.5 67.42 17.05 15.00 118
## Private* 6 500 2.00 0.00 2.0 2.00 0.00 2.00 2
## isElite* 7 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## cluster* 8 500 1.00 0.00 1.0 1.00 0.00 1.00 1
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.67 -0.94 1.32 0.01
## Outstate 19360.00 0.45 0.07 148.96
## Enroll 4580.00 4.68 31.74 19.27
## Grad.Rate 103.00 -0.33 0.26 0.72
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## name* 1 65 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 65 0.54 0.20 0.53 0.54 0.22 0.15 0.96
## Outstate 3 65 16433.52 3163.36 17475.00 16844.92 2157.18 5224.00 20100.00
## Enroll 4 65 752.12 544.77 573.00 677.53 424.02 137.00 2505.00
## Grad.Rate 5 65 84.57 12.20 89.00 85.53 11.86 54.00 100.00
## Private* 6 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## isElite* 7 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 65 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.81 0.06 -0.93 0.03
## Outstate 14876.00 -1.28 1.20 392.37
## Enroll 2368.00 1.32 1.36 67.57
## Grad.Rate 46.00 -0.61 -0.74 1.51
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## name* 1 199 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 199 0.74 0.14 0.75 0.74 0.14 0.37 1
## Outstate 3 199 6649.42 1969.22 6597.00 6516.04 1848.80 2580.00 15516
## Enroll 4 199 1577.97 1219.52 1292.00 1412.39 1009.65 153.00 6392
## Grad.Rate 5 199 54.64 13.63 54.00 54.41 13.34 10.00 100
## Private* 6 199 1.00 0.00 1.00 1.00 0.00 1.00 1
## isElite* 7 199 1.00 0.00 1.00 1.00 0.00 1.00 1
## cluster* 8 199 3.00 0.00 3.00 3.00 0.00 3.00 3
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.63 -0.37 -0.34 0.01
## Outstate 12936.00 0.91 2.14 139.59
## Enroll 6239.00 1.47 2.50 86.45
## Grad.Rate 90.00 0.20 0.63 0.97
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 0.00 NaN NaN 0.00
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max
## name* 1 13 NaN NA NA NaN NA Inf -Inf
## accept_rate 2 13 0.54 0.14 0.53 0.54 0.17 0.34 0.78
## Outstate 3 13 9323.77 3108.54 8400.00 9098.73 2833.25 5391.00 15732.00
## Enroll 4 13 2603.69 1541.66 2478.00 2505.00 1697.58 588.00 5705.00
## Grad.Rate 5 13 77.46 11.98 80.00 78.27 10.38 51.00 95.00
## Private* 6 13 1.00 0.00 1.00 1.00 0.00 1.00 1.00
## isElite* 7 13 2.00 0.00 2.00 2.00 0.00 2.00 2.00
## cluster* 8 13 2.69 0.48 3.00 2.73 0.00 2.00 3.00
## range skew kurtosis se
## name* -Inf NA NA NA
## accept_rate 0.44 0.22 -1.51 0.04
## Outstate 10341.00 0.49 -1.08 862.15
## Enroll 5117.00 0.51 -0.90 427.58
## Grad.Rate 44.00 -0.52 -0.42 3.32
## Private* 0.00 NaN NaN 0.00
## isElite* 0.00 NaN NaN 0.00
## cluster* 1.00 -0.74 -1.56 0.13
To conclude, there are four cluster among the 777 colleges.
Cluster 1 are mainly private, non-elite colleges with medium levels of out-of-state tuition and smaller levels of student enrollment.
Cluster 2 consists of private, elite colleges with lower acceptance rates and higher rates of out-of-state tuition and graduation.
Cluster 3 is mainly public and non-elite, with the lowest rates of tuition and graduation but the largest levels of enrollment.
Lastly, Cluster 4 consists of 13 public colleges with rather high enrolment, low acceptance rate, medium tuition rate, and a high graduation rate.
Read more on t-SNE: https://jmonlong.github.io/Hippocamplus/2018/02/13/tsne-and-clustering/
finding cluster solutions with cluster membership probability
library(cluster)
fuzzy_fit <- fanny(iris[ ,3:4], 3, metric = "euclidean", stand = FALSE)
head(fuzzy_fit$membership, 3)## [,1] [,2] [,3]
## [1,] 0.9717502 0.01658896 0.01166088
## [2,] 0.9717502 0.01658896 0.01166088
## [3,] 0.9460213 0.03157233 0.02240638
fviz_cluster(fuzzy_fit,
ellipse.type = "norm",
repel = TRUE,
palette = "jco",
ggtheme = theme_minimal(),
legend = "right")Read: https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html
## Cluster 1 Cluster 2 Cluster 3
## 1 0.001993004 0.9971282 0.0008788244
## 2 0.014852798 0.9787709 0.0063762576
## 3 0.012332674 0.9821882 0.0054790852
## Cluster 1 Cluster 2 Cluster 3
## 0.75255612 0.03000091 0.21744297
## Cluster 1 Cluster 2 Cluster 3
## 0.60544111 0.04407181 0.35048708
## Cluster 1 Cluster 2 Cluster 3
## 0.75255612 0.03000091 0.21744297
##
## 1 2 3
## 1 0 50 0
## 2 64 0 0
## 3 0 0 36
fviz_cluster(ppclust2(fcm_fit, "kmeans"), data = iris[ ,3:4],
ellipse.type = "norm",
repel = TRUE,
palette = "jco",
ggtheme = theme_minimal(),
legend = "right")About: http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
Data source: https://www.r-bloggers.com/chaining-effect-in-clustering/
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(layerName)` instead of `layerName` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 3018750)
library(factoextra)
centers <- as.data.frame(km$centers)
d <- dist(centers, method ="euclidean")
hc <- hclust(d, method = "single")
g <- fviz_dend(hc, k=24, show_labels = FALSE)
glibrary(ggrepel)
cluster <- cutree(hc, k = 24)
c <- centers %>% bind_cols(as.data.frame(cluster)) %>%
group_by(cluster) %>%
summarise(x=first(x), y=first(y))
g <- centers %>% bind_cols(as.data.frame(cluster)) %>%
ggplot(aes(x, y, color=as.factor(cluster))) +
geom_point() +
geom_label_repel(data=c, aes(x, y, label=cluster), box.padding = 1) +
theme(legend.position="none")dbscan(centers,
eps = 150, # the radius of neighborhood around a point x
MinPts = 2) # the minimum number of neighbors within “eps” radius## DBSCAN clustering for 500 objects.
## Parameters: eps = 150, minPts = 2
## The clustering contains 1 cluster(s) and 0 noise points.
##
## 1
## 500
##
## Available fields: cluster, eps, minPts
This solution results in only 1 cluster.
Let us try out varying the parameters and comparing the results:
## DBSCAN clustering for 500 objects.
## Parameters: eps = 80, minPts = 2
## The clustering contains 2 cluster(s) and 0 noise points.
##
## 1 2
## 444 56
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 50, minPts = 2
## The clustering contains 6 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6
## 157 82 197 29 8 27
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 30, minPts = 2
## The clustering contains 21 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 8 64 127 3 11 11 110 8 66 10 7 7 11 4 9 6 14 10 8 4
## 21
## 2
##
## Available fields: cluster, eps, minPts
res4 <- dbscan(centers, eps = 20, MinPts = 2) # 23 clusters + 2 = this is the best solution, see plots
res4## DBSCAN clustering for 500 objects.
## Parameters: eps = 20, minPts = 2
## The clustering contains 24 cluster(s) and 0 noise points.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 8 46 127 3 11 11 107 8 11 66 7 10 7 7 11 4 9 6 14 10
## 21 22 23 24
## 8 4 3 2
##
## Available fields: cluster, eps, minPts
## DBSCAN clustering for 500 objects.
## Parameters: eps = 10, minPts = 2
## The clustering contains 35 cluster(s) and 398 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 398 3 4 7 2 2 3 3 2 2 3 4 2 2 3 2 10 2 2 2
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 3 3 3 2 5 4 2 2 3 2 2 3 2 2 2 2
##
## Available fields: cluster, eps, minPts
Compared to k-means or traditional hierarchical clustering, density-based DBSCAN method is much more efficient.
h_avg_cut2 <- hcut(centers, k = 24, hc_method = "average")
fviz_cluster(h_avg_cut2, centers, ellipse.type = "convex")hfit_ward2 <- hcut(centers, k = 24, hc_method = "ward.D2")
fviz_cluster(hfit_ward2, centers, ellipse.type = "convex")Classify US states by crime types
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Tasks for practice:
Cluster states using hierarchical clustering
Pick up the best k and cluster US states with k-means
Visualise your results showing labels
What are the representative states for each cluster?