In preparation of, and in anticipation of, proper socio-economic data, we test-drive t-SNE algorithms on the IRIS dataset. Questions to love.borjeson at gmail.com
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a deep learning approach to clustering. The analytical end-goals of t-SNE are similar to those of e.g. Self Organizing Maps, though the methods differ. t-SNE tries, from a high dimensional dataset, find the most probable neighbors of datapoints in a low-dimensional (i.e. 2 or 3) space. Wikipedia should get you pretty far in terms of basic understanding of the statistical mechanisms and machine learning steps and iterations.
A lot of attention is given to the plots. It’s not just for show: t-SNEs are decisively graphical/visual to their nature and should be interpreted mainly, albeit not exclusively, by visual inspection.
A peculiarity of t-SNE is that the dimensions of the representation, or cluster sizes and positions for that matter, really don’t have a meaningful interpretation. So put away your scarf, stub out your Gauloise and stop thinking of “dimsenions of power” or anything like that. Rather: have a seat, put the kettle on and have the Browns over for tea: this is, afterall, about finding clusters of neighbors.
It is very easy to apply t-SNE and it is higly effective. It is however very hard to evaluate the t-SNE performance on an unknown, high-dimensional material. If paramateres are off, you could verey well end upp with “clusters” that are pure artifacts of the method and not of structures in the data. Most critical is the “perplexity” setting, more on that below.
Use t-SNE with care and verify the results with some other method (e.g. k-means or Self Organizing Maps).
prescript
#If you, like me, run your code in RStudio, this could save you 1.5 seconds :-):
library(rstudioapi)
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))
script
library(Rtsne)
library(caret)
library(plotly)
library(GGally)
#A closer look at IRIS
iris_complete <-iris[complete.cases(iris),] #only complete cases... the iris dataset floats around in the sky with dimonds.
iris_unique <- unique(iris_complete) # Remove duplicates.
iris_scale <- as.data.frame(scale(iris_unique[,1:4])) #Scale.
#On IRIS, and perhaps many other datasets, scaling is not allways the best way to go.
#In the IRIS dataset, the dimension with larger scope (and hence larger impact on the model) discriminates better between the species.
knitr::kable(
summary(iris_unique[1:5, 1:5]), caption = 'The Iris dataset.'
)
The Iris dataset.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
Min. :4.60 | Min. :3.00 | Min. :1.3 | Min. :0.2 | setosa :5 | |
1st Qu.:4.70 | 1st Qu.:3.10 | 1st Qu.:1.4 | 1st Qu.:0.2 | versicolor:0 | |
Median :4.90 | Median :3.20 | Median :1.4 | Median :0.2 | virginica :0 | |
Mean :4.86 | Mean :3.28 | Mean :1.4 | Mean :0.2 | NA | |
3rd Qu.:5.00 | 3rd Qu.:3.50 | 3rd Qu.:1.4 | 3rd Qu.:0.2 | NA | |
Max. :5.10 | Max. :3.60 | Max. :1.5 | Max. :0.2 | NA |
ggpairs(iris_unique, title = "IRIS")
The Iris dataset. Again.
set.seed(44)
m1 <- Rtsne(iris_unique[,1:4], pca=TRUE, check_duplicates=FALSE, perplexity=20, theta=0.5, dims=2) # Run TSNE
#pca=TRUE->the initial pca step.
#perpelexity typically between 5:50, but smaller for smaller datasets. Crucial but slippery.
#theta= Speed/accuracy trade-off (increase for less accuracy). 0.5 ok.
# Show the objects in the 2D tsne representation
par(bg = "#fffff8", family = "serif", family = "serif")
plot(m1$Y,col=iris_unique$Species, asp=1)
Representation of the most probable neighbors in a two dimensional space.
Should you really cluster on top of a t-SNE? Given the the output of a t-SNE i would say yes, if you really cannot survive with just visual inspection. HOWEVER: We run the standard k-mean and HAC here, which kind of works for a material that has low dimensionality to begin with. In most other cases, you would do better with a clustering algorithm that is less Euclidean and more about density, such as DBSCAN.
Within groups sum of squares.
## getting the two dimension into a data.frame
df_m1 <- as.data.frame(m1$Y)
df_m1_withcluster <- df_m1 #This is to simplify a bit down the line, so the clustering can be run on the d_m1 without var. spec.
## Creating k-means clustering model, and assigning the result to the data used to create the tsne
set.seed(44)
wss <- (nrow(df_m1)-1)*sum(apply(df_m1,2,var)) #within group sum of squares
for (i in 1:10) wss[i] <- sum(kmeans(df_m1,
centers=i)$withinss) #i range from 1 to no of nodes -1...
par(mar=c(5.1,4.1,4.1,2.1), bg = "#fffff8", family = "serif") #par sets or adjusts plotting parameters. Not allways necessary. "mar" = margin.
plot(1:10, wss, type="b", xlab="Number of Clusters", #match i
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)")
Within groups sum of squares.
set.seed(44)
fit_kmeans <- kmeans(df_m1, 3)
df_m1_withcluster$cl_kmeans <- as.factor(fit_kmeans$cluster)
wss <- (nrow(df_m1)-1)*sum(apply(iris_scale,2,var)) #within group sum of squares
for (i in 1:10) wss[i] <- sum(kmeans(iris_scale,
centers=i)$withinss) #i range from 1 to no of nodes -1...
par(mar=c(5.1,4.1,4.1,2.1), bg = "#fffff8", family = "serif") #par sets or adjusts plotting parameters. Not allways necessary. "mar" = margin.
plot(1:10, wss, type="b", xlab="Number of Clusters", #match i
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)")
set.seed(44)
fit_kmeansdirect <- kmeans(iris_scale, 3) #directly on iris
df_m1_withcluster$cl_kmeansdirect <- as.factor(fit_kmeansdirect$cluster)
Agglomerative clusters.
#Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_hierarchical <- hclust(dist(df_m1, method = "euclidean"), method = "ward.D2")
fit_hierarchicaldirect <- hclust(dist(iris_scale, method = "euclidean"), method = "ward.D2") #directly on iris
par(bg = "#fffff8", family = "serif")
plot(fit_hierarchical)
Agglomerative clusters.
par(bg = "#fffff8", family = "serif")
plot(fit_hierarchicaldirect)
#Setting 3 clusters as output
df_m1_withcluster$cl_hierarchical <- as.factor(cutree(fit_hierarchical, k=3))
df_m1_withcluster$cl_hierarchicaldirect <- as.factor(cutree(fit_hierarchicaldirect, k=3))
plot_cluster=function(data, var_cluster, palette)
{
ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
geom_point(size=1) +
guides(colour=guide_legend(override.aes=list(size=3))) +
xlab("") + ylab("") +
ggtitle("") +
theme_light(base_size=20) +
theme(text = element_text(size=10, family="serif")) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
legend.direction = "horizontal",
legend.position = "bottom",
legend.box = "horizontal") +
scale_colour_brewer(palette = palette)
}
df_m1_withcluster$true_species <- as.factor(iris_unique$Species) #add true species to the cluster dataset
plot_k <- plot_cluster(df_m1_withcluster, "cl_kmeans", "Set1")
plot_kc <- plot_cluster(df_m1_withcluster, "cl_kmeansdirect", "Set1")
plot_h <- plot_cluster(df_m1_withcluster, "cl_hierarchical", "Set1")
plot_hc <- plot_cluster(df_m1_withcluster, "cl_hierarchicaldirect", "Set1")
plot_s <- plot_cluster(df_m1_withcluster, "true_species", "Set1")
library(gridExtra)
grid.arrange(plot_k, plot_kc, plot_s, ncol=3)
Comparing t-SNE/clusters to true species.
grid.arrange(plot_h, plot_hc, plot_s, ncol=3)
Comparing t-SNE/clusters to true species.
Apparently, ordering is different for k-means, but that does not mean that the clustering is off. Several ways to solve that, but results are pretty clear already. Nonetheless:
df_m1_withcluster$cl_kmeans <- ifelse(df_m1_withcluster$cl_kmeans == "2", "4", df_m1_withcluster$cl_kmeans) #4 is free, change 2:s into 4:s.
df_m1_withcluster$cl_kmeans <- ifelse(df_m1_withcluster$cl_kmeans == "3", "2", df_m1_withcluster$cl_kmeans) #2 is free, change 3:s into 2:s.
df_m1_withcluster$cl_kmeans <- ifelse(df_m1_withcluster$cl_kmeans == "1", "3", df_m1_withcluster$cl_kmeans) #3 is free, change 1:s into 3:s.
df_m1_withcluster$cl_kmeans <- ifelse(df_m1_withcluster$cl_kmeans == "4", "1", df_m1_withcluster$cl_kmeans) #1 is free, change 4:s into 1:s.
df_m1_withcluster$cl_kmeansdirect <- ifelse(df_m1_withcluster$cl_kmeansdirect == "2", "4", df_m1_withcluster$cl_kmeansdirect) #4 is free, change 2:s into 4:s.
df_m1_withcluster$cl_kmeansdirect <- ifelse(df_m1_withcluster$cl_kmeansdirect == "3", "2", df_m1_withcluster$cl_kmeansdirect) #2 is free, change 3:s into 2:s.
df_m1_withcluster$cl_kmeansdirect <- ifelse(df_m1_withcluster$cl_kmeansdirect == "1", "3", df_m1_withcluster$cl_kmeansdirect) #3 is free, change 1:s into 3:s.
df_m1_withcluster$cl_kmeansdirect <- ifelse(df_m1_withcluster$cl_kmeansdirect == "4", "1", df_m1_withcluster$cl_kmeansdirect) #1 is free, change 4:s into 1:s.
plot_k <- plot_cluster(df_m1_withcluster, "cl_kmeans", "Set1")
plot_kc <- plot_cluster(df_m1_withcluster, "cl_kmeansdirect", "Set1")
grid.arrange(plot_k, plot_kc, plot_s, ncol=3)
Re-orded plots.
The k-mean solution on top of the t-SNE is insanely good.
The more dimensions we retain, the harder the results are to interpret while there is no guarantee that the original dimensionality is better preserved.
Agglomerative clusters.
set.seed(44)
m2 <- Rtsne(iris_unique[,1:4], pca=TRUE, check_duplicates=FALSE, perplexity=20, theta=0.5, dims=3) # Run TSNE
df_m2 <- as.data.frame(m2$Y)
## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_hierarchical3d <- hclust(dist(df_m2, method = "euclidean"), method = "ward.D2")
par(bg = "#fffff8", family = "serif")
plot(fit_hierarchical3d)
df_m2$cluster <- as.factor(cutree(fit_hierarchical3d, k=3))
df_m2$species <- as.factor(df_m1_withcluster$true_species)
df_m2$cluster_species <- paste(df_m2$species,": cluster", df_m2$cluster)
t <- list(
family = "serif",
size = 11,
color = "#852E2E")
p <- plot_ly(type = 'scatter3d', mode = 'markers', colors = "Accent", color = df_m2$cluster_species) %>%
add_trace(
x = df_m2$V1,
y = df_m2$V2,
z = df_m2$V3,
marker = list(
size = 3),
name = df_m2$cluster_species,
text = paste("Species: ", df_m2$species),
showlegend = T
) %>%
layout(
title = "3d t-SNA",
titlefont = list(
size = 10
),
paper_bgcolor = "#fffff8",
font = t,
xaxis = list(
zeroline = F
),
yaxis = list(
hoverformat = '.2f',
zeroline = F
)
)
p
Representation of the most probable neighbors in a three dimensional space.
Looks pretty ok, especially since it’s a 50 for 50 accuracy for Versicolor.
(setting) Perplexity is balancing between capturing the local (low perplexity) and the global (high perplexity), usually in the intervall between 5 and 50. Will Kullback-Leibler (KL-) divergence give us a hint on an appropriate level of perplexity?
Repeat <- 100; # Number of random starts
per5 <- sapply(1:Repeat, function(u) {set.seed(u);
Rtsne(iris_unique[,1:4], pca=TRUE, check_duplicates=FALSE, perplexity=5, theta=0.5, dims=2)}, simplify = FALSE)
per30 <- sapply(1:Repeat, function(u) {set.seed(u);
Rtsne(iris_unique[,1:4], pca=TRUE, check_duplicates=FALSE, perplexity=20, theta=0.5, dims=2)}, simplify = FALSE)
per40 <- sapply(1:Repeat, function(u) {set.seed(u);
Rtsne(iris_unique[,1:4], pca=TRUE, check_duplicates=FALSE, perplexity=40, theta=0.5, dims=2)}, simplify = FALSE)
costs <- c( sapply(per5, function(u) min(u$itercosts)),
sapply(per30, function(u) min(u$itercosts)),
sapply(per40, function(u) min(u$itercosts)))
perplexities <- c( rep(5,Repeat), rep(20,Repeat), rep(40,Repeat))
par(bg = "#fffff8", family = "serif")
plot(density(costs[perplexities == 5]), xlim= c(0,0.6), ylim=c(0,200), lwd= 2,
main='KL-scores', col='grey') ; grid()
lines(density(costs[perplexities == 20]), col='orange', lwd= 2);
lines(density(costs[perplexities == 40]), col='darkgreen', lwd= 2);
legend('topright', col=c('grey','orange','darkgreen'),
c('perplexity = 5', 'perplexity = 20', 'perplexity = 40'), lwd = 2)
KL-scores per different perplexities on the same dataset.
KL-scores will not help us choose an appropriate perplexity level. KL-scores will, however, let us choose the best model GIVEN a certain perplexity level.
When perplexity is increased, KL-divergence will drop, regardless of the quality of the dimensional reduction. Higher perplexity is not always better, but lower KL-scores is better for any givcen level of perplexity.
GMY
## [1] "MYA"