t-SNE Help

Load codes and packages

setwd("~/Desktop/course 2020/615/tSNE_Function")
if (!require('Rtsne')) install.packages('Rtsne')
if (!require('Rcpp')) install.packages('Rcpp')
if (!require('animation')) install.packages('animation')
if (!require('stats')) install.packages('stats')
if (!require('NMI')) install.packages('NMI')
library(stats)
library(Rtsne)
library(Rcpp)
library(NMI)
library(animation)
source("my_tsne.R")
source("cpp.R")
source("my_LTSA.R")

Introduction

Here we will introduce the parameter of our function. And later we will display how to use our code via some examples.

t-SNE R code

my_tsne<-function(x,q=2,perplexity = 50, maxit = 1000, tol= 1e-9,record_history = FALSE,pca=TRUE,d=20){}

“x” here is the input raw data matrix, with every row as a record. “q” is the dimension of visualization. “maxit” is the default number of iterations, and “tol” is the tolerence for convergence. “pca” is whether we should use PCA before t-SNE, defalut is TRUE. “d” is the number of dimensions for PCA.

“perplexity” is a hyper parameter about \(\sigma\), controling “the neighborhood number” in visualization. We can “two rings & perplexity” gif (https://gph.is/g/ZdXVGW9) to understand it, where the true data is two rings tangled with each other in 3 dimensional space. A smaller “perplexity” means the algorithm will focus more on the local structure while a bigger “perplexity” will pay more attension to the global structure. Usually it is setted to 5 - 50.

“record_history” means whether we will record the iteration history or not. If “record_history=FALSE”, then the output will be a list with two elements, where the first one is the final visualization result and the second one is the record of loss/objective function. If “record_history=TRUE”, the output result will be every ten steps’ visualization results.

t-SNE Rcpp code

We also develop a Rcpp code to speed up the iteration and computation. Due to time limit, it only outputs the final visualization.

tsne_cpp<-function(x,q=2,perplexity = 50, maxit = 1000,d=20,pca=TRUE){}

LTSA code

When we have prior information that the data is dense / continuous, and there are few outliers, we can use LTSA for visualization. (e.g. for manifold structure data)

LTSA<-function(x,d=2,k=120){}

“x” here is the input raw data matrix, with every row as a record. “d” is the dimension of visualization “k” is the number of k-nearest-neighbor we will use.

The output is a matrix about the visualization result.

Visualization of Gaussian Mixed model

Here is a comparison between different algorithms, we can see that t-SNE has a significantly better performance than the other two.

set.seed(123)

x<-generate_data(300,4)
res1<-Rtsne(x,pca = FALSE)
res2<-tsne_cpp(x,pca = TRUE,maxit = 1000)
res_pca<-prcomp(x)
res3<-res_pca$x[,1:2]
res4<-LTSA(x,d=2,k=120)

par(mfrow=c(2,2))
plot(res1$Y,main = "Rtsne",xlab = "t-SNE 1",ylab = "t-SNE 2")
plot(res2,main = "My t-SNE with Rcpp",xlab = "t-SNE 1",ylab = "t-SNE 2")
plot(res3,main = "PCA",xlab = "PCA 1",ylab = "PCA 2")
plot(res4,main = "LTSA",xlab = "LTSA 1",ylab = "LTSA 2")

Comparison of two t-SNE functions

Let’s do the simulation in the setting of Gaussian Mixed Model, where we can adjust the scale and variance of the clustering data.

The clustering parameters are setted as: there are \(p\) informative variables Y, and \(5\times p\) noisy variables Z. The sample size is setted to 300, and there are 4 clusters for clear visualization.

For sample i in group k, \(Y_{ij}\sim N(\mu_j,\sigma_j^2),Z_{ij} \sim N(0,1)\).

where the \(\mu_j\),\(\sigma_j^2\) are initialized as

\[\mu_j \sim N(0,6 \times scale), \sigma^2_j \sim 1+\chi^2_1 \]

When scale is big, it is easier for the algorithm to detect the difference between different groups; and if scale is small, it is harder.

set.seed(12345)

x<-generate_data(300,4,mean_scale = 0.6)
res<-prcomp(x)
x<-res$x[,1:5]
res1<-Rtsne(x,pca = FALSE)
res2<-tsne_cpp(x,pca = FALSE)

km1<-kmeans(res1$Y,4,iter.max = 50,nstart = 5)
km2<-kmeans(res2,4,iter.max = 50,nstart = 5)
nmi1<-NMI(cbind(1:300,km1$cluster),cbind(1:300,rep(1:4,each =75)))
nmi2<-NMI(cbind(1:300,km2$cluster),cbind(1:300,rep(1:4,each =75)))

par(mfrow=c(2,2))
plot(res1$Y,main = paste("Rtsne, scale = 0.6,NMI =",round(nmi1$value,2)) ,xlab = "t-SNE 1",ylab = "t-SNE 2",col = rep(1:4,each =75) ,pch = km1$cluster)

plot(res2,main = paste("My t-SNE, scale = 0.6,NMI =",round(nmi2$value,2)) ,xlab = "t-SNE 1",ylab = "t-SNE 2",col = rep(1:4,each =75) ,pch = km2$cluster)


x<-generate_data(300,4,mean_scale = 0.4)
res<-prcomp(x)
x<-res$x[,1:5]
res1<-Rtsne(x,pca = FALSE)
res2<-tsne_cpp(x,pca = FALSE)

km1<-kmeans(res1$Y,4,iter.max = 50,nstart = 5)
km2<-kmeans(res2,4,iter.max = 50,nstart = 5)
nmi1<-NMI(cbind(1:300,km1$cluster),cbind(1:300,rep(1:4,each =75)))
nmi2<-NMI(cbind(1:300,km2$cluster),cbind(1:300,rep(1:4,each =75)))

plot(res1$Y,main = paste("Rtsne, scale = 0.4,NMI =",round(nmi1$value,2)) ,xlab = "t-SNE 1",ylab = "t-SNE 2",col = rep(1:4,each =75) ,pch = km1$cluster)

plot(res2,main = paste("My t-SNE, scale = 0.4,NMI =",round(nmi2$value,2)) ,xlab = "t-SNE 1",ylab = "t-SNE 2",col = rep(1:4,each =75) ,pch = km2$cluster)

We use NMI (Normalized Mutual Information) as criteria of accuracy (the closer to 1, the better) based on the visualization result, where colors denote the true group, and point types denote kmeans result. According to the visualization, we can see that two functions have quite close performance, regardless of the scale.

Time comparison between Rtsne and My t-SNE (with Rcpp)

The t-SNE code without Rcpp is quite slow, as we have shown in the report. Here we only compare the Rtsne and our code with Rcpp.

set.seed(123)
t_R <- NULL
t_cpp <- NULL
for (i in 1:5){
  x<-generate_data(300,4)
  t_R <- cbind(t_R,system.time(
    res1<-Rtsne(x,pca = FALSE)
    ))
  t_cpp <- cbind(t_cpp,system.time(
    res2<-tsne_cpp(x,pca = FALSE,maxit = 1000)
    ))
}

C <- list(t_R[1,],t_cpp[1,])
names(C) <- c(paste("R_tsne\n n=" , length(t_R[1,]) , sep=""), 
              paste("My_cpp_tsne\n n=" , length(t_cpp[1,]) , sep=""))
par(mgp=c(3,3,0))
boxplot(C , col="#69b3a2" , ylab="user time (s)" , main = "Gaussian Mixed Model")

par(mfrow=c(1,2))
plot(res1$Y,main = "Rtsne",xlab = "t-SNE 1",ylab = "t-SNE 2")
plot(res2,main = "My code with Rcpp",xlab = "t-SNE 1",ylab = "t-SNE 2")

Dynamic Visualization

For dynamic visualization, please set “record_history = TRUE”, and the result will be the visualization result every 10 iterations. You can use the following code to generate a gif. The generated gif will appear in the working directory.

set.seed(123)
x<-generate_data(300,4)

res<- my_tsne(x,q=2,perplexity = 50,maxit = 1000,record_history = TRUE,pca=TRUE,d = 20)

saveGIF({
  
  for(ii in 1:length(res)){
    plot(res[[ii]],main = paste("iteration",ii*10),xlab = "t-SNE 1", ylab = "t-SNE 2")
  }
  
}, movie.name = "Dynamic Visualization.gif", interval = 0.1, nmax = 30, 
ani.width = 600)