This article will illustrate how to visualize high-dimensional data/vector in R with the t-SNE and Barnes-Hut-SNE techniques .It does a good job to visualize complex multi-dimensional data as well as unsupervised clustering of data. t-SNE stands for t-Distributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction.It maps multi-dimensional data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising high-dimensional data in this way allows us to look for patterns in the dataset.Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities.
t-SNE a non-linear dimensionality reduction algorithm finds patterns in the data by identifying observed clusters based on similarity of data points with multiple features. But it is not a clustering algorithm it is a dimensionality reduction algorithm.
It has been extensively applied in Image processing, NLP, genomic data and speech processing. It has been utilized for improving the analysis of brain and heart scans. Below are a few examples:
While t-SNE itself is computationally heavy, a faster version exists that uses what is known as the Barnes-Hut approximation. This faster version allows t-SNE to be applied on large real-world datasets. t-SNE for Exploratory Data Analysis
The data for the analysis in this project can be found here
Abstract: Crowdsourced data from OpenStreetMap is used to automate the classification of satellite images into different land cover classes (impervious, farm, forest, grass, orchard, water).The dataset has 29 variables and 10546 observations.
Data Set Information:
This dataset was derived from geospatial data from two sources: 1) Landsat time-series satellite imagery from the years 2014-2015, and 2) crowdsourced georeferenced polygons with land cover labels obtained from OpenStreetMap. The crowdsourced polygons cover only a small part of the image area, and are used used to extract training data from the image for classifying the rest of the image. The main challenge with the dataset is that both the imagery and the crowdsourced data contain noise (due to cloud cover in the images and innaccurate labeling/digitizing of polygons).
Files in zip folder -The ‘training.csv’ file contains the training data for classification. Do not use this file to evaluate classification accuracy because it contains noise (many class labeling errors). -The ‘testing.csv’ file contains testing data to evaluate the classification accuracy. This file does not contain any class labeling errors.
Attribute Information:
class: The land cover class (impervious, farm, forest, grass, orchard, water) [note: this is the target variable to classify]. max_ndvi: the maximum NDVI (normalized difference vegetation index) value derived from the time-series of satellite images. 20150720_N - 20140101_N : NDVI values extracted from satellite images acquired between January 2014 and July 2015, in reverse chronological order (dates given in the format yyyymmdd).
#install.packages("tsne")
library(Rtsne)
library(tsne)
library(tidyverse)
library(rio)
library(RColorBrewer)
library(knitr)
library(plotly)
## calling the installed package
train<- import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Clustering/Crowdsourced Mapping/training.csv")
head(train)
train$class=as.factor(train$class)
The data set contains 29 variables and 10545 observations.
str(train)
'data.frame': 10545 obs. of 29 variables:
$ class : Factor w/ 6 levels "farm","forest",..: 6 6 6 6 6 2 6 6 6 6 ...
$ max_ndvi : num 998 914 3801 952 1232 ...
$ 20150720_N: num 637.6 634.2 1671.3 58 72.5 ...
$ 20150602_N: num 659 594 1207 -1599 -1221 ...
$ 20150517_N: num -1882 -1626 450 211 380 ...
$ 20150501_N: num -1924 -1672 1071 -1053 -1257 ...
$ 20150415_N: num 998 914 546 579 516 ...
$ 20150330_N: num -1740 -692 1078 -1565 -1413 ...
$ 20150314_N: num 630 708 215 -858 -803 ...
$ 20150226_N: num -1628 -1671 850 730 683 ...
$ 20150210_N: num -1326 -1409 1284 -3162 -2829 ...
$ 20150125_N: num -944 -989 1305 -1522 -1268 ...
$ 20150109_N: num 277 214 542 433 461 ...
$ 20141117_N: num -206.8 -75.6 922.6 228.2 317.5 ...
$ 20141101_N: num 536 893 890 555 405 ...
$ 20141016_N: num 749 401 836 531 564 ...
$ 20140930_N: num -483 -390 1824 952 1232 ...
$ 20140813_N: num 492 394 1670 -1075 -118 ...
$ 20140626_N: num 656 667 2307 546 683 ...
$ 20140610_N: num -921 -955 1562 -1026 -1814 ...
$ 20140525_N: num -1043 -934 1566 369 156 ...
$ 20140509_N: num -1942 -625 2208 -1787 -1190 ...
$ 20140423_N: num 267 120 1057 -1228 -924 ...
$ 20140407_N: num 367 365 385 305 432 ...
$ 20140322_N: num 452 477 301 291 283 ...
$ 20140218_N: num 211 221 294 369 298 ...
$ 20140202_N: num -2203 -2250 2763 -2202 -2197 ...
$ 20140117_N: num -1180 -1361 151 600 626 ...
$ 20140101_N: num 434 524 3801 -1344 -827 ...
## Curating the database for analysis with both t-SNE and PCA
Labels<-train$class
## for plotting
colors = rainbow(length(unique(Labels)))
names(colors) = unique(Labels)
## Executing the algorithm on curated data
tsne <- Rtsne(train[,-1], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500,check_duplicates=FALSE)
Read the 10545 x 28 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 10545
- point 10000 of 10545
Done in 7.02 seconds (sparsity = 0.011878)!
Learning embedding...
Iteration 50: error is 98.507910 (50 iterations in 5.78 seconds)
Iteration 100: error is 89.068958 (50 iterations in 6.25 seconds)
Iteration 150: error is 85.403221 (50 iterations in 4.97 seconds)
Iteration 200: error is 84.928282 (50 iterations in 4.85 seconds)
Iteration 250: error is 84.820367 (50 iterations in 4.65 seconds)
Iteration 300: error is 3.080956 (50 iterations in 4.72 seconds)
Iteration 350: error is 2.596294 (50 iterations in 5.00 seconds)
Iteration 400: error is 2.317933 (50 iterations in 4.84 seconds)
Iteration 450: error is 2.129001 (50 iterations in 5.05 seconds)
Iteration 500: error is 1.993635 (50 iterations in 5.17 seconds)
Fitting performed in 51.29 seconds.
exeTimeTsne<- system.time(Rtsne(train[,-1], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500,check_duplicates=FALSE))
Read the 10545 x 28 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 10545
- point 10000 of 10545
Done in 5.38 seconds (sparsity = 0.011878)!
Learning embedding...
Iteration 50: error is 98.507908 (50 iterations in 5.16 seconds)
Iteration 100: error is 89.404647 (50 iterations in 5.02 seconds)
Iteration 150: error is 85.436631 (50 iterations in 4.70 seconds)
Iteration 200: error is 84.860116 (50 iterations in 4.75 seconds)
Iteration 250: error is 84.712320 (50 iterations in 4.73 seconds)
Iteration 300: error is 3.081531 (50 iterations in 4.76 seconds)
Iteration 350: error is 2.596316 (50 iterations in 4.98 seconds)
Iteration 400: error is 2.316512 (50 iterations in 4.97 seconds)
Iteration 450: error is 2.130918 (50 iterations in 5.02 seconds)
Iteration 500: error is 1.996556 (50 iterations in 5.07 seconds)
Fitting performed in 49.18 seconds.
## Plotting
colourCount = length(unique(train$class))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
df=data_frame(Y1=tsne$Y[,1],Y2=tsne$Y[,2],label=Labels)
plot(df$Y1,df$Y2,col=colors[Labels], main="tsne")
ggplot(df,aes(Y1,Y2,colour=label))+geom_point()
#scale_fill_manual(values = c("#D53E4F","#FC8D59","#FEE08B","#FFFFBF","#E6F598","#99D594","#3288BD"))
#c("#D53E4F","#FC8D59","#FEE08B","#FFFFBF","#E6F598","#99D594","#3288BD")
tsne$Y%>%as.data.frame()%>%head()%>%rename(Y1=V1,Y2=V2)
exeTimeTsne
user system elapsed
54.927 1.089 56.031
#exectutiontimePCA
## Executing the algorithm
tsne2 <- Rtsne(train[,-1], dims = 3, perplexity=30, verbose=TRUE, max_iter = 500,check_duplicates=FALSE,epoch=200)
Read the 10545 x 28 data matrix successfully!
Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 10545
- point 10000 of 10545
Done in 5.73 seconds (sparsity = 0.011878)!
Learning embedding...
Iteration 50: error is 98.507907 (50 iterations in 10.80 seconds)
Iteration 100: error is 88.255190 (50 iterations in 12.70 seconds)
Iteration 150: error is 84.751021 (50 iterations in 9.00 seconds)
Iteration 200: error is 84.285947 (50 iterations in 8.92 seconds)
Iteration 250: error is 84.203537 (50 iterations in 8.82 seconds)
Iteration 300: error is 2.762951 (50 iterations in 8.96 seconds)
Iteration 350: error is 2.286740 (50 iterations in 9.81 seconds)
Iteration 400: error is 2.032735 (50 iterations in 10.04 seconds)
Iteration 450: error is 1.868553 (50 iterations in 10.38 seconds)
Iteration 500: error is 1.752392 (50 iterations in 10.44 seconds)
Fitting performed in 99.87 seconds.
df=data_frame(Y1=tsne2$Y[,1],Y2=tsne2$Y[,2],Y3=tsne2$Y[,3],label=Labels)
plot(df$Y1,df$Y2,col=colors[Labels], main="tsne")
ggplot(df,aes(Y1,Y2,colour=label))+geom_point()
#display results of t-SNE
p <- plot_ly(df, x = ~Y1, y = ~Y2, z = ~Y3, color = ~Labels, colors = c('#D53E4F','#FC8D59','#FEE08B','#FFFFBF','#E6F598','#99D594')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Y1'),
yaxis = list(title = 'Y2'),
zaxis = list(title = 'Y3')))
p
# compare to PCA
dev.new()
pca1 = princomp(train[,-1])
pca = pca1$scores[,1:2]
df=data_frame(Comp.1=pca[,1],Comp.2=pca[,2],label=Labels)
ggplot(df,aes(Comp.2,Comp.1,colour=label))+geom_point()+labs(y=" Compenent 1",title="Principal Component Analysis",x="Component 2")
ggplot(df,aes(Comp.2,Comp.1,colour=label))+geom_text(aes(label=label))+labs(y=" Compenent 1",title="Principal Component Analysis",x="Component 2")