library(bigMap)
# load aux. stuff
source('../primes.R')
The Primes1G data set is a structured representation of the first 10^6 integers based on their prime factor decomposition. This is a highly sparse matrix with 78498 dimensions (primes lower than 10^6) where the values are the prime factorisation powers. We use a compact matrix format where odd columns hold the values of the columns (primes) in the original matrix and even columns hold the powers themselves. This results in a matrix with 10^6 rows and 14 columns.
This data set is inspired by the work of John Williamson https://johnhw.github.io/umap_primes/index.md.html, though we note some differences:
in Williamson’s work any factor is represented as either a zero or a one, just indicating divisibility by that factor, while in our approach we use the actuals powers to have unique representations for each integer;
in Williamson’s work affinities are based on cosine similarity, while we use euclidean distances.
load('../P1M.RData')
We use the script below to run pt-SNE on this data set using a HPC platform with 101 cores and 8GB/core,
# ./P1M_start.R:
# Run pt-SNE on Primes1M
# Compute kNP and HL-correlation
# +++ load package
library(bigMap)
# +++ load data
load('./P1M.RData')
# +++ start MPI cluster
threads <- 100
mpi.cl <- bdm.mpi.start(threads)
if (is.null(mpi.cl)) return()
# +++ run pt-SNE
ppx.list <- c(1000, 5000, 10000, 20000, 30000, 40000, 50000, 100000)
m.list <- lapply(ppx.list, function(ppx){
# +++ compute betas
m <- bdm.init(P1M, is.sparse = T, ppx = ppx, threads = threads, mpi.cl = mpi.cl)
# +++ skip re-exporting data to workers
P1M <- NULL
# +++ run ptSNE
m <- bdm.ptsne(P1M, m, theta = 0.5, layers = 2, threads = threads, mpi.cl = mpi.cl)
# +++ compute kNP
m <- bdm.knp(P1M, m, k.max = 250000, sampling = 0.25, threads = threads, mpi.cl = mpi.cl)
# +++ compute hlC
m <- bdm.hlCorr(P1M, m, threads = threads, mpi.cl = mpi.cl)
# +++ return embedding
m
})
# +++ save
save(m.list, file = './P1M_list.RData')
# +++ stop cluster
bdm.mpi.stop(mpi.cl)
Submit:
$ qsub -pe ompi 101 -l h_vmem=8G Rsnow ~/P1M_start.R
load('./P1M_list.RData')
nulL <- lapply(m.list, function(m) bdm.cost(m))
nulL <- lapply(m.list, function(m) primes.plot(P1M, m))