SVD package comparison

irlba vs RSpectra

I compared two SVD packages, irlba and RSpectra, in terms of their speed and accuracy.

Speed

Speed is measure by the time it takes to calculate 300 large singular values from a large document-feature matrix.

require(tidyr)

## Loading required package: tidyr

time_wide <- readRDS('LSS comparison_speed_1-10k_10x.RDS')
time_long <- gather(time_wide, 'type', 'sec', 2:3)
head(time_long)

##   size  type     sec
## 1 1000 irlba 179.680
## 2 1000 irlba 167.802
## 3 1000 irlba 179.567
## 4 1000 irlba 140.083
## 5 1000 irlba 174.122
## 6 1000 irlba 126.421

plot(time_long$size, time_long$sec, col = ifelse(time_long$type == 'irlba', 'red', 'black'),
     xlab = 'Corpus size', ylab = 'Execution time')
abline(lm(sec ~ size, data = subset(time_long, type == 'irlba')), col = 'red')
abline(lm(sec ~ size, data = subset(time_long, type == 'rspectra')), col = 'black')
grid()
legend('topleft', c('irlba', 'rspectra'), col = c('red', 'black'), pch = 1)

Accuracy

SVD algorithms often produces noticeably different outputs, so I tested the accuracy by decomposing a dfm’s document dimensions into 300 components, and calculated word-to-word similarity. I selected 25 commonly used English words (seed words) and extracted 100 most similar words to each of the word in the decomposed dfm. After the extraction, these words are stemmed and compared with the seed word’s stem to assess the quality of proximity estimation.

accu_wide <- readRDS('LSS comparison_1-10k_10x.RDS')
accu_wide$size <- accu_wide$n * 1000
accu_long <- gather(accu_wide, 'type', 'stem', -size)
accu_long <- accu_long[accu_long$type %in% c('sent3', 'sent3_rs'),]

plot(accu_long$size, accu_long$stem, 
     col = ifelse(accu_long$type == 'sent3', 'red', 'black'),
     pch = ifelse(accu_long$type == 'sent3', 1, 3),
     xlab = 'Corpus size', ylab = 'Number of stems in top 100')

irlba_lm <- lm(stem ~ size + I(size ^ 2), data = subset(accu_long, type == 'sent3'))
rspectra_lm <- lm(stem ~ size + I(size ^ 2), data = subset(accu_long, type == 'sent3_rs'))

lines(1:10 * 1000, predict(irlba_lm, newdata = data.frame(size = 1:10 * 1000)), col = 'red')
lines(1:10 * 1000, predict(rspectra_lm, newdata = data.frame(size = 1:10 * 1000)))

grid()
legend('topleft', c('irlba', 'rspectra'), col = c('red', 'black'), pch = 1)

SVD package comparison

KoheiWatanabe

16 November 2017

irlba vs RSpectra

Speed

Accuracy