I compared two SVD packages, irlba and RSpectra, in terms of their speed and accuracy.
Speed is measure by the time it takes to calculate 300 large singular values from a large document-feature matrix.
require(tidyr)
## Loading required package: tidyr
time_wide <- readRDS('LSS comparison_speed_1-10k_10x.RDS')
time_long <- gather(time_wide, 'type', 'sec', 2:3)
head(time_long)
## size type sec
## 1 1000 irlba 179.680
## 2 1000 irlba 167.802
## 3 1000 irlba 179.567
## 4 1000 irlba 140.083
## 5 1000 irlba 174.122
## 6 1000 irlba 126.421
plot(time_long$size, time_long$sec, col = ifelse(time_long$type == 'irlba', 'red', 'black'),
xlab = 'Corpus size', ylab = 'Execution time')
abline(lm(sec ~ size, data = subset(time_long, type == 'irlba')), col = 'red')
abline(lm(sec ~ size, data = subset(time_long, type == 'rspectra')), col = 'black')
grid()
legend('topleft', c('irlba', 'rspectra'), col = c('red', 'black'), pch = 1)
SVD algorithms often produces noticeably different outputs, so I tested the accuracy by decomposing a dfm’s document dimensions into 300 components, and calculated word-to-word similarity. I selected 25 commonly used English words (seed words) and extracted 100 most similar words to each of the word in the decomposed dfm. After the extraction, these words are stemmed and compared with the seed word’s stem to assess the quality of proximity estimation.
accu_wide <- readRDS('LSS comparison_1-10k_10x.RDS')
accu_wide$size <- accu_wide$n * 1000
accu_long <- gather(accu_wide, 'type', 'stem', -size)
accu_long <- accu_long[accu_long$type %in% c('sent3', 'sent3_rs'),]
plot(accu_long$size, accu_long$stem,
col = ifelse(accu_long$type == 'sent3', 'red', 'black'),
pch = ifelse(accu_long$type == 'sent3', 1, 3),
xlab = 'Corpus size', ylab = 'Number of stems in top 100')
irlba_lm <- lm(stem ~ size + I(size ^ 2), data = subset(accu_long, type == 'sent3'))
rspectra_lm <- lm(stem ~ size + I(size ^ 2), data = subset(accu_long, type == 'sent3_rs'))
lines(1:10 * 1000, predict(irlba_lm, newdata = data.frame(size = 1:10 * 1000)), col = 'red')
lines(1:10 * 1000, predict(rspectra_lm, newdata = data.frame(size = 1:10 * 1000)))
grid()
legend('topleft', c('irlba', 'rspectra'), col = c('red', 'black'), pch = 1)