…
Computation Time and the Singular Value Decomposition
…
In video 50 from the series Mining Massive Datasets - Stanford University (https://youtu.be/K38wVcdNuFc), a drawback for Singular Value Decomposition is mentioned: the decomposition may be more dense than the original matrix, slowing computation time. We want to explore a related question; does the density of the original matrix affect the computation time to create a model using Singular Value Decomposition?
The original matrix is the joke database from project 2. The original dataset contained 101 jokes and 23,500 users. It contains 72% non-NA values. For NA values in this and all subsequent matrices, we changed NA to -2, assuming that, on average, users were less likely to rate items thet disliked. We computed 7 sets of 19 sets of models.
The primary sets contained ratings for between 5 and 95 features in the feature vector. To get a sense of how the number of features affected the RMSE, we graphed 3 sets of these sets. Strictly speaking, 95 would be too large a set to use with the irlba method and only 101 jokes (as per a warning from the package.) However, we wanted to get a sense of when the gains from adding a larger set of features would level off as we increased the size.
The Irlba package uses implicitly restarted Lanczos bi-diagonalization. This method computes singular triples faster than the SVD method. Methods used by the irlba package to accomplish this task include: Lanczos bidiagonalization, implicit restarting, and harmonic Ritz values. The end result also creates 3 matrices including a feature vector of singular values. (https://youtu.be/3y_0-v9w_kY, http://www.math.uri.edu/~jbaglama/papers/paper14.pdf)
The original set contained 36 as the minimum number of rated jokes by each user. For different secondary sets, we capped the number of ratings per user at 40 to 101, by 10. Each set is more informationally dense than the set before it.
## [1] "original matrix density"
## [1] 0.7200308
#------------------------------------------------------------------------
densifier(joke_ratings,"filtered1",40)
densifier(joke_ratings,"filtered2",50)
densifier(joke_ratings,"filtered3",60)
densifier(joke_ratings,"filtered4",70)
densifier(joke_ratings,"filtered5",80)
densifier(joke_ratings,"filtered6",90)
densifier(joke_ratings,"filtered7",101)
#------------------------------------------------------------------------

Our first set of models, with the sparsest beginning matrix, shows a convex relationship. With this model, the tradeoff between feature set size and RMSE shows a model using around 35 would be optimal and going in either direction would cost an increasing amount. An RMSE of under 3 is acceptable for joke data rated from -10 to 10. The cost of a bad recommendation, for example, is low. Novelty or serendipity would be more important qualities.
RMSE <- function(aMatrix){
sqrt(mean((aMatrix-joke_ratings4)^2,na.rm=T))
}
#---------------------------
joke_ratings4<-joke_ratings4[,-1]
jokeMeans <- apply(joke_ratings4,2,mean,na.rm=TRUE)
jokeMeans[is.infinite(jokeMeans)]<--2
userMeans <- apply(joke_ratings4,1,mean,na.rm=TRUE)
overall.mean<-mean(jokeMeans,na.rm=T)
joke_ratings4[is.na(joke_ratings4)] <-overall.mean
userBias<-userMeans - overall.mean
jokeBias<-jokeMeans - overall.mean
error.rates4<-as.matrix(rep(0,19))
time.a<-Sys.time()
for (i in seq(5,95,5)){
decomp_4<-irlba(joke_ratings4,nu=i,nv=i)
irlbaPredictb<- userMeans + (decomp_4$u * sqrt(decomp_4$d)) %*% (sqrt(decomp_4$d) * t(decomp_4$v))
error.rates4[(i/5)]<-RMSE(irlbaPredictb)
}
time.b<-Sys.time()
density.speed[4,2]<-(time.b-time.a)


…
For less sparse models, the convexity appears different. The most full matrix has a gently changing, but largest slope. For the densest model, the average cost in RMSE is higher for small feature vectors and lower for large feature vectors. Graphed together, we can see that the range where the sparse matrix shines is when feature vectors are small. The lowest RMSE can be obtained with a large feature set and the full density matrix.

Finally, we look at the time cost depending on initial matrix sparsity. We find that the speed appears to increase as the density increases, but it is not a clear relationship. In this case, the increased time to run 19 models is under a second. Both models are empirically much quicker than the direct model creation methods used in project 2. We can see the tradeoff between time and feature set. We can see how sparcity affects the tradeoff. In other contexts, this would be an important consideration for model optimization. For this dataset, however, differences are small and most accurate model should be used.
