Implementing a Recommender System on Spark

The goal of this project is give you practice beginning to work with a distributed recommender system. It is sufficient for this assignment to build out your application on a single node

Source: For this project I’ll use the Jester joke dataset I used for project 4. It has the highest density of the example datasets I was able to find online: http://eigentaste.berkeley.edu/dataset/

Spark

This was an iterative process. I found out the hard way that Spark needs the DF columns to be numeric, so I’m forcing them to as.numeric in this step:

Recommenderlab ALS

At this stage, I’ll run the same model in the R recommenderlab package for comparison

It takes 12.18 to create the Spark ALS model, and only 2.14 to save predictions.
However, in recommenderlab, it takes 0.06 to build the model and 271.53 to save predictions.

Summary

It took a lot of trial and error to finally get Spark up and running properly, but now that it runs, it appears to be faster in both the model preparation and prediction stages. Recommenderlab took about the same length of time to run the model but significantly longer to make predictions. It’s also worth pointing out that the only available algorithm in Spark is for ALS matrix factorization whereas the recommenderlab package has more variety.
For larger datasets, I’d definitely say it makes sense to prepare the environment for and use Spark - and perhaps even tinker with additional nodes or moving to a hosted environment.
For something smaller, recommenderlab or similar would do the job.

References: measuring-function-execution-time-in-r (https://stackoverflow.com/a/33375008)