DATA 612 | Project 5

Implementing a Recommender System on Spark

The goal of this project is give you practice beginning to work with a distributed recommender system. It is sufficient for this assignment to build out your application on a single node

Source: For this project I’ll use the Jester joke dataset I used for project 4. It has the highest density of the example datasets I was able to find online: http://eigentaste.berkeley.edu/dataset/

# Import 5K x 100 realRatingMatrix of jokes 
data(Jester5k)

# Summary of ratings per user
summary(rowCounts(Jester5k))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    36.0    53.0    72.0    72.4   100.0   100.0

# Matrix size
dim(Jester5k)
## [1] 5000  100

# Number of ratings
nratings(Jester5k)
## [1] 362106

Spark

# Set up Spark environment
# Sys.setenv(JAVA_HOME = "C:/PROGRA~2/Java/jre1.8.0_211")

# Establish connection
sc <- spark_connect(master = "local")

This was an iterative process. I found out the hard way that Spark needs the DF columns to be numeric, so I’m forcing them to as.numeric in this step:

jester.df <- as(Jester5k, 'data.frame')

jester.df$user <- as.numeric(jester.df$user)
jester.df$item <- as.numeric(jester.df$item)

# And then copying it over to Spark:
spark_jester_df <- sdf_copy_to(sc, jester.df, "spark_jester_df", overwrite = T)

# Split into training and testing datasets. Then store in Spark
jester_df_splits <- sdf_random_split(spark_jester_df, training=0.8, testing=0.2, seed = 613)

spark_train <- sdf_copy_to(sc, jester_df_splits$training, "spark_train", overwrite = T)
spark_test <- sdf_copy_to(sc, jester_df_splits$testing, "spark_test", overwrite = T)

# Train ALS model while monitoring elapsed time
tic()

model <- ml_als(spark_train,
                user_col = "user",
                item_col = "item",
                rating_col = "rating",
                max_iter = 5)

elapsed <- toc(quiet = T)
spark_build_elapsed <- elapsed$toc-elapsed$tic


# Calculate predictions against test set from Spark model and "collect" it back to local R
tic()

spark_predictions <- model$.jobj %>%
  invoke("transform", spark_dataframe(spark_test)) %>%
  collect()

elapsed <- toc(quiet = T)
spark_predict_elapsed <- elapsed$toc-elapsed$tic

spark_disconnect(sc)
## NULL

Recommenderlab ALS

At this stage, I’ll run the same model in the R recommenderlab package for comparison

# Split training from testing data sets
eval_sets <- evaluationScheme(data = Jester5k,
                              method = "split",
                              train = 0.8,
                              given = 30,
                              goodRating = 3,
                              k = 5)

tic()

# Run ALS model
model2 <- Recommender(data = getData(eval_sets, "train"), method = "ALS")

elapsed <- toc(quiet = T)
rec_build_elapsed <- elapsed$toc-elapsed$tic


# And save predictions
tic()

predictions2 <- predict(model2, getData(eval_sets, "known"), type = "ratingMatrix")

elapsed <- toc(quiet = T)
rec_predict_elapsed <- elapsed$toc-elapsed$tic

It takes 12.18 to create the Spark ALS model, and only 2.14 to save predictions.
However, in recommenderlab, it takes 0.06 to build the model and 271.53 to save predictions.

Summary

It took a lot of trial and error to finally get Spark up and running properly, but now that it runs, it appears to be faster in both the model preparation and prediction stages. Recommenderlab took about the same length of time to run the model but significantly longer to make predictions. It’s also worth pointing out that the only available algorithm in Spark is for ALS matrix factorization whereas the recommenderlab package has more variety.
For larger datasets, I’d definitely say it makes sense to prepare the environment for and use Spark - and perhaps even tinker with additional nodes or moving to a hosted environment.
For something smaller, recommenderlab or similar would do the job.

References: measuring-function-execution-time-in-r (https://stackoverflow.com/a/33375008)