Source: For this project I’ll use the Jester joke dataset I used for project 4. It has the highest density of the example datasets I was able to find online: http://eigentaste.berkeley.edu/dataset/
# Set up Spark environment
# Sys.setenv(JAVA_HOME = "C:/PROGRA~2/Java/jre1.8.0_211")
# Establish connection
sc <- spark_connect(master = "local")This was an iterative process. I found out the hard way that Spark needs the DF columns to be numeric, so I’m forcing them to as.numeric in this step:
jester.df <- as(Jester5k, 'data.frame')
jester.df$user <- as.numeric(jester.df$user)
jester.df$item <- as.numeric(jester.df$item)# And then copying it over to Spark:
spark_jester_df <- sdf_copy_to(sc, jester.df, "spark_jester_df", overwrite = T)# Split into training and testing datasets. Then store in Spark
jester_df_splits <- sdf_random_split(spark_jester_df, training=0.8, testing=0.2, seed = 613)
spark_train <- sdf_copy_to(sc, jester_df_splits$training, "spark_train", overwrite = T)
spark_test <- sdf_copy_to(sc, jester_df_splits$testing, "spark_test", overwrite = T)# Train ALS model while monitoring elapsed time
tic()
model <- ml_als(spark_train,
user_col = "user",
item_col = "item",
rating_col = "rating",
max_iter = 5)
elapsed <- toc(quiet = T)
spark_build_elapsed <- elapsed$toc-elapsed$tic
# Calculate predictions against test set from Spark model and "collect" it back to local R
tic()
spark_predictions <- model$.jobj %>%
invoke("transform", spark_dataframe(spark_test)) %>%
collect()
elapsed <- toc(quiet = T)
spark_predict_elapsed <- elapsed$toc-elapsed$tic
spark_disconnect(sc)
## NULLAt this stage, I’ll run the same model in the R recommenderlab package for comparison
# Split training from testing data sets
eval_sets <- evaluationScheme(data = Jester5k,
method = "split",
train = 0.8,
given = 30,
goodRating = 3,
k = 5)
tic()
# Run ALS model
model2 <- Recommender(data = getData(eval_sets, "train"), method = "ALS")
elapsed <- toc(quiet = T)
rec_build_elapsed <- elapsed$toc-elapsed$tic
# And save predictions
tic()
predictions2 <- predict(model2, getData(eval_sets, "known"), type = "ratingMatrix")
elapsed <- toc(quiet = T)
rec_predict_elapsed <- elapsed$toc-elapsed$ticIt takes 12.18 to create the Spark ALS model, and only 2.14 to save predictions.
However, in recommenderlab, it takes 0.06 to build the model and 271.53 to save predictions.
It took a lot of trial and error to finally get Spark up and running properly, but now that it runs, it appears to be faster in both the model preparation and prediction stages. Recommenderlab took about the same length of time to run the model but significantly longer to make predictions. It’s also worth pointing out that the only available algorithm in Spark is for ALS matrix factorization whereas the recommenderlab package has more variety.
For larger datasets, I’d definitely say it makes sense to prepare the environment for and use Spark - and perhaps even tinker with additional nodes or moving to a hosted environment.
For something smaller, recommenderlab or similar would do the job.
References: measuring-function-execution-time-in-r (https://stackoverflow.com/a/33375008)