1 Implementing a Recommender System on Spark

In our previous assignment, we experimented with accuracy measures and incorporated serendipity into our recommender system. The dataset we used for that assignment - in its original form - had appoximately 1.5 mil ratings, which we in turn scaled down considerably.

In this assignment, we will attempt to use the full dataset using Spark, then compare the performance to the model we built in project 4 using the scaled down dataset.

# dataset of ratings
training_raw <- read_csv("https://raw.githubusercontent.com/bsvmelo/Data612-Summer-2020/master/Project_4/training_subset.csv") %>% data.frame()

## Parsed with column specification:
## cols(
##   userId = col_double(),
##   movieId = col_double(),
##   rating = col_double(),
##   timestamp = col_double()
## )

ratings_raw<- select(training_raw, 1:3)

1.1 Move Dataframe to Spark

Once the dataframe was created on the local disk, we connected to Spark and copied it to a Spark dataframe:

# install Spark
spark_install(version = "2.4", hadoop_version = "2.7")
# create connection to Spark
sc <- spark_connect(master = "local")  
# move local disk dataframe to Spark; had to split into 10 commands
ratingsSprk <- sdf_copy_to(sc, ratings_raw, "ratings_sprk", overwrite = TRUE)

1.2 Create Recommender Model in Spark

After loading the data frame to Spark, we built a recommender system using the ml_als function available in sparklyr, which builds collaborative filtering models using Alternating Least Squares (ALS).

We then used that model to make predictions and compiled accuracy metrics. In a later section, we will compare these metrics to the same metrics generated by running the model exactly as we had in Project 4.

# create Spark recommender model
partitions <- ratingsSprk %>% sdf_random_split(training = 0.8, test = 0.2, seed = 137)
spark_mdl <- ml_als(partitions$training, rating ~ userId + movieId,rating_col = "rating", user_col = "userId", item_col = "movieId", cold_start_strategy = "drop")
summary(spark_mdl)

##                Length Class             Mode       
## pipeline_model  5     ml_pipeline_model list       
## formula         1     -none-            character  
## dataset         2     tbl_spark         list       
## pipeline        5     ml_pipeline       list       
## model          11     ml_als_model      list       
## .jobj           2     spark_jobj        environment

# predictions
sparkPredict <- ml_predict(spark_mdl,partitions$test)
# create recommendation
sparkRec <- ml_recommend(spark_mdl, type = c("items", "users"), n = 10)
# check accuracy
rmse <- ml_regression_evaluator(sparkPredict, metric_name = "rmse")
mae <- ml_regression_evaluator(sparkPredict, metric_name = "mae")
mse <- ml_regression_evaluator(sparkPredict, metric_name = "mse")
als_spark<-c(rmse, mse, mae)
# disconnect
spark_disconnect(sc)

1.3 Compare to Results from Project 4 Dataset

In the section below, we recreate the model and generate predictions just as we had in Project 4:

# coercing into realRatingMatrix
t <- distinct(ratings_raw)
ratings <- as(t, "realRatingMatrix")
# Subsetting training set with movies that have been rated more than 200 times
ratings1 <- ratings[,colCounts(ratings) > 200]
# create evaluation scheme
eval_sets <- evaluationScheme(data = ratings1, method = "cross-validation", k = 4, given = 5, goodRating = 3)
# build UBCF model and SVD model
ubcf_rec <- Recommender(getData(eval_sets, "train"), "UBCF", param = list(normalize = "center", method = "cosine"))
svd_rec <- Recommender(getData(eval_sets, "train"), "SVD", param = list(normalize = "center", k = 10))
# Make predictions with each model
ubcf_pred <- predict(ubcf_rec, getData(eval_sets, "known"), type = "ratings")
svd_pred <- predict(svd_rec, getData(eval_sets, "known"), type = "ratings")

1.4 Compare the UBCF and SVD Recommender Models to Spark Model

Now we can compare the results from the ALS model built using Spark with the UBCF and SVD models we created in Project 4:

# Table showing error calcs for UBCF vs SVD
ubcf_er <- calcPredictionAccuracy(ubcf_pred, getData(eval_sets, "unknown"))
svd_er <- calcPredictionAccuracy(svd_pred, getData(eval_sets, "unknown"))
# RMSE, MSE and MAE
k_Method <- c("ALS","UBCF-Cosine","SVD")
k_table_p <- data.frame(rbind(als_spark,ubcf_er, svd_er)) 
rownames(k_table_p) <- k_Method
k_table_p <- k_table_p[order(k_table_p$RMSE ),]
kable(k_table_p) %>% kable_styling()

	RMSE	MSE	MAE
ALS	0.6484466	0.4204830	0.4863415
UBCF-Cosine	0.8843518	0.7820781	0.6635512
SVD	0.8926244	0.7967784	0.6693957

As can be seen the RMSE, MSE and MAE are better for the ALS model than the UBCF or SVD models.

2 References

https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1

DATA 612 - Summer 2020 - Project 5 | Implementing a Recommender System on Spark

Bruno de Melo and Leland Randles

July 9, 2020