In our previous assignment, we experimented with accuracy measures and incorporated serendipity into our recommender system. The dataset we used for that assignment - in its original form - had appoximately 1.5 mil ratings, which we in turn scaled down considerably.
In this assignment, we will attempt to use the full dataset using Spark, then compare the performance to the model we built in project 4 using the scaled down dataset.
# dataset of ratings
training_raw <- read_csv("https://raw.githubusercontent.com/bsvmelo/Data612-Summer-2020/master/Project_4/training_subset.csv") %>% data.frame()## Parsed with column specification:
## cols(
## userId = col_double(),
## movieId = col_double(),
## rating = col_double(),
## timestamp = col_double()
## )
Once the dataframe was created on the local disk, we connected to Spark and copied it to a Spark dataframe:
After loading the data frame to Spark, we built a recommender system using the ml_als function available in sparklyr, which builds collaborative filtering models using Alternating Least Squares (ALS).
We then used that model to make predictions and compiled accuracy metrics. In a later section, we will compare these metrics to the same metrics generated by running the model exactly as we had in Project 4.
# create Spark recommender model
partitions <- ratingsSprk %>% sdf_random_split(training = 0.8, test = 0.2, seed = 137)
spark_mdl <- ml_als(partitions$training, rating ~ userId + movieId,rating_col = "rating", user_col = "userId", item_col = "movieId", cold_start_strategy = "drop")
summary(spark_mdl)## Length Class Mode
## pipeline_model 5 ml_pipeline_model list
## formula 1 -none- character
## dataset 2 tbl_spark list
## pipeline 5 ml_pipeline list
## model 11 ml_als_model list
## .jobj 2 spark_jobj environment
# predictions
sparkPredict <- ml_predict(spark_mdl,partitions$test)
# create recommendation
sparkRec <- ml_recommend(spark_mdl, type = c("items", "users"), n = 10)
# check accuracy
rmse <- ml_regression_evaluator(sparkPredict, metric_name = "rmse")
mae <- ml_regression_evaluator(sparkPredict, metric_name = "mae")
mse <- ml_regression_evaluator(sparkPredict, metric_name = "mse")
als_spark<-c(rmse, mse, mae)
# disconnect
spark_disconnect(sc)In the section below, we recreate the model and generate predictions just as we had in Project 4:
# coercing into realRatingMatrix
t <- distinct(ratings_raw)
ratings <- as(t, "realRatingMatrix")
# Subsetting training set with movies that have been rated more than 200 times
ratings1 <- ratings[,colCounts(ratings) > 200]
# create evaluation scheme
eval_sets <- evaluationScheme(data = ratings1, method = "cross-validation", k = 4, given = 5, goodRating = 3)
# build UBCF model and SVD model
ubcf_rec <- Recommender(getData(eval_sets, "train"), "UBCF", param = list(normalize = "center", method = "cosine"))
svd_rec <- Recommender(getData(eval_sets, "train"), "SVD", param = list(normalize = "center", k = 10))
# Make predictions with each model
ubcf_pred <- predict(ubcf_rec, getData(eval_sets, "known"), type = "ratings")
svd_pred <- predict(svd_rec, getData(eval_sets, "known"), type = "ratings")Now we can compare the results from the ALS model built using Spark with the UBCF and SVD models we created in Project 4:
# Table showing error calcs for UBCF vs SVD
ubcf_er <- calcPredictionAccuracy(ubcf_pred, getData(eval_sets, "unknown"))
svd_er <- calcPredictionAccuracy(svd_pred, getData(eval_sets, "unknown"))
# RMSE, MSE and MAE
k_Method <- c("ALS","UBCF-Cosine","SVD")
k_table_p <- data.frame(rbind(als_spark,ubcf_er, svd_er))
rownames(k_table_p) <- k_Method
k_table_p <- k_table_p[order(k_table_p$RMSE ),]
kable(k_table_p) %>% kable_styling()| RMSE | MSE | MAE | |
|---|---|---|---|
| ALS | 0.6484466 | 0.4204830 | 0.4863415 |
| UBCF-Cosine | 0.8843518 | 0.7820781 | 0.6635512 |
| SVD | 0.8926244 | 0.7967784 | 0.6693957 |
As can be seen the RMSE, MSE and MAE are better for the ALS model than the UBCF or SVD models.