This projecdt requires that use Spark to do the comparison. The first obstacle I encountered was installing the version that satisfy the “Sparklyr” for r. I am gonna use MovieLense to do the comparison. As I did more research about the sparklyr, I found out that sparklyr can use ALS for the recommender system.
library(sparklyr)
library(recommenderlab)
library(dplyr)
# Connect with spark
sc <- spark_connect(master = "local")
data(MovieLense, package = "recommenderlab")
movielense <- MovieLense
movies <- as(movielense,"data.frame")
head(movies)
## user item rating
## 1 1 Toy Story (1995) 5
## 453 1 GoldenEye (1995) 3
## 584 1 Four Rooms (1995) 4
## 674 1 Get Shorty (1995) 3
## 883 1 Copycat (1995) 3
## 969 1 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 5
movies <- transform(movies, itemid=as.numeric(factor(item)))
colnames(movies) <- c("user","item","rating","itemid")
# convert user and itemid to numeric in order to use for Spark
movies$user <- as.numeric(movies$user)
movies$itemid <- as.numeric(movies$itemid)
movies <- movies %>% select(-item)
head(movies)
## user rating itemid
## 1 1 5 1525
## 453 1 3 618
## 584 1 4 555
## 674 1 3 594
## 883 1 3 344
## 969 1 5 1318
movies_wide <- reshape(movies, idvar = "user", timevar = "itemid", direction = "wide") %>%
arrange(user)
rownames(movies_wide) <- movies_wide$user
movies_wide <- movies_wide %>% select(-user)
# copy table to Spark
sp_movies <- sdf_copy_to(sc,movies,"spmovies",overwrite = TRUE)
partitions <- sp_movies %>% sdf_random_split(training = 0.7, test = 0.3)
sp_movies_training <- partitions$training
sp_movies_test <- partitions$test
head(sp_movies_training)
## # Source: spark<?> [?? x 3]
## user rating itemid
## <dbl> <dbl> <dbl>
## 1 1 1 33
## 2 1 1 48
## 3 1 1 111
## 4 1 1 134
## 5 1 1 135
## 6 1 1 136
model<- ml_als(sp_movies_training,rating_col = "rating",user_col = "user", item_col = "itemid",rank = 10)
predictions <- ml_predict(model, sp_movies_test)
predictions <- data.frame(predictions)
predictions$difference <- (predictions$rating - predictions$prediction)
predictions$difference_square <- (predictions$difference)^2
head(predictions)
## user rating itemid prediction difference difference_square
## 1 868 4 12 4.256855 -0.2568555 0.06597474
## 2 503 5 13 2.500457 2.4995427 6.24771377
## 3 17 5 14 2.523404 2.4765956 6.13352596
## 4 759 3 14 3.278862 -0.2788620 0.07776401
## 5 52 4 18 3.610735 0.3892653 0.15152747
## 6 232 4 18 3.413062 0.5869379 0.34449610
sqrt(mean(predictions$difference_square,na.rm = TRUE))
## [1] 0.9361416
Conclusion: Working with spark is a great experience although I still have to use “recommenderlab” for more experiments for the recommender system. There is some inconvenient parts of Sparklyr. when loading the packages, it may take a little while to run the program but it calculate the prediciton by ALS is very fast. On the Contrary, “recommenderlab” takes little time to run the program but it calculates ALS for a littlbe bit longer time. Comparing the accuracy, Spark RMSE = 0.92, which is pretty good and better the RMSE in Collaborative Filtering in my Project 4 which is around 1.2. I think overall, Spark is a pretty good experience. Combining the accuracy,the time consuming and the size of the data set, Spark may be a good idea to start with.