Project 5 Data 612

This projecdt requires that use Spark to do the comparison. The first obstacle I encountered was installing the version that satisfy the “Sparklyr” for r. I am gonna use MovieLense to do the comparison. As I did more research about the sparklyr, I found out that sparklyr can use ALS for the recommender system.

library(sparklyr)
library(recommenderlab)
library(dplyr)
# Connect with spark
sc <- spark_connect(master = "local")
data(MovieLense, package = "recommenderlab")

movielense <- MovieLense
movies <- as(movielense,"data.frame")
head(movies)

##     user                                                 item rating
## 1      1                                     Toy Story (1995)      5
## 453    1                                     GoldenEye (1995)      3
## 584    1                                    Four Rooms (1995)      4
## 674    1                                    Get Shorty (1995)      3
## 883    1                                       Copycat (1995)      3
## 969    1 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)      5

movies <- transform(movies, itemid=as.numeric(factor(item)))
colnames(movies) <- c("user","item","rating","itemid")
# convert user and itemid to numeric in order to use for Spark
movies$user <- as.numeric(movies$user)
movies$itemid <- as.numeric(movies$itemid)
movies <- movies %>% select(-item)
head(movies)

##     user rating itemid
## 1      1      5   1525
## 453    1      3    618
## 584    1      4    555
## 674    1      3    594
## 883    1      3    344
## 969    1      5   1318

movies_wide <- reshape(movies, idvar = "user", timevar = "itemid", direction = "wide") %>% 
    arrange(user)
rownames(movies_wide) <- movies_wide$user
movies_wide <- movies_wide %>% select(-user)



# copy table to Spark

sp_movies <- sdf_copy_to(sc,movies,"spmovies",overwrite = TRUE)
partitions <- sp_movies %>% sdf_random_split(training = 0.7, test = 0.3)
sp_movies_training <- partitions$training
sp_movies_test <- partitions$test
head(sp_movies_training)

## # Source: spark<?> [?? x 3]
##    user rating itemid
##   <dbl>  <dbl>  <dbl>
## 1     1      1     33
## 2     1      1     48
## 3     1      1    111
## 4     1      1    134
## 5     1      1    135
## 6     1      1    136

model<- ml_als(sp_movies_training,rating_col = "rating",user_col = "user", item_col = "itemid",rank = 10)
predictions <- ml_predict(model, sp_movies_test)

predictions <- data.frame(predictions)
predictions$difference <- (predictions$rating - predictions$prediction)
predictions$difference_square <- (predictions$difference)^2

head(predictions)

##   user rating itemid prediction difference difference_square
## 1  868      4     12   4.256855 -0.2568555        0.06597474
## 2  503      5     13   2.500457  2.4995427        6.24771377
## 3   17      5     14   2.523404  2.4765956        6.13352596
## 4  759      3     14   3.278862 -0.2788620        0.07776401
## 5   52      4     18   3.610735  0.3892653        0.15152747
## 6  232      4     18   3.413062  0.5869379        0.34449610

sqrt(mean(predictions$difference_square,na.rm = TRUE))

## [1] 0.9361416

Conclusion: Working with spark is a great experience although I still have to use “recommenderlab” for more experiments for the recommender system. There is some inconvenient parts of Sparklyr. when loading the packages, it may take a little while to run the program but it calculate the prediciton by ALS is very fast. On the Contrary, “recommenderlab” takes little time to run the program but it calculates ALS for a littlbe bit longer time. Comparing the accuracy, Spark RMSE = 0.92, which is pretty good and better the RMSE in Collaborative Filtering in my Project 4 which is around 1.2. I think overall, Spark is a pretty good experience. Combining the accuracy,the time consuming and the size of the data set, Spark may be a good idea to start with.

Project 5 Data 612

Vivian Kong

7/2/2019