Recommender System for Movies - Implementation on Spark

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time. The selected dataset has ~100K movie ratings (1-5) from ~600 users on ~9000 movies.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv.

Loading datasets, Package Installation

## package 'sparklyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\humberh\AppData\Local\Temp\RtmpIBHwFR\downloaded_packages
## package 'tictoc' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\humberh\AppData\Local\Temp\RtmpIBHwFR\downloaded_packages
Recommender based on Alternating Least Squares (ALS) - Spark Implementation
## [1] 0.7716416
## [1] 0.8784313
## [1] 0.6782136
## NULL
## 46.45 sec elapsed
RMSE MSE MAE
recommenderlab 1.0021374 1.0042793 0.7664364
Spark 0.8784313 0.7716416 0.6782136
Conclusion

Spark implementation of the ALS-based recommender model outperformed recommenderlab in both main areas: Performance and Accuracy.

Performance was expected as it is one of the main benefits of using a distributed data/analytics platform, although installed as Local Node, Spark uses multithreading to achieve the distributed compute fashion.

Spark outperformed recommenderlab by a factor of 6X in terms of performance

(R can also be run as a parallel engine with R Studio Server and MS R Open multithreaded execution).

Spark running in a multi-server and high-density environment, like in the cloud, can achieve unparalleled performance that makes possible addressing the most challenging DS and ML problems. Cloud elasticity allows to minimize compute and maintenance costs while maximizing efficiency.

The Spark implementation also showed better accuracy, 0.88 RMSE vs 1.10 RMSE in recommenderlab, I am thinking there is also a better ALS implementation in Spark but the exact reason is not 100% clear for me.