Overview
The goal of this project is to practice beginning to work with a distributed recommender system. I leverage my Project 3 Recommender System Spark and Sparklyr to add an additional model to my original analysis. Project 3 originally included UCBF and SVD models, we’ll use Spark to build and add an ALS model to the analysis. Next we will compare the accuracy of the three approaches and discuss the Spark / Sparklyr experience and potential benefits. Part 1 includes the original Project 3 analysis. Part 2 includes the new ALS model built with Spark and Sparklyr. We take a step by step approach with this new tool.
Part 1 - The Original Project 3 Work
The Data Set
The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/data sets/movielens/. Please note - I reduced the size of the movie matrix so that my circa 1990 Mac Mini could handle the load. The data set is comprised of two files - rates and titles. We utilized the skimr package to explore the data. SVD models require no missing data. Skimr will let us know where we stand in that regard.
Split the Data
To train and test our models, we need to split our data into training and testing sets. We utilize an 80-20 split, with given of 7 and a good Ratting set to 3.5.
Build Models
We will build a User-Based Collaborative model and an SVD model. We will compare each model’s performance, based upon RMSE, as well as the time required to build and predict under each methodology.
User-Based Collaborative Model
See table 1 below for the performance results of the UBCF model.
Singular Value Decomposition (SVD) Model
Next we create the sVD model. After some tuning, k = 20, was utilized in the final SVD model. Table 2 below set forth the performance of the SVD model.
UCBF Prediction
|
Movie
|
|
Pulp Fiction (1994)
|
|
Godfather, The (1972)
|
|
Taxi Driver (1976)
|
|
Silence of the Lambs, The (1991)
|
|
Star Wars: Episode IV - A New Hope (1977)
|
SVD Prediction
|
Movie
|
|
Butch Cassidy and the Sundance Kid (1969)
|
|
Crying Game, The (1992)
|
|
Star Wars: Episode IV - A New Hope (1977)
|
|
Raising Arizona (1987)
|
|
Godfather: Part II, The (1974)
|
UCBF vs SVD
The two approaches yield similar results. Each algorithm recommended a Star Wars movie and a God Father movie. I can also see similarities between Taxi and Raising Arizona (light hearted and funny). From here the UCBF recommended two great, albeit violent movies and the SVD went with Butch and Sundance and the Crying Game. These pairs don’t seem to be too closely related.
Part 2 - ALS With Spark and Sparklyr
ALS Model Using Spark
We already have an UCBF and SVD models, so now we will use spark to create an ALS model. Here are the steps for this analysis.
Establish Connection to Spark Server - local in this case.
Prepare the data
We are using the same data set as used above, just a assigning some spark-inspired names to the variables.
Create training and test data frames
Move to Spark server
This is the key step of copy the data to the spark server for processing. The move to spark is accomplished with a simple copy command (sdf_copy_to)
Build ALS model in spark using the ml_als command
Compare Results and Discuss Spark
The UBCF and SVD model performed better than the ALS, but that not the headline.
Spark and Sparklyr provide the average R programmer an ability to harness the power and speed of spark while staying in the friendly confines of R. What I experienced was that Spark did not seem to build / calculate the model any faster than RecommenderLabs, however, predictions were significantly faster. This is because the spark approach loads everything to memory, so once you have a working model it just sits there any memory ready to respond. As a result, spark would be a much better platform to deploy a system at scale.
In conclusion, moving to a distributed architecture would seem advisable when data sets are large, processing is computationally demanding and / or minimizing processing time is critical.
|
|
RMSE
|
MSE
|
MAE
|
|
RecLabs_UBCF
|
0.8682038
|
0.7537779
|
0.6706194
|
|
RecLab_SVD
|
0.8731257
|
0.7623486
|
0.6761709
|
|
Spark ALS
|
1.0261927
|
1.0530714
|
0.7928117
|