The goal of this project is give you practice beginning to work with a distributed recommender system. It is sufficient for this assignment to build out your application on a single node.
Adapt one of your recommendation systems to work with Apache Spark and compare the performance with your previous iteration.
Consider the efficiency of the system and the added complexity of using Spark. You may complete the
assignment using PySpark (Python), SparkR (R) , sparklyr (R), or Scala.
Please include in your conclusion: For your given recommender system’s data, algorithm(s), and
(envisioned) implementation, at what point would you see moving to a distributed platform such as Spark becoming necessary?
You may work on any platform of your choosing, including Databricks Community Edition or in local mode. You are encouraged but not required to work in a small group on this project.
For Project 5, I will build on Project 4 using the same MovieLens dataset, and evaluate the models accuracy using RMSE, MSE and MAE.
The MovieLens Latest Small Datasets contain 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 32.0 64.0 105.4 147.5 735.0
## [1] "data" "normalize"
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
## [1] "Toy Story (1995)"
## [2] "GoldenEye (1995)"
## [3] "Four Rooms (1995)"
## [4] "Get Shorty (1995)"
## [5] "Copycat (1995)"
## [6] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
## [7] "Twelve Monkeys (1995)"
## [8] "Babe (1995)"
## [9] "Dead Man Walking (1995)"
## [10] "Richard III (1995)"
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
## ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. ..@ i : int [1:99392] 0 1 4 5 9 12 14 15 16 17 ...
## .. .. ..@ p : int [1:1665] 0 452 583 673 882 968 994 1386 1605 1904 ...
## .. .. ..@ Dim : int [1:2] 943 1664
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : chr [1:943] "1" "2" "3" "4" ...
## .. .. .. ..$ : chr [1:1664] "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
## .. .. ..@ x : num [1:99392] 5 4 4 4 4 3 1 5 4 5 ...
## .. .. ..@ factors : list()
## ..@ normalize: NULL
ratingvalues <- as.vector(MovieLense@data)
unique(ratingvalues) # The rating is numeric with the least value as 0.5 and the highest values as 5.## [1] 5 4 0 3 1 2
## ratingvalues
## 0 1 2 3 4 5
## 1469760 6059 11307 27002 33947 21077
hist(ratingvalues,
breaks = 5,
main="Distribution of Ratings",
xlab="Ratings",
col="grey",
freq=TRUE
) # Pre-processing
Convert data to numeric
## user item rating
## 1 1 Toy Story (1995) 5
## 453 1 GoldenEye (1995) 3
## 584 1 Four Rooms (1995) 4
## 674 1 Get Shorty (1995) 3
## 883 1 Copycat (1995) 3
## 969 1 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 5
ratingsMat <- sparseMatrix(i = movies$user, j = movies$itemid, x = movies$rating,
dims = c(length(unique(movies$user)), length(unique(movies$itemid))),
dimnames = list(paste("u", 1:length(unique(movies$user)), sep = ""),
paste("m", 1:length(unique(movies$itemid)), sep = "")))
ratingsReal <- new("realRatingMatrix", data = ratingsMat)
ratingsReal## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
I’m selecting the Users who have rated at least 100 movies and those movies that have been watched at least 150 times
I’m selecting the Users who have rated at least 100 movies and those movies that have been watched at least 150 times
Use 90% for training the model
## Evaluation scheme with 10 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=4.000000
## Data set: 358 x 200 rating matrix of class 'realRatingMatrix' with 32713 ratings.
ModelAlgorithms <- list(
"ALS" = list(name="ALS", param=list(normalize = "Z-score")),
"Popular" = list(name="POPULAR", param=list(normalize = "Z-score")),
"UserBased" = list(name="UBCF", param=list(normalize = "Z-score",
method="Cosine",
nn=50, minRating=3)),
"ItemBased" = list(name="IBCF2", param=list(normalize = "Z-score"
))
)## ALS run fold/sample [model time/prediction time]
## 1 [0.02sec/7.05sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.04sec/0.02sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Warning: Unknown parameters: minRating
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## weighted = TRUE
## normalize = center
## min_matching_items = 0
## min_predictive_items = 0
## verbose = FALSE
## [0sec/0.17sec]
## IBCF2 run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 0 0 0
## Error in .local(data, ...) :
## Recommender method IBCF2 not implemented for data type realRatingMatrix .
## Warning in .local(x, method, ...):
## Recommender 'ItemBased' has failed and has been removed from the results!
From the above information, it can be observed that Item Based failed and was removed from the modeling list.
ALS seems to be the worst performing Model with the highest RMSE, MSE and MAE.Therefore, I will implement the worst performing Model (ALS) in Spark to see if the performance will improve.
## [1] "C:\\Users\\Emahayz_Pro\\AppData\\Local/spark/spark-2.4.3-bin-hadoop2.7"
## spark hadoop
## 1 2.4.3 2.7
## dir
## 1 C:\\Users\\Emahayz_Pro\\AppData\\Local/spark/spark-2.4.3-bin-hadoop2.7
I have connected to Spark locally and will copy the data to spark.First, I will convert data based on sparklyr requirements as indicated below.
Split for training and testing
View the Training set
## # Source: spark<?> [?? x 4]
## user item rating itemid
## <dbl> <chr> <dbl> <dbl>
## 1 1 12 Angry Men (1957) 5 4
## 2 1 20,000 Leagues Under the Sea (1954) 3 7
## 3 1 2001: A Space Odyssey (1968) 4 8
## 4 1 Abyss, The (1989) 3 18
## 5 1 Air Bud (1997) 1 33
## 6 1 Akira (1988) 4 37
View the Testing set
## # Source: spark<?> [?? x 4]
## user item rating itemid
## <dbl> <chr> <dbl> <dbl>
## 1 1 101 Dalmatians (1996) 2 3
## 2 1 Ace Ventura: Pet Detective (1994) 3 19
## 3 1 Aladdin (1992) 4 38
## 4 1 Alien (1979) 5 43
## 5 1 Austin Powers: International Man of Mystery (1997) 4 106
## 6 1 Bad Boys (1995) 2 117
## 7 1 Blade Runner (1982) 5 186
## 8 1 Cape Fear (1991) 3 264
## 9 1 Clerks (1994) 5 316
## 10 1 Copycat (1995) 3 344
## # ... with more rows
Spark Model
## user item rating itemid prediction
## 1 839 8 Heads in a Duffel Bag (1997) 1 12 3.660742
## 2 225 8 Seconds (1994) 4 13 3.738235
## 3 254 8 Seconds (1994) 4 13 2.838886
## 4 429 8 Seconds (1994) 2 13 3.453169
## 5 707 A Chef in Love (1996) 4 14 4.257363
## 6 794 A Chef in Love (1996) 4 14 3.984950
SparkMSE <- mean((Sparkpred$rating - Sparkpred$prediction)^2)
SparkRMSE <- sqrt(SparkMSE)
SparkMAE <- mean(abs(Sparkpred$Rating - Sparkpred$prediction))Disconnect from Spark
ModelName <- ("Spark Model")
Model_RMSE <- ("0.9214")
Model_MSE <- ("0.8490")
Model_MAE <- ("NaN")
Model_Performance <- data.frame(ModelName,Model_RMSE,Model_MSE,Model_MAE)
Model_Performance## ModelName Model_RMSE Model_MSE Model_MAE
## 1 Spark Model 0.9214 0.8490 NaN
The purpose of this exercise was to build a recommender system and to practice working with distributed recommender system.
For the given data set, ALS provided the worst performance compared to POpular and User Based. Based on the poor performance of the ALS, I decided to build an ALS using Sparklyr to see if I can get better performance for ALS using a distributed recommender system.
The performance using Spark definitely improved as can be seen that the RMSE and MSE are far lower compared to the ALS recommender algorithm. The distributed recommender system shows a big advantage even though the implementation seems to be more complex.
Breese JS, Heckerman D, Kadie C (1998). “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” In Uncertainty in Artificial Intelligence. Proceedings of the Fourteenth Conference, pp. 43-52.
Kohavi, Ron (1995). “A study of cross-validation and bootstrap for accuracy estimation and model selection”. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137-1143.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30–37. https://doi.org/10.1109/MC.2009.263
Michael Hahsler (2016). recommenderlab: A Framework for Developing and Testing Recommendation Algorithms, R package. https://CRAN.R-project.org/package=recommenderlab