library(tidyverse)
library(kableExtra)
library(knitr)
library(recommenderlab)
library(dplyr)
library(ggplot2)         
library(ggrepel)         
library(tictoc)
library(sparklyr)

Goal

The goal of this project is give you practice beginning to work with a distributed recommender system.

Deliverables

  1. It is sufficient for this assignment to build out your application on a single node.

  2. Adapt one of your recommendation systems to work with Apache Spark and compare the performance with your previous iteration.

  3. Consider the efficiency of the system and the added complexity of using Spark.

  4. You may complete the assignment using PySpark (Python), SparkR (R) , sparklyr (R), or Scala.

Theseis

For this project, I will be continue to work on the Amazon Product Review’s Data Set for Video Game Titles. In addition, I will be using the package sparklyr to host local instance of Spark.

I am thinking that even thought I will be working out of a local instance data management environment, efficiency will significantly improve, wven though the calculation will be performed on the same machine.

DataSet

Amazon Open Data Poduct Reviews.
Index
Readme
Registry

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.

Since there are tons of products on Amazon, I decided to focus on Video Game Reviews.

I posted my partial of Amazon Video Game Product Reviews to my Github

Data Import

# CSV Import
vg_ratings <- read.csv(paste0("https://raw.githubusercontent.com/josephsimone/Data-612/master/project_4/ratings_Amazon.csv"))

Data Preperation for recommendorLab

# Sparse Matrix
vg_ratings_matrix <- sparseMatrix(as.integer(vg_ratings$UserId), as.integer(vg_ratings$ProductId), x = vg_ratings$Rating)
colnames(vg_ratings_matrix) <- levels(vg_ratings$ProductId)
rownames(vg_ratings_matrix) <- levels(vg_ratings$UserId)

In order for the recommendation system to work in recommenderLab, I had to sparse the matrix manually was to high so had to put some ratings so that all users to have a “sufficient” numbers of reviews.

# Matrix Creation
vg_real_matrix <- as(vg_ratings_matrix, "realRatingMatrix")
# Setup
set.seed(123)
e <- evaluationScheme(vg_real_matrix, method = "split", train = 0.8, given=5, goodRating = 3)
train <- getData(e, "train")
known <- getData(e, "known")
unknown <- getData(e, "unknown")
# Training
tic()
rALS_model <- Recommender(train, method = "ALS")
train_time <- toc(quiet = TRUE)

ALS Model with RecommenderLab

# Model with R
tic()
ALS_rpred <- predict(rALS_model, newdata = known, type = "ratings")
predict_time <- toc(quiet = TRUE)

Train <- round(train_time$toc - train_time$tic, 2)
Predict <- round(predict_time$toc - predict_time$tic, 2)

count <-  data.frame(Method = "recommenderlab", Train = round(train_time$toc - train_time$tic, 2), 
                                   Predict = round(predict_time$toc - predict_time$tic, 2))
# Metrics
rals_accuracy <- calcPredictionAccuracy(ALS_rpred, unknown)

Spark ALS Model

Much like recommendorLab, similar modeling can be done with Spark. However, first you must set up Spark local instance to host the environment. From there you are ready to import the dataframes you are want to process into Spark. Then you can perform modeling techniques, calculate predictions and compare results all within the Apache Spark environment.

To connect to Apache Spark, I will be using the package sparklyr. An R interface to Apache Spark, a fast and general engine for big data processing. This package supports connecting to local and remote Apache Spark clusters, provides a ‘dplyr’ compatible back-end, and provides an interface to Spark’s built-in machine learning algorithms.

# Spark Connection
c <- spark_connect(master = "local")

Data Preperartion for Spark

# Data Prep
sp_df <- vg_ratings
sp_df$UserId <- as.integer(sp_df$UserId)
sp_df$ProductId <- as.integer(sp_df$ProductId)

Unlike recommenderLab I did not have to parse the data for it to be able to be processed.

# Setup
temp_train <- sample(x = c(TRUE, FALSE), size = nrow(sp_df),
                      replace = TRUE, prob = c(0.8, 0.2))
spark_train <- sp_df[temp_train, ]
spark_test <- sp_df[!temp_train, ]
# Migration
spark_train <- sdf_copy_to(c, spark_train, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(c, spark_test, "test_ratings", overwrite = TRUE)
# Modeling with Spark
tic()
als_spark <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE, 
                   rating_col = "Rating", user_col = "UserId", item_col = "ProductId")
train_time <- toc(quiet = TRUE)
# Prediction
tic()
predict_spark <- als_spark$.jobj %>%
  invoke("transform", spark_dataframe(spark_test)) %>%
  collect()
predict_time <- toc(quiet = TRUE)

count <- rbind(count, data.frame(Method = "Spark", 
                                   Train = round(train_time$toc - train_time$tic, 2), 
                                   Predict = round(predict_time$toc - predict_time$tic, 2)))

predict_spark <- predict_spark[!is.na(predict_spark$prediction), ] # Remove NA from splitting
# Metrics
spark_mse<- mean((predict_spark$Rating - predict_spark$prediction)^2)
spark_rsme <- sqrt(spark_mse)
spark_mae <- mean(abs(predict_spark$Rating - predict_spark$prediction))
# Spark Disconnect
spark_disconnect(c)

Evaluation of the Both ALS Models

metrics <- rbind(rals_accuracy, data.frame(RMSE = spark_rsme, MSE = spark_mse, MAE = spark_mae))
rownames(metrics) <- c("AlS recommenderLab Model ", "ALS Spark Model")
knitr::kable(metrics, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
RMSE MSE MAE
AlS recommenderLab Model 1.383589 1.914317 1.076951
ALS Spark Model 1.350034 1.822591 1.039740
knitr::kable(count, format = "html", row.names = FALSE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Method Train Predict
recommenderlab 0.00 285.91
Spark 8.69 2.11

Conclusion

The main thing theme to point out is the diffence in performance of these two models. With each of the calculations being the processed through the same hardware,Apache Spark deligates the usage of resources more efficently. Therfore, resulting in the significanty faster prediction times than recommendorLab. Even though training the model take a bit longer in Spark. During this time, Spark is distributing jobs amongs allocated resources.

Apache Spark improved the overall performance of the ALS Model, even when hosting in as local host. This is a major draw to using a distribution systems. However, Spark has the more complex implementation process.