## Warning: package 'recommenderlab' was built under R version 3.5.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.5.3
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.5.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.5.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Warning: package 'registry' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
##
## expand
## Warning: package 'tictoc' was built under R version 3.5.2
## Warning: package 'sparklyr' was built under R version 3.5.3
Connecting to local spark instance.
sc <- spark_connect(master = "local")
As requested I will be using the same dataset from porject 4 Videogames reviews from the amazon open data repository. I’l be using the library sparklyr and a local instance of Spark, my theory is that even running a local instance data management efficiency will improve despite the fact is running in the same hardware but actually using the GPU cores available.
TThe dataset I will be using is provided by Amazon Open Data and Describes product Ratings based on 1 to 5 stars and you can find it here https://s3.rAmazonaws.com/rAmazon-reviews-pds/tsv/index.txt among other useful datasets, for better description of the datasets look here https://s3.rAmazonaws.com/rAmazon-reviews-pds/readme.html.
For the project purpose I took a subset from the 1 million plus records available in the dataset. I had to adjust the matrix manually sparcity was to high so had to put some ratings so every user had “enough” in order form the recommender to work
# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/sortega7878/DATA612/master/ratings_Amazon.csv"))
Instead of using the methods I used in the prior project I will do the benchmark using ALS one with regular code and other using Spark
# Data prep
ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")
# Train/test split
set.seed(88)
eval <- evaluationScheme(amazon, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
# Training
tic()
modelALS <- Recommender(train, method = "ALS")
train_time <- toc(quiet = TRUE)
# Predicting
tic()
predALS <- predict(modelALS, newdata = known, type = "ratings")
predict_time <- toc(quiet = TRUE)
Training <- round(train_time$toc - train_time$tic, 2)
Predicting <- round(predict_time$toc - predict_time$tic, 2)
timing <- data.frame(Method = "recommenderlab", Training = round(train_time$toc - train_time$tic, 2),
Predicting = round(predict_time$toc - predict_time$tic, 2))
# Accuracy
accALS <- calcPredictionAccuracy(predALS, unknown)
Similar modeling can be done with Spark. The general process is very simple - set up Spark local instance, copy relevant data frames into Spark, perform modeling and run predictions, compare results. As in previous projects, data is split into training and testing sets (80/20 split) and results are evaluated mostly using RMSE.
# Connection
sc <- spark_connect(master = "local")
## Re-using existing Spark connection to local
# Prepare data
spark_df <- ratings
spark_df$UserId <- as.integer(spark_df$UserId)
spark_df$ProductId <- as.integer(spark_df$ProductId)
# Split for training and testing
which_train <- sample(x = c(TRUE, FALSE), size = nrow(spark_df),
replace = TRUE, prob = c(0.8, 0.2))
train_df <- spark_df[which_train, ]
test_df <- spark_df[!which_train, ]
# Move to Spark
spark_train <- sdf_copy_to(sc, train_df, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(sc, test_df, "test_ratings", overwrite = TRUE)
# Build model
tic()
sparkALS <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE,
rating_col = "Rating", user_col = "UserId", item_col = "ProductId")
train_time <- toc(quiet = TRUE)
# Run prediction
tic()
sparkPred <- sparkALS$.jobj %>%
invoke("transform", spark_dataframe(spark_test)) %>%
collect()
predict_time <- toc(quiet = TRUE)
timing <- rbind(timing, data.frame(Method = "Spark",
Training = round(train_time$toc - train_time$tic, 2),
Predicting = round(predict_time$toc - predict_time$tic, 2)))
sparkPred <- sparkPred[!is.na(sparkPred$prediction), ] # Remove NaN due to data set splitting
# Calculate error
mseSpark <- mean((sparkPred$Rating - sparkPred$prediction)^2)
rmseSpark <- sqrt(mseSpark)
maeSpark <- mean(abs(sparkPred$Rating - sparkPred$prediction))
# Disconnect
spark_disconnect(sc)
## NULL
Even Split due randomization might be different RMSE values are fairly close (slightly better in Spark) that might be attributable to the split of data and not completely sure if spark has any to do with these.
accuracy <- rbind(accALS, data.frame(RMSE = rmseSpark, MSE = mseSpark, MAE = maeSpark))
rownames(accuracy) <- c("recommenderlab ALS", "Spark ALS")
knitr::kable(accuracy, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| RMSE | MSE | MAE | |
|---|---|---|---|
| recommenderlab ALS | 1.386013 | 1.921033 | 1.084437 |
| Spark ALS | 1.353287 | 1.831384 | 1.047647 |
The huge difference is performance, even all the process where executed in the same hardware Spark seems to have a better usage of resources, and hence dramatically predict faster than the traditional method eventhough training was slower. Spark will distribute the jobs among the processor capabilities doing it faster.
knitr::kable(timing, format = "html", row.names = FALSE) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Method | Training | Predicting |
|---|---|---|
| recommenderlab | 0.01 | 522.08 |
| Spark | 15.70 | 3.33 |
Even with running just the local instance, Spark improved overall performance. This is clearly the biggest advantage of the distributed processing. The biggest disadvantage is also fairly obvious - more complex implementation. I believe this is the main tradeoff.
Dramatic uses of technologies of Spark happens in the cloud with billions and trillions of datapoints and really using this power to bring the insights from the ocean of data. Is always important to bear in mind objective, real required answer times, how often needs to be updated since we don’t have anymore limitations in the HW reallity is that cost can get hefty really fast.