Overview

The following is a use-case implementation of Spark on Windows 10, 64bit, via Sparklyr. A prior recommendation system was developed with the Jester5K dataset using recommenderlab library inside R and RStudio. A speed test was designed to compare the first and second Spark methods, however each method will differ so take these results with a giant grain of salt. Without an existing local Spark database, about 95% of the time commited was towards setting up an environment. There are too many resources to list that helped in my quest to make Spark happy on my PC.

The Jester data set https://grouplens.org/datasets/jester/ “contains a sample of 5000 users from the anonymous ratings data from the Jester Online Joke Recommender System collected between April 1999 and May 2003.” A user collaborative filtering approaches will be analyzed and compared using evaluation metrics, and a preferred option will be selected and optimized. A method for introducing variance, or “diversity”, will also be evaluated and the Root Mean Square Errors of each solution will be calculated and contrasted.


Data Loading and Preparation

set.seed(222)
# Libraries
library(recommenderlab)
library(sparklyr)
## Warning: package 'sparklyr' was built under R version 3.4.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
# Data loading
data(Jester5k)

Constructing the original model

#constrain / normalize
Jester5k@data@x[Jester5k@data@x[] < -10] <- -10
Jester5k@data@x[Jester5k@data@x[] > 10] <- 10

# Keeping only jokes with more than 80 ratings and users with more than twenty rating
Jester5k_r <- Jester5k[rowCounts(Jester5k) > 80,  colCounts(Jester5k) > 20]

start_time<- Sys.time()
e <- evaluationScheme(Jester5k_r, method = "split",train = 0.8, given = 30, goodRating = 3, k=5)
e
## Evaluation scheme with 30 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1701 x 100 rating matrix of class 'realRatingMatrix' with 167062 ratings.
# Creation of the model - U(ser) B(ased) C(ollaborative) F(iltering)
Rec.model <- Recommender(getData(e, "train"), "UBCF", parameter = list(method = "pearson", normalize = "Z-score", nn=25))

#Making predictions 
prediction_UJ <- predict(Rec.model, getData(e, "known"), type="ratings",n=10)

# set all predictions that fall outside the valid range to the boundary values
prediction_UJ@data@x[prediction_UJ@data@x[] < -10] <- -10
prediction_UJ@data@x[prediction_UJ@data@x[] > 10] <- 10

timeA <- Sys.time() - start_time

Without using spark, a basic recommender takes 9.864074 seconds.

Constructing Spark Environment

#hardcoding environment for posterity
java_path <- normalizePath('C:/JAVA/jre1.8.0_131')
spark_path <-normalizePath("C:/Users/Robert/AppData/Local/rstudio/spark/Cache/spark-2.0.1-bin-hadoop2.7/")
Sys.setenv(JAVA_HOME = java_path)
Sys.setenv(SPARK_HOME = spark_path)

#setting up data
jDF <- as(Jester5k, 'data.frame')
jDF$user <- as.numeric(jDF$user)
jDF$item <- as.numeric(jDF$item)

#connecting to local spark
sc2 <- spark_connect(master='local')

start_time<- Sys.time()

jDF <- sdf_copy_to(sc2,jDF, "spark_jester", overwrite=T)

implicit_model <- ml_als_factorization(jDF, 
                                       iter.max = 5, 
                                       regularization.parameter = 0.01, 
                                       implicit.preferences = TRUE, 
                                       alpha = 1.0)

implicit_predictions <- implicit_model$.model %>%
  invoke("transform", spark_dataframe(jDF)) %>%
  collect()

end_times <- Sys.time()

timeB <- end_times - start_time

head(implicit_predictions)
## # A tibble: 6 x 4
##    user  item rating prediction
##   <dbl> <dbl>  <dbl>      <dbl>
## 1    12    12   9.08   6.261599
## 2    13    12  -9.61  -9.069301
## 3    14    12  -9.51  -7.208566
## 4    18    12  -9.13  -9.587784
## 5    38    12  -5.10  -5.624005
## 6    46    12  -3.83  -5.032250
spark_disconnect(sc2)

With spark, a basic recommender takes 1.0977349 seconds.