1.0 Objective

Adapt one of your recommendation systems to work with Apache Spark and compare the performance with your previous iteration. Consider the efficiency of the system and the added complexity of using Spark.

2.0 Data Sourcing and Loading

Our dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.

# user artist plays
data_set <- fread('user_artists.dat',header = T,sep='\t')
data_set <- as.data.frame(data_set)
# artist listing
artist_ds <- fread('artists.dat',header = T,sep='\t')
artist_ds <- as.data.frame(artist_ds)
# create programmer friendly column names. 
colnames(data_set) <- c('userID','artistID','listeningCount')
pander(head(data_set))
userID artistID listeningCount
2 51 13883
2 52 11690
2 53 11351
2 54 10300
2 55 8983
2 56 6152

2.0 Starting Spark, H2o and Loading the Data

sc <- spark_connect(master="local")
# Copy the data into Spark
data_tbl <- sdf_copy_to(sc, data_set, overwrite = TRUE)
data_tbl
## # Source:   table<data_set> [?? x 3]
## # Database: spark_connection
##    userID artistID listeningCount
##     <int>    <int>          <int>
##  1      2       51          13883
##  2      2       52          11690
##  3      2       53          11351
##  4      2       54          10300
##  5      2       55           8983
##  6      2       56           6152
##  7      2       57           5955
##  8      2       58           4616
##  9      2       59           4337
## 10      2       60           4147
## # ... with 9.282e+04 more rows

3.0 Split the dataset into Train and Test

We decided to use 80% of our dataset for training and the remaining 20% for testing.

partitions <- data_tbl %>%  sdf_partition(training = 0.8, test = 0.2, seed = 1099)
# creating H2o fames for Training and Test Datasets
training <- as_h2o_frame(sc, partitions$training, strict_version_check = FALSE)
test <- as_h2o_frame(sc, partitions$test, strict_version_check = FALSE)

4.0 Creating GBM Model with H2o

We implemented a Gradient Boosting Machine Model(GBM) from the H2o package. GBM is a forward learning ensemble method. We optimized its execution by using the dataframe stored within local Spark node connection (sc)

h2o.no_progress()
start.time <- Sys.time()
gbm_model <- h2o.gbm(y = 3, x = c(1,2), training_frame = training, ntrees = 1000,max_depth = 4, learn_rate =0.01, seed = 1122)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Time difference of 11.2543 secs

5.0 Prediction

We utilized H2o’s prediction function. Preformed prediction using our test dataset. Once again the dataset is stored within local Spark node connection (sc)

pred <- h2o.predict(gbm_model, newdata = test)
# convert from H2O Frame to Spark DataFrame
predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE)
actual <- partitions$test %>%
  select(listeningCount) %>%
  collect() %>%
  `[[`("listeningCount")
# produce a data.frame that contains our actual and predicted values
data <- data.frame(
  predicted = predicted,
  actual    = actual
)
names(data) <- c("predicted", "actual")
pander(head(data))
predicted actual
5973 4616
4684 4337
3333 3644
3370 3312
3203 2619
1524 2120

6.0 Model Performance

perfm <- h2o.performance(gbm_model,newdata = test)
perfm
## H2ORegressionMetrics: gbm
## 
## MSE:  19259208
## RMSE:  4388.531
## MAE:  784.8457
## RMSLE:  1.754279
## Mean Residual Deviance :  19259208

Conclusion

Reimplementing our music recommending system with Spark and H2o was extremely challenging but rewarding. The biggest difficulty we faced was installing and configuring Spark and H2o. Spark comes bundled with different versions of Hadoop and getting the right combination needed for our environment was challenging. We eventually got Spark and H2o to work after we decided to try all of the available combinations. Once we were up and running, the execution time with Spark was fast. In our previous projects, we were forced to limit our dataset to 500 users and 500 plays because anything more than that resulted in hours of compile time. Spark managed to handle all 100,000 users and plays in a matter of approx 12 seconds.

When it came to choosing a model from the Spark package, we realized that it doesn’t have a built-in recommender function. We eventually ended up implementing a Gradient Boosting Machine (GBM) Model from the H2o package.

Based on this experience we would conclude that the utilizing a distributed platform such as Spark is necessary when the dataset is large, although we leveraged the local implementation.

The performance of our model was poor as its MSE is high, and that’s understandable because the model we utilized may not be the best for modeling user artist plays. It would be interesting to utilize other model options from Spark and H2o to see if they better model and hence betetr performance results for our dataset domain.

Dataset Credits