# Required libraries
library(recommenderlab)  
library(tidyverse)           
library(ggthemes)
library(kableExtra)
library(skimr)
library(ggrepel)         
library(tictoc)
library(sparklyr)

Overview

The goal of this project is to practice beginning to work with a distributed recommender system. I leverage my Project 3 Recommender System Spark and Sparklyr to add an additional model to my original analysis. Project 3 originally included UCBF and SVD models, we’ll use Spark to build and add an ALS model to the analysis. Next we will compare the accuracy of the three approaches and discuss the Spark / Sparklyr experience and potential benefits. Part 1 includes the original Project 3 analysis. Part 2 includes the new ALS model built with Spark and Sparklyr. We take a step by step approach with this new tool.

Part 1 - The Original Project 3 Work

The Data Set

The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/data sets/movielens/. Please note - I reduced the size of the movie matrix so that my circa 1990 Mac Mini could handle the load. The data set is comprised of two files - rates and titles. We utilized the skimr package to explore the data. SVD models require no missing data. Skimr will let us know where we stand in that regard.

# Data import
setwd("C:/Users/mutue/OneDrive/Documents/Data612")
ratings <- read.csv('ratings150.csv')
titles <- read.csv('movies.csv')

Convert to matrix

First we convert our data into the required format - a `realRatingMatrix`. The end result is a 150 by 4332 rating matrix with more than 18,000 ratings.

movieMatrix <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)


row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- as.matrix(movieMatrix[-c(1)])
movieRealMatrix <- as(movieMatrix, "realRatingMatrix")
movieRealMatrix

## 150 x 4332 rating matrix of class 'realRatingMatrix' with 18262 ratings.

Split the Data

To train and test our models, we need to split our data into training and testing sets. We utilize an 80-20 split, with given of 7 and a good Ratting set to 3.5.

# Train/test split
set.seed(7)
eval <- evaluationScheme(movieRealMatrix, method = "split", train = 0.8, given = 20, goodRating = 3.5)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

Build Models

We will build a User-Based Collaborative model and an SVD model. We will compare each model’s performance, based upon RMSE, as well as the time required to build and predict under each methodology.

User-Based Collaborative Model

See table 1 below for the performance results of the UBCF model.

# UBCF model

tic("UBCF Model - Training")
modelUBCF <- Recommender(train, method = "UBCF")
toc(log = TRUE, quiet = TRUE)
tic("UBCF Model - Predicting")
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)

Table 1. UBCF Performance Results.

	x
RMSE	0.8682038
MSE	0.7537779
MAE	0.6706194

Singular Value Decomposition (SVD) Model

Next we create the sVD model. After some tuning, k = 20, was utilized in the final SVD model. Table 2 below set forth the performance of the SVD model.

# SVD model

tic("SVD Model - Training")
modelSVD <- Recommender(train, method = "SVD", parameter = list(k = 20))
toc(log = TRUE, quiet = TRUE)
tic("SVD Model - Predicting")
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)

Table 2. SVD Performance Results

	x
RMSE	0.8731257
MSE	0.7623486
MAE	0.6761709

Performance Assessment

The UCBF model outperformed the SVD model by a narrow margin. The RMSE for the UCBF was 0.868 versus 0.873 for the SVD - a virtual tie from the performance perspective. Next we will see have fast the models are trained and how fast they produce predictions. Table 4. below sets forth a comparison of the two alternatives.

####The result of the speed comparison are interesting. The UCBF model was trained 10x faster than the SVD(0.02 seconds vs 0.20 seconds). However, it also took almost 10x longer for the UCBF model predictions (0.82 second vs 0.09 seconds).

These results make a strong argument for the SVD model. This is due to the fact that model are trained infrequently, but are called upon for predictions often. Given the near equal RSME performance and the superior prediction performance the SVD model would be a good choice for a production system.

Run Time
UBCF Model - Training: 0.01 sec elapsed
UBCF Model - Predicting: 0.92 sec elapsed
SVD Model - Training: 0.22 sec elapsed
SVD Model - Predicting: 0.08 sec elapsed

Model Predictions

Now we will make some movie predictions to see if the models produce similar results. Since it’s the 22nd of June, we’ll pick the 22nd user and see how she rated her movies.

Our movie rater appears to be a somewhat generous movie rater or someone who simply likes movies. Of the 22 movies rated 18 were either rated 4 or 5. Dumb & Dumber got a rating of 1, Pulp Fiction and Ace Ventura each earned a 3. This could indicate that our movie rater does like Violent or Comedy movies. There does appear to be a preference for action/suspense, drama and feel good movies.

Movie	Rating
Dumb & Dumber (Dumb and Dumber) (1994)	1
Pulp Fiction (1994)	3
Ace Ventura: Pet Detective (1994)	3
Crimson Tide (1995)	4
Waterworld (1995)	4
Interview with the Vampire: The Vampire Chronicles (1994)	4
Shawshank Redemption, The (1994)	4
True Lies (1994)	4
Cliffhanger (1993)	4
Beauty and the Beast (1991)	4
Apollo 13 (1995)	5
Batman Forever (1995)	5
Die Hard: With a Vengeance (1995)	5
Net, The (1995)	5
Outbreak (1995)	5
Stargate (1994)	5
Star Trek: Generations (1994)	5
While You Were Sleeping (1995)	5
Clear and Present Danger (1994)	5
Aladdin (1992)	5
Dances with Wolves (1990)	5
Batman (1989)	5

UCBF Prediction

mov_recommend1 <- as.data.frame(predUBCF@data[22, ])
colnames(mov_recommend1) <- c("Rating")
mov_recommend1$movieId <- as.integer(rownames(mov_recommend1))
mov_recommend1 <- mov_recommend1 %>% arrange(desc(Rating)) %>% head(5) %>% 
  inner_join (titles, by="movieId") %>%
  select(Movie = "title")


kable(mov_recommend1) %>%
  kable_styling()

Movie
Pulp Fiction (1994)
Godfather, The (1972)
Taxi Driver (1976)
Silence of the Lambs, The (1991)
Star Wars: Episode IV - A New Hope (1977)

SVD Prediction

mov_recommend2 <- as.data.frame(predSVD@data[22, ])
colnames(mov_recommend2) <- c("Rating")
mov_recommend2$movieId <- as.integer(rownames(mov_recommend2))
mov_recommend2 <- mov_recommend2 %>% arrange(desc(Rating)) %>% head(5) %>% 
  inner_join (titles, by="movieId") %>%
  select(Movie = "title")


kable(mov_recommend2) %>%
  kable_styling()

Movie
Butch Cassidy and the Sundance Kid (1969)
Crying Game, The (1992)
Star Wars: Episode IV - A New Hope (1977)
Raising Arizona (1987)
Godfather: Part II, The (1974)

UCBF vs SVD

Part 2 - ALS With Spark and Sparklyr

ALS Model Using Spark

We already have an UCBF and SVD models, so now we will use spark to create an ALS model. Here are the steps for this analysis.

Establish Connection to Spark Server - local in this case.

# Connection
sc <- spark_connect(master = "local")

Prepare the data

We are using the same data set as used above, just a assigning some spark-inspired names to the variables.

# Prepare data
spark_df <- ratings
spark_df$userId <- as.integer(spark_df$userId)
spark_df$movieId <- as.integer(spark_df$movieId)

Create training and test data frames

# Split for training and testing
which_train <- sample(x = c(TRUE, FALSE), size = nrow(spark_df),
                      replace = TRUE, prob = c(0.8, 0.2))
train_df <- spark_df[which_train, ]
test_df <- spark_df[!which_train, ]

Move to Spark server

This is the key step of copy the data to the spark server for processing. The move to spark is accomplished with a simple copy command (sdf_copy_to)

# Move to Spark
spark_train <- sdf_copy_to(sc, train_df, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(sc, test_df, "test_ratings", overwrite = TRUE)

Build ALS model in spark using the ml_als command

# Build model
sparkALS <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE, 
                   rating_col = "rating", user_col = "userId", item_col = "movieId")

Model Performance

We use the model to make predictions to assess its performance. We’ll use the same metric that were used above for the UCBF and SVD models - MSE, RMSE, and MAE.

# Run prediction

sparkPred <- sparkALS$.jobj %>%
  invoke("transform", spark_dataframe(spark_test)) %>%
  collect()


sparkPred <- sparkPred[!is.na(sparkPred$prediction), ] # Remove NaN due to data set splitting

# Calculate error
mseSpark <- mean((sparkPred$rating - sparkPred$prediction)^2)
rmseSpark <- sqrt(mseSpark)
maeSpark <- mean(abs(sparkPred$rating - sparkPred$prediction))



# Disconnect
spark_disconnect(sc)

Display the Spark ALS Model Performance Results

accuracy <- data.frame(RMSE = rmseSpark, MSE = mseSpark, MAE = maeSpark)
rownames(accuracy) <-  "Spark ALS"
kable(accuracy) %>%
  kable_styling()

	RMSE	MSE	MAE
Spark ALS	1.026193	1.053071	0.7928117

Compare Results and Discuss Spark

The UBCF and SVD model performed better than the ALS, but that not the headline.

Spark and Sparklyr provide the average R programmer an ability to harness the power and speed of spark while staying in the friendly confines of R. What I experienced was that Spark did not seem to build / calculate the model any faster than RecommenderLabs, however, predictions were significantly faster. This is because the spark approach loads everything to memory, so once you have a working model it just sits there any memory ready to respond. As a result, spark would be a much better platform to deploy a system at scale.

In conclusion, moving to a distributed architecture would seem advisable when data sets are large, processing is computationally demanding and / or minimizing processing time is critical.

accuracy <- rbind(calcPredictionAccuracy(predUBCF, unknown), calcPredictionAccuracy(predSVD, unknown), data.frame(RMSE = rmseSpark, MSE = mseSpark, MAE = maeSpark))
rownames(accuracy) <- c("RecLabs_UBCF", "RecLab_SVD", "Spark ALS")
kable(accuracy) %>%
  kable_styling()

	RMSE	MSE	MAE
RecLabs_UBCF	0.8682038	0.7537779	0.6706194
RecLab_SVD	0.8731257	0.7623486	0.6761709
Spark ALS	1.0261927	1.0530714	0.7928117

DATA 612 Project 5

Jim Mundy

Overview

Part 1 - The Original Project 3 Work

The Data Set

Convert to matrix

First we convert our data into the required format - a `realRatingMatrix`. The end result is a 150 by 4332 rating matrix with more than 18,000 ratings.

Split the Data

To train and test our models, we need to split our data into training and testing sets. We utilize an 80-20 split, with given of 7 and a good Ratting set to 3.5.

Build Models

User-Based Collaborative Model

Table 1. UBCF Performance Results.

Singular Value Decomposition (SVD) Model

Table 2. SVD Performance Results

Performance Assessment

These results make a strong argument for the SVD model. This is due to the fact that model are trained infrequently, but are called upon for predictions often. Given the near equal RSME performance and the superior prediction performance the SVD model would be a good choice for a production system.

Model Predictions

UCBF Prediction

SVD Prediction

UCBF vs SVD

Part 2 - ALS With Spark and Sparklyr

ALS Model Using Spark

We already have an UCBF and SVD models, so now we will use spark to create an ALS model. Here are the steps for this analysis.

Establish Connection to Spark Server - local in this case.

Prepare the data

We are using the same data set as used above, just a assigning some spark-inspired names to the variables.

Create training and test data frames

Move to Spark server

This is the key step of copy the data to the spark server for processing. The move to spark is accomplished with a simple copy command (sdf_copy_to)

Build ALS model in spark using the ml_als command

Model Performance

We use the model to make predictions to assess its performance. We’ll use the same metric that were used above for the UCBF and SVD models - MSE, RMSE, and MAE.

Display the Spark ALS Model Performance Results

Compare Results and Discuss Spark

The UBCF and SVD model performed better than the ALS, but that not the headline.

In conclusion, moving to a distributed architecture would seem advisable when data sets are large, processing is computationally demanding and / or minimizing processing time is critical.

DATA 612 Project 5

Jim Mundy

Overview

Part 1 - The Original Project 3 Work

The Data Set

Convert to matrix

First we convert our data into the required format - a realRatingMatrix. The end result is a 150 by 4332 rating matrix with more than 18,000 ratings.

Split the Data

To train and test our models, we need to split our data into training and testing sets. We utilize an 80-20 split, with given of 7 and a good Ratting set to 3.5.

Build Models

User-Based Collaborative Model

Table 1. UBCF Performance Results.

Singular Value Decomposition (SVD) Model

Table 2. SVD Performance Results

Performance Assessment

These results make a strong argument for the SVD model. This is due to the fact that model are trained infrequently, but are called upon for predictions often. Given the near equal RSME performance and the superior prediction performance the SVD model would be a good choice for a production system.

Model Predictions

UCBF Prediction

SVD Prediction

UCBF vs SVD

Part 2 - ALS With Spark and Sparklyr

ALS Model Using Spark

We already have an UCBF and SVD models, so now we will use spark to create an ALS model. Here are the steps for this analysis.

Establish Connection to Spark Server - local in this case.

Prepare the data

We are using the same data set as used above, just a assigning some spark-inspired names to the variables.

Create training and test data frames

Move to Spark server

This is the key step of copy the data to the spark server for processing. The move to spark is accomplished with a simple copy command (sdf_copy_to)

Build ALS model in spark using the ml_als command

Model Performance

We use the model to make predictions to assess its performance. We’ll use the same metric that were used above for the UCBF and SVD models - MSE, RMSE, and MAE.

Display the Spark ALS Model Performance Results

Compare Results and Discuss Spark

The UBCF and SVD model performed better than the ALS, but that not the headline.

In conclusion, moving to a distributed architecture would seem advisable when data sets are large, processing is computationally demanding and / or minimizing processing time is critical.

First we convert our data into the required format - a `realRatingMatrix`. The end result is a 150 by 4332 rating matrix with more than 18,000 ratings.