Data 612 Project 4

## Warning: package 'recommenderlab' was built under R version 3.5.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 3.5.3

## Loading required package: arules

## Warning: package 'arules' was built under R version 3.5.3

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## Warning: package 'proxy' was built under R version 3.5.3

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Warning: package 'registry' was built under R version 3.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:Matrix':
## 
##     expand

## Warning: package 'tictoc' was built under R version 3.5.2

Amazon Videogames reviews

The dataset I will be using is provided by Amazon Open Data and Describes product Ratings based on 1 to 5 stars and you can find it here https://s3.rAmazonaws.com/rAmazon-reviews-pds/tsv/index.txt among other useful datasets, for better description of the datasets look here https://s3.rAmazonaws.com/rAmazon-reviews-pds/readme.html.

For the project purpose I took a subset from the 1 million plus records available in the dataset. I had to adjust the matrix manually sparcity was to high so had to put some ratings so every user had “enough” in order form the recommender to work

#Importing Data
ratings <- read.csv(paste0("https://raw.githubusercontent.com/sortega7878/DATA612/master/ratings_Amazon.csv"))
rMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), x = ratings$Rating)
colnames(rMatrix) <- levels(ratings$ProductId)
rownames(rMatrix) <- levels(ratings$UserId)
rAmazon <- as(rMatrix, "realRatingMatrix")

Next step is split the dataset in train and test and a datastructure to save the processing times to measure performance

# Train/test split
set.seed(60)
eval <- evaluationScheme(rAmazon, method = "split", train = 0.8, given=3, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
# Set up data frame for timing
timing <- data.frame(Model=factor(), Training=double(), Predicting=double())

Recommender Models

We are going to build and compare several mode to be able to measure them using user based, random and simple value descomposition

# ---------------- USER BASED COLLABORATIVE FILTERING ----------------
malgorithm <- "UBCF"
# Training
tic()
modelUBCF <- Recommender(train, method = malgorithm)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(malgorithm), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accUBCF <- calcPredictionAccuracy(predUBCF, unknown)

# ---------------- RANDOM ----------------
malgorithm <- "RANDOM"
# Training
tic()
modelRandom <- Recommender(train, method = malgorithm)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predRandom <- predict(modelRandom, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(malgorithm), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accRandom <- calcPredictionAccuracy(predRandom, unknown)

# ---------------- SVD ----------------
malgorithm <- "SVD"
# Training
tic()
modelSVD <- Recommender(train, method = malgorithm, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(malgorithm), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accSVD <- calcPredictionAccuracy(predSVD, unknown)

Model Comparison

accuracy <- rbind(accUBCF, accRandom)
accuracy <- rbind(accuracy, accSVD)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
knitr::kable(accuracy, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	RMSE	MSE	MAE
UBCF	1.212853	1.471013	0.8825426
Random	1.454458	2.115447	1.0470317
SVD	1.207088	1.457062	0.8585245

Reviewing the accuracy numbers for 3 models we see that UBCF and SVD models are very close together. SVD model is only slightly better. The Random model is noticeably worse. It is not suprising that random recommendations are not as accurate as recommendations based on prior ratings.

Next we can review ROC curve and the Precision-Recall plot for all 3 models. Again SVD performs better than UBCF and considerably better than the Random model. There can be a discussion about performance in this space since while overall SVD performs a lot more slower in training than UBCF prediction times really dramatically are faster in SVD so goes back to the question of the environment where you will execute the model and resources available.

models <- list(
  "UBCF" = list(name = "UBCF", param = NULL),
  "Random" = list(name = "RANDOM", param = NULL),
  "SVD" = list(name = "SVD", param = list(k = 50))
  )
evalResults <- evaluate(x = eval, method = models, n = c(1, 5, 10, 30, 60))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/265.88sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/7.44sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [17.26sec/8.75sec]

The ROC Curves tell a different story where accuracy depsite the training/prediction models position UBCF as the model to follow in this case depite the computing requirements for a player like Amazon computing costs might not be an element to bear in mind in order to improve accuracy however that might not be the case to every organization.

# ROC Curve
plot(evalResults, 
     annotate = TRUE, legend = "topleft", main = "ROC Curve")

# Precision-Recall Plot
plot(evalResults, "prec/rec", 
     annotate = TRUE, legend = "topright", main = "Precision-Recall")

Finally, it is important to consider training and prediction time. From the table below we can see that the UBCF model can be created fairly quickly, but predicting results takes considerable time. The Random model is pretty efficient all around. The SVD model takes longer to build than to predict, but altogether it is quicker than the UBCF model. This may be a factor in some projects.

rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	Training	Predicting
UBCF	0.03	448.99
RANDOM	0.00	4.40
SVD	17.75	5.86

Final Remarks

We covered multiple models to measure different performance parameters such as precision, processing time, etc. One of the discussion items is what elements are valuable for the organization and not neccesarily being the fastest in either training or predicting will be the whole answer , maybe the environment requires constant update of the training data so a model that benefits these criteria will be the best , other example is I need an answer instantly so prediction is important. Everyone desires better accuracy metrics but the challenge will be at least for most of the organizations balance this requirement vs the cost of computing.

References

Extra datasets http://www.ee.columbia.edu/~cylin/course/bigdata/getdatasetinfo.html Amazon Open Data Repository https://registry.opendata.aws/ Amazon Customer Reviews Dataset https://registry.opendata.aws/amazon-reviews/ Implementing a Recommender System with SageMaker, MXNet, and Gluon,Making Video Recommendations Using Neural Networks and Embeddings