The idea here is not only to get the top recommendations from each method but also to determine which method would deliver accurate predictions. The alternating least squares factorization method will be used in 2 cases - first on the entire dataset and next on the entire data set split into a training set (80%) and a testing set (20%). The metrics that will be used are area under the curve, mean absolute error, and root-mean-squared error.
The dataset utilized is a dataset that is in the recommenderlab package. This is the books crossing dataset. The ratings range between 0 and 11. The dataset will be converted into a data frame and then fed into the sparklyr connection.
library(sparklyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sc <- spark_connect(master = "local")
library(recommenderlabBX)
## Loading required package: recommenderlab
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
data(BX)
books <- as(BX, 'data.frame')
books$user <- as.numeric(books$user)
books$item <- as.numeric(books$item)
books_ratings <- sdf_copy_to(sc, books, 'books_ratings', overwrite = T)
books_ratings
## Source: query [1.15e+06 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## user item rating
## <dbl> <dbl> <dbl>
## 1 41397 32093 0
## 2 92674 56388 0
## 3 96814 255 5
## 4 96814 10522 0
## 5 96814 72083 0
## 6 96814 84173 0
## 7 96814 90612 0
## 8 96814 95590 0
## 9 96814 162098 0
## 10 96814 164454 0
## # ... with 1.15e+06 more rows
The alternating least squares method is used in order to make predictions as to how each user would rate a particular book.
model <- ml_als_factorization(books_ratings)
predictions1 <- model$.model %>% invoke("transform", spark_dataframe(books_ratings)) %>%
collect()
data.frame(predictions1)[1:10,]
## user item rating prediction
## 1 100725 12 0 0.000000
## 2 100725 13 0 0.000000
## 3 90796 14 0 0.000000
## 4 4690 18 8 7.827173
## 5 55936 25 7 6.958733
## 6 58713 37 9 8.878883
## 7 69201 38 0 0.000000
## 8 55737 46 0 0.000000
## 9 29940 50 0 0.000000
## 10 101439 52 7 6.918979
Not all of the results are accurate because some predictions are negative values and some are greater than 11. For this reason, these predictions were imputed (if the prediciton value is negative, then it should be 0; if it is greater than 11, then it should be 11.).
p1 <- predictions1$prediction
p1[p1 < 0] = 0
p1[p1 > 11] = 11
predictions1$prediction <- p1
data.frame(predictions1)[1:10,]
## user item rating prediction
## 1 100725 12 0 0.000000
## 2 100725 13 0 0.000000
## 3 90796 14 0 0.000000
## 4 4690 18 8 7.827173
## 5 55936 25 7 6.958733
## 6 58713 37 9 8.878883
## 7 69201 38 0 0.000000
## 8 55737 46 0 0.000000
## 9 29940 50 0 0.000000
## 10 101439 52 7 6.918979
The ratings and predictions are all grouped by book. None of the average predictions exceed 10.4.
by_item_1 <- group_by(predictions1, item)
mean_items_1 <- data.frame(summarise(by_item_1, mean(rating), mean(prediction)))
colnames(mean_items_1) <- c('item', 'mean rating', 'mean prediction')
data.frame(mean_items_1[order(-mean_items_1$`mean prediction`),])[1:10,]
## item mean.rating mean.prediction
## 20919 20919 10 10.39867
## 8644 8644 10 10.36455
## 19151 19151 10 10.36455
## 83822 83822 10 10.36455
## 154154 154154 10 10.36455
## 154227 154227 10 10.36455
## 155319 155319 10 10.36455
## 202079 202079 10 10.36455
## 246469 246469 10 10.36455
## 255702 255702 10 10.36455
The dataset is split into a training set (80%) and a testing set (20%).
partitions <- books_ratings %>% sdf_partition(training = 0.8, test = 0.2, seed = 1099)
The alternating least squares factorization method is applied to the training set and then used to predict the results for the testing set.
fit <- partitions$training %>% ml_als_factorization()
predictions2raw <- fit$.model %>%
invoke("transform", spark_dataframe(partitions$test)) %>%
collect()
data.frame(group_by(predictions2raw, item))[1:10,]
## user item rating prediction
## 1 69201 38 0 NaN
## 2 50722 70 5 NaN
## 3 55116 73 3 NaN
## 4 84483 101 8 NaN
## 5 69779 186 10 0
## 6 85388 190 8 0
## 7 10648 218 8 NaN
## 8 31568 225 0 NaN
## 9 60023 244 0 NaN
## 10 63474 254 7 0
There were several ‘NA’ values rendered in the prediction column. However, only 25% of these values were ‘NA’. These values were subsetted out.
p2 <- subset(predictions2raw, prediction != 'NaN')
nrow(p2)/nrow(predictions2raw)
## [1] 0.7582754
Like the previous case, the prediction results were imputed by replacing the negative values with 0 and the values greater than 11 with 11.
prediction <- p2$prediction
prediction[prediction < 0] = 0
prediction[prediction > 11] = 11
predictions2 <- data.frame(p2[,1:3], prediction)
predictions2[1:10,]
## user item rating prediction
## 1 69779 186 10 0.00000000
## 2 85388 190 8 0.00000000
## 3 63474 254 7 0.00000000
## 4 81126 328 0 0.00000000
## 5 25432 333 5 0.54312873
## 6 52991 406 0 0.08731084
## 7 6912 411 0 0.00000000
## 8 77548 413 7 0.00000000
## 9 16244 484 7 0.00000000
## 10 6131 499 7 2.64732623
The ratings and predictions were all grouped by book. Unlike the previous case, the highest predictions are 11.
by_item <- group_by(predictions2, item)
meanitems <- data.frame(summarise(by_item, mean(rating), mean(prediction)))
colnames(meanitems) <- c('item', 'mean rating', 'mean prediction')
pmeanitems <- meanitems[order(-meanitems$`mean prediction`),]
pmeanitems[1:10,]
## item mean rating mean prediction
## 1507 7829 7 11
## 1764 8488 5 11
## 2040 9534 8 11
## 2063 9585 0 11
## 3788 14407 9 11
## 4310 17614 8 11
## 6109 25581 0 11
## 6169 25848 0 11
## 6398 26445 10 11
## 6720 27634 8 11
library(Metrics)
library(knitr)
a1 <- matrix(as.vector(c(rmse(predictions1$rating,predictions1$prediction),
mae(predictions1$rating,predictions1$prediction), auc(predictions1$rating,predictions1$prediction))))
rownames(a1) <- c('Root-Mean-Squared Error', 'Mean Absolute Error', 'Area Under Curve')
a2 <- matrix(as.vector(c(rmse(predictions2$rating,predictions2$prediction),
mae(predictions2$rating,predictions2$prediction), auc(predictions2$rating,predictions2$prediction))))
rownames(a2) <- rownames(a1)
kable(data.frame(a1,a2), col.names = c('Entire Set', 'Training/Testing'))
| Entire Set | Training/Testing | |
|---|---|---|
| Root-Mean-Squared Error | 1.2438687 | 4.1084945 |
| Mean Absolute Error | 0.5733166 | 2.8033784 |
| Area Under Curve | 0.5345535 | 0.5179848 |
According to the metrics, it would be best to perform alternating least squares factorization on the entire dataset rather than splitting the dataset into a training set and a testing set and using the testing set to evaluate the model.
This way, not only would the root-mean-squared error and the mean-absolute error would be minimized; there would be absolutely no ‘NA’ values in the prediction results.