final project

Motivation

The idea here is not only to get the top recommendations from each method but also to determine which method would deliver accurate predictions. The alternating least squares factorization method will be used in 2 cases - first on the entire dataset and next on the entire data set split into a training set (80%) and a testing set (20%). The metrics that will be used are area under the curve, mean absolute error, and root-mean-squared error.

Data Utilized

The dataset utilized is a dataset that is in the recommenderlab package. This is the books crossing dataset. The ratings range between 0 and 11. The dataset will be converted into a data frame and then fed into the sparklyr connection.

library(sparklyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sc <- spark_connect(master = "local")
library(recommenderlabBX)

## Loading required package: recommenderlab

## Loading required package: Matrix

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

data(BX)

books <- as(BX, 'data.frame')
books$user <- as.numeric(books$user)
books$item <- as.numeric(books$item)

books_ratings <- sdf_copy_to(sc, books, 'books_ratings', overwrite = T)

books_ratings

## Source:   query [1.15e+06 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
## 
##     user   item rating
##    <dbl>  <dbl>  <dbl>
## 1  41397  32093      0
## 2  92674  56388      0
## 3  96814    255      5
## 4  96814  10522      0
## 5  96814  72083      0
## 6  96814  84173      0
## 7  96814  90612      0
## 8  96814  95590      0
## 9  96814 162098      0
## 10 96814 164454      0
## # ... with 1.15e+06 more rows

Alternating Least Squares Factorization Results

The alternating least squares method is used in order to make predictions as to how each user would rate a particular book.

model <- ml_als_factorization(books_ratings)

predictions1 <- model$.model %>% invoke("transform", spark_dataframe(books_ratings)) %>%
  collect()
data.frame(predictions1)[1:10,]

##      user item rating prediction
## 1  100725   12      0   0.000000
## 2  100725   13      0   0.000000
## 3   90796   14      0   0.000000
## 4    4690   18      8   7.827173
## 5   55936   25      7   6.958733
## 6   58713   37      9   8.878883
## 7   69201   38      0   0.000000
## 8   55737   46      0   0.000000
## 9   29940   50      0   0.000000
## 10 101439   52      7   6.918979

Alternating Least Squares Factorization Results Imputed

Not all of the results are accurate because some predictions are negative values and some are greater than 11. For this reason, these predictions were imputed (if the prediciton value is negative, then it should be 0; if it is greater than 11, then it should be 11.).

p1 <- predictions1$prediction
p1[p1 < 0] = 0
p1[p1 > 11] = 11

predictions1$prediction <- p1
data.frame(predictions1)[1:10,]

##      user item rating prediction
## 1  100725   12      0   0.000000
## 2  100725   13      0   0.000000
## 3   90796   14      0   0.000000
## 4    4690   18      8   7.827173
## 5   55936   25      7   6.958733
## 6   58713   37      9   8.878883
## 7   69201   38      0   0.000000
## 8   55737   46      0   0.000000
## 9   29940   50      0   0.000000
## 10 101439   52      7   6.918979

Grouping Ratings and Predictions by Book

The ratings and predictions are all grouped by book. None of the average predictions exceed 10.4.

by_item_1 <- group_by(predictions1, item)
mean_items_1 <- data.frame(summarise(by_item_1, mean(rating), mean(prediction)))
colnames(mean_items_1) <- c('item', 'mean rating', 'mean prediction')
data.frame(mean_items_1[order(-mean_items_1$`mean prediction`),])[1:10,]

##          item mean.rating mean.prediction
## 20919   20919          10        10.39867
## 8644     8644          10        10.36455
## 19151   19151          10        10.36455
## 83822   83822          10        10.36455
## 154154 154154          10        10.36455
## 154227 154227          10        10.36455
## 155319 155319          10        10.36455
## 202079 202079          10        10.36455
## 246469 246469          10        10.36455
## 255702 255702          10        10.36455

Splitting the Dataset

The dataset is split into a training set (80%) and a testing set (20%).

partitions <- books_ratings %>% sdf_partition(training = 0.8, test = 0.2, seed = 1099)

Alternating Least Squares Factorization Results (Testing Set)

The alternating least squares factorization method is applied to the training set and then used to predict the results for the testing set.

fit <- partitions$training %>% ml_als_factorization()

predictions2raw <- fit$.model %>%
  invoke("transform", spark_dataframe(partitions$test)) %>%
  collect()

data.frame(group_by(predictions2raw, item))[1:10,]

##     user item rating prediction
## 1  69201   38      0        NaN
## 2  50722   70      5        NaN
## 3  55116   73      3        NaN
## 4  84483  101      8        NaN
## 5  69779  186     10          0
## 6  85388  190      8          0
## 7  10648  218      8        NaN
## 8  31568  225      0        NaN
## 9  60023  244      0        NaN
## 10 63474  254      7          0

Alternating Least Squares Factorization ‘NA’ Values

There were several ‘NA’ values rendered in the prediction column. However, only 25% of these values were ‘NA’. These values were subsetted out.

p2 <- subset(predictions2raw, prediction != 'NaN')
nrow(p2)/nrow(predictions2raw)

## [1] 0.7582754

Alternating Least Squares Factorization Results Imputed (Testing Set)

Like the previous case, the prediction results were imputed by replacing the negative values with 0 and the values greater than 11 with 11.

prediction <- p2$prediction
prediction[prediction < 0] = 0
prediction[prediction > 11] = 11

predictions2 <- data.frame(p2[,1:3], prediction)
predictions2[1:10,]

##     user item rating prediction
## 1  69779  186     10 0.00000000
## 2  85388  190      8 0.00000000
## 3  63474  254      7 0.00000000
## 4  81126  328      0 0.00000000
## 5  25432  333      5 0.54312873
## 6  52991  406      0 0.08731084
## 7   6912  411      0 0.00000000
## 8  77548  413      7 0.00000000
## 9  16244  484      7 0.00000000
## 10  6131  499      7 2.64732623

Grouping Ratings and Predictions by Book (Testing Set)

The ratings and predictions were all grouped by book. Unlike the previous case, the highest predictions are 11.

by_item <- group_by(predictions2, item)
meanitems <- data.frame(summarise(by_item, mean(rating), mean(prediction)))
colnames(meanitems) <- c('item', 'mean rating', 'mean prediction')
pmeanitems <- meanitems[order(-meanitems$`mean prediction`),]
pmeanitems[1:10,]

##       item mean rating mean prediction
## 1507  7829           7              11
## 1764  8488           5              11
## 2040  9534           8              11
## 2063  9585           0              11
## 3788 14407           9              11
## 4310 17614           8              11
## 6109 25581           0              11
## 6169 25848           0              11
## 6398 26445          10              11
## 6720 27634           8              11

library(Metrics)
library(knitr)
a1 <- matrix(as.vector(c(rmse(predictions1$rating,predictions1$prediction),
mae(predictions1$rating,predictions1$prediction), auc(predictions1$rating,predictions1$prediction))))
rownames(a1) <- c('Root-Mean-Squared Error', 'Mean Absolute Error', 'Area Under Curve') 
a2 <- matrix(as.vector(c(rmse(predictions2$rating,predictions2$prediction),
mae(predictions2$rating,predictions2$prediction), auc(predictions2$rating,predictions2$prediction))))
rownames(a2) <- rownames(a1)

kable(data.frame(a1,a2), col.names = c('Entire Set', 'Training/Testing'))

	Entire Set	Training/Testing
Root-Mean-Squared Error	1.2438687	4.1084945
Mean Absolute Error	0.5733166	2.8033784
Area Under Curve	0.5345535	0.5179848

Conclusion

According to the metrics, it would be best to perform alternating least squares factorization on the entire dataset rather than splitting the dataset into a training set and a testing set and using the testing set to evaluate the model.

This way, not only would the root-mean-squared error and the mean-absolute error would be minimized; there would be absolutely no ‘NA’ values in the prediction results.