When a user gives feed back or rate to a certain book that they read, this collection of feedback can be represented in a form of a matrix. Where each row represents each users, while each column represents different books. Obviously the matrix will be sparse since not everyone is going to read every book, due to different interests.

One strength of matrix factorization is the fact that it can incorporate implicit feedback, information that are not directly given but can be derived by analyzing user behavior. With this advantage, we can estimate and predict if a user is going to like a certain book that they never read. If that rating (assumed) is high, we can recommend that that book to the user.

One method, Singular Value Method is what will be used going forward in this project. Let there be matrix A \(M \times N\). The matrix can be viewed as a dot product between two matrix with each matrices having dimensions of \(M \times K\) and \(K \times N\). One downside of SVD is that it does not work well with missing values.

library(recommenderlab)
library(tidyverse)
library(kableExtra)

Review of how user-item books were created in Project(2).

book_ratings <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", sep = ",", header = T, stringsAsFactors = F)

book_titles <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", sep = ",", header = T, stringsAsFactors = F) %>% select(book_id, title)

book_titles$book_id <- as.factor(book_titles$book_id)
book_ratings$user_id <- as.factor(book_ratings$user_id)
book_ratings$book_id <- as.factor(book_ratings$book_id)

bmatrix <- as(book_ratings, "realRatingMatrix")
bmatrix <- bmatrix[rowCounts(bmatrix) > 150, colCounts(bmatrix) > 300]
bmatrix
## 4011 x 4147 rating matrix of class 'realRatingMatrix' with 582842 ratings.

Training and Test Sets

So we are going to split the data.80:20, train and test respectively, keeping 20 items and running the evaluation once.

## min(rowCounts(bmatrix)) = 75 so we can keep 25 items per user
bmat_eval <- evaluationScheme(data = bmatrix, method = "split", train = 0.8, given = 25, goodRating = 3, k = 1) 

Let’s see how the different sets are in the evalution

eval_train <- getData(bmat_eval, "train") #train
eval_train
## 3208 x 4147 rating matrix of class 'realRatingMatrix' with 465786 ratings.
eval_known <- getData(bmat_eval, "known") # test data that we know 
eval_known
## 803 x 4147 rating matrix of class 'realRatingMatrix' with 20075 ratings.
eval_unknown <- getData(bmat_eval, "unknown")#test data that we do not know
eval_unknown
## 803 x 4147 rating matrix of class 'realRatingMatrix' with 96981 ratings.
qplot(rowCounts(eval_unknown)) + geom_histogram(binwidth = 10) + ggtitle("Unknown Items by the readers")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Evaluating the recommender system with SVD. Build recommender:

system.time({
  
books_svd <- Recommender(data = getData(bmat_eval, "train"), method = "SVD") 
})
##    user  system elapsed 
##    2.35    0.39    2.82
books_svd
## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 3208 users.

Compare it to IBCF

system.time({
imodel <- Recommender(data = getData(bmat_eval, "train"), method = "IBCF") 
})
##    user  system elapsed 
##  609.10  305.67  915.74
imodel
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 3208 users.

The SVD method is clearly faster.

Predict model

#as topNlist
books_svd_pred <- predict(object = books_svd, newdata = eval_known, n = 10) %>% list()
books_svd_pred
## [[1]]
## Recommendations as 'topNList' with n = 10 for 803 users.
#as real rating matrix
books_svd_pred_ <- predict(object = books_svd, newdata = eval_known, n = 10, type = "ratings")
books_svd_pred_
## 803 x 4147 rating matrix of class 'realRatingMatrix' with 3309966 ratings.
#ibcf
booksibcf_pred <- predict(object = imodel, newdata = eval_known, n = 10, type = "ratings")
svd_pred <- function(i){
p <- books_svd_pred[[1]]@ratings[[i]]
r <- data.frame("name" = as.factor(i))
r <- inner_join(r, book_titles, by = c("name" = "book_id")) %>% select(title)
print(paste("Ratings for --", r))
return(p)
}

Let’s pick some users and predict their ratings

svd_pred(1)
## Warning: Column `name`/`book_id` joining factors with different levels,
## coercing to character vector
## [1] "Ratings for -- The Hunger Games (The Hunger Games, #1)"
##  [1] 3.110284 3.100933 3.097932 3.096504 3.095380 3.095059 3.095027
##  [8] 3.093767 3.093691 3.093329
svd_pred(5)
## Warning: Column `name`/`book_id` joining factors with different levels,
## coercing to character vector
## [1] "Ratings for -- The Great Gatsby"
##  [1] 3.624007 3.603465 3.597189 3.561628 3.540180 3.536442 3.536245
##  [8] 3.517575 3.516860 3.501264
svd_pred(200)
## Warning: Column `name`/`book_id` joining factors with different levels,
## coercing to character vector
## [1] "Ratings for -- And Then There Were None"
##  [1] 4.037318 4.022901 4.019216 4.015127 4.013435 4.004952 3.939503
##  [8] 3.936291 3.932991 3.927677
svd_pred(18)
## Warning: Column `name`/`book_id` joining factors with different levels,
## coercing to character vector
## [1] "Ratings for -- Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)"
##  [1] 3.822550 3.815094 3.808116 3.787699 3.783683 3.781402 3.780454
##  [8] 3.777911 3.773492 3.770998
svd_pred(400)
## Warning: Column `name`/`book_id` joining factors with different levels,
## coercing to character vector
## [1] "Ratings for -- Neuromancer"
##  [1] 3.685309 3.679514 3.676269 3.656057 3.588399 3.582025 3.580939
##  [8] 3.557832 3.551406 3.549157

To measure Accuracy

SVD

eval_svd_pred <- calcPredictionAccuracy(x = books_svd_pred_, data = eval_unknown, byUser = TRUE) 

kable(head(eval_svd_pred)) %>% kable_styling(bootstrap_options = "striped", font_size = 12, full_width = F)
RMSE MSE MAE
35 0.7990776 0.6385251 0.6067715
143 1.0108130 1.0217429 0.8559298
202 0.8055430 0.6488995 0.6614300
307 0.9705449 0.9419574 0.8277673
338 1.2079899 1.4592396 1.0824372
368 0.8609464 0.7412288 0.6116328

Item Based

eval_ibcf_pred <- calcPredictionAccuracy(x = booksibcf_pred, data = eval_unknown, byUser = TRUE)

kable(head(eval_ibcf_pred)) %>% kable_styling(bootstrap_options = "striped", font_size = 12, full_width = F)
RMSE MSE MAE
35 1.0000000 1.000000 1.0000000
143 1.5118579 2.285714 1.4285714
202 0.8660254 0.750000 0.8333333
307 0.5000000 0.250000 0.2500000
338 0.7071068 0.500000 0.5000000
368 0.8660254 0.750000 0.7500000

Summary

Collaborative Filtering takes a longer time to learn the data than does SVD. According to the evaluation of the predictions, the SVD model is more accurate with smaller error rate.