This document shows how to use packages for building recommender systems in R: recommenderlab, recosystem, SlopeOne and SVDApproximation. It also compares performance of few algorithms based on 1M MovieLense dataset.
First of all, mentioned packages are needed to be installed and loaded:
install.packages("recommenderlab")
install.packages("recosystem")
library(devtools)
install_github(repo = "SlopeOne", username = "tarashnot")
install_github(repo = "SVDApproximation", username = "tarashnot")
library(recommenderlab)
library(recosystem)
library(SlopeOne)
library(SVDApproximation)
library(data.table)
library(RColorBrewer)
library(ggplot2)
MovieLens is a project of GroupLens Research, a research lab in the Department of Computer Science and Engineering at the University of Minnesota, since 1997. This project was focused on gathering research data on personalized recommendations. MovieLens is a recommender system and virtual community website that recommends movies for its users to watch, based on their film preferences using collaborative filtering.
1M MovieLens dataset contains approximately 1 million ratings of 6040 movies from 3706 users. Ratings variate from 1 to 5. General statistics of this dataset could be observed in the figure below (grey lines show median values). The level of rating matrix sparsity is 0.045.
This dataset is added to SVDApproximation, so could be reached with no downloads, as soon as package is loaded:
data(ratings)
head(ratings)
user item rating
1: 1 1 5
2: 6 1 4
3: 8 1 4
4: 9 1 5
5: 10 1 5
6: 18 1 4
ratings table contains ids of users and items (in our case - movies) and ratings.
Statistics of ratings data:
visualize_ratings(ratings_table = ratings)
There are two main categories of collaborative filtering algorithms: memory-based and model-based methods.
Memory-based methods simply memorize all ratings and make recommendations based on relation between user-item and rest of the matrix. In model-based methods predicting parametrized model firstly is needed to be fit based on rating matrix and then recommendations are issued based on fitted model.
The most popular two memory-based methods are user-based and item-based collaborative filtering. These methods are example of neighboring-based methods, which refer to ratings of similar users or items and make recommendations based on weighed sum of nearest users/items ratings. User-based CF method is built based on assumption that if two users have similar ratings on some items, they will have similar ratings on the remaining items. The same for item-based CF, but with item perspective.
Model-based methods, on the other hand, build parametrized models and recommend items with the highest rank, returned by model. For example, Slope One method learns a set of simple predictors (one for each pair of two items) with just constant variable. Therefore, this variable represents average difference between ratings of two items. Using this method, fast computation and reasonable accuracy could be easily achieved. Another example of this class of methods is SVD Approximation. In this approach ranking matrix is decomposed based on Singular Value Decomposition and then reconstructed keeping only first r most significance entities. This gives an ability to predict missing values of ranking matrix.
Most popular (item average) approach computes average rating for each item based on available ratings and predicts each unknown rating as average for item. As a result, missed ratings for each item will be the same for each user.
Algorithm:
1. Calculate average rating for each item;
2. Predict missed ratings in R(rating matrix) as average for item.
To work with recommenderlab package, data firstly is needed to be converted to sparse format:
sparse_ratings <- sparseMatrix(i = ratings$user, j = ratings$item, x = ratings$rating,
dims = c(length(unique(ratings$user)), length(unique(ratings$item))),
dimnames = list(paste("u", 1:length(unique(ratings$user)), sep = ""),
paste("m", 1:length(unique(ratings$item)), sep = "")))
sparse_ratings[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
u1 5 . . . . . . . . .
u2 . . . . . . . . . .
u3 . . . . . . . . . .
u4 . . . . . . . . . .
u5 . . . . . 2 . . . .
u6 4 . . . . . . . . .
u7 . . . . . 4 . . . .
u8 4 . . 3 . . . . . .
u9 5 . . . . . . . . .
u10 5 5 . . . . 4 . . .
recommenderlab works with realRatingMatrix object, which could be easily created from sparse matrix:
real_ratings <- new("realRatingMatrix", data = sparse_ratings)
real_ratings
6040 x 3706 rating matrix of class 'realRatingMatrix' with 1000209 ratings.
To make prediction of some ratings, we need to build Recommender. To avoid skewness in predictions, data is needed to be normalized before applying any algorithm. I normalized by users’ avarages (subtracted means of each user from all their known ratings).
model <- Recommender(real_ratings, method = "POPULAR", param=list(normalize = "center"))
And prediction of ratings could be made, for example, for first 5 users:
prediction <- predict(model, real_ratings[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
m1 m2 m3 m4 m5
u1 NA 3.389036 3.461714 3.491661 2.726688
u2 4.668176 3.389036 3.461714 3.491661 2.726688
u3 4.668176 3.389036 3.461714 3.491661 2.726688
u4 4.668176 3.389036 3.461714 3.491661 2.726688
u5 4.668176 3.389036 3.461714 3.491661 2.726688
As we could see, predictions of missed ratings of first 5 users are calculated. As we used average approach, they are the same for each item.
Calculation of RMSE of such approach is available below:
set.seed(1)
e <- evaluationScheme(real_ratings, method="split", train=0.8, given=-5)
#5 ratings of 20% of users are excluded for testing
model <- Recommender(getData(e, "train"), "POPULAR")
prediction <- predict(model, getData(e, "known"), type="ratings")
rmse_popular <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_popular
RMSE
1.086751
User-based CF forms predictions based on aggregated ratings from the closest users (nearest neighbors). Nearest neighbors are defined based on similarity between users, which is calculated using available ratings. It is important to understand, that this method works under assumption that users with similar ratings will rate items similarly.
There are many different similarity measures, which are used for training such recommender. The most popular for collaborative filtering are Pearson correlation and Cosine similarity.
Algorithm:
1. Calculate similarity between user u and all other users. For this could be used any preferred similarity measure;
2. Select top n users with the highest similarity to users u;
3. Calculate predictions for unknown ratings for user u as average of available ratings from n closest users or as weighed (on similarity distance) ratings of n closest users.
To find the best value of n, separate validation set or cross-validation could be used. In our experiment we have used Cosine similarity and selected n as 50 based on cross-validation. Results for different number of n are shown below. As experiment were run multiple times (to achieve more stable results), all estimates are shown using boxplots:
Even we have selected n as 50 as the best-estimated value, we could see, that n=20 is also good enough.
Predicting of ratings and evaluation of UBCF could be done in the same way as for Most Popular approach:
#Building model
model <- Recommender(real_ratings, method = "UBCF",
param=list(normalize = "center", method="Cosine", nn=50))
#Making predictions
prediction <- predict(model, real_ratings[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
m1 m2 m3 m4 m5
[1,] NA 4.230440 4.198770 4.150188 4.175816
[2,] 3.922250 3.648546 3.677559 3.710457 3.753160
[3,] 3.985497 3.807452 3.849233 3.891652 3.860626
[4,] 4.170688 4.157193 4.190476 4.190476 4.195399
[5,] 3.300253 3.084914 3.132698 3.118013 3.146465
#Estimating RMSE
set.seed(1)
model <- Recommender(getData(e, "train"), method = "UBCF",
param=list(normalize = "center", method="Cosine", nn=50))
prediction <- predict(model, getData(e, "known"), type="ratings")
rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf
RMSE
0.9896001
This time result is much better.
Item-based CF approach is very similar to user-based. But in this one, similarity is computed between items, not users. Assumption is that users will prefer items similar to other items they like.
Algorithm:
1. Calculate similarity matrix between all items based on available users’ ratings. For this could be used any preferred similarity measure;
2. For user u:
2.1 Store only n closest items to each item;
2.2 Calculate predicted rating for each item based on available ratings of user u by weighting available ratings of users on similarities.
As with user-based CF, we have used Cosine similarity and selected n as 350 based on cross-validation. Results for different number of n are shown in the figure below:
This plot shows us a little different behavior, than we saw for user-based CF. First of all, it is possible to achieve much better result with increasing number of nearest items. Moreover, the best value is much higher in comparison with user-based CF.
Item-based CF gives an ability to achieve lower RMSE on test set than user-based CF, what makes it more suitable for given dataset.
Predicting of ratings and evaluation of IBCF could be done in the same way as for User-Based CF or Most Popular approaches:
#Building model
model <- Recommender(real_ratings, method = "IBCF",
param=list(normalize = "center", method="Cosine", k=350))
#Making predictions
prediction <- predict(model, real_ratings[1:5], type="ratings")
as(prediction, "matrix")[,1:8]
m1 m2 m3 m4 m5 m6 m7 m8
u1 NA NA 4.000000 NA NA 4.400783 NA 3.73582
u2 4.264205 3.185612 3.172429 3.202051 3.581299 4.014510 3.253853 3.00000
u3 4.034784 3.794960 4.475967 NA 3.477293 4.171651 4.000000 5.00000
u4 4.319038 NA NA NA NA 4.318060 NA NA
u5 3.561104 1.783488 1.000000 2.638740 1.524747 NA 2.851991 3.00000
#Estimating RMSE
set.seed(1)
model <- Recommender(getData(e, "train"), method = "IBCF",
param=list(normalize = "center", method="Cosine", k=350))
prediction <- predict(model, getData(e, "known"), type="ratings")
rmse_ubcf <- calcPredictionAccuracy(prediction, getData(e, "unknown"))[1]
rmse_ubcf
RMSE
0.9378255
Slope One was introduced by Daniel Lemire and Anna Maclachlan in paper ‘Slope One Predictors for Online Rating-Based Collaborative Filtering’. This algorithm is one of the simplest way to perform collaborative filtering based on items’ similarity. This makes it very easy to implement and use, and accuracy of this algorithm equals to the accuracy of more complicated and resource-intensive algorithms.
Slope One method operates with average differences of ratings between each item and makes predictions based on their weighted value. For more details, go to Wikipedia.
Package SlopeOne works with data.table objects. It is pretty fast, but for huge dataset it needs a lot of RAM.
Firstly, we need to change names and types of variables of ratings dataset to make them suitable for package:
names(ratings) <- c("user_id", "item_id", "rating")
ratings <- data.table(ratings)
ratings[, user_id := as.character(user_id)]
ratings[, item_id := as.character(item_id)]
setkey(ratings, user_id, item_id)
Lets split data into train and test sets:
set.seed(1)
in_train <- rep(TRUE, nrow(ratings))
in_train[sample(1:nrow(ratings), size = round(0.2 * length(unique(ratings$user_id)), 0) * 5)] <- FALSE
ratings_train <- ratings[(in_train)]
ratings_test <- ratings[(!in_train)]
Ratings normalization:
ratings_train_norm <- normalize_ratings(ratings_train)
Building Slope One model:
model <- build_slopeone(ratings_train_norm$ratings)
Making predictions for test set:
predictions <- predict_slopeone(model,
ratings_test[ , c(1, 2), with = FALSE],
ratings_train_norm$ratings)
unnormalized_predictions <- unnormalize_ratings(normalized = ratings_train_norm,
ratings = predictions)
Calculating RMSE on test set:
rmse_slopeone <- sqrt(mean((unnormalized_predictions$predicted_rating - ratings_test$rating) ^ 2))
rmse_slopeone
[1] 0.8962808
Matrix Factorization is a popular technique to solve recommender system problem. The main idea is to approximate the matrix R(m x n) by the product of two matrixes of lower dimension: P (k x m) and Q (k x n).
Matrix P represents latent factors of users. So, each k-elements column of matrix P represents each user. Each k-elements column of matrix Q represents each item. So, to find rating for item i by user u we simply need to compute two vectors: P[,u]’ x Q[,i]. Short and full description of package is available here.
This package works with data saved on disk in 3 columns with no headers:
data(ratings)
set.seed(1)
in_train <- rep(TRUE, nrow(ratings))
in_train[sample(1:nrow(ratings), size = round(0.2 * length(unique(ratings$user)), 0) * 5)] <- FALSE
ratings_train <- ratings[(in_train)]
ratings_test <- ratings[(!in_train)]
write.table(ratings_train, file = "trainset.txt", sep = " ", row.names = FALSE, col.names = FALSE)
write.table(ratings_test, file = "testset.txt", sep = " ", row.names = FALSE, col.names = FALSE)
Next step is to build Recommender object:
r = Reco()
Now we could tune parameters of Recommender: number of latent factors (dim), gradient descend step rate (lrate), and penalty parameter to avoid overfitting (cost). To make this easier and fester to run, cost could be set to some small value and dim while tunning will adopt to it:
opts <- r$tune("trainset.txt", opts = list(dim = c(1:20), lrate = c(0.05),
nthread = 4, cost = c(0), niter = 200, nfold = 10, verbose = FALSE))
And now model could be trained with the best tuned parameters:
r$train("trainset.txt", opts = c(opts$min, nthread = 4, niter = 500, verbose = FALSE))
Visualization of number of latent factors vs. cross-validation RMSE is below:
Making prediction on test set and calculating RMSE:
outfile = tempfile()
r$predict("testset.txt", outfile)
scores_real <- read.table("testset.txt", header = FALSE, sep = " ")$V3
scores_pred <- scan(outfile)
rmse_mf <- sqrt(mean((scores_real-scores_pred) ^ 2))
rmse_mf
[1] 0.8605387
In this approach ranking matrix is decomposed based on Singular Value Decomposition and then reconstructed keeping only first r entities. This gives an ability to predict missing values. SVD Approximation method is similar to Matrix Factorization described above, but here latent factors of users and items are retrieved in another way.
Algorithm:
1. Replace all missing values with items’ averages;
2. Normalize matrix by subtracting users’ averages (calculated based on initial rating matrix, not filled-in);
3. Perform Singular Value Decomposition of R;
4. Keeping only first r rows of matrix U, r rows and r columns of matrix S and r columns of matrix V, reconstruct matrix R: U[1:r,] x S[1:r,1:r] x V[,1:r]’
r denotes number of latent factors of decomposition and best value of this parameter could be found using cross-validation or separate ratings for validation. For our dataset the smallest value of RMSE on validation set corresponds to r = 24. But, as for UBCF, we could decrease this number to 10-15, and this cost us almost no performance lose. First 24 principal components explains 17% of variability, what is enough for the prediction. Using more components will cause overfitting.
set.seed(1)
mtx <- split_ratings(ratings_table = ratings,
proportion = c(0.7, 0.15, 0.15))
Now we could build our model:
model <- svd_build(mtx)
Next step is to find best r:
model_tunes <- svd_tune(model, r = 2:50)
Train, validatin RMSE and best r could be observed on the plot below.
model_tunes$train_vs_valid
Testing on the test set:
rmse_svd <- svd_rmse(model, r = model_tunes$r_best, rmse_type = c("test"))
rmse_svd
[1] 0.9313392
We have performed 10 simulations for each algorithm. RMSE estimates for different algorithms could be viewed below:
From memory-based approaches Item-Based CF shows the best result. It performs much better than Most Popular and User-Based CF.
The best performance was shown by Matrix Factorization techniques with Stochastic Gradient Descend (0.873). Moreover, for this method it is enough to use only 5-8 factors to achieve good result. Slope One also performs well. In comparison with Matrix Factorization, it needs much less time for training.
Also, next plot shows comparison of two another characteristics of algorithms: training time (using described libraries on R) and tuning complexity. Brighter rectangles stand for better value (higher accuracy, lower training time, lower tunning complexity). Even Matrix Factorization is the best from accuracy perspective, it is harder to tune and train it.