Colaborative Filtering with recommenderlab

Overview: What is Colaborative Filtering (CF)?

CF are Machine Learning algorithms that predict which items a customer could find amusing/interesting by using his preferences within a set of items. CF became famous after Netflix Contest in 2007. Basically, the problem they intended to adress was how to recommend movies to users which had few or no previous history on the platform.

To illustrate how it works in R, we’ll use a small dataset of movies and ratings obtained from IBDb applying recommenderlab package and it’s built-in functions. If you don’t have any of the libraries below, use install.packages("package_not_installed") to install them.

library(dplyr)
library(recommenderlab)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(DT)

Data Description & Pre-processing

Let’s load data, we have two files, “movies.csv” with movies Id and it’s name and genre, and “ratings.csv” with all user ratings and it’s timestamp. A small sample of data is shown.

movies=read.csv("movies.csv",header=T)
ratings=read.csv("ratings.csv",header = T)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

##   userId movieId rating  timestamp
## 1      1      31    2.5 1260759144
## 2      1    1029    3.0 1260759179
## 3      1    1061    3.0 1260759182
## 4      1    1129    2.0 1260759185
## 5      1    1172    4.0 1260759205
## 6      1    1263    2.0 1260759151

Next thing to do is to change ratings format. recommenderlab functions work with an specific format only: users for rows and user ratings on the columns for each item. This process can be done easily using library reshape2 function dcast. The timestamp will be removed since it’s not needed for this analysis.

ratings_m=dcast(ratings[,-4], userId~movieId,value.var="rating")

userId	1	2	3	4	5	6	7	8	9	10	11
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
5	NA	NA	4	NA	NA	NA	NA	NA	NA	NA	NA
6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
7	3	NA	NA	NA	NA	NA	NA	NA	NA	3	NA
8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Now it’s required to use an specific sparse matrix or realRatingMatrix to build a recommender. We remove the ID column and set it as the rownames. This matrix format puts a point “.” on NA values.

rownames(ratings_m)=ratings_m[,1]
ratings_m2=as.matrix(ratings_m[,-1])
ratings_m2 <- as(ratings_m2,"realRatingMatrix")

It’s important to have an overview of this kind of data, that is: what is the minimum number of items rated by users? what is the global average rate score? etc.

avuser=round(mean(rowMeans(ratings_m[,-1],na.rm = T)),2)

row_counts=data.frame(freq=rowCounts(ratings_m2))

row_counts %>% ggplot(aes(x=freq))+ geom_histogram(color="darkred",fill="gray")+ggtitle("Number of movies rated by users")+theme_gray()+geom_text(data=data.frame(), aes(label = paste("Min. Movies rated:",min(row_counts$freq)), x = 1750, y = 250), hjust = 0, vjust = 1)+
  geom_text(data=data.frame(), aes(label = paste("Average. Score Rate:",avuser), x = 1750, y = 220), hjust = 0, vjust = 1)

Building a Recommender

Now let’s build a recommender, we can use different algorithms. However, only User-Based and Item-Based Colaborative Filtering are going to be shown. For more info go to the authors documentation page. The main difference between these algorithms are how weighted averages are computed to estimate missing user ratings. One uses nearest users as the other takes advantage from nearest items. Right below an image illustrates the aforementioned.

diff

source: http://cuihelei.blogspot.com/2012/09/the-difference-among-three.html

In order to control a recommender, evaluationScheme sets how the algorithm is going to work with the following parameters:

method: this can be done using either split or Cross Validation
train : proportion of data used for training
given: number of items given for testing: in test dataset it selects given random items and evaluates how good it predicts those
goodRating: value for which we consider a rating “positive”, as we have ratings ranging from 1 to 5, 3 is considered as a possitive value for a movie/item

We’ll use the last 11 users to illustrate how the movie recommendation works. Movie Ratings are also normalized using Z-score (mean centered by row and divided by its std.dev), this helps to control for outliers. The parameter method is the dis-similarity matrix to compute.“cosine”, “pearson”,“jaccard” and others can be used.

NOTE: IBCF can take up to 40 minutes to train (tested on an Intel i7 6700K at stock 4.0 GHz), due to single-core restriction in R. This package is not optimized for multi-core paralell computing. If you want to try the model for educational purposes only, use less movies by generating a small random sample of them

e <- evaluationScheme(ratings_m2[1:660,], method="split", train=0.8, given=10, goodRating=3)


UBCF <- Recommender(getData(e, "train"), "UBCF", 
                        param=list(normalize = "Z-score",method="Cosine"))


IBCF <- Recommender(getData(e, "train"), "IBCF",param=list(normalize = "Z-score",method="Cosine"))

Now that our models have finished training, it’s time to look at some error indicators like MSE or MAE to find out which is the best movie recommender. We get predictions for testing set with known ratings. These are compared to the real values.

p1 <- predict(UBCF, getData(e, "known"), type="ratings")

p2 <- predict(IBCF, getData(e, "known"), type="ratings")

error <- rbind(
   UBCF_rat = calcPredictionAccuracy(p1, getData(e, "unknown")),
   IBCF_rat = calcPredictionAccuracy(p2, getData(e, "unknown"))
   )

	RMSE	MSE	MAE
UBCF_rat	1.000837	1.001675	0.7872649
IBCF_rat	1.183332	1.400276	0.8952206

It seems User-Based Colaborative Filtering outperformed Item-based approach. Thus, we’ll use it to predict top 5 unknown movies for users not included in both training and testing sets, that better suits them.

recom <- predict(UBCF, ratings_m2[661:(nrow(ratings_m2)),], n=5)  ###PREDICT BEST 5 for user 1001 and 1002
list_recom=as(recom,"list")
head(list_recom,3)

## $`661`
## [1] "110" "356" "593" "296" "457"
## 
## $`662`
## [1] "260"  "608"  "858"  "50"   "8961"
## 
## $`663`
## [1] "2858" "2959" "296"  "4993" "1197"

Now we can look into the movies.csv file and grab the movie name using dplyr function left_join.

data_recom=melt(list_recom)
colnames(data_recom)=c("movieId","userId")
data_recom$movieId=as.integer(data_recom$movieId)
data_recom=data_recom %>% left_join(movies,by="movieId")

movieId	userId	title	genres
1	661	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
3	661	Grumpier Old Men (1995)	Comedy\|Romance
5	661	Father of the Bride Part II (1995)	Comedy
2	661	Jumanji (1995)	Adventure\|Children\|Fantasy
4	661	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
6	662	Heat (1995)	Action\|Crime\|Thriller
8	662	Tom and Huck (1995)	Adventure\|Children
9	662	Sudden Death (1995)	Action
7	662	Sabrina (1995)	Comedy\|Romance
10	662	GoldenEye (1995)	Action\|Adventure\|Thriller

Final Comments

Here we’ve shown how to build CF algorithms with recommenderlab package. However there are parameters and evaluation methods that we skipped. For more details go to the author’s official documentation.

Written by: JHON PARRA

Colaborative Filtering with recommenderlab

Stats Col

March 2019

Overview: What is Colaborative Filtering (CF)?

Data Description & Pre-processing

Building a Recommender

Final Comments

userId	1	2	3	4	5	6	7	8	9	10	11
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
5	NA	NA	4	NA	NA	NA	NA	NA	NA	NA	NA
6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
7	3	NA	NA	NA	NA	NA	NA	NA	NA	3	NA
8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

userId	1	2	3	4	5	6	7	8	9	10	11
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
5	NA	NA	4	NA	NA	NA	NA	NA	NA	NA	NA
6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
7	3	NA	NA	NA	NA	NA	NA	NA	NA	3	NA
8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

userId	1	2	3	4	5	6	7	8	9	10	11
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	NA	NA	NA	NA	NA	NA	NA	NA	NA	4	NA
5	NA	NA	4	NA	NA	NA	NA	NA	NA	NA	NA
6	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
7	3	NA	NA	NA	NA	NA	NA	NA	NA	3	NA
8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA