Overview: What is Colaborative Filtering (CF)?

CF are Machine Learning algorithms that predict which items a customer could find amusing/interesting by using his preferences within a set of items. CF became famous after Netflix Contest in 2007. Basically, the problem they intended to adress was how to recommend movies to users which had few or no previous history on the platform.

To illustrate how it works in R, we’ll use a small dataset of movies and ratings obtained from IBDb applying recommenderlab package and it’s built-in functions. If you don’t have any of the libraries below, use install.packages("package_not_installed") to install them.


library(dplyr)
library(recommenderlab)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(DT)


Data Description & Pre-processing

Let’s load data, we have two files, “movies.csv” with movies Id and it’s name and genre, and “ratings.csv” with all user ratings and it’s timestamp. A small sample of data is shown.

movies=read.csv("movies.csv",header=T)
ratings=read.csv("ratings.csv",header = T)


##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller
##   userId movieId rating  timestamp
## 1      1      31    2.5 1260759144
## 2      1    1029    3.0 1260759179
## 3      1    1061    3.0 1260759182
## 4      1    1129    2.0 1260759185
## 5      1    1172    4.0 1260759205
## 6      1    1263    2.0 1260759151


Next thing to do is to change ratings format. recommenderlab functions work with an specific format only: users for rows and user ratings on the columns for each item. This process can be done easily using library reshape2 function dcast. The timestamp will be removed since it’s not needed for this analysis.

ratings_m=dcast(ratings[,-4], userId~movieId,value.var="rating")
userId 1 2 3 4 5 6 7 8 9 10 11
1 NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA 4 NA
3 NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA 4 NA
5 NA NA 4 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA
7 3 NA NA NA NA NA NA NA NA 3 NA
8 NA NA NA NA NA NA NA NA NA NA NA


Now it’s required to use an specific sparse matrix or realRatingMatrix to build a recommender. We remove the ID column and set it as the rownames. This matrix format puts a point “.” on NA values.

rownames(ratings_m)=ratings_m[,1]
ratings_m2=as.matrix(ratings_m[,-1])
ratings_m2 <- as(ratings_m2,"realRatingMatrix")


It’s important to have an overview of this kind of data, that is: what is the minimum number of items rated by users? what is the global average rate score? etc.

avuser=round(mean(rowMeans(ratings_m[,-1],na.rm = T)),2)

row_counts=data.frame(freq=rowCounts(ratings_m2))

row_counts %>% ggplot(aes(x=freq))+ geom_histogram(color="darkred",fill="gray")+ggtitle("Number of movies rated by users")+theme_gray()+geom_text(data=data.frame(), aes(label = paste("Min. Movies rated:",min(row_counts$freq)), x = 1750, y = 250), hjust = 0, vjust = 1)+
  geom_text(data=data.frame(), aes(label = paste("Average. Score Rate:",avuser), x = 1750, y = 220), hjust = 0, vjust = 1)

Building a Recommender

Now let’s build a recommender, we can use different algorithms. However, only User-Based and Item-Based Colaborative Filtering are going to be shown. For more info go to the authors documentation page. The main difference between these algorithms are how weighted averages are computed to estimate missing user ratings. One uses nearest users as the other takes advantage from nearest items. Right below an image illustrates the aforementioned.

diff

source: http://cuihelei.blogspot.com/2012/09/the-difference-among-three.html

In order to control a recommender, evaluationScheme sets how the algorithm is going to work with the following parameters:

  • method: this can be done using either split or Cross Validation
  • train : proportion of data used for training
  • given: number of items given for testing: in test dataset it selects given random items and evaluates how good it predicts those
  • goodRating: value for which we consider a rating “positive”, as we have ratings ranging from 1 to 5, 3 is considered as a possitive value for a movie/item


We’ll use the last 11 users to illustrate how the movie recommendation works. Movie Ratings are also normalized using Z-score (mean centered by row and divided by its std.dev), this helps to control for outliers. The parameter method is the dis-similarity matrix to compute.“cosine”, “pearson”,“jaccard” and others can be used.

NOTE: IBCF can take up to 40 minutes to train (tested on an Intel i7 6700K at stock 4.0 GHz), due to single-core restriction in R. This package is not optimized for multi-core paralell computing. If you want to try the model for educational purposes only, use less movies by generating a small random sample of them

e <- evaluationScheme(ratings_m2[1:660,], method="split", train=0.8, given=10, goodRating=3)


UBCF <- Recommender(getData(e, "train"), "UBCF", 
                        param=list(normalize = "Z-score",method="Cosine"))


IBCF <- Recommender(getData(e, "train"), "IBCF",param=list(normalize = "Z-score",method="Cosine"))

Now that our models have finished training, it’s time to look at some error indicators like MSE or MAE to find out which is the best movie recommender. We get predictions for testing set with known ratings. These are compared to the real values.


p1 <- predict(UBCF, getData(e, "known"), type="ratings")

p2 <- predict(IBCF, getData(e, "known"), type="ratings")

error <- rbind(
   UBCF_rat = calcPredictionAccuracy(p1, getData(e, "unknown")),
   IBCF_rat = calcPredictionAccuracy(p2, getData(e, "unknown"))
   )
RMSE MSE MAE
UBCF_rat 1.000837 1.001675 0.7872649
IBCF_rat 1.183332 1.400276 0.8952206


It seems User-Based Colaborative Filtering outperformed Item-based approach. Thus, we’ll use it to predict top 5 unknown movies for users not included in both training and testing sets, that better suits them.

recom <- predict(UBCF, ratings_m2[661:(nrow(ratings_m2)),], n=5)  ###PREDICT BEST 5 for user 1001 and 1002
list_recom=as(recom,"list")
head(list_recom,3)
## $`661`
## [1] "110" "356" "593" "296" "457"
## 
## $`662`
## [1] "260"  "608"  "858"  "50"   "8961"
## 
## $`663`
## [1] "2858" "2959" "296"  "4993" "1197"

Now we can look into the movies.csv file and grab the movie name using dplyr function left_join.

data_recom=melt(list_recom)
colnames(data_recom)=c("movieId","userId")
data_recom$movieId=as.integer(data_recom$movieId)
data_recom=data_recom %>% left_join(movies,by="movieId")
movieId userId title genres
1 661 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
3 661 Grumpier Old Men (1995) Comedy|Romance
5 661 Father of the Bride Part II (1995) Comedy
2 661 Jumanji (1995) Adventure|Children|Fantasy
4 661 Waiting to Exhale (1995) Comedy|Drama|Romance
6 662 Heat (1995) Action|Crime|Thriller
8 662 Tom and Huck (1995) Adventure|Children
9 662 Sudden Death (1995) Action
7 662 Sabrina (1995) Comedy|Romance
10 662 GoldenEye (1995) Action|Adventure|Thriller

Final Comments

Here we’ve shown how to build CF algorithms with recommenderlab package. However there are parameters and evaluation methods that we skipped. For more details go to the author’s official documentation.



Written by: JHON PARRA