CF are Machine Learning algorithms that predict which items a customer could find amusing/interesting by using his preferences within a set of items. CF became famous after Netflix Contest in 2007. Basically, the problem they intended to adress was how to recommend movies to users which had few or no previous history on the platform.
To illustrate how it works in R, we’ll use a small dataset of movies and ratings obtained from IBDb applying recommenderlab package and it’s built-in functions. If you don’t have any of the libraries below, use install.packages("package_not_installed") to install them.
library(dplyr)
library(recommenderlab)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(DT)
Let’s load data, we have two files, “movies.csv” with movies Id and it’s name and genre, and “ratings.csv” with all user ratings and it’s timestamp. A small sample of data is shown.
movies=read.csv("movies.csv",header=T)
ratings=read.csv("ratings.csv",header = T)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
## userId movieId rating timestamp
## 1 1 31 2.5 1260759144
## 2 1 1029 3.0 1260759179
## 3 1 1061 3.0 1260759182
## 4 1 1129 2.0 1260759185
## 5 1 1172 4.0 1260759205
## 6 1 1263 2.0 1260759151
Next thing to do is to change ratings format. recommenderlab functions work with an specific format only: users for rows and user ratings on the columns for each item. This process can be done easily using library reshape2 function dcast. The timestamp will be removed since it’s not needed for this analysis.
ratings_m=dcast(ratings[,-4], userId~movieId,value.var="rating")
| userId | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4 | NA |
| 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4 | NA |
| 5 | NA | NA | 4 | NA | NA | NA | NA | NA | NA | NA | NA |
| 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 7 | 3 | NA | NA | NA | NA | NA | NA | NA | NA | 3 | NA |
| 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Now it’s required to use an specific sparse matrix or realRatingMatrix to build a recommender. We remove the ID column and set it as the rownames. This matrix format puts a point “.” on NA values.
rownames(ratings_m)=ratings_m[,1]
ratings_m2=as.matrix(ratings_m[,-1])
ratings_m2 <- as(ratings_m2,"realRatingMatrix")
It’s important to have an overview of this kind of data, that is: what is the minimum number of items rated by users? what is the global average rate score? etc.
avuser=round(mean(rowMeans(ratings_m[,-1],na.rm = T)),2)
row_counts=data.frame(freq=rowCounts(ratings_m2))
row_counts %>% ggplot(aes(x=freq))+ geom_histogram(color="darkred",fill="gray")+ggtitle("Number of movies rated by users")+theme_gray()+geom_text(data=data.frame(), aes(label = paste("Min. Movies rated:",min(row_counts$freq)), x = 1750, y = 250), hjust = 0, vjust = 1)+
geom_text(data=data.frame(), aes(label = paste("Average. Score Rate:",avuser), x = 1750, y = 220), hjust = 0, vjust = 1)
Now let’s build a recommender, we can use different algorithms. However, only User-Based and Item-Based Colaborative Filtering are going to be shown. For more info go to the authors documentation page. The main difference between these algorithms are how weighted averages are computed to estimate missing user ratings. One uses nearest users as the other takes advantage from nearest items. Right below an image illustrates the aforementioned.
source: http://cuihelei.blogspot.com/2012/09/the-difference-among-three.html
In order to control a recommender, evaluationScheme sets how the algorithm is going to work with the following parameters:
method: this can be done using either split or Cross Validation
train : proportion of data used for training
given: number of items given for testing: in test dataset it selects given random items and evaluates how good it predicts those
goodRating: value for which we consider a rating “positive”, as we have ratings ranging from 1 to 5, 3 is considered as a possitive value for a movie/item
We’ll use the last 11 users to illustrate how the movie recommendation works. Movie Ratings are also normalized using Z-score (mean centered by row and divided by its std.dev), this helps to control for outliers. The parameter method is the dis-similarity matrix to compute.“cosine”, “pearson”,“jaccard” and others can be used.
NOTE: IBCF can take up to 40 minutes to train (tested on an Intel i7 6700K at stock 4.0 GHz), due to single-core restriction in R. This package is not optimized for multi-core paralell computing. If you want to try the model for educational purposes only, use less movies by generating a small random sample of them
e <- evaluationScheme(ratings_m2[1:660,], method="split", train=0.8, given=10, goodRating=3)
UBCF <- Recommender(getData(e, "train"), "UBCF",
param=list(normalize = "Z-score",method="Cosine"))
IBCF <- Recommender(getData(e, "train"), "IBCF",param=list(normalize = "Z-score",method="Cosine"))
Now that our models have finished training, it’s time to look at some error indicators like MSE or MAE to find out which is the best movie recommender. We get predictions for testing set with known ratings. These are compared to the real values.
p1 <- predict(UBCF, getData(e, "known"), type="ratings")
p2 <- predict(IBCF, getData(e, "known"), type="ratings")
error <- rbind(
UBCF_rat = calcPredictionAccuracy(p1, getData(e, "unknown")),
IBCF_rat = calcPredictionAccuracy(p2, getData(e, "unknown"))
)
| RMSE | MSE | MAE | |
|---|---|---|---|
| UBCF_rat | 1.000837 | 1.001675 | 0.7872649 |
| IBCF_rat | 1.183332 | 1.400276 | 0.8952206 |
It seems User-Based Colaborative Filtering outperformed Item-based approach. Thus, we’ll use it to predict top 5 unknown movies for users not included in both training and testing sets, that better suits them.
recom <- predict(UBCF, ratings_m2[661:(nrow(ratings_m2)),], n=5) ###PREDICT BEST 5 for user 1001 and 1002
list_recom=as(recom,"list")
head(list_recom,3)
## $`661`
## [1] "110" "356" "593" "296" "457"
##
## $`662`
## [1] "260" "608" "858" "50" "8961"
##
## $`663`
## [1] "2858" "2959" "296" "4993" "1197"
Now we can look into the movies.csv file and grab the movie name using dplyr function left_join.
data_recom=melt(list_recom)
colnames(data_recom)=c("movieId","userId")
data_recom$movieId=as.integer(data_recom$movieId)
data_recom=data_recom %>% left_join(movies,by="movieId")
| movieId | userId | title | genres |
|---|---|---|---|
| 1 | 661 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 3 | 661 | Grumpier Old Men (1995) | Comedy|Romance |
| 5 | 661 | Father of the Bride Part II (1995) | Comedy |
| 2 | 661 | Jumanji (1995) | Adventure|Children|Fantasy |
| 4 | 661 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 6 | 662 | Heat (1995) | Action|Crime|Thriller |
| 8 | 662 | Tom and Huck (1995) | Adventure|Children |
| 9 | 662 | Sudden Death (1995) | Action |
| 7 | 662 | Sabrina (1995) | Comedy|Romance |
| 10 | 662 | GoldenEye (1995) | Action|Adventure|Thriller |
Here we’ve shown how to build CF algorithms with recommenderlab package. However there are parameters and evaluation methods that we skipped. For more details go to the author’s official documentation.
Written by: JHON PARRA