Introduction

The goal of this assignment is to compare the accuracy of recommender system algorithms. I will build functions to evaluate the best model and compare their accuracies.

# load libraries
library(tidyverse)
library(kableExtra)
library(knitr)
library(recommenderlab)

Preparing the data

The dataset I will work with is hosted on GroupLens and generated by the Information Retrieval Group at Universidad Autónoma de Madrid. This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. For the purpose of this assignment, I will work with only three variables from the the dataset - userID, artistName and artistID.

Importing the data

First, I create a temporary folder to download the zip file directly from GroupLens. Then, extract the file - user_artists.dat and print first five rows of the dataset. user_artists.dat has user id and the artist id. Every row represents user1 listened to artist1 relationship.

# create temporary folder to download the file
td = tempdir()
tf = tempfile(tmpdir = td, fileext = ".zip")
download.file("http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip", tf)

# load the data
user_artists <- read.table(unz(tf, "user_artists.dat"), header = T, sep = '\t')
user_artists$weight <- NULL

artists <- read.table(unz(tf, "artists.dat"), sep = "\t", quote = "", header = T,  fill = T)
artists <- artists[, -c(3:4)]
artists$id <- as(artists$id, "character")

# display the data
head(user_artists) %>% 
  kable() %>% 
  kable_styling("striped", full_width = F)
userID artistID
2 51
2 52
2 53
2 54
2 55
2 56

Defining a rating matrix

The dataset has 1892 users and 17632 artists. A matrix of this magnitude will not work on my current hardware. Therefore, I will sample only 30% of the data and create a binary rating matrix. Binary rating matrix is the most appropriate for this purpose; 1 represents listened to the artist, 0 represents did not listened to the artist.

I also select the most relevant data about artists that have been listened only a number of times because otherwise their ratings might be biased because of lack of data. Also users who listened only a few artists, their ratings might be biased as well.

Lastly, I plot the user-item rating matrix.

# reshape data to create matrix
set.seed(123)

user_artists <- user_artists %>% 
  select(userID, artistID) %>% 
  spread(artistID, -userID, fill = 0) %>% 
  select(-userID) %>% 
  as_tibble() %>% 
  sample_frac(0.30, replace = TRUE) %>% 
  as.matrix() %>% 
  as("binaryRatingMatrix")

# relevant dataset
rating_matrix <- user_artists[rowCounts(user_artists) >= 25, colCounts(user_artists) >= 5]
rating_matrix
## 557 x 1123 rating matrix of class 'binaryRatingMatrix' with 18245 ratings.
# plot matrix
image(rating_matrix, main = "Binary rating matrix")

### Build the model To measure the performances across models more accurately, I will use the k-fold method to split the data. evaluationScheme function does the by splitting the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. I will split the dataset in to 5 chunks where all but 1 randomly selected item is withheld for evaluation. Lastly, I give a good rating of 1; threshold at which the rating is considered good.

# split dataset using k-fold method
set.seed(1)
scheme <- rating_matrix %>% 
  evaluationScheme(method = "cross", k = 5, given = -1, goodRating = 1)
scheme
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=1.000000
## Data set: 557 x 1123 rating matrix of class 'binaryRatingMatrix' with 18245 ratings.

In order to compare different models, we first need to define them. Each model is stored in a list with its name and parameters. I will use the Random, Popular, UBCF and IBCF methods. As per textbook’s advice, I will also build an IBCF model setting the distance method to Jaccard since the rating matrix is binary. Lastly, I evaluate the recommender models performance based on n (the number of items to recommend to each user).

# create list of algorithms
algorithms <- list( "IBCF Jaccard" = list(name = "IBCF", param = list(method = "Jaccard")),
                    "Random Items" = list(name = "RANDOM", param = NULL),
                    "Popular Items" = list(name  = "POPULAR", param = NULL),
                    "IBCF" = list(name = "IBCF", param = list(k = 5)),
                    "UBCF" = list(name = "UBCF", param = list(method = "Cosine", nn = 500)))

# evaluate the recommender models
results <- evaluate(scheme, method = algorithms, n = seq(10, 100, 20))
## IBCF run fold/sample [model time/prediction time]
##   1  [3.77sec/0.07sec] 
##   2  [3.35sec/0.06sec] 
##   3  [3.19sec/0.1sec] 
##   4  [3.3sec/0.05sec] 
##   5  [3.29sec/0.03sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.05sec] 
##   2  [0sec/0.06sec] 
##   3  [0sec/0.07sec] 
##   4  [0sec/0.05sec] 
##   5  [0sec/0.06sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0.01sec/0.19sec] 
##   3  [0sec/0.19sec] 
##   4  [0sec/0.17sec] 
##   5  [0sec/0.17sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [3.21sec/0.02sec] 
##   2  [3.4sec/0.02sec] 
##   3  [3.33sec/0.03sec] 
##   4  [2.97sec/0.01sec] 
##   5  [2.92sec/0.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.53sec] 
##   2  [0sec/0.51sec] 
##   3  [0sec/0.53sec] 
##   4  [0sec/0.53sec] 
##   5  [0sec/0.53sec]

Evaluating the recommendations

We can evaluate the recommendations by comparing the recommendations with the purchases having a positive rating. I will extract the confusion matrix from results object. A confusion matrix is a table that is often used to describe the performance of a classification model. We can not tell which model performs the best by a glance; for it we will plot a chart.

# create function to extract confusion matrix
avg_conf_matr <- function(results){
  tmp <- results %>%
    getConfusionMatrix()  %>%  
    as.list() 
  as.data.frame(Reduce("+",tmp) / length(tmp)) %>%
    mutate(n = seq(10, 100, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}

# get confusion matrix from the results
results_tbl <- results %>% 
  map(avg_conf_matr) %>% 
  enframe() %>% 
  unnest()

# print confusion matrix 
results_tbl %>% kable() %>% kable_styling("striped", full_width = T)
name n precision recall TPR FPR
IBCF Jaccard 10 0.0109735 0.1097345 0.1097345 0.0088070
IBCF Jaccard 30 0.0066350 0.1982301 0.1982301 0.0265014
IBCF Jaccard 50 0.0057382 0.2831858 0.2831858 0.0441296
IBCF Jaccard 70 0.0051011 0.3504425 0.3504425 0.0617530
IBCF Jaccard 90 0.0048681 0.4283186 0.4283186 0.0793481
Random Items 10 0.0014159 0.0141593 0.0141593 0.0088921
Random Items 30 0.0007670 0.0230088 0.0230088 0.0266937
Random Items 50 0.0007434 0.0371681 0.0371681 0.0444905
Random Items 70 0.0007332 0.0513274 0.0513274 0.0622873
Random Items 90 0.0007473 0.0672566 0.0672566 0.0800826
Popular Items 10 0.0083186 0.0831858 0.0831858 0.0088306
Popular Items 30 0.0050737 0.1522124 0.1522124 0.0265786
Popular Items 50 0.0041416 0.2070796 0.2070796 0.0443392
Popular Items 70 0.0038938 0.2725664 0.2725664 0.0620903
Popular Items 90 0.0034808 0.3132743 0.3132743 0.0798635
IBCF 10 0.0158059 0.1575221 0.1575221 0.0087109
IBCF 30 0.0102169 0.3008850 0.3008850 0.0260885
IBCF 50 0.0090693 0.4336283 0.4336283 0.0427253
IBCF 70 0.0086940 0.5539823 0.5539823 0.0577672
IBCF 90 0.0080949 0.6088496 0.6088496 0.0698288
UBCF 10 0.0215929 0.2159292 0.2159292 0.0087124
UBCF 30 0.0122124 0.3663717 0.3663717 0.0263879
UBCF 50 0.0092035 0.4601770 0.4601770 0.0441138
UBCF 70 0.0075853 0.5309735 0.5309735 0.0618602
UBCF 90 0.0066077 0.5946903 0.5946903 0.0796129

We can compare the models by building a chart displaying their ROC curves. A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, we can notice that the highest is IBCF with NULL parameter but only slightly. In our precision-recall chart, we see UBCF yields the best model. Since UBCF is really close to IBCF on ROC curves, therefore UBCF is the best-performing technique.

# roc curves chart
results_tbl %>% 
  ggplot(aes(FPR, TPR, color = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n)) +
  labs(title = "ROC Curves", color = "Models") +
  theme_minimal()

# recall / precision curves chart
results_tbl %>%
  ggplot(aes(recall, precision, color = fct_reorder2(as.factor(name), precision, recall))) +
  geom_line() +
  geom_label(aes(label = n)) +
  labs(title = "Precision-Recall Curves", color = "Models") +
  theme_minimal()

### Serendipity
Serendipity is the measure of how surprising the successful or relevant recommendations are. For example, serendipity is when you start a stand-up show on Netflix with an unknown comedian from a foreign country and becoming a fan by the end of the show. Fortunately, this dataset has other useful variables that I can implement to increase serendipity.

The dataset folder has a file called user_friends. It contains the friend relations between users in the database. I will take the top recommendation of the corresponding friend and recommend it to the user to increase serendipity. However, we cannot measure the accuracy of this model until it has been tested. Upon testing, I can validate whether the model worked by counting the number of times the recommended friend’s top artist was played. If the overall average is more than the mean and the return rate of listeners are not too low, then I would say this is a successful implementation of serendipity.

Prediction

# make prediction
recomm <- Recommender(getData(scheme, 'train'), method = "IBCF", param = list(k = 50))
pred <- predict(recomm, newdata = getData(scheme, 'known'), n = 10)

# prediction for user 1
artists %>% 
  filter(id %in% pred@items[[1]]) %>% 
  kable(col.names = c("ID", "Artist")) %>% 
  kable_styling("striped", full_width = F)
ID Artist
36 KMFDM
119 billy gomberg + offthesky
125 aslope
130 Philippe Lamy
156 VAST
224 Yeong-wook Jo
671 Alkistis Protopsalti
970 Infernal
971 Zoé
973 Hevia

Conclusion

Besides serendipity, I can increase novelty by recommending unknown artists to a user. But as mentioned previously without testing our model online, I can not evaluate the accuracy of the model.

Another metrics would be to evaluate the ratings. In order to recommend items to new users, collaborative filtering estimates the ratings of items that are not yet purchased. Then, it recommends the top-rated items. To pick the best model, I would compare the root mean squared error of all the algorithms and pick the one with the lowest score.

In conclusion, in order to achieve my goals to increase novelty, I would have to design an online evaluation environment. In this environment, I will recommend 5 artists each week to the users. Four of the artists will be similar but one will be unknown. After the release of the following week’s 5 recommended artists, I will evaluate the collected the data from previous week. I will take this steps for several weeks until I have sufficient data to analyze whether the recommended unknown artists were successful. If the majority of the recommended unknown artists were listened, the implementation is successful.