DATA 612 Project 4 | Accuracy and Beyond

Preparing the data

The dataset I will work with is hosted on GroupLens and generated by the Information Retrieval Group at Universidad AutÃ³noma de Madrid. This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. For the purpose of this assignment, I will work with only three variables from the the dataset - userID, artistName and artistID.

Importing the data

First, I create a temporary folder to download the zip file directly from GroupLens. Then, extract the file - user_artists.dat and print first five rows of the dataset. user_artists.dat has user id and the artist id. Every row represents user1 listened to artist1 relationship.

# create temporary folder to download the file
td = tempdir()
tf = tempfile(tmpdir = td, fileext = ".zip")
download.file("http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip", tf)

# load the data
user_artists <- read.table(unz(tf, "user_artists.dat"), header = T, sep = '\t')
user_artists$weight <- NULL

artists <- read.table(unz(tf, "artists.dat"), sep = "\t", quote = "", header = T,  fill = T)
artists <- artists[, -c(3:4)]
artists$id <- as(artists$id, "character")

# display the data
head(user_artists) %>% 
  kable() %>% 
  kable_styling("striped", full_width = F)

userID	artistID
2	51
2	52
2	53
2	54
2	55
2	56

Defining a rating matrix

The dataset has 1892 users and 17632 artists. A matrix of this magnitude will not work on my current hardware. Therefore, I will sample only 30% of the data and create a binary rating matrix. Binary rating matrix is the most appropriate for this purpose; 1 represents listened to the artist, 0 represents did not listened to the artist.

I also select the most relevant data about artists that have been listened only a number of times because otherwise their ratings might be biased because of lack of data. Also users who listened only a few artists, their ratings might be biased as well.

Lastly, I plot the user-item rating matrix.

# reshape data to create matrix
set.seed(123)

user_artists <- user_artists %>% 
  select(userID, artistID) %>% 
  spread(artistID, -userID, fill = 0) %>% 
  select(-userID) %>% 
  as_tibble() %>% 
  sample_frac(0.30, replace = TRUE) %>% 
  as.matrix() %>% 
  as("binaryRatingMatrix")

# relevant dataset
rating_matrix <- user_artists[rowCounts(user_artists) >= 25, colCounts(user_artists) >= 5]
rating_matrix

## 557 x 1123 rating matrix of class 'binaryRatingMatrix' with 18245 ratings.

# plot matrix
image(rating_matrix, main = "Binary rating matrix")

### Build the model To measure the performances across models more accurately, I will use the k-fold method to split the data. evaluationScheme function does the by splitting the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. I will split the dataset in to 5 chunks where all but 1 randomly selected item is withheld for evaluation. Lastly, I give a good rating of 1; threshold at which the rating is considered good.

# split dataset using k-fold method
set.seed(1)
scheme <- rating_matrix %>% 
  evaluationScheme(method = "cross", k = 5, given = -1, goodRating = 1)
scheme

## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=1.000000
## Data set: 557 x 1123 rating matrix of class 'binaryRatingMatrix' with 18245 ratings.

In order to compare different models, we first need to define them. Each model is stored in a list with its name and parameters. I will use the Random, Popular, UBCF and IBCF methods. As per textbook’s advice, I will also build an IBCF model setting the distance method to Jaccard since the rating matrix is binary. Lastly, I evaluate the recommender models performance based on n (the number of items to recommend to each user).

# create list of algorithms
algorithms <- list( "IBCF Jaccard" = list(name = "IBCF", param = list(method = "Jaccard")),
                    "Random Items" = list(name = "RANDOM", param = NULL),
                    "Popular Items" = list(name  = "POPULAR", param = NULL),
                    "IBCF" = list(name = "IBCF", param = list(k = 5)),
                    "UBCF" = list(name = "UBCF", param = list(method = "Cosine", nn = 500)))

# evaluate the recommender models
results <- evaluate(scheme, method = algorithms, n = seq(10, 100, 20))

## IBCF run fold/sample [model time/prediction time]
##   1  [3.77sec/0.07sec] 
##   2  [3.35sec/0.06sec] 
##   3  [3.19sec/0.1sec] 
##   4  [3.3sec/0.05sec] 
##   5  [3.29sec/0.03sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.05sec] 
##   2  [0sec/0.06sec] 
##   3  [0sec/0.07sec] 
##   4  [0sec/0.05sec] 
##   5  [0sec/0.06sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0.01sec/0.19sec] 
##   3  [0sec/0.19sec] 
##   4  [0sec/0.17sec] 
##   5  [0sec/0.17sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [3.21sec/0.02sec] 
##   2  [3.4sec/0.02sec] 
##   3  [3.33sec/0.03sec] 
##   4  [2.97sec/0.01sec] 
##   5  [2.92sec/0.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.53sec] 
##   2  [0sec/0.51sec] 
##   3  [0sec/0.53sec] 
##   4  [0sec/0.53sec] 
##   5  [0sec/0.53sec]

Evaluating the recommendations

We can evaluate the recommendations by comparing the recommendations with the purchases having a positive rating. I will extract the confusion matrix from results object. A confusion matrix is a table that is often used to describe the performance of a classification model. We can not tell which model performs the best by a glance; for it we will plot a chart.

# create function to extract confusion matrix
avg_conf_matr <- function(results){
  tmp <- results %>%
    getConfusionMatrix()  %>%  
    as.list() 
  as.data.frame(Reduce("+",tmp) / length(tmp)) %>%
    mutate(n = seq(10, 100, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}

# get confusion matrix from the results
results_tbl <- results %>% 
  map(avg_conf_matr) %>% 
  enframe() %>% 
  unnest()

# print confusion matrix 
results_tbl %>% kable() %>% kable_styling("striped", full_width = T)

name	n	precision	recall	TPR	FPR
IBCF Jaccard	10	0.0109735	0.1097345	0.1097345	0.0088070
IBCF Jaccard	30	0.0066350	0.1982301	0.1982301	0.0265014
IBCF Jaccard	50	0.0057382	0.2831858	0.2831858	0.0441296
IBCF Jaccard	70	0.0051011	0.3504425	0.3504425	0.0617530
IBCF Jaccard	90	0.0048681	0.4283186	0.4283186	0.0793481
Random Items	10	0.0014159	0.0141593	0.0141593	0.0088921
Random Items	30	0.0007670	0.0230088	0.0230088	0.0266937
Random Items	50	0.0007434	0.0371681	0.0371681	0.0444905
Random Items	70	0.0007332	0.0513274	0.0513274	0.0622873
Random Items	90	0.0007473	0.0672566	0.0672566	0.0800826
Popular Items	10	0.0083186	0.0831858	0.0831858	0.0088306
Popular Items	30	0.0050737	0.1522124	0.1522124	0.0265786
Popular Items	50	0.0041416	0.2070796	0.2070796	0.0443392
Popular Items	70	0.0038938	0.2725664	0.2725664	0.0620903
Popular Items	90	0.0034808	0.3132743	0.3132743	0.0798635
IBCF	10	0.0158059	0.1575221	0.1575221	0.0087109
IBCF	30	0.0102169	0.3008850	0.3008850	0.0260885
IBCF	50	0.0090693	0.4336283	0.4336283	0.0427253
IBCF	70	0.0086940	0.5539823	0.5539823	0.0577672
IBCF	90	0.0080949	0.6088496	0.6088496	0.0698288
UBCF	10	0.0215929	0.2159292	0.2159292	0.0087124
UBCF	30	0.0122124	0.3663717	0.3663717	0.0263879
UBCF	50	0.0092035	0.4601770	0.4601770	0.0441138
UBCF	70	0.0075853	0.5309735	0.5309735	0.0618602
UBCF	90	0.0066077	0.5946903	0.5946903	0.0796129

We can compare the models by building a chart displaying their ROC curves. A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, we can notice that the highest is IBCF with NULL parameter but only slightly. In our precision-recall chart, we see UBCF yields the best model. Since UBCF is really close to IBCF on ROC curves, therefore UBCF is the best-performing technique.

# roc curves chart
results_tbl %>% 
  ggplot(aes(FPR, TPR, color = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n)) +
  labs(title = "ROC Curves", color = "Models") +
  theme_minimal()

# recall / precision curves chart
results_tbl %>%
  ggplot(aes(recall, precision, color = fct_reorder2(as.factor(name), precision, recall))) +
  geom_line() +
  geom_label(aes(label = n)) +
  labs(title = "Precision-Recall Curves", color = "Models") +
  theme_minimal()

### Serendipity
Serendipity is the measure of how surprising the successful or relevant recommendations are. For example, serendipity is when you start a stand-up show on Netflix with an unknown comedian from a foreign country and becoming a fan by the end of the show. Fortunately, this dataset has other useful variables that I can implement to increase serendipity.

The dataset folder has a file called user_friends. It contains the friend relations between users in the database. I will take the top recommendation of the corresponding friend and recommend it to the user to increase serendipity. However, we cannot measure the accuracy of this model until it has been tested. Upon testing, I can validate whether the model worked by counting the number of times the recommended friend’s top artist was played. If the overall average is more than the mean and the return rate of listeners are not too low, then I would say this is a successful implementation of serendipity.

Prediction

# make prediction
recomm <- Recommender(getData(scheme, 'train'), method = "IBCF", param = list(k = 50))
pred <- predict(recomm, newdata = getData(scheme, 'known'), n = 10)

# prediction for user 1
artists %>% 
  filter(id %in% pred@items[[1]]) %>% 
  kable(col.names = c("ID", "Artist")) %>% 
  kable_styling("striped", full_width = F)

ID	Artist
36	KMFDM
119	billy gomberg + offthesky
125	aslope
130	Philippe Lamy
156	VAST
224	Yeong-wook Jo
671	Alkistis Protopsalti
970	Infernal
971	ZoÃÂ©
973	Hevia

DATA 612 Project 4 | Accuracy and Beyond

Saayed Alam

6/21/2019

Introduction

Preparing the data

Importing the data

Defining a rating matrix

Evaluating the recommendations

Prediction

Conclusion