Instructions

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Data input

The dataset used is the Amazon Digital Music dataset: http://jmcauley.ucsd.edu/data/amazon/links.html. This dataset includes 478,235 users, 266,414 items (music/albums) and ratings that range from 1 to 5.

Loading libraries/dataset

library(dplyr)
library(tidyr)
library(ggplot2)
library(recommenderlab)
library(reshape2)
library(knitr)

MP1 = read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/Amazon_Music_P1.csv", header = T)
MP2 = read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/Amazon_Music_P2.csv", header = T)

music = rbind(MP1, MP2)
music = music[,1:3]
kable(head(music))

user	item	rating
A2EFCYXHNK06IS	5555991584	5
A1WR23ER5HMAA9	5555991584	5
A2IR4Q0GPAFJKW	5555991584	4
A2V0KUVAB9HSYO	5555991584	4
A1J0GL9HCA7ELW	5555991584	5
A3EBHHCZO6V2A4	5555991584	5

Creating Matrix

The matrix was narrowed down to the users and items that received the most traffic because the matrix was too sparse to retain and evaluate all the ratings.

set.seed(101)
a = data.frame(head(sort(table(music$user), decreasing = T), 1835))
colnames(a) = c("user", "count")

user_merge = merge(music, a, by = "user")

b = data.frame(head(sort(table(user_merge$item), decreasing = T), 988))
colnames(b) = c("item", "count")

item_merge = merge(user_merge, b, by = "item")

final = subset(item_merge, select = c("user", "item", "rating"))

data = as(final, "realRatingMatrix")
data = data[rowCounts(data) > 5, colCounts(data) > 5]
data

## 744 x 988 rating matrix of class 'realRatingMatrix' with 18227 ratings.

print(paste0("Minimum number of ratings: ", min(rowCounts(data))))

## [1] "Minimum number of ratings: 6"

Data Exploration

The histograms of the Users/Music heatmap can help understand how sparse the matrix is and whether patterns exist within the data.

image(data, main = "Heatmap of Users and Music Items")

Overall, the ratings decrease in frequency from 5 to 1. the number of 1 and 2 ratings may (when added), approximate to the 3-star counts. This will probably show in the other histograms a tendency for users and ratings to have a score around the 4-5 star mark.

qplot(final$rating, geom="histogram", main = "Histogram of Ratings", xlab = "Rating Scores", binwidth = 0.5, fill=I("cornflower blue"),col=I("black"))

The ratings per music item are primarily around the 4-5 star rating, with the 4.5 rating being the most frequent grade. Few scores below 3 stars are given to any one music item.

# Ratings per Music
new = final %>% group_by(item) %>%
  summarise(count = mean(rating))
qplot(new$count, geom="histogram", main = "Histogram of Music Ratings", xlab = "Average Rating Scores Per Music", binwidth = 0.25, fill=I("FireBrick"),col=I("black"))

Interestingly, a lot of users appear to give 5 stars on average. Very few give, overall, low scores. Though the majority of users give 5 stars, the mean/median is around the 4.5 star mark.

# Ratings per Music
new2 = final %>% group_by(user) %>%
  summarise(count = mean(rating))
qplot(new2$count, geom="histogram", main = "Histogram of User Ratings", xlab = "Average Rating Scores Per User", binwidth = 0.25, fill=I("Plum"),col=I("black"))

Accuracy

A recommender system was created and its performance was evaluated through accuracy scores (RMSE, MSE and MAE). The data was split into training and test datasets (80 and 20 percent, respectively).

evaluation = evaluationScheme(data, method="split", train=0.8, given=5, goodRating=4)

#Evaluation datasets
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")

First Recommender

The primary recommender system was a User Based Collaborative Filter, where recommendations are made by determining the similarity between users.

# User-User
ubcf_train = Recommender(ev_train, "UBCF")
ubcf_preds = predict(ubcf_train, ev_known, type = "ratings")
ubcf_preds

## 149 x 988 rating matrix of class 'realRatingMatrix' with 62910 ratings.

Other Recommenders

# Item-Item
ibcf_train = Recommender(ev_train, "IBCF")
ibcf_preds = predict(ibcf_train, ev_known, type = "ratings")


# Popular
pop_train = Recommender(ev_train, "POPULAR")
pop_preds = predict(pop_train, ev_known, type = "ratings")


# SVD
svd_train = Recommender(ev_train, "SVD")
svd_preds = predict(svd_train, ev_known, type = "ratings")

Comparison

The results from the accuracy comparison measures, shown in the table below, determined that the Popular model was the most successful, with the UBCF model coming in second (and the SVD model being a close third). The IBCF model underperformed greatly, in comparison.

accuracy = rbind(
  UBCF = calcPredictionAccuracy(ubcf_preds, ev_unknown),
  IBCF = calcPredictionAccuracy(ibcf_preds, ev_unknown),
  SVD = calcPredictionAccuracy(svd_preds, ev_unknown),
  POPULAR = calcPredictionAccuracy(pop_preds, ev_unknown)
  )

acc_df = round(as.data.frame(accuracy), 3)
kable(acc_df[order(acc_df$RMSE),])

	RMSE	MSE	MAE
POPULAR	0.915	0.838	0.674
SVD	0.956	0.914	0.693
IBCF	1.070	1.146	0.659
UBCF	1.092	1.193	0.807

However, when viewing the ROC plots of the models, the IBCF model did not perform as poorly as assumed in the accuracy table. Around the 0.07 mark on the FPR axis, it overtook the Popular model (the most successful model of the group).

eval_sets = evaluationScheme(data = data, method = "cross-validation", k = 4, given = 5, goodRating = 4)

mult_models = list(
  UBCF = list(name = "UBCF", param = list(method = "pearson")),
  IBCF = list(name = "IBCF", param = list(method = "pearson")),
  Popular = list(name = "POPULAR", param = NULL),
  SVD = list(name = "SVD", param = NULL)
)

# Testing models
models = evaluate(eval_sets, mult_models, n= c(1, 5, seq(10, 100, 10)))

## UBCF run fold/sample [model time/prediction time]
##   1

## Timing stopped at: 0.14 0 0.14

## Error in neighbors[, x] : incorrect number of dimensions
## IBCF run fold/sample [model time/prediction time]
##   1  [8.35sec/0.03sec] 
##   2  [8.91sec/0.1sec] 
##   3  [7.78sec/0.06sec] 
##   4  [8.06sec/0.07sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.38sec] 
##   2  [0sec/0.44sec] 
##   3  [0sec/0.36sec] 
##   4  [0sec/0.38sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.11sec/0.18sec] 
##   2  [0.06sec/0.24sec] 
##   3  [0.41sec/0.13sec] 
##   4  [0.06sec/0.16sec]

## Warning in .local(x, method, ...): 
##   Recommender 'UBCF' has failed and has been removed from the results!

# Plotting models
plot(models, annotate = T, legend="topleft")

Serendipity

Since the Popular system was the most successful, the most popular recommendentations were extracted. It is visible from the recommendation lists of the five first users that the top 20 recommendations are approximately the same. Variation appears to be introduced more after the first 20.

pres = predict(pop_train, 1:100 , data = ev_train, n = 20)
pres@items[1:5]

## $A36L3NRW858ROT
##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44
## 
## $A1IANEBSMVGHS9
##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44
## 
## $A22KYWYEL6GBES
##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44
## 
## $A38ZTUNQJVGP6S
##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44
## 
## $A2SPI5WNZLOJ2U
##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44

Since the first 20 values were primarily similar through all users, a user was selected at random and their first 20 music item recommendations were used as the standard to be removed from the matrix so that more “serendipitous” (less popular) recommendations could be made using other algorithms.

values = unlist(as.vector(head(sample(pres@items[1:100]), 1)), use.names=FALSE)
values

##  [1] 151 375 319 107 442  41 369 475 142 781  43  27 245  42 362 186  40
## [18] 483  38  44

data2 = data[,-values]

evaluation2 = evaluationScheme(data2, method="split", train=0.8, given=5, goodRating=4)

#Evaluation datasets
ev_train2 = getData(evaluation2, "train")
ev_known2 = getData(evaluation2, "known")
ev_unknown2 = getData(evaluation2, "unknown")

# User-User
seren_ubcf_train = Recommender(ev_train2, "UBCF")
seren_ubcf_preds = predict(seren_ubcf_train, ev_known2, type = "ratings")

# Item-Item
seren_ibcf_train = Recommender(ev_train2, "IBCF")
seren_ibcf_preds = predict(seren_ibcf_train, ev_known2, type = "ratings")

# Popular
seren_pop_train = Recommender(ev_train2, "POPULAR")
seren_pop_preds = predict(seren_pop_train, ev_known2, type = "ratings")

# SVD
seren_svd_train = Recommender(ev_train2, "SVD")
seren_svd_preds = predict(seren_svd_train, ev_known2, type = "ratings")

The UBCF, IBCF and SVD algorithms were used with this “less popular” matrix to create recommendations. The Popular algorithm was also used, but it appeared unlikely to hold any meaningful significance. The ROC plot below shows the Popular model performing the best until the 0.05 FPR mark, where it is overtaken by the Item Based model. The SVD model underperformed (comparitively) but not by much.

eval_sets2 = evaluationScheme(data = data2, method = "cross-validation", k = 4, given = 5, goodRating = 4)

mult_models2 = list(
  seren_UBCF = list(name = "UBCF", param = list(method = "pearson")),
  seren_IBCF = list(name = "IBCF", param = list(method = "pearson")),
  seren_Popular = list(name = "POPULAR", param = NULL),
  seren_SVD = list(name = "SVD", param = NULL)
)

# Testing models
models2 = evaluate(eval_sets2, mult_models2, n= c(1, 5, seq(10, 100, 10)))

## UBCF run fold/sample [model time/prediction time]
##   1

## Timing stopped at: 0.19 0 0.19

## Error in neighbors[, x] : incorrect number of dimensions
## IBCF run fold/sample [model time/prediction time]
##   1  [8.15sec/0.38sec] 
##   2  [7.44sec/0.08sec] 
##   3  [7.83sec/0.05sec] 
##   4  [7.45sec/0.06sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.43sec] 
##   2  [0sec/0.41sec] 
##   3  [0sec/0.37sec] 
##   4  [0sec/0.36sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.06sec/0.16sec] 
##   2  [0.06sec/0.48sec] 
##   3  [0.11sec/0.15sec] 
##   4  [0.11sec/0.22sec]

## Warning in .local(x, method, ...): 
##   Recommender 'seren_UBCF' has failed and has been removed from the results!

# Plotting models
plot(models2, annotate = T, legend="topleft")

Accuracy Comparisons

When comparing the “serendipitous” models, they performed better than their counterparts. The IBCF model performed below the regular Popular, UBCF and SVD models. Since the “serendipitous” Popular model’s results cannot really be trusted (despite it performing better that the regular Popular model), it is probably safest to consider the best model as the “serendipitous” UBCF algorithm.

accuracy2 = rbind(
  seren_UBCF = calcPredictionAccuracy(seren_ubcf_preds, ev_unknown2),
  seren_IBCF = calcPredictionAccuracy(seren_ibcf_preds, ev_unknown2),
  seren_SVD = calcPredictionAccuracy(seren_svd_preds, ev_unknown2),
  seren_POPULAR = calcPredictionAccuracy(seren_pop_preds, ev_unknown2)
  )

acc_df2 = round(as.data.frame(accuracy2), 3)
comp = rbind(acc_df, acc_df2)

kable(comp[order(comp$RMSE),])

	RMSE	MSE	MAE
POPULAR	0.915	0.838	0.674
SVD	0.956	0.914	0.693
seren_POPULAR	1.037	1.076	0.782
IBCF	1.070	1.146	0.659
UBCF	1.092	1.193	0.807
seren_SVD	1.105	1.221	0.823
seren_UBCF	1.177	1.386	0.891
seren_IBCF	1.397	1.952	0.954

Conclusion

Overall, if ignoring the popular model, a user-to-user algorithm is the most successful with this dataset, though removing the 20 most popular music items lead to a lower RMSE value.

One way to determine serendipity (or recommendation success) using an online environment is by tracking whether a recommended album/track is played by a user (or by seeing if a user explores music by a recommended artist). This may include calculating a shift in music preferences (by using a diversity metric to see if the user’s taste has expanded into something they hadn’t tried before). This could either be recorded by retrieving how many times a user clicks on a newly recommended artist/album or by measuring the number of times/overall time a track has been listened to.

DATA 612 Project 4 | Accuracy and Beyond

Erinda Budo

June 30, 2020