The goal of this assignment is give you practice working with accuracy and other recommender system metrics.
The dataset used is the Amazon Digital Music dataset: http://jmcauley.ucsd.edu/data/amazon/links.html. This dataset includes 478,235 users, 266,414 items (music/albums) and ratings that range from 1 to 5.
library(dplyr)
library(tidyr)
library(ggplot2)
library(recommenderlab)
library(reshape2)
library(knitr)
MP1 = read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/Amazon_Music_P1.csv", header = T)
MP2 = read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/Amazon_Music_P2.csv", header = T)
music = rbind(MP1, MP2)
music = music[,1:3]
kable(head(music))
| user | item | rating |
|---|---|---|
| A2EFCYXHNK06IS | 5555991584 | 5 |
| A1WR23ER5HMAA9 | 5555991584 | 5 |
| A2IR4Q0GPAFJKW | 5555991584 | 4 |
| A2V0KUVAB9HSYO | 5555991584 | 4 |
| A1J0GL9HCA7ELW | 5555991584 | 5 |
| A3EBHHCZO6V2A4 | 5555991584 | 5 |
The matrix was narrowed down to the users and items that received the most traffic because the matrix was too sparse to retain and evaluate all the ratings.
set.seed(101)
a = data.frame(head(sort(table(music$user), decreasing = T), 1835))
colnames(a) = c("user", "count")
user_merge = merge(music, a, by = "user")
b = data.frame(head(sort(table(user_merge$item), decreasing = T), 988))
colnames(b) = c("item", "count")
item_merge = merge(user_merge, b, by = "item")
final = subset(item_merge, select = c("user", "item", "rating"))
data = as(final, "realRatingMatrix")
data = data[rowCounts(data) > 5, colCounts(data) > 5]
data
## 744 x 988 rating matrix of class 'realRatingMatrix' with 18227 ratings.
print(paste0("Minimum number of ratings: ", min(rowCounts(data))))
## [1] "Minimum number of ratings: 6"
The histograms of the Users/Music heatmap can help understand how sparse the matrix is and whether patterns exist within the data.
image(data, main = "Heatmap of Users and Music Items")
Overall, the ratings decrease in frequency from 5 to 1. the number of 1 and 2 ratings may (when added), approximate to the 3-star counts. This will probably show in the other histograms a tendency for users and ratings to have a score around the 4-5 star mark.
qplot(final$rating, geom="histogram", main = "Histogram of Ratings", xlab = "Rating Scores", binwidth = 0.5, fill=I("cornflower blue"),col=I("black"))
The ratings per music item are primarily around the 4-5 star rating, with the 4.5 rating being the most frequent grade. Few scores below 3 stars are given to any one music item.
# Ratings per Music
new = final %>% group_by(item) %>%
summarise(count = mean(rating))
qplot(new$count, geom="histogram", main = "Histogram of Music Ratings", xlab = "Average Rating Scores Per Music", binwidth = 0.25, fill=I("FireBrick"),col=I("black"))
Interestingly, a lot of users appear to give 5 stars on average. Very few give, overall, low scores. Though the majority of users give 5 stars, the mean/median is around the 4.5 star mark.
# Ratings per Music
new2 = final %>% group_by(user) %>%
summarise(count = mean(rating))
qplot(new2$count, geom="histogram", main = "Histogram of User Ratings", xlab = "Average Rating Scores Per User", binwidth = 0.25, fill=I("Plum"),col=I("black"))
A recommender system was created and its performance was evaluated through accuracy scores (RMSE, MSE and MAE). The data was split into training and test datasets (80 and 20 percent, respectively).
evaluation = evaluationScheme(data, method="split", train=0.8, given=5, goodRating=4)
#Evaluation datasets
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")
The primary recommender system was a User Based Collaborative Filter, where recommendations are made by determining the similarity between users.
# User-User
ubcf_train = Recommender(ev_train, "UBCF")
ubcf_preds = predict(ubcf_train, ev_known, type = "ratings")
ubcf_preds
## 149 x 988 rating matrix of class 'realRatingMatrix' with 62910 ratings.
# Item-Item
ibcf_train = Recommender(ev_train, "IBCF")
ibcf_preds = predict(ibcf_train, ev_known, type = "ratings")
# Popular
pop_train = Recommender(ev_train, "POPULAR")
pop_preds = predict(pop_train, ev_known, type = "ratings")
# SVD
svd_train = Recommender(ev_train, "SVD")
svd_preds = predict(svd_train, ev_known, type = "ratings")
The results from the accuracy comparison measures, shown in the table below, determined that the Popular model was the most successful, with the UBCF model coming in second (and the SVD model being a close third). The IBCF model underperformed greatly, in comparison.
accuracy = rbind(
UBCF = calcPredictionAccuracy(ubcf_preds, ev_unknown),
IBCF = calcPredictionAccuracy(ibcf_preds, ev_unknown),
SVD = calcPredictionAccuracy(svd_preds, ev_unknown),
POPULAR = calcPredictionAccuracy(pop_preds, ev_unknown)
)
acc_df = round(as.data.frame(accuracy), 3)
kable(acc_df[order(acc_df$RMSE),])
| RMSE | MSE | MAE | |
|---|---|---|---|
| POPULAR | 0.915 | 0.838 | 0.674 |
| SVD | 0.956 | 0.914 | 0.693 |
| IBCF | 1.070 | 1.146 | 0.659 |
| UBCF | 1.092 | 1.193 | 0.807 |
However, when viewing the ROC plots of the models, the IBCF model did not perform as poorly as assumed in the accuracy table. Around the 0.07 mark on the FPR axis, it overtook the Popular model (the most successful model of the group).
eval_sets = evaluationScheme(data = data, method = "cross-validation", k = 4, given = 5, goodRating = 4)
mult_models = list(
UBCF = list(name = "UBCF", param = list(method = "pearson")),
IBCF = list(name = "IBCF", param = list(method = "pearson")),
Popular = list(name = "POPULAR", param = NULL),
SVD = list(name = "SVD", param = NULL)
)
# Testing models
models = evaluate(eval_sets, mult_models, n= c(1, 5, seq(10, 100, 10)))
## UBCF run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 0.14 0 0.14
## Error in neighbors[, x] : incorrect number of dimensions
## IBCF run fold/sample [model time/prediction time]
## 1 [8.35sec/0.03sec]
## 2 [8.91sec/0.1sec]
## 3 [7.78sec/0.06sec]
## 4 [8.06sec/0.07sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.38sec]
## 2 [0sec/0.44sec]
## 3 [0sec/0.36sec]
## 4 [0sec/0.38sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.11sec/0.18sec]
## 2 [0.06sec/0.24sec]
## 3 [0.41sec/0.13sec]
## 4 [0.06sec/0.16sec]
## Warning in .local(x, method, ...):
## Recommender 'UBCF' has failed and has been removed from the results!
# Plotting models
plot(models, annotate = T, legend="topleft")
Since the Popular system was the most successful, the most popular recommendentations were extracted. It is visible from the recommendation lists of the five first users that the top 20 recommendations are approximately the same. Variation appears to be introduced more after the first 20.
pres = predict(pop_train, 1:100 , data = ev_train, n = 20)
pres@items[1:5]
## $A36L3NRW858ROT
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
##
## $A1IANEBSMVGHS9
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
##
## $A22KYWYEL6GBES
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
##
## $A38ZTUNQJVGP6S
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
##
## $A2SPI5WNZLOJ2U
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
Since the first 20 values were primarily similar through all users, a user was selected at random and their first 20 music item recommendations were used as the standard to be removed from the matrix so that more “serendipitous” (less popular) recommendations could be made using other algorithms.
values = unlist(as.vector(head(sample(pres@items[1:100]), 1)), use.names=FALSE)
values
## [1] 151 375 319 107 442 41 369 475 142 781 43 27 245 42 362 186 40
## [18] 483 38 44
data2 = data[,-values]
evaluation2 = evaluationScheme(data2, method="split", train=0.8, given=5, goodRating=4)
#Evaluation datasets
ev_train2 = getData(evaluation2, "train")
ev_known2 = getData(evaluation2, "known")
ev_unknown2 = getData(evaluation2, "unknown")
# User-User
seren_ubcf_train = Recommender(ev_train2, "UBCF")
seren_ubcf_preds = predict(seren_ubcf_train, ev_known2, type = "ratings")
# Item-Item
seren_ibcf_train = Recommender(ev_train2, "IBCF")
seren_ibcf_preds = predict(seren_ibcf_train, ev_known2, type = "ratings")
# Popular
seren_pop_train = Recommender(ev_train2, "POPULAR")
seren_pop_preds = predict(seren_pop_train, ev_known2, type = "ratings")
# SVD
seren_svd_train = Recommender(ev_train2, "SVD")
seren_svd_preds = predict(seren_svd_train, ev_known2, type = "ratings")
The UBCF, IBCF and SVD algorithms were used with this “less popular” matrix to create recommendations. The Popular algorithm was also used, but it appeared unlikely to hold any meaningful significance. The ROC plot below shows the Popular model performing the best until the 0.05 FPR mark, where it is overtaken by the Item Based model. The SVD model underperformed (comparitively) but not by much.
eval_sets2 = evaluationScheme(data = data2, method = "cross-validation", k = 4, given = 5, goodRating = 4)
mult_models2 = list(
seren_UBCF = list(name = "UBCF", param = list(method = "pearson")),
seren_IBCF = list(name = "IBCF", param = list(method = "pearson")),
seren_Popular = list(name = "POPULAR", param = NULL),
seren_SVD = list(name = "SVD", param = NULL)
)
# Testing models
models2 = evaluate(eval_sets2, mult_models2, n= c(1, 5, seq(10, 100, 10)))
## UBCF run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 0.19 0 0.19
## Error in neighbors[, x] : incorrect number of dimensions
## IBCF run fold/sample [model time/prediction time]
## 1 [8.15sec/0.38sec]
## 2 [7.44sec/0.08sec]
## 3 [7.83sec/0.05sec]
## 4 [7.45sec/0.06sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.43sec]
## 2 [0sec/0.41sec]
## 3 [0sec/0.37sec]
## 4 [0sec/0.36sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.06sec/0.16sec]
## 2 [0.06sec/0.48sec]
## 3 [0.11sec/0.15sec]
## 4 [0.11sec/0.22sec]
## Warning in .local(x, method, ...):
## Recommender 'seren_UBCF' has failed and has been removed from the results!
# Plotting models
plot(models2, annotate = T, legend="topleft")
When comparing the “serendipitous” models, they performed better than their counterparts. The IBCF model performed below the regular Popular, UBCF and SVD models. Since the “serendipitous” Popular model’s results cannot really be trusted (despite it performing better that the regular Popular model), it is probably safest to consider the best model as the “serendipitous” UBCF algorithm.
accuracy2 = rbind(
seren_UBCF = calcPredictionAccuracy(seren_ubcf_preds, ev_unknown2),
seren_IBCF = calcPredictionAccuracy(seren_ibcf_preds, ev_unknown2),
seren_SVD = calcPredictionAccuracy(seren_svd_preds, ev_unknown2),
seren_POPULAR = calcPredictionAccuracy(seren_pop_preds, ev_unknown2)
)
acc_df2 = round(as.data.frame(accuracy2), 3)
comp = rbind(acc_df, acc_df2)
kable(comp[order(comp$RMSE),])
| RMSE | MSE | MAE | |
|---|---|---|---|
| POPULAR | 0.915 | 0.838 | 0.674 |
| SVD | 0.956 | 0.914 | 0.693 |
| seren_POPULAR | 1.037 | 1.076 | 0.782 |
| IBCF | 1.070 | 1.146 | 0.659 |
| UBCF | 1.092 | 1.193 | 0.807 |
| seren_SVD | 1.105 | 1.221 | 0.823 |
| seren_UBCF | 1.177 | 1.386 | 0.891 |
| seren_IBCF | 1.397 | 1.952 | 0.954 |
Overall, if ignoring the popular model, a user-to-user algorithm is the most successful with this dataset, though removing the 20 most popular music items lead to a lower RMSE value.
One way to determine serendipity (or recommendation success) using an online environment is by tracking whether a recommended album/track is played by a user (or by seeing if a user explores music by a recommended artist). This may include calculating a shift in music preferences (by using a diversity metric to see if the user’s taste has expanded into something they hadn’t tried before). This could either be recorded by retrieving how many times a user clicks on a newly recommended artist/album or by measuring the number of times/overall time a track has been listened to.