The goal of this assignment is give you practice working with accuracy and other recommender system metrics.
Deliverables
As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
• The dataset is a product ratings for beauty products sold on Amazon.com. The dataset was downloaded from Kaggle.com.
• Original set contains 2,023,070 observations and 4 variables - User ID, Product ID, Rating (from 1 to 5), and Time Stamp. It covers 1,210,271 users and 249,274 products. In order to make the set more manageable it has been reduced to a smaller subset.
• The final ratings dataset used consists of 3562 x 12057 rating matrix of class ‘realRatingMatrix’ with 68565 ratings.
Now let’s see the process of reducing the data from the main dataset
Step 1) Import original file and select sample for project
Step-2) Explore
UserId ProductId Rating Timestamp
1 A39HTATAQ9V7YF 0205616461 5 1369699200
2 A3JM6GV9MNOF9X 0558925278 3 1355443200
3 A1Z513UWSAAO0F 0558925278 5 1404691200
4 A1WMRR494NWEWV 0733001998 4 1382572800
5 A3IAAVS479H7M7 0737104473 1 1274227200
6 AKJHHD5VEH7VG 0762451459 5 1404518400
[1] "factor"
[1] "factor"
[1] "numeric"
[1] "integer"
Step-3) Convert to realRatingMatrix
ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId),
x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")Step-4) Explore
1210271 x 249274 rating matrix of class 'realRatingMatrix' with 2023070 ratings.
1 2 3 4 5 6 7 8 9 10
887401 175875 64336 30285 16187 9827 6324 4260 3181 2275
11 12 13 14 15 16 17 18 19 20
1745 1402 1012 912 677 550 462 411 323 277
21 22 23 24 25 26 27 28 29 30
262 238 198 160 132 105 129 95 94 82
31 32 33 34 35 36 37 38 39 40
68 53 64 51 49 45 32 41 37 38
41 42 43 44 45 46 47 48 49 50
30 35 29 20 13 25 25 19 19 17
51 52 53 54 55 56 57 58 59 60
13 18 13 15 14 11 15 6 6 13
61 62 63 64 65 66 67 68 69 70
6 9 11 3 10 4 4 5 6 6
71 72 73 74 75
5 2 6 5 7
[ reached getOption("max.print") -- omitted 78 entries ]
1 2 3 4 5 6 7 8 9 10
103484 42209 22334 13902 9623 7214 5592 4404 3574 3059
11 12 13 14 15 16 17 18 19 20
2542 2267 2024 1657 1526 1410 1208 1096 1054 912
21 22 23 24 25 26 27 28 29 30
869 810 723 663 591 545 508 517 468 432
31 32 33 34 35 36 37 38 39 40
381 388 361 338 332 347 276 269 267 261
41 42 43 44 45 46 47 48 49 50
261 245 221 226 177 204 188 177 167 174
51 52 53 54 55 56 57 58 59 60
156 138 151 157 132 142 112 119 105 120
61 62 63 64 65 66 67 68 69 70
103 125 116 111 83 94 99 86 89 92
71 72 73 74 75
89 78 84 85 79
[ reached getOption("max.print") -- omitted 513 entries ]
Step-5) Select Subset 1 and Subset 2
10320 x 12057 rating matrix of class 'realRatingMatrix' with 111871 ratings.
amazonShort <- amazon[, colCounts(amazon) > 30]
amazonShort <- amazonShort[rowCounts(amazonShort) > 10, ]
amazonShort3562 x 12057 rating matrix of class 'realRatingMatrix' with 68565 ratings.
Step-6) Check and Remove Empty Lines
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
598 479 335 274 264 184 163 135 118 99 84 73 73 66 50 54 26 34
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
36 31 31 30 22 18 17 15 11 20 15 13 11 8 7 9 8 5
47 48 49 50 51 52 53 54 55 56 57 58 60 61 62 63 64 65
10 10 4 5 8 4 7 6 4 2 3 3 1 2 4 4 1 2
66 67 68 69 70 71 72 73 74 75 76 77 78 79 81 85 86 90
4 4 2 2 1 1 1 3 2 5 2 5 1 3 3 2 1 1
91 93 94
2 1 1
[ reached getOption("max.print") -- omitted 18 entries ]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
2410 2404 1849 1231 918 587 451 323 217 192 147 114 96 77 51
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
60 48 48 49 30 36 39 24 18 26 20 20 17 23 18
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
19 20 21 15 16 10 14 16 10 14 8 12 10 8 9
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
15 11 9 9 15 13 5 4 6 10 4 9 13 11 5
60 61 62 63 64 65 66 68 69 70 71 72 73 74 75
6 12 6 6 5 8 4 7 2 6 2 6 5 3 8
[ reached getOption("max.print") -- omitted 41 entries ]
3562 x 9647 rating matrix of class 'realRatingMatrix' with 68565 ratings.
Step-7) Convert to data frame and save as CSV file
Import the ratings_final dataset:
ratings <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data612/master/Project%204/ratings_final.csv")
ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId),
x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")Split the dataset into test and train sets to build the model.
# Train/test split
set.seed(88)
eval <- evaluationScheme(amazon, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
# Set up data frame for timing
timing <- data.frame(Model = factor(), Training = double(), Predicting = double())Now, Let’s build three different models
USER BASED COLLABORATIVE FILTERING
model_method <- "UBCF"
# Training
tic()
modelUBCF <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time),
Predicting = as.double(predict_time)))
# Accuracy
accUBCF <- calcPredictionAccuracy(predUBCF, unknown)
# resultsUBCF <- evaluate(x = eval, method = model_method, n = c(1, 5, 10,
# 30, 60))RANDOM
model_method <- "RANDOM"
# Training
tic()
modelRandom <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predRandom <- predict(modelRandom, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time),
Predicting = as.double(predict_time)))
# Accuracy
accRandom <- calcPredictionAccuracy(predRandom, unknown)
# resultsRandom <- evaluate(x = eval, method = model_method, n = c(1, 5, 10,
# 30, 60))SVD
model_method <- "SVD"
# Training
tic()
modelSVD <- Recommender(train, method = model_method, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time),
Predicting = as.double(predict_time)))
# Accuracy
accSVD <- calcPredictionAccuracy(predSVD, unknown)
# resultsSVD <- evaluate(x = eval, method = model_method, n = c(1, 5, 10,
# 30, 60))As we have build all three models for the dataset, now we can proceed with compairing the accuracy for all three models
accuracy <- rbind(accUBCF, accRandom)
accuracy <- rbind(accuracy, accSVD)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
knitr::kable(accuracy, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| RMSE | MSE | MAE | |
|---|---|---|---|
| UBCF | 1.129307 | 1.275334 | 0.838039 |
| Random | 1.423150 | 2.025355 | 1.025171 |
| SVD | 1.123959 | 1.263285 | 0.827643 |
As we review the accuracy scores above for UBCF, Random, SVD models, we see that Random has the lowest accuracy than UBCF and SVD. Whereas, UBCF and SVD models accuracy figures are quite close to each other. It is not surprising that random recommendations are not as accurate as recommendations based on prior ratings.
Now we can we can review ROC curve and Precision-Recall plot for all three models.
models <- list(UBCF = list(name = "UBCF", param = NULL), Random = list(name = "RANDOM",
param = NULL), SVD = list(name = "SVD", param = list(k = 50)))
evalResults <- evaluate(x = eval, method = models, n = c(1, 5, 10, 30, 60))UBCF run fold/sample [model time/prediction time]
1 [0.006sec/109.981sec]
RANDOM run fold/sample [model time/prediction time]
1 [0.001sec/4.169sec]
SVD run fold/sample [model time/prediction time]
1 [12.857sec/4.995sec]
# Precision-Recall Plot
plot(evalResults, "prec/rec", annotate = TRUE, legend = "topright", main = "Precision-Recall")UBCF performs better than SVD and considerably better than the Random model.
• Now, Let us see the training and prediction time.
• From the table below we can see that the UBCF model can be created fairly quickly, but predicting results takes considerable time. The Random model is pretty efficient all around. The SVD model takes longer to build than to predict, but altogether it is quicker than the UBCF model.
rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Training | Predicting | |
|---|---|---|
| UBCF | 0.02 | 108.18 |
| RANDOM | 0.00 | 2.31 |
| SVD | 12.70 | 3.44 |
• Since UBCF and SVD models’s accuracy scores were similar and they also performed better compared to Random model, let’s create a hybrid model consisting of UBCF and SVD models.
• It may not always be desirable to recommend products that are likely to be most highly rated by a user. Recommending somewhat unexpected products may improve user experience, expand user preferences, provide additional knowledge about a user.
• In order to make sure that most of recommendations are still likely to be highly rated we only allow very minor influence of the Random model (0.99 vs. 0.01 weight between UBCF and Random models).
model_Hybrid <- HybridRecommender(modelUBCF, modelRandom, weights = c(0.99,
0.01))
pred_Hybrid <- predict(model_Hybrid, newdata = known, type = "ratings")
(accHybrid <- calcPredictionAccuracy(pred_Hybrid, unknown)) RMSE MSE MAE
1.3008284 1.6921544 0.9147718
Comprison of the accuracy
The accuracy has gone down. It is not as bad as with purely random model, but clearly not as good as UBCF or SVD models. However, the goal here is to influence user experience rather than make the most accurate model, so we need to employ different metrics.
Let us look at top 10 recommendations for the first user in the test set.
pUBCF <- predict(modelUBCF, newdata = known[1], type = "topNList")
pHybrid <- predict(model_Hybrid, newdata = known[1], type = "topNList")$A100ZQDV7L8PVV
[1] 8871 4298 8747 9564 8868 2076 1181 9547 7390 9560
$A100ZQDV7L8PVV
[1] 4298 8871 8747 8868 2076 9547 9560 1181 8643 9564
Now as we see, the Hybrid model includes most of the items recommended by the UBCF model, but there are new items and the order is different.
• In this project we have build three different recommender system algorithms and compared the accuracy of all the three different models. Similar process can be employed to compare additional models or to adjust model parameters to find the most optimal model.
Additional experiments that could be performed
• One of the approaches in measuring success of diversification may be A/B testing. Users are randomly divided into two groups and each group is offered a slightly different experience. For instance, one group may get recommendations only from the UBCF model while the other group will get recommendations from the hybrid model.
• User experience is measured in some way. The least instrusive way is to monitor user interaction. In this example of Amazon products, a click on a recommendation suggested by the random element of the model will point to the fact that the hybrid model provides valuable recommendations.
• Of course, it is possible to track other metrics - products bought, time spent on product page, amount spent, etc. The basic idea is to see if the hybrid model provides meaningful improvement to the basic model.
• It is important to have objective measures when building and optimizing data science models. Evaluation of a model that returns highly relevant, but redundant recommendations should reflect that the model may score poorly in user experience.
• One of the approaches to measure diversity is described in Novelty and Diversity in Information Retrieval Evaluation. This or similar measurement should be incorporated in projects of this type.