Objective

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

Deliverables

  1. As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

  2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

  3. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Load required libraries

library(recommenderlab)  # Matrix/recommender functions
## Warning: package 'recommenderlab' was built under R version 3.6.3
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.6.3
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.6.3
## 
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
library(dplyr)           # Data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)           # Data manipulation
## Warning: package 'tidyr' was built under R version 3.6.3
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack
library(ggplot2)         # Plotting
library(tictoc)          # Operation timing

Dataset Information

• The dataset is a product ratings for beauty products sold on Amazon.com. The dataset was downloaded from Kaggle.com.

• Original set contains 2,023,070 observations and 4 variables - User ID, Product ID, Rating (from 1 to 5), and Time Stamp. It covers 1,210,271 users and 249,274 products. In order to make the set more manageable it has been reduced to a smaller subset.

• The final ratings dataset used consists of 3562 x 12057 rating matrix of class ‘realRatingMatrix’ with 68565 ratings.

Subsetting Dataset

Now let’s see the process of reducing the data from the main dataset

Step 1) Import original file and select sample for project

ratings <- read.csv("E:/github/MS/DATA612/Project_4/ratings_Beauty.csv")

Step-2) Explore

head(ratings)
##           UserId  ProductId Rating  Timestamp
## 1 A39HTATAQ9V7YF 0205616461      5 1369699200
## 2 A3JM6GV9MNOF9X 0558925278      3 1355443200
## 3 A1Z513UWSAAO0F 0558925278      5 1404691200
## 4 A1WMRR494NWEWV 0733001998      4 1382572800
## 5 A3IAAVS479H7M7 0737104473      1 1274227200
## 6  AKJHHD5VEH7VG 0762451459      5 1404518400
class(ratings$UserId); class(ratings$ProductId); class(ratings$Rating); class(ratings$Timestamp)
## [1] "factor"
## [1] "factor"
## [1] "numeric"
## [1] "integer"
hist(ratings$Rating, col = "Blue")

Step-3) Convert to realRatingMatrix

ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")

Step-4) Explore

amazon 
## 1210271 x 249274 rating matrix of class 'realRatingMatrix' with 2023070 ratings.
hist(rowCounts(amazon), col = "Green")

table(rowCounts(amazon))
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## 887401 175875  64336  30285  16187   9827   6324   4260   3181   2275   1745 
##     12     13     14     15     16     17     18     19     20     21     22 
##   1402   1012    912    677    550    462    411    323    277    262    238 
##     23     24     25     26     27     28     29     30     31     32     33 
##    198    160    132    105    129     95     94     82     68     53     64 
##     34     35     36     37     38     39     40     41     42     43     44 
##     51     49     45     32     41     37     38     30     35     29     20 
##     45     46     47     48     49     50     51     52     53     54     55 
##     13     25     25     19     19     17     13     18     13     15     14 
##     56     57     58     59     60     61     62     63     64     65     66 
##     11     15      6      6     13      6      9     11      3     10      4 
##     67     68     69     70     71     72     73     74     75     76     77 
##      4      5      6      6      5      2      6      5      7      5      5 
##     78     79     80     81     82     83     84     85     86     87     88 
##      2      3      2      8      3      1      5      1      2      4      1 
##     89     90     91     92     93     94     95     96     97     98     99 
##      2      3      1      2      5      2      3      1      1      2      2 
##    102    103    104    105    107    108    109    110    112    113    114 
##      2      1      3      1      1      1      3      1      2      1      2 
##    115    116    117    118    120    122    125    127    129    130    131 
##      1      1      2      2      1      1      1      2      1      1      1 
##    132    134    135    137    139    141    145    150    151    152    154 
##      1      1      2      1      1      1      1      1      1      1      1 
##    155    164    168    170    172    173    182    186    205    209    211 
##      1      1      1      1      1      1      1      1      1      1      1 
##    225    249    259    269    275    276    278    326    336    389 
##      1      1      1      1      1      1      1      1      1      1
hist(colCounts(amazon), col = "Yellow")

table(colCounts(amazon))
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## 103484  42209  22334  13902   9623   7214   5592   4404   3574   3059   2542 
##     12     13     14     15     16     17     18     19     20     21     22 
##   2267   2024   1657   1526   1410   1208   1096   1054    912    869    810 
##     23     24     25     26     27     28     29     30     31     32     33 
##    723    663    591    545    508    517    468    432    381    388    361 
##     34     35     36     37     38     39     40     41     42     43     44 
##    338    332    347    276    269    267    261    261    245    221    226 
##     45     46     47     48     49     50     51     52     53     54     55 
##    177    204    188    177    167    174    156    138    151    157    132 
##     56     57     58     59     60     61     62     63     64     65     66 
##    142    112    119    105    120    103    125    116    111     83     94 
##     67     68     69     70     71     72     73     74     75     76     77 
##     99     86     89     92     89     78     84     85     79     74     66 
##     78     79     80     81     82     83     84     85     86     87     88 
##     66     63     55     72     60     62     62     58     51     64     56 
##     89     90     91     92     93     94     95     96     97     98     99 
##     52     61     56     50     47     41     36     49     46     49     37 
##    100    101    102    103    104    105    106    107    108    109    110 
##     32     34     39     40     26     46     46     34     36     27     29 
##    111    112    113    114    115    116    117    118    119    120    121 
##     32     39     26     25     29     33     28     22     23     19     27 
##    122    123    124    125    126    127    128    129    130    131    132 
##     22     21     28     24     16     24     25     18     19     20     16 
##    133    134    135    136    137    138    139    140    141    142    143 
##     24     27     21     15     31     21     23     14      9     16     19 
##    144    145    146    147    148    149    150    151    152    153    154 
##     17     24     15      8     16     10     11     13     16     11     15 
##    155    156    157    158    159    160    161    162    163    164    165 
##     16      5     18     16     13     12      6     10     14     16     14 
##    166    167    168    169    170    171    172    173    174    175    176 
##     17     10     19     18     13      8      8     10     11      4     13 
##    177    178    179    180    181    182    183    184    185    186    187 
##     12      7     11     14     10      9      9      8     12      9     10 
##    188    189    190    191    192    193    194    195    196    197    198 
##     11      9     15      6      5      5      8      8      7      7     12 
##    199    200    201    202    203    204    205    206    207    208    209 
##      9      6      4      8      4      5     12      9      8      5      4 
##    210    211    212    213    214    215    216    217    218    219    220 
##     10      6      5     10      9      8      6      5     10      6      4 
##    221    222    223    224    225    226    227    228    229    230    231 
##      4      4      5      9      5      1      5      5      9      7      8 
##    232    233    234    235    236    237    238    239    240    241    242 
##      1      8      7     11      8      4      4      4      7      7      5 
##    243    244    245    246    247    248    249    250    251    252    253 
##      4      6      3      2      3      7      5      5      5      3      7 
##    254    255    256    257    258    259    260    261    262    263    264 
##      3      8      8      4      5      2      7      4      3      3      6 
##    265    266    267    268    269    272    273    274    275    276    277 
##      5      3      3      3      3      4      2      2      1      5      2 
##    278    279    281    282    283    284    285    286    287    288    289 
##      3      3      7      4      2      2      4      3      3      4      6 
##    290    291    292    293    294    295    296    297    298    299    300 
##      3      2      3      3      3      5      2      2      4      1      6 
##    301    302    303    305    306    307    308    309    310    311    312 
##      3      2      3      4      3      4      1      3      1      5      3 
##    313    315    316    318    319    320    321    322    324    326    327 
##      3      5      4      2      3      1      4      2      1      1      3 
##    328    329    330    331    332    333    334    335    336    337    338 
##      1      1      5      2      2      2      1      2      3      3      3 
##    339    340    341    342    345    346    347    348    349    350    351 
##      5      2      3      1      2      1      1      2      1      3      2 
##    352    353    354    355    358    359    360    362    363    364    366 
##      1      1      2      6      2      1      3      3      2      2      1 
##    368    369    370    372    375    376    377    379    380    381    383 
##      3      3      3      2      1      2      5      2      2      1      1 
##    384    386    387    388    389    391    392    394    395    396    397 
##      1      2      3      1      1      1      1      1      1      1      1 
##    398    399    400    402    404    405    406    407    409    411    412 
##      3      2      1      1      1      1      2      1      1      1      4 
##    413    414    415    416    418    419    422    423    426    427    429 
##      2      3      3      1      2      2      1      2      1      1      1 
##    430    431    432    434    435    436    438    441    442    443    446 
##      3      1      2      1      1      1      3      1      2      2      2 
##    447    452    455    456    458    459    460    461    462    463    465 
##      2      2      1      1      1      2      3      2      1      3      2 
##    466    467    468    472    473    474    478    479    480    481    483 
##      1      1      3      1      1      3      1      1      4      1      1 
##    488    489    494    495    496    497    498    499    500    501    502 
##      1      1      1      2      2      1      2      1      2      3      1 
##    503    506    507    509    510    511    512    513    514    515    519 
##      2      1      2      1      3      1      2      1      1      1      1 
##    520    523    524    534    537    539    544    545    550    551    553 
##      1      1      3      1      1      1      1      2      2      1      1 
##    554    557    558    563    565    581    584    585    587    590    591 
##      1      2      3      2      1      1      1      2      1      1      1 
##    594    595    597    598    599    600    601    605    607    609    614 
##      1      1      1      1      2      1      1      1      1      1      1 
##    616    618    619    624    639    643    644    653    656    661    662 
##      1      1      1      1      2      1      1      1      1      1      1 
##    665    666    668    671    672    678    680    682    685    686    687 
##      1      1      1      1      1      1      1      1      1      1      2 
##    689    691    698    700    704    706    707    709    713    714    720 
##      1      1      1      1      1      1      1      1      2      1      1 
##    725    729    734    735    736    738    746    755    757    758    765 
##      1      1      1      1      1      1      1      1      1      1      1 
##    766    768    773    782    784    789    795    798    810    818    828 
##      1      1      2      1      1      1      1      1      1      1      1 
##    834    845    865    880    883    885    888    896    899    925    926 
##      1      1      1      1      1      1      1      1      1      1      1 
##    943    945    946    981    992   1046   1051   1061   1074   1079   1105 
##      1      1      1      1      1      1      1      1      1      1      1 
##   1108   1135   1136   1153   1159   1163   1323   1330   1333   1341   1347 
##      1      1      1      1      1      1      1      1      1      1      1 
##   1349   1379   1430   1468   1475   1558   1589   1838   1885   1918   2041 
##      2      1      1      1      1      1      1      1      1      1      1 
##   2088   2143   2477   2869   7533 
##      1      1      1      1      1

Step-5) Select Subset 1 and Subset 2

( amazonShort <- amazon[rowCounts(amazon) > 10, colCounts(amazon) > 30] )
## 10320 x 12057 rating matrix of class 'realRatingMatrix' with 111871 ratings.
amazonShort <- amazon[ , colCounts(amazon) > 30]
amazonShort <- amazonShort[rowCounts(amazonShort) > 10, ]
amazonShort
## 3562 x 12057 rating matrix of class 'realRatingMatrix' with 68565 ratings.

Step-6) Check and Remove Empty Lines

table(rowCounts(amazonShort))
## 
##  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
## 598 479 335 274 264 184 163 135 118  99  84  73  73  66  50  54  26  34  36  31 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
##  31  30  22  18  17  15  11  20  15  13  11   8   7   9   8   5  10  10   4   5 
##  51  52  53  54  55  56  57  58  60  61  62  63  64  65  66  67  68  69  70  71 
##   8   4   7   6   4   2   3   3   1   2   4   4   1   2   4   4   2   2   1   1 
##  72  73  74  75  76  77  78  79  81  85  86  90  91  93  94  95  97  98  99 100 
##   1   3   2   5   2   5   1   3   3   2   1   1   2   1   1   1   1   1   1   1 
## 101 103 105 110 111 112 114 116 119 126 152 185 187 
##   1   1   2   1   1   1   1   1   1   1   1   1   1
table(colCounts(amazonShort))
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2410 2404 1849 1231  918  587  451  323  217  192  147  114   96   77   51   60 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   48   48   49   30   36   39   24   18   26   20   20   17   23   18   19   20 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##   21   15   16   10   14   16   10   14    8   12   10    8    9   15   11    9 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
##    9   15   13    5    4    6   10    4    9   13   11    5    6   12    6    6 
##   64   65   66   68   69   70   71   72   73   74   75   76   77   78   79   80 
##    5    8    4    7    2    6    2    6    5    3    8    4    4    2    4    6 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   95   96   98 
##    4    7    6    5    2    4    1    4    1    1    4    1    2    1    1    1 
##   99  102  103  105  106  107  112  116  122  136  141  143  146  147  148  160 
##    1    1    1    2    2    1    1    1    1    1    1    1    1    1    1    1 
##  164  167  174  177 
##    1    1    1    1
( amazonShort <- amazonShort[ , colCounts(amazonShort) != 0] )
## 3562 x 9647 rating matrix of class 'realRatingMatrix' with 68565 ratings.

Step-7) Convert to data frame and save as CSV file

df <- as.data.frame(as.matrix(amazonShort@data))
df$UserId <- rownames(df)
df <- df %>% gather(key = ProductId, value = Rating, -UserId) %>% filter(Rating != 0)
write.csv(df, "E:/github/MS/DATA612/Project_4/ratings_final.csv", row.names = FALSE)

Data import and Data Preparation

Import the ratings_final dataset:

ratings <- read.csv("https://raw.githubusercontent.com/vijay564/DATA612/master/Project_4/dataset/ratings_final.csv")
ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")

Train and Test sets

Split the dataset into test and train sets to build the model.

# Train/test split
set.seed(88)
eval <- evaluationScheme(amazon, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
# Set up data frame for timing
timing <- data.frame(Model=factor(), Training=double(), Predicting=double())

Recommender Models

Now, Let’s build three different models

USER BASED COLLABORATIVE FILTERING

model_method <- "UBCF"
# Training
tic()
modelUBCF <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accUBCF <- calcPredictionAccuracy(predUBCF, unknown)
#resultsUBCF <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

RANDOM

model_method <- "RANDOM"
# Training
tic()
modelRandom <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predRandom <- predict(modelRandom, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accRandom <- calcPredictionAccuracy(predRandom, unknown)
#resultsRandom <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

SVD

model_method <- "SVD"
# Training
tic()
modelSVD <- Recommender(train, method = model_method, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)
# Predicting
tic()
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)
timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 
# Accuracy
accSVD <- calcPredictionAccuracy(predSVD, unknown)
#resultsSVD <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

Compairing Models

As we have build all three models for the dataset, now we can proceed with compairing the accuracy for all three models

accuracy <- rbind(accUBCF, accRandom)
accuracy <- rbind(accuracy, accSVD)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
knitr::kable(accuracy, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
RMSE MSE MAE
UBCF 1.080407 1.167279 0.7939583
Random 1.353367 1.831603 0.9672497
SVD 1.076808 1.159516 0.7809777

As we review the accuracy scores above for UBCF, Random, SVD models, we see that Random has the lowest accuracy than UBCF and SVD. Whereas, UBCF and SVD models accuracy figures are quite close to each other. It is not surprising that random recommendations are not as accurate as recommendations based on prior ratings.

ROC curve

Now we can we can review ROC curve and Precision-Recall plot for all three models.

models <- list(
  "UBCF" = list(name = "UBCF", param = NULL),
  "Random" = list(name = "RANDOM", param = NULL),
  "SVD" = list(name = "SVD", param = list(k = 50))
  )
evalResults <- evaluate(x = eval, method = models, n = c(1, 5, 10, 30, 60))
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/125.61sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/5.05sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [12.55sec/5.39sec]
# ROC Curve
plot(evalResults, 
     annotate = TRUE, legend = "topleft", main = "ROC Curve")

# Precision-Recall Plot
plot(evalResults, "prec/rec", 
     annotate = TRUE, legend = "topright", main = "Precision-Recall")

UBCF performs better than SVD and considerably better than the Random model.

• Now, Let us see the training and prediction time.

• From the table below we can see that the UBCF model can be created fairly quickly, but predicting results takes considerable time. The Random model is pretty efficient all around. The SVD model takes longer to build than to predict, but altogether it is quicker than the UBCF model.

rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Training Predicting
UBCF 0.01 130.64
RANDOM 0.00 2.53
SVD 12.72 3.34

Implement Support for Business/User Experiance Goal

• Since UBCF and SVD models’s accuracy scores were similar and they also performed better compared to Random model, let’s create a hybrid model consisting of UBCF and SVD models.

• It may not always be desirable to recommend products that are likely to be most highly rated by a user. Recommending somewhat unexpected products may improve user experience, expand user preferences, provide additional knowledge about a user.

• In order to make sure that most of recommendations are still likely to be highly rated we only allow very minor influence of the Random model (0.99 vs. 0.01 weight between UBCF and Random models).

model_Hybrid <- HybridRecommender(
    modelUBCF, 
    modelRandom, 
    weights = c(0.99, 0.01))
pred_Hybrid <- predict(model_Hybrid, newdata = known, type = "ratings")
( accHybrid <- calcPredictionAccuracy(pred_Hybrid, unknown) )
##      RMSE       MSE       MAE 
## 1.4019483 1.9654589 0.9436949

Comprison of the accuracy

The accuracy has gone down. It is not as bad as with purely random model, but clearly not as good as UBCF or SVD models. However, the goal here is to influence user experience rather than make the most accurate model, so we need to employ different metrics.

Let us look at top 10 recommendations for the first user in the test set.

pUBCF <- predict(modelUBCF, newdata = known[1], type = "topNList")
pHybrid <- predict(model_Hybrid, newdata = known[1], type = "topNList")
pUBCF@items
## $A103WXT3CHVY0H
##  [1] 2569   41  485 2870 2882   32 5148  435 2294 2914
pHybrid@items
## $A103WXT3CHVY0H
##  [1] 2569  485   41 2870   32 2914 2294 2187 1743 6187

Now as we see, the Hybrid model includes most of the items recommended by the UBCF model, but there are new items and the order is different.

Conclusion

• In this project we have build three different recommender system algorithms and compared the accuracy of all the three different models. Similar process can be employed to compare additional models or to adjust model parameters to find the most optimal model.

Additional experiments that could be performed

• One of the approaches in measuring success of diversification may be A/B testing. Users are randomly divided into two groups and each group is offered a slightly different experience. For instance, one group may get recommendations only from the UBCF model while the other group will get recommendations from the hybrid model.

• User experience is measured in some way. The least instrusive way is to monitor user interaction. In this example of Amazon products, a click on a recommendation suggested by the random element of the model will point to the fact that the hybrid model provides valuable recommendations.

• Of course, it is possible to track other metrics - products bought, time spent on product page, amount spent, etc. The basic idea is to see if the hybrid model provides meaningful improvement to the basic model.

• It is important to have objective measures when building and optimizing data science models. Evaluation of a model that returns highly relevant, but redundant recommendations should reflect that the model may score poorly in user experience.

• One of the approaches to measure diversity is described in Novelty and Diversity in Information Retrieval Evaluation. This or similar measurement should be incorporated in projects of this type.