Project 4

The dataset contains product ratings for beauty products sold on Amazon.

• Original set contains 2,023,070 observations and 4 variables. It covers 1,210,271 users and 249,274 products. We will work with a subset of the data, otherwise running the project takes up long time.

 ratings <- read.csv("Beauty.csv")

Lets look at the first few records and ratings distribution

head(ratings)

##           UserId  ProductId Rating  Timestamp
## 1 A39HTATAQ9V7YF 0205616461      5 1369699200
## 2 A3JM6GV9MNOF9X 0558925278      3 1355443200
## 3 A1Z513UWSAAO0F 0558925278      5 1404691200
## 4 A1WMRR494NWEWV 0733001998      4 1382572800
## 5 A3IAAVS479H7M7 0737104473      1 1274227200
## 6  AKJHHD5VEH7VG 0762451459      5 1404518400

hist(ratings$Rating)

Step-3) Convert to realRatingMatrix

ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), 
    x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")

Step-4) Explore

amazon

## 1210271 x 249274 rating matrix of class 'realRatingMatrix' with 2023070 ratings.

hist(rowCounts(amazon))

table(rowCounts(amazon))

## 
##      1      2      3      4      5      6      7      8      9     10     11 
## 887401 175875  64336  30285  16187   9827   6324   4260   3181   2275   1745 
##     12     13     14     15     16     17     18     19     20     21     22 
##   1402   1012    912    677    550    462    411    323    277    262    238 
##     23     24     25     26     27     28     29     30     31     32     33 
##    198    160    132    105    129     95     94     82     68     53     64 
##     34     35     36     37     38     39     40     41     42     43     44 
##     51     49     45     32     41     37     38     30     35     29     20 
##     45     46     47     48     49     50     51     52     53     54     55 
##     13     25     25     19     19     17     13     18     13     15     14 
##     56     57     58     59     60     61     62     63     64     65     66 
##     11     15      6      6     13      6      9     11      3     10      4 
##     67     68     69     70     71     72     73     74     75     76     77 
##      4      5      6      6      5      2      6      5      7      5      5 
##     78     79     80     81     82     83     84     85     86     87     88 
##      2      3      2      8      3      1      5      1      2      4      1 
##     89     90     91     92     93     94     95     96     97     98     99 
##      2      3      1      2      5      2      3      1      1      2      2 
##    102    103    104    105    107    108    109    110    112    113    114 
##      2      1      3      1      1      1      3      1      2      1      2 
##    115    116    117    118    120    122    125    127    129    130    131 
##      1      1      2      2      1      1      1      2      1      1      1 
##    132    134    135    137    139    141    145    150    151    152    154 
##      1      1      2      1      1      1      1      1      1      1      1 
##    155    164    168    170    172    173    182    186    205    209    211 
##      1      1      1      1      1      1      1      1      1      1      1 
##    225    249    259    269    275    276    278    326    336    389 
##      1      1      1      1      1      1      1      1      1      1

hist(colCounts(amazon))

table(colCounts(amazon))

## 
##      1      2      3      4      5      6      7      8      9     10     11 
## 103484  42209  22334  13902   9623   7214   5592   4404   3574   3059   2542 
##     12     13     14     15     16     17     18     19     20     21     22 
##   2267   2024   1657   1526   1410   1208   1096   1054    912    869    810 
##     23     24     25     26     27     28     29     30     31     32     33 
##    723    663    591    545    508    517    468    432    381    388    361 
##     34     35     36     37     38     39     40     41     42     43     44 
##    338    332    347    276    269    267    261    261    245    221    226 
##     45     46     47     48     49     50     51     52     53     54     55 
##    177    204    188    177    167    174    156    138    151    157    132 
##     56     57     58     59     60     61     62     63     64     65     66 
##    142    112    119    105    120    103    125    116    111     83     94 
##     67     68     69     70     71     72     73     74     75     76     77 
##     99     86     89     92     89     78     84     85     79     74     66 
##     78     79     80     81     82     83     84     85     86     87     88 
##     66     63     55     72     60     62     62     58     51     64     56 
##     89     90     91     92     93     94     95     96     97     98     99 
##     52     61     56     50     47     41     36     49     46     49     37 
##    100    101    102    103    104    105    106    107    108    109    110 
##     32     34     39     40     26     46     46     34     36     27     29 
##    111    112    113    114    115    116    117    118    119    120    121 
##     32     39     26     25     29     33     28     22     23     19     27 
##    122    123    124    125    126    127    128    129    130    131    132 
##     22     21     28     24     16     24     25     18     19     20     16 
##    133    134    135    136    137    138    139    140    141    142    143 
##     24     27     21     15     31     21     23     14      9     16     19 
##    144    145    146    147    148    149    150    151    152    153    154 
##     17     24     15      8     16     10     11     13     16     11     15 
##    155    156    157    158    159    160    161    162    163    164    165 
##     16      5     18     16     13     12      6     10     14     16     14 
##    166    167    168    169    170    171    172    173    174    175    176 
##     17     10     19     18     13      8      8     10     11      4     13 
##    177    178    179    180    181    182    183    184    185    186    187 
##     12      7     11     14     10      9      9      8     12      9     10 
##    188    189    190    191    192    193    194    195    196    197    198 
##     11      9     15      6      5      5      8      8      7      7     12 
##    199    200    201    202    203    204    205    206    207    208    209 
##      9      6      4      8      4      5     12      9      8      5      4 
##    210    211    212    213    214    215    216    217    218    219    220 
##     10      6      5     10      9      8      6      5     10      6      4 
##    221    222    223    224    225    226    227    228    229    230    231 
##      4      4      5      9      5      1      5      5      9      7      8 
##    232    233    234    235    236    237    238    239    240    241    242 
##      1      8      7     11      8      4      4      4      7      7      5 
##    243    244    245    246    247    248    249    250    251    252    253 
##      4      6      3      2      3      7      5      5      5      3      7 
##    254    255    256    257    258    259    260    261    262    263    264 
##      3      8      8      4      5      2      7      4      3      3      6 
##    265    266    267    268    269    272    273    274    275    276    277 
##      5      3      3      3      3      4      2      2      1      5      2 
##    278    279    281    282    283    284    285    286    287    288    289 
##      3      3      7      4      2      2      4      3      3      4      6 
##    290    291    292    293    294    295    296    297    298    299    300 
##      3      2      3      3      3      5      2      2      4      1      6 
##    301    302    303    305    306    307    308    309    310    311    312 
##      3      2      3      4      3      4      1      3      1      5      3 
##    313    315    316    318    319    320    321    322    324    326    327 
##      3      5      4      2      3      1      4      2      1      1      3 
##    328    329    330    331    332    333    334    335    336    337    338 
##      1      1      5      2      2      2      1      2      3      3      3 
##    339    340    341    342    345    346    347    348    349    350    351 
##      5      2      3      1      2      1      1      2      1      3      2 
##    352    353    354    355    358    359    360    362    363    364    366 
##      1      1      2      6      2      1      3      3      2      2      1 
##    368    369    370    372    375    376    377    379    380    381    383 
##      3      3      3      2      1      2      5      2      2      1      1 
##    384    386    387    388    389    391    392    394    395    396    397 
##      1      2      3      1      1      1      1      1      1      1      1 
##    398    399    400    402    404    405    406    407    409    411    412 
##      3      2      1      1      1      1      2      1      1      1      4 
##    413    414    415    416    418    419    422    423    426    427    429 
##      2      3      3      1      2      2      1      2      1      1      1 
##    430    431    432    434    435    436    438    441    442    443    446 
##      3      1      2      1      1      1      3      1      2      2      2 
##    447    452    455    456    458    459    460    461    462    463    465 
##      2      2      1      1      1      2      3      2      1      3      2 
##    466    467    468    472    473    474    478    479    480    481    483 
##      1      1      3      1      1      3      1      1      4      1      1 
##    488    489    494    495    496    497    498    499    500    501    502 
##      1      1      1      2      2      1      2      1      2      3      1 
##    503    506    507    509    510    511    512    513    514    515    519 
##      2      1      2      1      3      1      2      1      1      1      1 
##    520    523    524    534    537    539    544    545    550    551    553 
##      1      1      3      1      1      1      1      2      2      1      1 
##    554    557    558    563    565    581    584    585    587    590    591 
##      1      2      3      2      1      1      1      2      1      1      1 
##    594    595    597    598    599    600    601    605    607    609    614 
##      1      1      1      1      2      1      1      1      1      1      1 
##    616    618    619    624    639    643    644    653    656    661    662 
##      1      1      1      1      2      1      1      1      1      1      1 
##    665    666    668    671    672    678    680    682    685    686    687 
##      1      1      1      1      1      1      1      1      1      1      2 
##    689    691    698    700    704    706    707    709    713    714    720 
##      1      1      1      1      1      1      1      1      2      1      1 
##    725    729    734    735    736    738    746    755    757    758    765 
##      1      1      1      1      1      1      1      1      1      1      1 
##    766    768    773    782    784    789    795    798    810    818    828 
##      1      1      2      1      1      1      1      1      1      1      1 
##    834    845    865    880    883    885    888    896    899    925    926 
##      1      1      1      1      1      1      1      1      1      1      1 
##    943    945    946    981    992   1046   1051   1061   1074   1079   1105 
##      1      1      1      1      1      1      1      1      1      1      1 
##   1108   1135   1136   1153   1159   1163   1323   1330   1333   1341   1347 
##      1      1      1      1      1      1      1      1      1      1      1 
##   1349   1379   1430   1468   1475   1558   1589   1838   1885   1918   2041 
##      2      1      1      1      1      1      1      1      1      1      1 
##   2088   2143   2477   2869   7533 
##      1      1      1      1      1

Step-5) Subsetting

(amazonShort <- amazon[rowCounts(amazon) > 10, colCounts(amazon) > 30])

## 10320 x 12057 rating matrix of class 'realRatingMatrix' with 111871 ratings.

amazonShort <- amazon[, colCounts(amazon) > 30]
amazonShort <- amazonShort[rowCounts(amazonShort) > 10, ]
amazonShort

## 3562 x 12057 rating matrix of class 'realRatingMatrix' with 68565 ratings.

Step-6) Remove Empty Lines

table(rowCounts(amazonShort))

## 
##  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
## 598 479 335 274 264 184 163 135 118  99  84  73  73  66  50  54  26  34  36  31 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
##  31  30  22  18  17  15  11  20  15  13  11   8   7   9   8   5  10  10   4   5 
##  51  52  53  54  55  56  57  58  60  61  62  63  64  65  66  67  68  69  70  71 
##   8   4   7   6   4   2   3   3   1   2   4   4   1   2   4   4   2   2   1   1 
##  72  73  74  75  76  77  78  79  81  85  86  90  91  93  94  95  97  98  99 100 
##   1   3   2   5   2   5   1   3   3   2   1   1   2   1   1   1   1   1   1   1 
## 101 103 105 110 111 112 114 116 119 126 152 185 187 
##   1   1   2   1   1   1   1   1   1   1   1   1   1

table(colCounts(amazonShort))

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2410 2404 1849 1231  918  587  451  323  217  192  147  114   96   77   51   60 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   48   48   49   30   36   39   24   18   26   20   20   17   23   18   19   20 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##   21   15   16   10   14   16   10   14    8   12   10    8    9   15   11    9 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
##    9   15   13    5    4    6   10    4    9   13   11    5    6   12    6    6 
##   64   65   66   68   69   70   71   72   73   74   75   76   77   78   79   80 
##    5    8    4    7    2    6    2    6    5    3    8    4    4    2    4    6 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   95   96   98 
##    4    7    6    5    2    4    1    4    1    1    4    1    2    1    1    1 
##   99  102  103  105  106  107  112  116  122  136  141  143  146  147  148  160 
##    1    1    1    2    2    1    1    1    1    1    1    1    1    1    1    1 
##  164  167  174  177 
##    1    1    1    1

(amazonShort <- amazonShort[, colCounts(amazonShort) != 0])

## 3562 x 9647 rating matrix of class 'realRatingMatrix' with 68565 ratings.

Step-7) Convert to data frame

ratings <- as.data.frame(as.matrix(amazonShort@data))
ratings$UserId <- rownames(ratings)
ratings <- ratings %>% gather(key = ProductId, value = Rating, -UserId) %>% filter(Rating != 
    0)

Data import and Data Preparation

ratings <- read.csv("finalratings.csv")
ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), 
    x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")

Train and Test sets

set.seed(1)
eval <- evaluationScheme(amazon, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

timing <- data.frame(Model = factor(), Training = double(), Predicting = double())

Models

USER BASED COLLABORATIVE FILTERING

model_method <- "UBCF"

# Train
tic()
modelUBCF <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predict
tic()
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time), 
    Predicting = as.double(predict_time)))
# Accuracy
accuracyUBCF <- calcPredictionAccuracy(predUBCF, unknown)

RANDOM

model_method <- "RANDOM"

# Training
tic()
modelRandom <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
predRandom <- predict(modelRandom, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time), 
    Predicting = as.double(predict_time)))

# Accuracy
accuracyRandom <- calcPredictionAccuracy(predRandom, unknown)

SVD

model_method <- "SVD"

# Train
tic()
modelSVD <- Recommender(train, method = model_method, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predict
tic()
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), Training = as.double(train_time), 
    Predicting = as.double(predict_time)))

# Accuracy
accuracySVD <- calcPredictionAccuracy(predSVD, unknown)

Compairing Models

Lets compare the accuracy of all three models

Based on the table below we see that UBCF and SVD have similar RMSE. As expected Random model have the least accuracy out of the three models.

accuracy <- rbind(accuracyUBCF, accuracyRandom)
accuracy <- rbind(accuracy, accuracySVD)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
knitr::kable(accuracy, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

	RMSE	MSE	MAE
UBCF	1.114454	1.242008	0.8056955
Random	1.389105	1.929612	0.9923223
SVD	1.113120	1.239035	0.7978551

ROC curve

Lets review ROC curve and Precision-Recall plot for all three models.

models <- list(UBCF = list(name = "UBCF", param = NULL), Random = list(name = "RANDOM", 
    param = NULL), SVD = list(name = "SVD", param = list(k = 50)))
evalResults <- evaluate(x = eval, method = models, n = c(1, 5, 10, 30, 60))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/147.61sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/4.65sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [10.65sec/5.04sec]

plot(evalResults, annotate = TRUE, legend = "topleft", main = "ROC Curve")

# Precision-Recall Plot
plot(evalResults, "prec/rec", annotate = TRUE, legend = "topright", main = "Precision-Recall")

UBCF is better than SVD and alot better than Random model. As expected, Random model is the worse out of the models.

•Training and prediction time.

• From the table below we can see that the UBCF model can be trained quickly, but prediction takes significant amount of time. The Random model is very fast for training and predictions. The SVD model takes longer to train than to predict.

rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

	Training	Predicting
UBCF	0.12	156.02
RANDOM	0.00	2.69
SVD	11.28	3.08

Implement Support for Business/User Experiance Goal

• Accuracy for UBCF and SVD models were similar and these are alot better than Random model. We will create a hybrid model consisting of UBCF and SVD models.

• Recommending products that are likely to be rated high by a user might not be desirable. Recommending unexpected products may improve user experience, expand user preferences, provide additional knowledge.

• To make sure that most of recommendations are still likely be highly rated, we only allow very minor influence of the Random model (0.99 vs. 0.01 weight between UBCF and Random models).

model_Hybrid <- HybridRecommender(modelUBCF, modelRandom, weights = c(0.99, 
    0.01))
pred_Hybrid <- predict(model_Hybrid, newdata = known, type = "ratings")
(accHybrid <- calcPredictionAccuracy(pred_Hybrid, unknown))

##      RMSE       MSE       MAE 
## 1.3230367 1.7504262 0.9006966

Compare Accuracy

Accuracy of the Hybrid model is less than SVD and UBCF, but it is better than random model. The goal here is to influence user experience rather than make the most accurate model, so we need to use different metrics.

Below is the prediction for first user in the test set. Hybrid model includes most of the items recommended by the UBCF model and also inlcudes new items and the order is different.

pUBCF <- predict(modelUBCF, newdata = known[1], type = "topNList")
pHybrid <- predict(model_Hybrid, newdata = known[1], type = "topNList")
pUBCF@items

## $A03364251DGXSGA9PSR99
##  [1] 6591 5924 6203 7215 1167 2760 5096 6705 5078 6729

pHybrid@items

## $A03364251DGXSGA9PSR99
##  [1] 6203 7215 6591 2760 6705 2880 8827 1982 5096 1167

We built three different recommender systems and compared the accuracy of those different models, similar process can be used to compare additional models or to adjust model parameters to find the most optimal model.

Additional experiments that worth exploring

• One of the approaches in measuring success of diversification is by using A/B testing. Users are randomly divided into two groups and each group is offered a slightly different experience. For instance, one group may get recommendations only from the UBCF model while the other group will get recommendations from the hybrid model.

• Track/explore other metrics - products bought, time spent on product page, amount spent, etc. We would like to explore if hybrid model provides meaningful improvement over basic model.

• Evaluation of a model that returns highly relevant, but redundant recommendations may score poorly in user experience.

• Measure diversity, as described in Novelty and Diversity in Information Retrieval Evaluation or similar measurements should be incorporated in projects of this type.