Project 4 | Accuracy and Beyond

Project Objectives
- Deliverables
Libraries
Data Preparation and Exploration
Data Visualization
- Exploring the values of the rating
Recommendation algorithms
- Split the dataset into training set (80%) and testing set (20%):
User Based Collaborative Filtering
Random
SVD
Compairing Models
ROC Curve
Implementation Support for Business or User Experience Goal
Accuracy Comparison
Conclusion
- Additional experiments

Project Objectives

The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

In this assignment you’re asked to do at least one or (if you like) both of the following:

Work in a small group, and/or
Choose a different dataset to work with from your previous projects.

Deliverables

As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

You’ll find some of the material discussed in this week’s reading to be helpful in completing this project. You may also want to look at papers online, or search for “recsys” on youtube or slideshare.net.

Libraries

library(recommenderlab)
library(reshape2)
library(RCurl)
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tictoc)

Data Preparation and Exploration

We gathered data from https://www.kaggle.com/skillsmuggler/amazon-ratings?select=ratings_Beauty.csv. This is an Amazon dataset, obtained from Kaggle.com. Description of the content is as follows:

This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website.

It contains:

the unique UserId (Customer Identification),
the product ASIN (Amazon’s unique product identification code for each product),
Ratings (ranging from 1-5 based on customer satisfaction) and
the Timestamp of the rating (in UNIX time)

Load Data

amzonratings <- read.csv("./ratings_Beauty.csv")

Preview data

#Preview movies data
kable(head(amzonratings, n = 10L)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

UserId	ProductId	Rating	Timestamp
A39HTATAQ9V7YF	0205616461	5	1369699200
A3JM6GV9MNOF9X	0558925278	3	1355443200
A1Z513UWSAAO0F	0558925278	5	1404691200
A1WMRR494NWEWV	0733001998	4	1382572800
A3IAAVS479H7M7	0737104473	1	1274227200
AKJHHD5VEH7VG	0762451459	5	1404518400
A1BG8QW55XHN6U	1304139212	5	1371945600
A22VW0P4VZHDE3	1304139220	5	1373068800
A3V3RE4132GKRO	130414089X	5	1401840000
A327B0I7CYTEJC	130414643X	4	1389052800

dim(amzonratings)

## [1] 2023070       4

hist(amzonratings$Rating)

Create Matrix

We are converting the dataset to an object of class realRatingMatrix. So, in the following code-chunk, through a couple of steps, we’ll convert amzonratings into a realRatingsMatrix, called amazon, for downstream analysis (refer: page 33 of Building a Recommendation System with R).

amazonmatrix <- sparseMatrix(as.integer(amzonratings$UserId), as.integer(amzonratings$ProductId), 
    x = amzonratings$Rating)

colnames(amazonmatrix) <- levels(amzonratings$ProductId)

rownames(amazonmatrix) <- levels(amzonratings$UserId)

(amazon <- as(amazonmatrix, "realRatingMatrix"))

## 1210271 x 249274 rating matrix of class 'realRatingMatrix' with 2023070 ratings.

At this point, we’ll take stock of the important characteristics of amazon.

dim(amazon)

## [1] 1210271  249274

head(getData.frame(amazon))

##                          user       item rating
## 1609862 A00008821J0F472NDY6A2 B007T8XPC6      5
## 299511  A000186437REL8X2RW8UW B000JD4N9M      5
## 1791795  A0002574WYJMBWKNCPY8 B00AH4TJNS      3
## 1557548 A00029263J863WSR0TDRS B00766SGE8      5
## 898365   A00031961JI1CBNV98TW B002OO16QC      5
## 1544065 A000325234LCBTFVL1QK4 B006ZN1B1S      5

Check and Remove Empty Lines

table(rowCounts(amazon))

## 
##      1      2      3      4      5      6      7      8      9     10 
## 887401 175875  64336  30285  16187   9827   6324   4260   3181   2275 
##     11     12     13     14     15     16     17     18     19     20 
##   1745   1402   1012    912    677    550    462    411    323    277 
##     21     22     23     24     25     26     27     28     29     30 
##    262    238    198    160    132    105    129     95     94     82 
##     31     32     33     34     35     36     37     38     39     40 
##     68     53     64     51     49     45     32     41     37     38 
##     41     42     43     44     45     46     47     48     49     50 
##     30     35     29     20     13     25     25     19     19     17 
##     51     52     53     54     55     56     57     58     59     60 
##     13     18     13     15     14     11     15      6      6     13 
##     61     62     63     64     65     66     67     68     69     70 
##      6      9     11      3     10      4      4      5      6      6 
##     71     72     73     74     75     76     77     78     79     80 
##      5      2      6      5      7      5      5      2      3      2 
##     81     82     83     84     85     86     87     88     89     90 
##      8      3      1      5      1      2      4      1      2      3 
##     91     92     93     94     95     96     97     98     99    102 
##      1      2      5      2      3      1      1      2      2      2 
##    103    104    105    107    108    109    110    112    113    114 
##      1      3      1      1      1      3      1      2      1      2 
##    115    116    117    118    120    122    125    127    129    130 
##      1      1      2      2      1      1      1      2      1      1 
##    131    132    134    135    137    139    141    145    150    151 
##      1      1      1      2      1      1      1      1      1      1 
##    152    154    155    164    168    170    172    173    182    186 
##      1      1      1      1      1      1      1      1      1      1 
##    205    209    211    225    249    259    269    275    276    278 
##      1      1      1      1      1      1      1      1      1      1 
##    326    336    389 
##      1      1      1

table(colCounts(amazon))

## 
##      1      2      3      4      5      6      7      8      9     10 
## 103484  42209  22334  13902   9623   7214   5592   4404   3574   3059 
##     11     12     13     14     15     16     17     18     19     20 
##   2542   2267   2024   1657   1526   1410   1208   1096   1054    912 
##     21     22     23     24     25     26     27     28     29     30 
##    869    810    723    663    591    545    508    517    468    432 
##     31     32     33     34     35     36     37     38     39     40 
##    381    388    361    338    332    347    276    269    267    261 
##     41     42     43     44     45     46     47     48     49     50 
##    261    245    221    226    177    204    188    177    167    174 
##     51     52     53     54     55     56     57     58     59     60 
##    156    138    151    157    132    142    112    119    105    120 
##     61     62     63     64     65     66     67     68     69     70 
##    103    125    116    111     83     94     99     86     89     92 
##     71     72     73     74     75     76     77     78     79     80 
##     89     78     84     85     79     74     66     66     63     55 
##     81     82     83     84     85     86     87     88     89     90 
##     72     60     62     62     58     51     64     56     52     61 
##     91     92     93     94     95     96     97     98     99    100 
##     56     50     47     41     36     49     46     49     37     32 
##    101    102    103    104    105    106    107    108    109    110 
##     34     39     40     26     46     46     34     36     27     29 
##    111    112    113    114    115    116    117    118    119    120 
##     32     39     26     25     29     33     28     22     23     19 
##    121    122    123    124    125    126    127    128    129    130 
##     27     22     21     28     24     16     24     25     18     19 
##    131    132    133    134    135    136    137    138    139    140 
##     20     16     24     27     21     15     31     21     23     14 
##    141    142    143    144    145    146    147    148    149    150 
##      9     16     19     17     24     15      8     16     10     11 
##    151    152    153    154    155    156    157    158    159    160 
##     13     16     11     15     16      5     18     16     13     12 
##    161    162    163    164    165    166    167    168    169    170 
##      6     10     14     16     14     17     10     19     18     13 
##    171    172    173    174    175    176    177    178    179    180 
##      8      8     10     11      4     13     12      7     11     14 
##    181    182    183    184    185    186    187    188    189    190 
##     10      9      9      8     12      9     10     11      9     15 
##    191    192    193    194    195    196    197    198    199    200 
##      6      5      5      8      8      7      7     12      9      6 
##    201    202    203    204    205    206    207    208    209    210 
##      4      8      4      5     12      9      8      5      4     10 
##    211    212    213    214    215    216    217    218    219    220 
##      6      5     10      9      8      6      5     10      6      4 
##    221    222    223    224    225    226    227    228    229    230 
##      4      4      5      9      5      1      5      5      9      7 
##    231    232    233    234    235    236    237    238    239    240 
##      8      1      8      7     11      8      4      4      4      7 
##    241    242    243    244    245    246    247    248    249    250 
##      7      5      4      6      3      2      3      7      5      5 
##    251    252    253    254    255    256    257    258    259    260 
##      5      3      7      3      8      8      4      5      2      7 
##    261    262    263    264    265    266    267    268    269    272 
##      4      3      3      6      5      3      3      3      3      4 
##    273    274    275    276    277    278    279    281    282    283 
##      2      2      1      5      2      3      3      7      4      2 
##    284    285    286    287    288    289    290    291    292    293 
##      2      4      3      3      4      6      3      2      3      3 
##    294    295    296    297    298    299    300    301    302    303 
##      3      5      2      2      4      1      6      3      2      3 
##    305    306    307    308    309    310    311    312    313    315 
##      4      3      4      1      3      1      5      3      3      5 
##    316    318    319    320    321    322    324    326    327    328 
##      4      2      3      1      4      2      1      1      3      1 
##    329    330    331    332    333    334    335    336    337    338 
##      1      5      2      2      2      1      2      3      3      3 
##    339    340    341    342    345    346    347    348    349    350 
##      5      2      3      1      2      1      1      2      1      3 
##    351    352    353    354    355    358    359    360    362    363 
##      2      1      1      2      6      2      1      3      3      2 
##    364    366    368    369    370    372    375    376    377    379 
##      2      1      3      3      3      2      1      2      5      2 
##    380    381    383    384    386    387    388    389    391    392 
##      2      1      1      1      2      3      1      1      1      1 
##    394    395    396    397    398    399    400    402    404    405 
##      1      1      1      1      3      2      1      1      1      1 
##    406    407    409    411    412    413    414    415    416    418 
##      2      1      1      1      4      2      3      3      1      2 
##    419    422    423    426    427    429    430    431    432    434 
##      2      1      2      1      1      1      3      1      2      1 
##    435    436    438    441    442    443    446    447    452    455 
##      1      1      3      1      2      2      2      2      2      1 
##    456    458    459    460    461    462    463    465    466    467 
##      1      1      2      3      2      1      3      2      1      1 
##    468    472    473    474    478    479    480    481    483    488 
##      3      1      1      3      1      1      4      1      1      1 
##    489    494    495    496    497    498    499    500    501    502 
##      1      1      2      2      1      2      1      2      3      1 
##    503    506    507    509    510    511    512    513    514    515 
##      2      1      2      1      3      1      2      1      1      1 
##    519    520    523    524    534    537    539    544    545    550 
##      1      1      1      3      1      1      1      1      2      2 
##    551    553    554    557    558    563    565    581    584    585 
##      1      1      1      2      3      2      1      1      1      2 
##    587    590    591    594    595    597    598    599    600    601 
##      1      1      1      1      1      1      1      2      1      1 
##    605    607    609    614    616    618    619    624    639    643 
##      1      1      1      1      1      1      1      1      2      1 
##    644    653    656    661    662    665    666    668    671    672 
##      1      1      1      1      1      1      1      1      1      1 
##    678    680    682    685    686    687    689    691    698    700 
##      1      1      1      1      1      2      1      1      1      1 
##    704    706    707    709    713    714    720    725    729    734 
##      1      1      1      1      2      1      1      1      1      1 
##    735    736    738    746    755    757    758    765    766    768 
##      1      1      1      1      1      1      1      1      1      1 
##    773    782    784    789    795    798    810    818    828    834 
##      2      1      1      1      1      1      1      1      1      1 
##    845    865    880    883    885    888    896    899    925    926 
##      1      1      1      1      1      1      1      1      1      1 
##    943    945    946    981    992   1046   1051   1061   1074   1079 
##      1      1      1      1      1      1      1      1      1      1 
##   1105   1108   1135   1136   1153   1159   1163   1323   1330   1333 
##      1      1      1      1      1      1      1      1      1      1 
##   1341   1347   1349   1379   1430   1468   1475   1558   1589   1838 
##      1      1      2      1      1      1      1      1      1      1 
##   1885   1918   2041   2088   2143   2477   2869   7533 
##      1      1      1      1      1      1      1      1

(amazon <- amazon[, colCounts(amazon) != 0])

## 1210271 x 249274 rating matrix of class 'realRatingMatrix' with 2023070 ratings.

Subset

Since the original file contains 2M rows, it consumes too much processing time. So, in order to reduce processing time, we are subsetting the matrix, in th below code-chunk.

(amazonsub <- amazon[rowCounts(amazon) > 10, colCounts(amazon) > 30])

## 10320 x 12057 rating matrix of class 'realRatingMatrix' with 111871 ratings.

amazonsub <- amazon[, colCounts(amazon) > 30]
amazonsub <- amazonsub[rowCounts(amazonsub) > 10, ]
amazonsub

## 3562 x 12057 rating matrix of class 'realRatingMatrix' with 68565 ratings.

Data Visualization

Exploring the values of the rating

# Vectorize and create unique vector.
vector_ratings <- as.vector(amazonsub@data)
unique(vector_ratings)

## [1] 0 5 2 1 4 3

# The ratings are in the range 0-5. Let's count the occurrences of each of them.
table_ratings <- table(vector_ratings)
kable(table_ratings)

vector_ratings	Freq
0	42878469
1	3114
2	3966
3	8273
4	16020
5	37192

Rating equal to 0 represents a missing value, so we’ll purge out the zero-ratings from vector_ratings.

vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of the ratings")

Recommendation algorithms

Split the dataset into training set (80%) and testing set (20%):

# Train/test split
set.seed(88)
evaluation <- evaluationScheme(amazonsub, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(evaluation, "train")
known <- getData(evaluation, "known")
unknown <- getData(evaluation, "unknown")

# Set up data frame for timing
timing <- data.frame(Model = factor(), Training = double(), Predicting = double())

User Based Collaborative Filtering

method_name <- "UBCF"

# Training
tic()
UBCF_model <- Recommender(train, method = method_name)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
UBCF_prediction <- predict(UBCF_model, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(method_name), Training = as.double(train_time), Predicting = as.double(predict_time)))
# Accuracy
UBCF_accuracy <- calcPredictionAccuracy(UBCF_prediction, unknown)

Random

method_name <- "RANDOM"

# Training
tic()
Random_model <- Recommender(train, method = method_name)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
Random_prediction <- predict(Random_model, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(method_name), Training = as.double(train_time), 
    Predicting = as.double(predict_time)))

# Accuracy
Random_accuracy <- calcPredictionAccuracy(Random_prediction, unknown)

SVD

method_name <- "SVD"

# Training
tic()
SVD_model <- Recommender(train, method = method_name, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
SVD_prediction <- predict(SVD_model, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(method_name), Training = as.double(train_time), Predicting = as.double(predict_time)))

# Accuracy
SVD_accuracy <- calcPredictionAccuracy(SVD_prediction, unknown)

Compairing Models

accuracy <- rbind(UBCF_accuracy, Random_accuracy)
accuracy <- rbind(accuracy, SVD_accuracy)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
#
kable(accuracy) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#fc5e5e") %>%
    scroll_box(width = "100%", height = "200px")

	RMSE	MSE	MAE
UBCF	1.080407	1.167279	0.7939583
Random	1.355398	1.837103	0.9711710
SVD	1.076807	1.159515	0.7809779

In the comparison-table, we see that the Random has the highest RMSE, which implies that its accuracy is lowest, but the accuracies of UBCF and SVD are almost same. This is expected, because We know that random recommendations are not expected to be as accurate as the ones based on historical data.

ROC Curve

In the following, we’ll review ROC curve and Precision-reall plot for all the models.

model_list <- list(UBCF = list(name = "UBCF", param = NULL), Random = list(name = "RANDOM", param = NULL), SVD = list(name = "SVD", param = list(k = 50)))
evaluate_results <- evaluate(x = evaluation, method = model_list, n = c(1, 5, 10, 30, 60))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/232.24sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/9.03sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [19.19sec/8.63sec]

# ROC Curve
plot(evaluate_results, annotate = TRUE, legend = "topleft", main = "ROC Curve")

# Precision-Recall Plot
plot(evaluate_results, "prec/rec", annotate = TRUE, legend = "topright", main = "Precision-Recall")

We observe that UBCF performs better than SVD, and considerably better than the Random.

rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	Training	Predicting
UBCF	0.02	212.44
RANDOM	0.00	4.45
SVD	21.97	7.09

The table shows that while UBCF model can be created fast, the prediction takes more time. Random model is only predicting, so the table shows only prediction time. In the case of SVD, it takes more time to build, but less time to predict. However, SVD takes less time than UBCF.

Implementation Support for Business or User Experience Goal

We observed that UBCF and SVD models’ accuracy scores were not only similar, but performed better, compared to Random model, let’s create a hybrid model consisting of UBCF and Random models.

Recommending products that are likely to be highly rated by the user, may not always desirable. Rather, users would like to receive recommendations on products, they didn’t immediately consciously imagine. So, perhaps serendipitous recommendation might be more welcome. So, we want to build in some entropy (or randomness), and mildly disrupt the routine of receiving recommendations for highly rated products. In order to accomodate best of both worlds–high rating for preferred items and mild serendipidity–we distributed the weights at 0.99 vs 0.01 between UBCF and Random models.

hybrid_model <- HybridRecommender(UBCF_model, Random_model, weights = c(0.99, 0.01))
Hybrid_prediction <- predict(hybrid_model, newdata = known, type = "ratings")
(Hybrid_accuracy <- calcPredictionAccuracy(Hybrid_prediction, unknown))

##      RMSE       MSE       MAE 
## 1.4017672 1.9649514 0.9435311

Accuracy Comparison

As a consequence of the entropy, the accuracy has reduced, as expected. However, goal here is not exclusively highest accuracy, but a balance of good user experience.

Now, let’s take top 10 recommendations for the the first user in the test set.

UBCF_recom <- predict(UBCF_model, newdata = known[1], type = "topNList")
Hybrid_recom <- predict(hybrid_model, newdata = known[1], type = "topNList")

UBCF_recom@items

## $A103WXT3CHVY0H
##  [1] 3234   46  580 3609 3624   37 6415  520 2883 3674

Hybrid_recom@items

## $A103WXT3CHVY0H
##  [1]  3234  3624  3609    46    37 10521  2883  7715  7576  2749

Conclusion

In this project, we built four recommender systems and compared their accuracies. Likewise, We could create additional models and compare, and arrive at the most optimal model.

Additional experiments

Another approach is measurement of diversification in choices, through A/B testing. In Amazon, the users are offered a set of recommendations. It may be possible to record how many of the recommended products are hit or clicked, or even leading to a purchase, by a user. A measure of the clicked products is a clue to a person’s tendencies.

It may be possible to throw in a different attractive pop-up Ad, to interrupt the user’s mouse clicking activity. If the user waits for the the Ad to finish or clicks the Ad off, and persists in clicking one or the other offered items, then his/her desire to evaluate thte products is strong.