In this project we will implement a recommender system for Amazon Healthcare products applying user based and item based collaborative filtering.
First, we load the necessary libraries for this project.
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
library(tidyr)
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
##
## expand, pack, unpack
library(caTools)
library(ggplot2)
We sourced our data from Amazon’s product reviews between May 1996 - July 2014 which can be found in the following link: http://jmcauley.ucsd.edu/data/amazon/links.html
Here we load our dataset of 2.9 million reviews and add column names. The users are identified by unique reviewer IDs and the products are coded with Amazon Standard Identification Numbers (ASINs).
hc <- read.csv("ratings_Health_and_Personal_Care.csv")
colnames(hc) <- c('user', 'product', 'rating', 'time') #add column names
head(hc)
## user product rating time
## 1 A3FYN0SZYWN74 0615208479 5 1228089600
## 2 A2J0WRZSAAHUAP 0615269990 5 1396742400
## 3 A38RKP6G5P8J63 0615269990 5 1386115200
## 4 ARENM677YXZKX 0615269990 2 1398297600
## 5 AMBJQQSRCAOHS 0615315860 5 1362614400
## 6 A2KX3GMQY9LS9N 0615315860 5 1358035200
In order to work with a smaller subset of this data, we removed the time column, selected the first 100 rows and converted our table from long to wide so that there would be a row for each reviewer and a column for each product, producing a user-item matrix. We notice that our matrix of ratings is sparse, meaning that there are users and items with few ratings, and this is a common issue in building recommender systems. Our resulting data set has 99 users and 30 healthcare products.
hc2 <- head(hc[, 1:3], n = 100) #remove the time column and select the first 10 rows
hc3 <- spread(hc2, product, rating) #convert table from long to wide
head(hc3)
## user 0615208479 0615269990 0615315860 0615406394 0615836828
## 1 A102TGNH1D915Z NA NA NA NA NA
## 2 A108X7C0O8VLPK NA NA NA NA NA
## 3 A10BUG0VJRIQLU NA NA NA NA NA
## 4 A11O3IHGGJBH67 NA NA NA NA NA
## 5 A12V35OD8T4ZVP NA NA NA 5 NA
## 6 A13UM8WCUC1LDD NA NA NA NA NA
## 0641710577 0641864507 0681504498 0705394638 0736789928 076493211X 0767196767
## 1 NA NA 3 NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA 5 NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## 0767196813 0895640163 0898004640 0898004659 0898004667 0929619730 093926322X
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## 0945523181 0972433503 0974959200 0976202107 0978559088 0979047714 0979679737
## 1 NA NA NA NA NA NA NA
## 2 NA NA 5 NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## 097968191X 0982614926 0982828004
## 1 NA NA NA
## 2 NA NA NA
## 3 5 NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 4 NA NA
In order to use the recommenderlab library, we first have to convert our dataframe into a real rating matrix.
hc_matrix <- as.matrix(hc3)
hc_RRM <- as(hc_matrix, "realRatingMatrix")
## Warning in storage.mode(from) <- "double": NAs introduced by coercion
dim(hc_RRM)
## [1] 99 30
We then jump into our ratings data through some exploratory work. Below is the distribution of our user ratings. The most common rating entry is NA as our matrix is sparse, but after that the most frequent rating is a 5 with nearly 75 of them in our matrix. This is followed by a 4 rating and then a 1 rating. This follows the intuition that users tend to rate items that they either really love or strongly dislike.
vector_ratings <- as.vector(hc_RRM@data)
vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of ratings for healthcare products on Amazon")
We also plotted the distribution of the average rating that our 30 products received. Most products received high ratings.
average_ratings <- colMeans(hc_RRM)
qplot(average_ratings) + stat_bin(binwidth = 0.1) + ggtitle("Distribution of the average product rating")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, we start to build our recommender system by splitting our data into training and test sets, using 80% of our data to train our engine and holding out 20% to test it.
which_train <- sample(x = c(TRUE, FALSE), size = nrow(hc_RRM), replace = TRUE, prob = c(0.8, 0.2))
hc_train <- hc_RRM[which_train, ]
hc_test <- hc_RRM[!which_train, ]
The first model we train applies user-based collaborative filtering. We normalize our ratings to remove bias due to users who tend to give high or low ratings. Normalization makes it such that the average ratings of each user is 0. We also use cosine similarity in order to identify similar users based on the cosine distance between every user-user vector pair. We set the nn parameter to 5 in order to identify the top 5 users that each user is most similar to.
ub_hc <- Recommender(data = hc_train, method = "UBCF", parameter = list(nn = 5, normalize = "center", method = "cosine"))
ub_hc
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 81 users.
Next, we retrieve 5 healthcare product recommendations for each of the 17 users in our test set.
If we examine the first user, they had only one rating and it was a 5 for product 0978559088. If we search this product’s ASIN (https://amazon-asin.com/), we see that it’s for “Nutrihill Resveratrol Lozenges”. This product is known for “many health benefits such as protecting the heart and circulatory system, lowering cholesterol, and protecting against clots which can cause heart attacks and stroke.” The user-based collaborative filtering suggests products such as 0615208479 (Brain Fitness Exercises Software) and 0615406394 (Aphrodite Reborn - Women’s Stories of Hope, Courage and Cancer) for this user.
ub_pred <- predict(object = ub_hc, newdata = hc_test, n = 5)
ub_matrix <- sapply(ub_pred@items, function(x) {colnames(hc_RRM)[x]} )
as(hc_test, "matrix")[1,]
## user 0615208479 0615269990 0615315860 0615406394 0615836828 0641710577
## NA NA NA NA NA NA NA
## 0641864507 0681504498 0705394638 0736789928 076493211X 0767196767 0767196813
## NA NA NA NA NA NA NA
## 0895640163 0898004640 0898004659 0898004667 0929619730 093926322X 0945523181
## NA NA NA NA NA NA NA
## 0972433503 0974959200 0976202107 0978559088 0979047714 0979679737 097968191X
## NA 5 NA NA NA NA NA
## 0982614926 0982828004
## NA NA
ub_matrix[1]
## [[1]]
## [1] "0615208479" "0615269990" "0615315860" "0615836828" "0641710577"
The second model implements item based collaborative filtering to recommend users items that received similar ratings to the items that the users rated. We normalized our data, looked at the 5 most similar items to each item, and compared items using cosine similarity.
ib_hc <- Recommender(data = hc_train, method = "IBCF", parameter = list(k = 5, normalize = "center", method = "cosine"))
ib_hc
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 81 users.
We then pulled predictions/recommendations for users in our test set using this item based collaborative filtering recommender. Unfortunately, only user #4 received any recommendations, and it was only for one product. This user gave a rating of 5 to product 0898004667 (Peacock Gift Wrapping Paper) and was only recommended product 0898004659 which we were unable to do a reverse ASIN search for.
ib_pred <- predict(object = ib_hc, newdata = hc_test, n = 5)
ib_matrix <- sapply(ib_pred@items, function(x) {colnames(hc_RRM)[x]} )
as(hc_test, "matrix")[4,]
## user 0615208479 0615269990 0615315860 0615406394 0615836828 0641710577
## NA NA NA NA NA NA NA
## 0641864507 0681504498 0705394638 0736789928 076493211X 0767196767 0767196813
## NA NA NA NA NA NA 5
## 0895640163 0898004640 0898004659 0898004667 0929619730 093926322X 0945523181
## NA NA NA NA NA NA NA
## 0972433503 0974959200 0976202107 0978559088 0979047714 0979679737 097968191X
## NA NA NA NA NA NA NA
## 0982614926 0982828004
## NA NA
ib_matrix[4]
## [[1]]
## character(0)
Overall, it seems that a user-based collaborative filtering recommender system is more informative here than one that uses item-based collaborative filtering. The former produced 5 recommendations for each of 14 users in our test set while the latter only produced a single item recomendation for one of our users. In future projects we will evaluate our recommender systems more thoroughly using metrics such as RMSE.