In this project we will implement a recommender system for Amazon Healthcare products applying matrix factorization, specifically singular value decomposition.
First, we load the necessary libraries for this project.
library(recommenderlab)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.3
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.6.3
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.6.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
##
## expand, pack, unpack
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
library(ggplot2)
We sourced our data from Amazon’s product reviews between May 1996 - July 2014 which can be found in the following link: http://jmcauley.ucsd.edu/data/amazon/links.html
Here we load our dataset of 2.9 million reviews and add column names. The users are identified by unique reviewer IDs and the products are coded with Amazon Standard Identification Numbers (ASINs).
hc <- read.csv("ratings_Health_and_Personal_Care.csv")
colnames(hc) <- c('user', 'product', 'rating', 'time') #add column names
head(hc)
## user product rating time
## 1 A3FYN0SZYWN74 0615208479 5 1228089600
## 2 A2J0WRZSAAHUAP 0615269990 5 1396742400
## 3 A38RKP6G5P8J63 0615269990 5 1386115200
## 4 ARENM677YXZKX 0615269990 2 1398297600
## 5 AMBJQQSRCAOHS 0615315860 5 1362614400
## 6 A2KX3GMQY9LS9N 0615315860 5 1358035200
We removed the time column and converted our table from long to wide so that there would be a row for each reviewer and a column for each product, producing a user-item matrix.
Compared to our previous projects, we ingest more reviews than before by taking the first 50,000 reviews in our dataset. We notice that our matrix of ratings is sparse, meaning that there are users and items with few ratings, and this is a common issue in building recommender systems. Our resulting data set has 47,786 users and 2,026 healthcare products.
hc2 <- head(hc[, 1:3], n = 50000)
hc3 <- spread(hc2, product, rating) #convert table from long to wide
#head(hc3)
In order to use the recommenderlab library, we first have to convert our dataframe into a real rating matrix.
hc_matrix <- as.matrix(hc3)
hc_RRM <- as(hc_matrix, "realRatingMatrix")
## Warning in storage.mode(from) <- "double": NAs introduced by coercion
dim(hc_RRM)
## [1] 47786 2026
We then decrease the sparsity of our matrix by making sure that each individual user and item have at least 3 ratings each. This makes it so that we work with 1,675 users and 1,264 items.
hc_RRM2 <- hc_RRM[rowCounts(hc_RRM) > 2, colCounts(hc_RRM) > 2]
dim(hc_RRM2)
## [1] 1675 1264
We then jump into our ratings data through some exploratory work. Below is the distribution of our user ratings. The second most common rating entry is NA as our matrix is sparse, but before that the most frequent rating is a 5 with nearly 2,500 of them in our matrix. This is followed by a 4 rating and then a 1 rating. This follows the intuition that users tend to rate items that they either really love or strongly dislike.
vector_ratings <- as.vector(hc_RRM2@data)
vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of ratings for healthcare products on Amazon")
We also plotted the distribution of the average rating that our 1,264 products received. Most products received high ratings.
average_ratings <- colMeans(hc_RRM2)
qplot(average_ratings) + stat_bin(binwidth = 0.1) + ggtitle("Distribution of the average product rating")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 456 rows containing non-finite values (stat_bin).
## Warning: Removed 456 rows containing non-finite values (stat_bin).
Next, we start to build our recommender system by splitting our data into training and test sets using cross-validation with 4 folds. Out of 1-5 ratings, we decided that the threshold between good and bad ratings was a rating of 3. We also made our “given” parameter 1 as that is the minimum number of ratings that a user has in our dataset.
percentage_training <- 0.8
min(rowCounts(hc_RRM2))
## [1] 1
items_to_keep <- 1
rating_threshold <- 3
eval_sets <- evaluationScheme(data=hc_RRM2, method="cross-validation", k=4, given=items_to_keep, goodRating=rating_threshold)
Singular Value Decomposition is a dimensionality reduction matrix factorization technique that decomposes the matrix into the product of its vectors and categorizes users/items. In order to implement SVD we need to make sure that there are no missing values, so we normalize our ratings to remove bias due to users who tend to give high or low ratings. Normalization makes it such that the average ratings of each user is 0. We train this model using normalization as a pre-processing technique.
svd_rec <- Recommender(data=getData(eval_sets, "train"), method="svd")
We then use this trained model to predict product ratings for our users and evaluated these predictions. The RMSE for this recommender system was 3.12.
svd_pred <- predict(object=svd_rec, newdata=getData(eval_sets, "known"), n=5, type="ratings")
svd_accuracy <- calcPredictionAccuracy(x=svd_pred, data=getData(eval_sets, "unknown"), byUser=FALSE)
svd_accuracy
## RMSE MSE MAE
## 3.061650 9.373702 2.245651
We compare this SVD recommender engine with one that uses item based collaborative filtering. The latter recommends users items that received similar ratings to the items that the users rated. In this model, we normalized our data, looked at the 5 most similar items to each item, and compared items using cosine similarity.
ib_rec <- Recommender(data=getData(eval_sets, "train"), method = "IBCF", parameter = list(k = 5, normalize = "center", method = "cosine"))
This model’s predictions using the test set had a RMSE of 2.
ib_pred <- predict(object=ib_rec, newdata=getData(eval_sets, "known"), n=5, type="ratings")
ib_accuracy <- calcPredictionAccuracy(x=ib_pred, data=getData(eval_sets, "unknown"), byUser=FALSE)
ib_accuracy
## RMSE MSE MAE
## NaN NaN NaN
After comparing the RMSE of our recommender engines, we decided that our engine that uses item based collaborative filtering performs better than our engine that uses SVD. As we saw in previous projects, user based collaborative filtering yielded even better results than item based collaborative filtering, so that has been our best engine so far with this dataset of Amazon Healthcare product ratings.
(Note that some of the metric results may differ when these results are published to rpubs)