DATA 612 Final Project

Final Project

Good Reads Book Recommender

Building a Production Ready Book Recommendation Engine

Author: Jim Mundy

Date: July 2020

Game Plan

I have selected the Good Reads data set for my final project. The data set has approximately 1 million ratings and 10 thousand items. Per the Final Project requirements, I seek to produce a production-quality system that provides quality book recommendations. The steps I will follow to achieve these objectives are set forth below:

Load Data and Perform EDA
Execute Required Preprocessing Tasks That Result From EDA
Build a Book Recommender Engine That Produces Quality Recommendations
Optimize the Engine For Production
Deploy the Recommender Engine Using Shiny

Load Data

Loading The DATA

We use the readr package and read_csv function to import the data. The data set is comprised of three four csv files:

books - book meta data including title, author, isbn number, etc.
rating - includes book_id, user_id and rating.
book_tags - join file comprised of book_id, tag_id and count.
tags - look-up table for tag names comprised of tag_id and tag_name.

books <- read_csv('books.csv')
ratings <- read_csv('ratings_cleaned.csv')
book_tags <- read_csv('book_tags.csv')
tags <- read_csv('tags.csv')

After loading the data, we take a quick glimpse to make sure everything loaded correctly. Overall the data looks good.

Books

glimpse(books)

## Rows: 10,000
## Columns: 23
## $ id                        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ book_id                   <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ best_book_id              <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ work_id                   <dbl> 2792775, 4640799, 3212258, 3275794, 24549...
## $ books_count               <dbl> 272, 491, 226, 487, 1356, 226, 969, 360, ...
## $ isbn                      <chr> "439023483", "439554934", "316015849", "6...
## $ isbn13                    <dbl> 9.780439e+12, 9.780440e+12, 9.780316e+12,...
## $ authors                   <chr> "Suzanne Collins", "J.K. Rowling, Mary Gr...
## $ original_publication_year <dbl> 2008, 1997, 2005, 1960, 1925, 2012, 1937,...
## $ original_title            <chr> "The Hunger Games", "Harry Potter and the...
## $ title                     <chr> "The Hunger Games (The Hunger Games, #1)"...
## $ language_code             <chr> "eng", "eng", "en-US", "eng", "eng", "eng...
## $ average_rating            <dbl> 4.34, 4.44, 3.57, 4.25, 3.89, 4.26, 4.25,...
## $ ratings_count             <dbl> 4780653, 4602479, 3866839, 3198671, 26836...
## $ work_ratings_count        <dbl> 4942365, 4800065, 3916824, 3340896, 27737...
## $ work_text_reviews_count   <dbl> 155254, 75867, 95009, 72586, 51992, 14073...
## $ ratings_1                 <dbl> 66715, 75504, 456191, 60427, 86236, 47994...
## $ ratings_2                 <dbl> 127936, 101676, 436802, 117415, 197621, 9...
## $ ratings_3                 <dbl> 560092, 455024, 793319, 446835, 606158, 3...
## $ ratings_4                 <dbl> 1481305, 1156318, 875073, 1001952, 936012...
## $ ratings_5                 <dbl> 2706317, 3011543, 1355439, 1714267, 94771...
## $ image_url                 <chr> "https://images.gr-assets.com/books/14473...
## $ small_image_url           <chr> "https://images.gr-assets.com/books/14473...

Ratings

glimpse(ratings)

## Rows: 960,595
## Columns: 4
## $ book_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ user_id <dbl> 314, 439, 588, 1169, 1185, 2077, 2487, 2900, 3662, 3922, 53...
## $ rating  <dbl> 5, 3, 5, 4, 4, 4, 4, 5, 4, 5, 5, 3, 5, 5, 3, 1, 4, 5, 4, 4,...
## $ N       <dbl> 181, 173, 186, 187, 190, 180, 193, 190, 185, 188, 181, 188,...

EDA & Preprocessing

To develop a better understanding of the data we use the skim function on the books and ratings data. Next we use ggplot to perform some additional EDA and to evaluate what, if any, preprocessing is required. Here are some observation:

The fields with missing data include: ISBN, original tile, language code, original publication date.
The missing data shouldn’t impact my recommender engine performance
The ratings table is 100% complete with ratings from 1 to 5.
The vast majority of ratings appear to fall between 3 and 5
We will need to remove duplicate data.

Books

Data summary
Name	books
Number of rows	10000
Number of columns	23
_______________________
Column type frequency:
character	7
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
isbn	700	0.93	7	10	9300
authors	0	1.00	3	742	4664
original_title	590	0.94	1	196	9258
title	0	1.00	2	186	9964
language_code	1084	0.89	2	5	25
image_url	0	1.00	52	88	6669
small_image_url	0	1.00	52	86	6669

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	5.000500e+03	2.886900e+03	1.00	2.500750e+03	5.000500e+03	7.500250e+03	1.000000e+04	▇▇▇▇▇
book_id	0	1.00	5.264697e+06	7.575462e+06	1.00	4.627575e+04	3.949655e+05	9.382225e+06	3.328864e+07	▇▂▁▁▁
best_book_id	0	1.00	5.471214e+06	7.827330e+06	1.00	4.791175e+04	4.251235e+05	9.636113e+06	3.553423e+07	▇▂▁▁▁
work_id	0	1.00	8.646183e+06	1.175106e+07	87.00	1.008841e+06	2.719525e+06	1.451775e+07	5.639960e+07	▇▂▁▁▁
books_count	0	1.00	7.571000e+01	1.704700e+02	1.00	2.300000e+01	4.000000e+01	6.700000e+01	3.455000e+03	▇▁▁▁▁
isbn13	585	0.94	9.755044e+12	4.428619e+11	195170342.00	9.780316e+12	9.780452e+12	9.780831e+12	9.790008e+12	▁▁▁▁▇
original_publication_year	21	1.00	1.981990e+03	1.525800e+02	-1750.00	1.990000e+03	2.004000e+03	2.011000e+03	2.017000e+03	▁▁▁▁▇
average_rating	0	1.00	4.000000e+00	2.500000e-01	2.47	3.850000e+00	4.020000e+00	4.180000e+00	4.820000e+00	▁▁▃▇▁
ratings_count	0	1.00	5.400124e+04	1.573700e+05	2716.00	1.356875e+04	2.115550e+04	4.105350e+04	4.780653e+06	▇▁▁▁▁
work_ratings_count	0	1.00	5.968732e+04	1.678038e+05	5510.00	1.543875e+04	2.383250e+04	4.591500e+04	4.942365e+06	▇▁▁▁▁
work_text_reviews_count	0	1.00	2.919960e+03	6.124380e+03	3.00	6.940000e+02	1.402000e+03	2.744250e+03	1.552540e+05	▇▁▁▁▁
ratings_1	0	1.00	1.345040e+03	6.635630e+03	11.00	1.960000e+02	3.910000e+02	8.850000e+02	4.561910e+05	▇▁▁▁▁
ratings_2	0	1.00	3.110880e+03	9.717120e+03	30.00	6.560000e+02	1.163000e+03	2.353250e+03	4.368020e+05	▇▁▁▁▁
ratings_3	0	1.00	1.147589e+04	2.854645e+04	323.00	3.112000e+03	4.894000e+03	9.287000e+03	7.933190e+05	▇▁▁▁▁
ratings_4	0	1.00	1.996570e+04	5.144736e+04	750.00	5.405750e+03	8.269500e+03	1.602350e+04	1.481305e+06	▇▁▁▁▁
ratings_5	0	1.00	2.378981e+04	7.976889e+04	754.00	5.334000e+03	8.836000e+03	1.730450e+04	3.011543e+06	▇▁▁▁▁

Ratings

Data summary
Name	ratings
Number of rows	960595
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
book_id	1	4890.38	2862.32	1	2414	4847	7336	10000	▇▇▇▇▇
user_id	1	25593.48	15209.95	1	12372	25047	38511	53424	▇▇▇▇▆
rating	1	3.85	0.98	1	3	4	5	5	▁▂▆▇▆
N	1	56.71	48.58	3	19	42	80	200	▇▃▂▁▁

ggplot2 EDA

The following chart facilitate a deeper understanding of the data set.

Count of Ratings

There are relatively few 1 and 2 ratings. Four was the most popular rating.

Number of ratings per user

The ratings per user chart has a long-tail, with the many user providing a handful (1 to 5) ratings and far fewer users providing up to 200 ratings.

Distribution of Genres

The data set includes more than 25 different genres. The most popular genres include fantasy, romance and mystery. Sports, manga and cookbooks were the least represented genres.

Top 10 rated books

Finally, the table below sets forth the Top 10 books by average rating. Three different Harry Potter books made the Top 10.

Preprocessing

The EDA leads me to implement one preprocessing task - Removing duplicate ratings pairs, if any.

ratings <- ratings %>% 
  group_by(user_id, book_id) %>% 
  mutate(n=n()) %>% 
  filter(n==1) %>% 
  ungroup()

Build Engine

Build and Evaluate Recommendation Engine

We will utilize RecommenderLabs to build and evaluate our recommendation engine. We will build a User Collaborative Model as our base line model. We will also consider alternative algorithms in an effort to find the algorithm that produces the best possible recommendations. The key steps to our analysis include:

Create ratings matrix
Build UBCF Model
Create Prediction of UBCF Model
Evaluate and Tune UBCF Model
Evaluate Alternative Algorithms
Pick the Best Algorithm

Create Rating <Matrix

First we use dplyr to restructure the ratings data frame. Then we convert the data frame to a matrix.

dimension_names <- list(user_id = sort(unique(ratings$user_id)), book_id = sort(unique(ratings$book_id)))

ratingmat <- spread(select(ratings, book_id, user_id, rating), book_id, rating) %>% 
  select(-user_id)

ratingmat <- as.matrix(ratingmat)
dimnames(ratingmat) <- dimension_names
#ratingmat[1:5, 1:5]

Next we replace NAs with zeros and use Recommederlabs (data=sparse_ratings) to create a sparse ratings matrix. The sparse matrix format reduces the memory footprint of the model. We have just under 1 million (960,595) ratings in our ratings matrix.

ratingmat0 <- ratingmat
ratingmat0[is.na(ratingmat0)] <- 0
sparse_ratings <- as(ratingmat0, "sparseMatrix")
rm(ratingmat0)
gc()

real_ratings <- new("realRatingMatrix", data = sparse_ratings)
real_ratings

## 45016 x 10000 rating matrix of class 'realRatingMatrix' with 960595 ratings.

Build Model

Now we build the model using the Recommender function, the UBCF algorithm and the Pearson method to calculate similarity.

model <- Recommender(real_ratings, method = "UBCF", param = list(method = "pearson", nn = 4))

Create Predictions

Now we designate a user (user 4763) and use the predict function to make a prediction.

current_user <- "4763"

prediction <- predict(model, real_ratings[current_user, ], type = "ratings")

Here are the prediction results for user 4763

Tune the model

We’ll utilize 10-fold cross validation and different values of nn to tune our recommendation model. We’ll set k to 10 and given to -1. The given of negative 1 means all but one rating is used to make the predictions. Performance is then evaluated for that 1 for each user.

scheme <- evaluationScheme(real_ratings[1:500,], method = "cross-validation", k = 10, given = -1, goodRating = 4)

Now we tune our model with values of nn set to 20, 30, 40, 50. Tuning results are plotted below. We see that nn = 40 produces the best results, but its not significantly better than nn = 20, 30 or 50.

algorithms <- list("random" = list(name = "RANDOM", param = NULL),
                   "UBCF_20" = list(name = "UBCF", param = list(nn = 20)),
                   "UBCF_30" = list(name = "UBCF", param = list(nn = 30)),
                   "UBCF_40" = list(name = "UBCF", param = list(nn = 40)),                   
                   "UBCF_50" = list(name = "UBCF", param = list(nn = 50))
                   )
# evaluate the alogrithms with the given scheme            
results <- evaluate(scheme, algorithms, type = "ratings")

## RANDOM run fold/sample [model time/prediction time]
##   1  [0.01sec/0.43sec] 
##   2  [0sec/0.33sec] 
##   3  [0sec/0.34sec] 
##   4  [0sec/0.34sec] 
##   5  [0sec/0.31sec] 
##   6  [0sec/0.32sec] 
##   7  [0sec/0.32sec] 
##   8  [0sec/0.34sec] 
##   9  [0sec/0.34sec] 
##   10  [0.02sec/0.31sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/2.17sec] 
##   2  [0sec/2.18sec] 
##   3  [0sec/2.29sec] 
##   4  [0.01sec/2.38sec] 
##   5  [0sec/2.8sec] 
##   6  [0sec/3.14sec] 
##   7  [0sec/2.29sec] 
##   8  [0sec/2.37sec] 
##   9  [0.02sec/2.5sec] 
##   10  [0sec/2.79sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/2sec] 
##   2  [0sec/1.99sec] 
##   3  [0sec/2.02sec] 
##   4  [0.02sec/2.04sec] 
##   5  [0.02sec/1.98sec] 
##   6  [0sec/2.03sec] 
##   7  [0sec/2.01sec] 
##   8  [0sec/2sec] 
##   9  [0sec/2.06sec] 
##   10  [0sec/2.27sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/2.04sec] 
##   2  [0sec/2.03sec] 
##   3  [0.01sec/2.03sec] 
##   4  [0sec/2.12sec] 
##   5  [0.01sec/2.19sec] 
##   6  [0.02sec/2.28sec] 
##   7  [0.02sec/2.21sec] 
##   8  [0.01sec/2.91sec] 
##   9  [0sec/2.63sec] 
##   10  [0.02sec/2.65sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/2.13sec] 
##   2  [0sec/2.23sec] 
##   3  [0sec/2.07sec] 
##   4  [0sec/2.03sec] 
##   5  [0.02sec/2.04sec] 
##   6  [0.02sec/2.18sec] 
##   7  [0sec/2.13sec] 
##   8  [0sec/2.09sec] 
##   9  [0sec/2.13sec] 
##   10  [0sec/2.12sec]

# restructure results output
tmp <- lapply(results, function(x) slot(x, "results"))
res <- tmp %>% 
  lapply(function(x) unlist(lapply(x, function(x) unlist(x@cm[ ,"RMSE"])))) %>% 
  as.data.frame() %>% 
  gather(key = "Algorithm", value = "RMSE")

The UBCF model had similar values across the different values for nn. The best performance was achieved when nn was set to 40.

res %>% 
  ggplot(aes(Algorithm, RMSE, fill = Algorithm)) +
  geom_bar(stat = "summary") + geom_errorbar(stat = "summary", width = 0.3, size = 0.8) +
  coord_cartesian(ylim = c(0.6, 1.3)) + guides(fill = FALSE) +
  ggtitle("Tuning Results") +
  theme_fivethirtyeight()

## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()

Compare UBCF to SVD

Given the SVD model’s ability to deploy at scale and our desire to deploy a model in Shiny, we’ll compare our baseline UBCF model to an SVD below. All things, being equal SVD would likely be a better algorithm for deploying a production system because of its ability to make prediction quickly.

scheme <- evaluationScheme(real_ratings[1:500,], method = "cross-validation", k = 10, given = -1, goodRating = 5)

algorithms <- list(
                   "UBCF" = list(name = "UBCF"),
                   "SVD" = list(name = "SVD")
                   )
                   
results <- evaluate(scheme, algorithms, type = "ratings", progress = FALSE)

# restructure results output
tmp <- lapply(results, function(x) slot(x, "results"))
res <- tmp %>% 
  lapply(function(x) unlist(lapply(x, function(x) unlist(x@cm[ ,"RMSE"])))) %>% 
  as.data.frame() %>% 
  gather(key = "Algorithm", value = "RMSE")

The UBCF outperforms SVD, so given all things were not equal we will deploy our engine utilizing our baseline UBCF model.

res %>% 
  mutate(Algorithm=factor(Algorithm, levels = c("UBCF", "SVD"))) %>%
  ggplot(aes(Algorithm, RMSE, fill = Algorithm)) + geom_bar(stat = "summary") + 
  geom_errorbar(stat = "summary", width = 0.3, size = 0.8) + coord_cartesian(ylim = c(0.6, 1.3)) + 
  guides(fill = FALSE) +
  ggtitle("UBCF vs SVD") +
  theme_fivethirtyeight()

Optimize

Optimize the engine for production

We have picked our baseline UBCF model to deploy. The problem with that choice is that RecommenderLabs is more of an educational/analytical tool and not ideally suited for production systems owing to the fact that its rather slow producing predictions. Normally, one might turn to Spark or Spark and Sparklyr deploy strategy. The Spark solution is well suited for production system because it holds everything in memory and produce predictions very efficiently and fast.

While I was researching final project alternatives, however, I came across a blog post that demonstrates an efficient R implementation of User Based Collaborative Filtering.

Blog Post - Improved UCBF

The blog post sets forth the following steps to implement an improved UBCF model:

Take a ratings matrix.
Normalize Matrix (optional)
Calculate similarities between users.
Use the k nearest neighbor approach (keep only k most similar users)
Calculate predictions and denormalize them in case normalization was performed

Two main optimization steps are summarized below:

The calculation of similarities is optimized for large sparse matrices.

k nearest neighbors on similarity matrices were not calculated in a loop, but rather using an optimized implementation. All the values from the similarity matrix were grouped by column. In each group (column), the k-th highest value were found and only the k highest values per column in the similarity matrix were kept and utilized for predictions.

You can see some similarities to matrix decomposition/reduction utilized by SVD models.

Two function did all the heavy lifting in this approach: cf_algorithm.R (collaborative filtering) and similarity_measures.R. The link below is to my github repository where these functions can be reviewed.

Live Engine

Click the Link Below To Access The Live Book Recommender Engine

Shiny App - Book Recommendation Engine

Resources

List of resouces leveraged in the project:

Kaggle Good Books Data Set - Book Recommender
Improved R implementation of collaborative filtering = https://blog.smartcat.io/2017/improved-r-implementation-of-collaborative-filtering/
Market Basket Analysis with recommenderlab - https://towardsdatascience.com/market-basket-analysis-with-recommenderlab-5e8bdc0de236
For Recommenderlab Package Vignette see: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
Smart Cats Blog Post - : //blog.smartcat.io/2017/improved-r-implementation-of-collaborative-filtering/
RStudio - Shiny Tutorial
Dean Atalli - Shiny javascript resoures
ShinyRatingInput (javascript star ratings) - Stefan Wilhelm.
shinyapps.io - deployed shiny app here.