Final Project


Good Reads Book Recommender

Building a Production Ready Book Recommendation Engine


Author: Jim Mundy

Date: July 2020


Game Plan


I have selected the Good Reads data set for my final project. The data set has approximately 1 million ratings and 10 thousand items. Per the Final Project requirements, I seek to produce a production-quality system that provides quality book recommendations. The steps I will follow to achieve these objectives are set forth below:


  • Load Data and Perform EDA
  • Execute Required Preprocessing Tasks That Result From EDA
  • Build a Book Recommender Engine That Produces Quality Recommendations
  • Optimize the Engine For Production
  • Deploy the Recommender Engine Using Shiny




Load Data

Loading The DATA


We use the readr package and read_csv function to import the data. The data set is comprised of three four csv files:

  • books - book meta data including title, author, isbn number, etc.
  • rating - includes book_id, user_id and rating.
  • book_tags - join file comprised of book_id, tag_id and count.
  • tags - look-up table for tag names comprised of tag_id and tag_name.


After loading the data, we take a quick glimpse to make sure everything loaded correctly. Overall the data looks good.

Books

## Rows: 10,000
## Columns: 23
## $ id                        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ book_id                   <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ best_book_id              <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ work_id                   <dbl> 2792775, 4640799, 3212258, 3275794, 24549...
## $ books_count               <dbl> 272, 491, 226, 487, 1356, 226, 969, 360, ...
## $ isbn                      <chr> "439023483", "439554934", "316015849", "6...
## $ isbn13                    <dbl> 9.780439e+12, 9.780440e+12, 9.780316e+12,...
## $ authors                   <chr> "Suzanne Collins", "J.K. Rowling, Mary Gr...
## $ original_publication_year <dbl> 2008, 1997, 2005, 1960, 1925, 2012, 1937,...
## $ original_title            <chr> "The Hunger Games", "Harry Potter and the...
## $ title                     <chr> "The Hunger Games (The Hunger Games, #1)"...
## $ language_code             <chr> "eng", "eng", "en-US", "eng", "eng", "eng...
## $ average_rating            <dbl> 4.34, 4.44, 3.57, 4.25, 3.89, 4.26, 4.25,...
## $ ratings_count             <dbl> 4780653, 4602479, 3866839, 3198671, 26836...
## $ work_ratings_count        <dbl> 4942365, 4800065, 3916824, 3340896, 27737...
## $ work_text_reviews_count   <dbl> 155254, 75867, 95009, 72586, 51992, 14073...
## $ ratings_1                 <dbl> 66715, 75504, 456191, 60427, 86236, 47994...
## $ ratings_2                 <dbl> 127936, 101676, 436802, 117415, 197621, 9...
## $ ratings_3                 <dbl> 560092, 455024, 793319, 446835, 606158, 3...
## $ ratings_4                 <dbl> 1481305, 1156318, 875073, 1001952, 936012...
## $ ratings_5                 <dbl> 2706317, 3011543, 1355439, 1714267, 94771...
## $ image_url                 <chr> "https://images.gr-assets.com/books/14473...
## $ small_image_url           <chr> "https://images.gr-assets.com/books/14473...

Ratings

## Rows: 960,595
## Columns: 4
## $ book_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ user_id <dbl> 314, 439, 588, 1169, 1185, 2077, 2487, 2900, 3662, 3922, 53...
## $ rating  <dbl> 5, 3, 5, 4, 4, 4, 4, 5, 4, 5, 5, 3, 5, 5, 3, 1, 4, 5, 4, 4,...
## $ N       <dbl> 181, 173, 186, 187, 190, 180, 193, 190, 185, 188, 181, 188,...

EDA & Preprocessing

EDA & Preprocessing


To develop a better understanding of the data we use the skim function on the books and ratings data. Next we use ggplot to perform some additional EDA and to evaluate what, if any, preprocessing is required. Here are some observation:

  • The fields with missing data include: ISBN, original tile, language code, original publication date.
  • The missing data shouldn’t impact my recommender engine performance
  • The ratings table is 100% complete with ratings from 1 to 5.
  • The vast majority of ratings appear to fall between 3 and 5
  • We will need to remove duplicate data.


Books

Data summary
Name books
Number of rows 10000
Number of columns 23
_______________________
Column type frequency:
character 7
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
isbn 700 0.93 7 10 0 9300 0
authors 0 1.00 3 742 0 4664 0
original_title 590 0.94 1 196 0 9258 0
title 0 1.00 2 186 0 9964 0
language_code 1084 0.89 2 5 0 25 0
image_url 0 1.00 52 88 0 6669 0
small_image_url 0 1.00 52 86 0 6669 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 5.000500e+03 2.886900e+03 1.00 2.500750e+03 5.000500e+03 7.500250e+03 1.000000e+04 ▇▇▇▇▇
book_id 0 1.00 5.264697e+06 7.575462e+06 1.00 4.627575e+04 3.949655e+05 9.382225e+06 3.328864e+07 ▇▂▁▁▁
best_book_id 0 1.00 5.471214e+06 7.827330e+06 1.00 4.791175e+04 4.251235e+05 9.636113e+06 3.553423e+07 ▇▂▁▁▁
work_id 0 1.00 8.646183e+06 1.175106e+07 87.00 1.008841e+06 2.719525e+06 1.451775e+07 5.639960e+07 ▇▂▁▁▁
books_count 0 1.00 7.571000e+01 1.704700e+02 1.00 2.300000e+01 4.000000e+01 6.700000e+01 3.455000e+03 ▇▁▁▁▁
isbn13 585 0.94 9.755044e+12 4.428619e+11 195170342.00 9.780316e+12 9.780452e+12 9.780831e+12 9.790008e+12 ▁▁▁▁▇
original_publication_year 21 1.00 1.981990e+03 1.525800e+02 -1750.00 1.990000e+03 2.004000e+03 2.011000e+03 2.017000e+03 ▁▁▁▁▇
average_rating 0 1.00 4.000000e+00 2.500000e-01 2.47 3.850000e+00 4.020000e+00 4.180000e+00 4.820000e+00 ▁▁▃▇▁
ratings_count 0 1.00 5.400124e+04 1.573700e+05 2716.00 1.356875e+04 2.115550e+04 4.105350e+04 4.780653e+06 ▇▁▁▁▁
work_ratings_count 0 1.00 5.968732e+04 1.678038e+05 5510.00 1.543875e+04 2.383250e+04 4.591500e+04 4.942365e+06 ▇▁▁▁▁
work_text_reviews_count 0 1.00 2.919960e+03 6.124380e+03 3.00 6.940000e+02 1.402000e+03 2.744250e+03 1.552540e+05 ▇▁▁▁▁
ratings_1 0 1.00 1.345040e+03 6.635630e+03 11.00 1.960000e+02 3.910000e+02 8.850000e+02 4.561910e+05 ▇▁▁▁▁
ratings_2 0 1.00 3.110880e+03 9.717120e+03 30.00 6.560000e+02 1.163000e+03 2.353250e+03 4.368020e+05 ▇▁▁▁▁
ratings_3 0 1.00 1.147589e+04 2.854645e+04 323.00 3.112000e+03 4.894000e+03 9.287000e+03 7.933190e+05 ▇▁▁▁▁
ratings_4 0 1.00 1.996570e+04 5.144736e+04 750.00 5.405750e+03 8.269500e+03 1.602350e+04 1.481305e+06 ▇▁▁▁▁
ratings_5 0 1.00 2.378981e+04 7.976889e+04 754.00 5.334000e+03 8.836000e+03 1.730450e+04 3.011543e+06 ▇▁▁▁▁

Ratings

Data summary
Name ratings
Number of rows 960595
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
book_id 0 1 4890.38 2862.32 1 2414 4847 7336 10000 ▇▇▇▇▇
user_id 0 1 25593.48 15209.95 1 12372 25047 38511 53424 ▇▇▇▇▆
rating 0 1 3.85 0.98 1 3 4 5 5 ▁▂▆▇▆
N 0 1 56.71 48.58 3 19 42 80 200 ▇▃▂▁▁

ggplot2 EDA

The following chart facilitate a deeper understanding of the data set.


Count of Ratings

There are relatively few 1 and 2 ratings. Four was the most popular rating.

Number of ratings per user

The ratings per user chart has a long-tail, with the many user providing a handful (1 to 5) ratings and far fewer users providing up to 200 ratings.

Distribution of Genres

The data set includes more than 25 different genres. The most popular genres include fantasy, romance and mystery. Sports, manga and cookbooks were the least represented genres.


Top 10 rated books

Finally, the table below sets forth the Top 10 books by average rating. Three different Harry Potter books made the Top 10.




Preprocessing

The EDA leads me to implement one preprocessing task - Removing duplicate ratings pairs, if any.

Build Engine

Build and Evaluate Recommendation Engine


We will utilize RecommenderLabs to build and evaluate our recommendation engine. We will build a User Collaborative Model as our base line model. We will also consider alternative algorithms in an effort to find the algorithm that produces the best possible recommendations. The key steps to our analysis include:

  • Create ratings matrix
  • Build UBCF Model
  • Create Prediction of UBCF Model
  • Evaluate and Tune UBCF Model
  • Evaluate Alternative Algorithms
  • Pick the Best Algorithm


Create Rating <Matrix

First we use dplyr to restructure the ratings data frame. Then we convert the data frame to a matrix.

Next we replace NAs with zeros and use Recommederlabs (data=sparse_ratings) to create a sparse ratings matrix. The sparse matrix format reduces the memory footprint of the model. We have just under 1 million (960,595) ratings in our ratings matrix.

## 45016 x 10000 rating matrix of class 'realRatingMatrix' with 960595 ratings.

Build Model

Now we build the model using the Recommender function, the UBCF algorithm and the Pearson method to calculate similarity.

Create Predictions

Now we designate a user (user 4763) and use the predict function to make a prediction.

Here are the prediction results for user 4763


Tune the model

We’ll utilize 10-fold cross validation and different values of nn to tune our recommendation model. We’ll set k to 10 and given to -1. The given of negative 1 means all but one rating is used to make the predictions. Performance is then evaluated for that 1 for each user.

Now we tune our model with values of nn set to 20, 30, 40, 50. Tuning results are plotted below. We see that nn = 40 produces the best results, but its not significantly better than nn = 20, 30 or 50.

## RANDOM run fold/sample [model time/prediction time]
##   1  [0.01sec/0.43sec] 
##   2  [0sec/0.33sec] 
##   3  [0sec/0.34sec] 
##   4  [0sec/0.34sec] 
##   5  [0sec/0.31sec] 
##   6  [0sec/0.32sec] 
##   7  [0sec/0.32sec] 
##   8  [0sec/0.34sec] 
##   9  [0sec/0.34sec] 
##   10  [0.02sec/0.31sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/2.17sec] 
##   2  [0sec/2.18sec] 
##   3  [0sec/2.29sec] 
##   4  [0.01sec/2.38sec] 
##   5  [0sec/2.8sec] 
##   6  [0sec/3.14sec] 
##   7  [0sec/2.29sec] 
##   8  [0sec/2.37sec] 
##   9  [0.02sec/2.5sec] 
##   10  [0sec/2.79sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/2sec] 
##   2  [0sec/1.99sec] 
##   3  [0sec/2.02sec] 
##   4  [0.02sec/2.04sec] 
##   5  [0.02sec/1.98sec] 
##   6  [0sec/2.03sec] 
##   7  [0sec/2.01sec] 
##   8  [0sec/2sec] 
##   9  [0sec/2.06sec] 
##   10  [0sec/2.27sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/2.04sec] 
##   2  [0sec/2.03sec] 
##   3  [0.01sec/2.03sec] 
##   4  [0sec/2.12sec] 
##   5  [0.01sec/2.19sec] 
##   6  [0.02sec/2.28sec] 
##   7  [0.02sec/2.21sec] 
##   8  [0.01sec/2.91sec] 
##   9  [0sec/2.63sec] 
##   10  [0.02sec/2.65sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/2.13sec] 
##   2  [0sec/2.23sec] 
##   3  [0sec/2.07sec] 
##   4  [0sec/2.03sec] 
##   5  [0.02sec/2.04sec] 
##   6  [0.02sec/2.18sec] 
##   7  [0sec/2.13sec] 
##   8  [0sec/2.09sec] 
##   9  [0sec/2.13sec] 
##   10  [0sec/2.12sec]

The UBCF model had similar values across the different values for nn. The best performance was achieved when nn was set to 40.

## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()

Optimize

Optimize the engine for production


We have picked our baseline UBCF model to deploy. The problem with that choice is that RecommenderLabs is more of an educational/analytical tool and not ideally suited for production systems owing to the fact that its rather slow producing predictions. Normally, one might turn to Spark or Spark and Sparklyr deploy strategy. The Spark solution is well suited for production system because it holds everything in memory and produce predictions very efficiently and fast.

While I was researching final project alternatives, however, I came across a blog post that demonstrates an efficient R implementation of User Based Collaborative Filtering.

Blog Post - Improved UCBF

The blog post sets forth the following steps to implement an improved UBCF model:

  • Take a ratings matrix.
  • Normalize Matrix (optional)
  • Calculate similarities between users.
  • Use the k nearest neighbor approach (keep only k most similar users)
  • Calculate predictions and denormalize them in case normalization was performed

Two main optimization steps are summarized below:

The calculation of similarities is optimized for large sparse matrices.

k nearest neighbors on similarity matrices were not calculated in a loop, but rather using an optimized implementation. All the values from the similarity matrix were grouped by column. In each group (column), the k-th highest value were found and only the k highest values per column in the similarity matrix were kept and utilized for predictions.

You can see some similarities to matrix decomposition/reduction utilized by SVD models.

Two function did all the heavy lifting in this approach: cf_algorithm.R (collaborative filtering) and similarity_measures.R. The link below is to my github repository where these functions can be reviewed.


Live Engine

Resources

List of resouces leveraged in the project: