Building a Production Ready Book Recommendation Engine
Author: Jim Mundy
Date: July 2020
I have selected the Good Reads data set for my final project. The data set has approximately 1 million ratings and 10 thousand items. Per the Final Project requirements, I seek to produce a production-quality system that provides quality book recommendations. The steps I will follow to achieve these objectives are set forth below:
We use the readr package and read_csv function to import the data. The data set is comprised of three four csv files:
books <- read_csv('books.csv')
ratings <- read_csv('ratings_cleaned.csv')
book_tags <- read_csv('book_tags.csv')
tags <- read_csv('tags.csv')After loading the data, we take a quick glimpse to make sure everything loaded correctly. Overall the data looks good.
## Rows: 10,000
## Columns: 23
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ book_id <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ best_book_id <dbl> 2767052, 3, 41865, 2657, 4671, 11870085, ...
## $ work_id <dbl> 2792775, 4640799, 3212258, 3275794, 24549...
## $ books_count <dbl> 272, 491, 226, 487, 1356, 226, 969, 360, ...
## $ isbn <chr> "439023483", "439554934", "316015849", "6...
## $ isbn13 <dbl> 9.780439e+12, 9.780440e+12, 9.780316e+12,...
## $ authors <chr> "Suzanne Collins", "J.K. Rowling, Mary Gr...
## $ original_publication_year <dbl> 2008, 1997, 2005, 1960, 1925, 2012, 1937,...
## $ original_title <chr> "The Hunger Games", "Harry Potter and the...
## $ title <chr> "The Hunger Games (The Hunger Games, #1)"...
## $ language_code <chr> "eng", "eng", "en-US", "eng", "eng", "eng...
## $ average_rating <dbl> 4.34, 4.44, 3.57, 4.25, 3.89, 4.26, 4.25,...
## $ ratings_count <dbl> 4780653, 4602479, 3866839, 3198671, 26836...
## $ work_ratings_count <dbl> 4942365, 4800065, 3916824, 3340896, 27737...
## $ work_text_reviews_count <dbl> 155254, 75867, 95009, 72586, 51992, 14073...
## $ ratings_1 <dbl> 66715, 75504, 456191, 60427, 86236, 47994...
## $ ratings_2 <dbl> 127936, 101676, 436802, 117415, 197621, 9...
## $ ratings_3 <dbl> 560092, 455024, 793319, 446835, 606158, 3...
## $ ratings_4 <dbl> 1481305, 1156318, 875073, 1001952, 936012...
## $ ratings_5 <dbl> 2706317, 3011543, 1355439, 1714267, 94771...
## $ image_url <chr> "https://images.gr-assets.com/books/14473...
## $ small_image_url <chr> "https://images.gr-assets.com/books/14473...
## Rows: 960,595
## Columns: 4
## $ book_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ user_id <dbl> 314, 439, 588, 1169, 1185, 2077, 2487, 2900, 3662, 3922, 53...
## $ rating <dbl> 5, 3, 5, 4, 4, 4, 4, 5, 4, 5, 5, 3, 5, 5, 3, 1, 4, 5, 4, 4,...
## $ N <dbl> 181, 173, 186, 187, 190, 180, 193, 190, 185, 188, 181, 188,...
To develop a better understanding of the data we use the skim function on the books and ratings data. Next we use ggplot to perform some additional EDA and to evaluate what, if any, preprocessing is required. Here are some observation:
| Name | books |
| Number of rows | 10000 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| isbn | 700 | 0.93 | 7 | 10 | 0 | 9300 | 0 |
| authors | 0 | 1.00 | 3 | 742 | 0 | 4664 | 0 |
| original_title | 590 | 0.94 | 1 | 196 | 0 | 9258 | 0 |
| title | 0 | 1.00 | 2 | 186 | 0 | 9964 | 0 |
| language_code | 1084 | 0.89 | 2 | 5 | 0 | 25 | 0 |
| image_url | 0 | 1.00 | 52 | 88 | 0 | 6669 | 0 |
| small_image_url | 0 | 1.00 | 52 | 86 | 0 | 6669 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 5.000500e+03 | 2.886900e+03 | 1.00 | 2.500750e+03 | 5.000500e+03 | 7.500250e+03 | 1.000000e+04 | ▇▇▇▇▇ |
| book_id | 0 | 1.00 | 5.264697e+06 | 7.575462e+06 | 1.00 | 4.627575e+04 | 3.949655e+05 | 9.382225e+06 | 3.328864e+07 | ▇▂▁▁▁ |
| best_book_id | 0 | 1.00 | 5.471214e+06 | 7.827330e+06 | 1.00 | 4.791175e+04 | 4.251235e+05 | 9.636113e+06 | 3.553423e+07 | ▇▂▁▁▁ |
| work_id | 0 | 1.00 | 8.646183e+06 | 1.175106e+07 | 87.00 | 1.008841e+06 | 2.719525e+06 | 1.451775e+07 | 5.639960e+07 | ▇▂▁▁▁ |
| books_count | 0 | 1.00 | 7.571000e+01 | 1.704700e+02 | 1.00 | 2.300000e+01 | 4.000000e+01 | 6.700000e+01 | 3.455000e+03 | ▇▁▁▁▁ |
| isbn13 | 585 | 0.94 | 9.755044e+12 | 4.428619e+11 | 195170342.00 | 9.780316e+12 | 9.780452e+12 | 9.780831e+12 | 9.790008e+12 | ▁▁▁▁▇ |
| original_publication_year | 21 | 1.00 | 1.981990e+03 | 1.525800e+02 | -1750.00 | 1.990000e+03 | 2.004000e+03 | 2.011000e+03 | 2.017000e+03 | ▁▁▁▁▇ |
| average_rating | 0 | 1.00 | 4.000000e+00 | 2.500000e-01 | 2.47 | 3.850000e+00 | 4.020000e+00 | 4.180000e+00 | 4.820000e+00 | ▁▁▃▇▁ |
| ratings_count | 0 | 1.00 | 5.400124e+04 | 1.573700e+05 | 2716.00 | 1.356875e+04 | 2.115550e+04 | 4.105350e+04 | 4.780653e+06 | ▇▁▁▁▁ |
| work_ratings_count | 0 | 1.00 | 5.968732e+04 | 1.678038e+05 | 5510.00 | 1.543875e+04 | 2.383250e+04 | 4.591500e+04 | 4.942365e+06 | ▇▁▁▁▁ |
| work_text_reviews_count | 0 | 1.00 | 2.919960e+03 | 6.124380e+03 | 3.00 | 6.940000e+02 | 1.402000e+03 | 2.744250e+03 | 1.552540e+05 | ▇▁▁▁▁ |
| ratings_1 | 0 | 1.00 | 1.345040e+03 | 6.635630e+03 | 11.00 | 1.960000e+02 | 3.910000e+02 | 8.850000e+02 | 4.561910e+05 | ▇▁▁▁▁ |
| ratings_2 | 0 | 1.00 | 3.110880e+03 | 9.717120e+03 | 30.00 | 6.560000e+02 | 1.163000e+03 | 2.353250e+03 | 4.368020e+05 | ▇▁▁▁▁ |
| ratings_3 | 0 | 1.00 | 1.147589e+04 | 2.854645e+04 | 323.00 | 3.112000e+03 | 4.894000e+03 | 9.287000e+03 | 7.933190e+05 | ▇▁▁▁▁ |
| ratings_4 | 0 | 1.00 | 1.996570e+04 | 5.144736e+04 | 750.00 | 5.405750e+03 | 8.269500e+03 | 1.602350e+04 | 1.481305e+06 | ▇▁▁▁▁ |
| ratings_5 | 0 | 1.00 | 2.378981e+04 | 7.976889e+04 | 754.00 | 5.334000e+03 | 8.836000e+03 | 1.730450e+04 | 3.011543e+06 | ▇▁▁▁▁ |
| Name | ratings |
| Number of rows | 960595 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| book_id | 0 | 1 | 4890.38 | 2862.32 | 1 | 2414 | 4847 | 7336 | 10000 | ▇▇▇▇▇ |
| user_id | 0 | 1 | 25593.48 | 15209.95 | 1 | 12372 | 25047 | 38511 | 53424 | ▇▇▇▇▆ |
| rating | 0 | 1 | 3.85 | 0.98 | 1 | 3 | 4 | 5 | 5 | ▁▂▆▇▆ |
| N | 0 | 1 | 56.71 | 48.58 | 3 | 19 | 42 | 80 | 200 | ▇▃▂▁▁ |
The following chart facilitate a deeper understanding of the data set.
There are relatively few 1 and 2 ratings. Four was the most popular rating.
The ratings per user chart has a long-tail, with the many user providing a handful (1 to 5) ratings and far fewer users providing up to 200 ratings.
The data set includes more than 25 different genres. The most popular genres include fantasy, romance and mystery. Sports, manga and cookbooks were the least represented genres.
Finally, the table below sets forth the Top 10 books by average rating. Three different Harry Potter books made the Top 10.
The EDA leads me to implement one preprocessing task - Removing duplicate ratings pairs, if any.
We will utilize RecommenderLabs to build and evaluate our recommendation engine. We will build a User Collaborative Model as our base line model. We will also consider alternative algorithms in an effort to find the algorithm that produces the best possible recommendations. The key steps to our analysis include:
First we use dplyr to restructure the ratings data frame. Then we convert the data frame to a matrix.
dimension_names <- list(user_id = sort(unique(ratings$user_id)), book_id = sort(unique(ratings$book_id)))
ratingmat <- spread(select(ratings, book_id, user_id, rating), book_id, rating) %>%
select(-user_id)
ratingmat <- as.matrix(ratingmat)
dimnames(ratingmat) <- dimension_names
#ratingmat[1:5, 1:5]Next we replace NAs with zeros and use Recommederlabs (data=sparse_ratings) to create a sparse ratings matrix. The sparse matrix format reduces the memory footprint of the model. We have just under 1 million (960,595) ratings in our ratings matrix.
ratingmat0 <- ratingmat
ratingmat0[is.na(ratingmat0)] <- 0
sparse_ratings <- as(ratingmat0, "sparseMatrix")
rm(ratingmat0)
gc()## 45016 x 10000 rating matrix of class 'realRatingMatrix' with 960595 ratings.
Now we build the model using the Recommender function, the UBCF algorithm and the Pearson method to calculate similarity.
Now we designate a user (user 4763) and use the predict function to make a prediction.
Here are the prediction results for user 4763
We’ll utilize 10-fold cross validation and different values of nn to tune our recommendation model. We’ll set k to 10 and given to -1. The given of negative 1 means all but one rating is used to make the predictions. Performance is then evaluated for that 1 for each user.
Now we tune our model with values of nn set to 20, 30, 40, 50. Tuning results are plotted below. We see that nn = 40 produces the best results, but its not significantly better than nn = 20, 30 or 50.
algorithms <- list("random" = list(name = "RANDOM", param = NULL),
"UBCF_20" = list(name = "UBCF", param = list(nn = 20)),
"UBCF_30" = list(name = "UBCF", param = list(nn = 30)),
"UBCF_40" = list(name = "UBCF", param = list(nn = 40)),
"UBCF_50" = list(name = "UBCF", param = list(nn = 50))
)
# evaluate the alogrithms with the given scheme
results <- evaluate(scheme, algorithms, type = "ratings")## RANDOM run fold/sample [model time/prediction time]
## 1 [0.01sec/0.43sec]
## 2 [0sec/0.33sec]
## 3 [0sec/0.34sec]
## 4 [0sec/0.34sec]
## 5 [0sec/0.31sec]
## 6 [0sec/0.32sec]
## 7 [0sec/0.32sec]
## 8 [0sec/0.34sec]
## 9 [0sec/0.34sec]
## 10 [0.02sec/0.31sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.02sec/2.17sec]
## 2 [0sec/2.18sec]
## 3 [0sec/2.29sec]
## 4 [0.01sec/2.38sec]
## 5 [0sec/2.8sec]
## 6 [0sec/3.14sec]
## 7 [0sec/2.29sec]
## 8 [0sec/2.37sec]
## 9 [0.02sec/2.5sec]
## 10 [0sec/2.79sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/2sec]
## 2 [0sec/1.99sec]
## 3 [0sec/2.02sec]
## 4 [0.02sec/2.04sec]
## 5 [0.02sec/1.98sec]
## 6 [0sec/2.03sec]
## 7 [0sec/2.01sec]
## 8 [0sec/2sec]
## 9 [0sec/2.06sec]
## 10 [0sec/2.27sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.02sec/2.04sec]
## 2 [0sec/2.03sec]
## 3 [0.01sec/2.03sec]
## 4 [0sec/2.12sec]
## 5 [0.01sec/2.19sec]
## 6 [0.02sec/2.28sec]
## 7 [0.02sec/2.21sec]
## 8 [0.01sec/2.91sec]
## 9 [0sec/2.63sec]
## 10 [0.02sec/2.65sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.01sec/2.13sec]
## 2 [0sec/2.23sec]
## 3 [0sec/2.07sec]
## 4 [0sec/2.03sec]
## 5 [0.02sec/2.04sec]
## 6 [0.02sec/2.18sec]
## 7 [0sec/2.13sec]
## 8 [0sec/2.09sec]
## 9 [0sec/2.13sec]
## 10 [0sec/2.12sec]
# restructure results output
tmp <- lapply(results, function(x) slot(x, "results"))
res <- tmp %>%
lapply(function(x) unlist(lapply(x, function(x) unlist(x@cm[ ,"RMSE"])))) %>%
as.data.frame() %>%
gather(key = "Algorithm", value = "RMSE")The UBCF model had similar values across the different values for nn. The best performance was achieved when nn was set to 40.
res %>%
ggplot(aes(Algorithm, RMSE, fill = Algorithm)) +
geom_bar(stat = "summary") + geom_errorbar(stat = "summary", width = 0.3, size = 0.8) +
coord_cartesian(ylim = c(0.6, 1.3)) + guides(fill = FALSE) +
ggtitle("Tuning Results") +
theme_fivethirtyeight()## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()
Given the SVD model’s ability to deploy at scale and our desire to deploy a model in Shiny, we’ll compare our baseline UBCF model to an SVD below. All things, being equal SVD would likely be a better algorithm for deploying a production system because of its ability to make prediction quickly.
scheme <- evaluationScheme(real_ratings[1:500,], method = "cross-validation", k = 10, given = -1, goodRating = 5)
algorithms <- list(
"UBCF" = list(name = "UBCF"),
"SVD" = list(name = "SVD")
)
results <- evaluate(scheme, algorithms, type = "ratings", progress = FALSE)
# restructure results output
tmp <- lapply(results, function(x) slot(x, "results"))
res <- tmp %>%
lapply(function(x) unlist(lapply(x, function(x) unlist(x@cm[ ,"RMSE"])))) %>%
as.data.frame() %>%
gather(key = "Algorithm", value = "RMSE")The UBCF outperforms SVD, so given all things were not equal we will deploy our engine utilizing our baseline UBCF model.
res %>%
mutate(Algorithm=factor(Algorithm, levels = c("UBCF", "SVD"))) %>%
ggplot(aes(Algorithm, RMSE, fill = Algorithm)) + geom_bar(stat = "summary") +
geom_errorbar(stat = "summary", width = 0.3, size = 0.8) + coord_cartesian(ylim = c(0.6, 1.3)) +
guides(fill = FALSE) +
ggtitle("UBCF vs SVD") +
theme_fivethirtyeight()We have picked our baseline UBCF model to deploy. The problem with that choice is that RecommenderLabs is more of an educational/analytical tool and not ideally suited for production systems owing to the fact that its rather slow producing predictions. Normally, one might turn to Spark or Spark and Sparklyr deploy strategy. The Spark solution is well suited for production system because it holds everything in memory and produce predictions very efficiently and fast.
While I was researching final project alternatives, however, I came across a blog post that demonstrates an efficient R implementation of User Based Collaborative Filtering.
The blog post sets forth the following steps to implement an improved UBCF model:
Two main optimization steps are summarized below:
The calculation of similarities is optimized for large sparse matrices.
k nearest neighbors on similarity matrices were not calculated in a loop, but rather using an optimized implementation. All the values from the similarity matrix were grouped by column. In each group (column), the k-th highest value were found and only the k highest values per column in the similarity matrix were kept and utilized for predictions.
You can see some similarities to matrix decomposition/reduction utilized by SVD models.
Two function did all the heavy lifting in this approach: cf_algorithm.R (collaborative filtering) and similarity_measures.R. The link below is to my github repository where these functions can be reviewed.
List of resouces leveraged in the project: