Goodreads Recommender

Introduction

The “goodbooks-10k” dataset consists of 6 million ratings over 50 thousand users and 10 thousand books. It was sourced from Goodreads. The ratings data is supplemented with tags and book metadata, leading itself to a hybrid collaborative filtering and content-based recommender. This data was supplemented with text blurbs pulled using the Wikipedia API describing each book.

We will build 3 different recommenders based on each dataset:

A standard ALS recommender using the ratings
Perform LDA on the text content to create a content recommender
Use seq2seq with tags to create content recommender

The final recommendations will be a mixed hybrid, a union of these 3 sets.

Load Data

These files are relatively small so we’ll load from github, then copy to spark.

library(tidyverse)
library(sparklyr)
library(kableExtra)
library(keras)
set.seed(7) #many of our models are stochastic

conf <- spark_config()
conf$`sparklyr.cores.local` <- 16
conf$`sparklyr.shell.driver-memory` <- "24G"
conf$spark.memory.fraction <- 0.9
spark_conn <- spark_connect('local', config = conf)




fp <- 'https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/tidy_words.csv'
words_local <- read_csv(fp)
book_to_id <- read_csv('https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/book_to_id.csv')
tags <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/book_tags.csv')
tag_descs <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv')
books <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv')
ratings <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv')
ratings_sc <- copy_to(spark_conn, ratings, overwrite = T)
descs <- copy_to(spark_conn, words_local, overwrite = T)

Basic Data Exploration

ratings %>%
  group_by(book_id) %>%
  summarize(read_count = n()) %>%
  ggplot() + geom_bar(aes(x = book_id, y = read_count), stat = 'identity') +
  labs(title = 'Long Tail of Preferences', y = 'Book Count', x = 'Book') +
  theme(
    axis.text.x = element_blank()
  ) +
  theme_minimal()

This is part of the justification for our hybrid model. The two content-based pieces should be able to recommend niche titles in the long tail.

ratings %>%
  group_by(user_id) %>%
  summarize(read_count = n()) %>%
  ggplot() + geom_bar(aes(x = reorder(user_id, read_count), y = read_count), stat = 'identity') +
  labs(title = 'User Frequency', y = 'Book Count', x = 'User') +
  theme(
    axis.text.x = element_blank()
  ) +
  theme_minimal()

In this dataset, we won’t really suffer from the cold start problem with the healthy number of ratings each user has. However, if deployed, our content-based nodes would be especially useful for new users.

LDA Model

Book titles were used in the wikipedia api to find relvent pages. Not all book titles could be found. Unpopular books and those not written in English were naturally filtered out, but over 85% of all titles were located. The data was collected in Python and preproccesed using the gensim library. This made it easier to stem words and remove words that occured in over 50% of the documents. These clean documents were exported into a csv for processing with the Spark ml_lda function.

LDA

The parameters required a bit of tuning. With default parameters, nearly all the weight was concentrated in two topics. After looking at gensim’s defaults and a bit of trial and error, the topic distribution was much improved. (two sections below)

features <- descs %>%
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features")


vec_model <- ml_pipeline( ft_tokenizer(spark_conn, "text", "tokens"), ft_count_vectorizer(spark_conn, "tokens", "features")) %>%
  ml_fit(descs)

vocab_key <- ml_vocabulary(ml_stage(vec_model, 'count_vectorizer')) %>% data.frame() %>%
  rownames_to_column('termIndices') %>%
  rename('word' = '.') %>%
  mutate(termIndices = as.integer(termIndices),
    termIndices = termIndices - 1
  )


lda_mod <-  ml_lda(features, k = 50, optimizer = 'online', learning_offset = 1, learning_decay = .5,
                   doc_concentration = .0005, optimize_doc_concentration = TRUE)

Word Distribution by Topic

Printing the topics takes a bit of work since the “ml_describe_topics” function returns token indexes, not actual words. Examining two of our most popular topics, we see that they are coherent, but wont’t always translate to user tastes.

topic_descriptions <- ml_describe_topics(lda_mod) %>%
  collect() %>%
  unnest(termIndices, termWeights) %>%
  mutate(topic = topic + 1)

topic_descriptions$termIndices <- unlist(topic_descriptions$termIndices)
topic_descriptions <- topic_descriptions %>% left_join(vocab_key, 'termIndices')

filter(topic_descriptions, topic %in% c(6, 28)) %>%
  head(20) %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')

topic	termIndices	termWeights	word
6	1	0.0268077465608327	seri
6	29	0.01686876784859	televis
6	27	0.0157436088877961	base
6	133	0.0125640776727025	air
6	111	0.011674862632556	episod
6	82	0.0102914734765798	season
6	132	0.00922415035485263	premier
6	26	0.00880663440243078	unit
6	81	0.00877372544181316	drama
6	0	0.00858964017968547	film
28	1	0.0468798577886867	seri
28	29	0.0151578746619639	televis
28	82	0.0142504182448538	season
28	18	0.0104330331458864	charact
28	215	0.00936227265308494	dead
28	111	0.00846953237484499	episod
28	2219	0.00829962725418106	dinosaur
28	13	0.00821679421880839	includ
28	273	0.00762519986117736	detect
28	73	0.00732577421758458	creat

Topic Distribution by Document

Each document will have a length 50 vector (50 is the number of topics we chose), which can be considered its hidden dimensions. These will be used to build a similarity matrix for each document so we want a distribution that’s not too uniform, nor too top heavy.

do_topic_temp <- ml_transform(lda_mod, features) %>%
  select(topicDistribution) %>%
  collect()


lda_features <- do.call(rbind, do_topic_temp$topicDistribution) 
colnames(lda_features) <- paste('topic', 1:50)


lda_long <- lda_features %>%
  data.frame() %>%
  gather(topic, value) 


lda_long$book_index <- 1:nrow(do_topic_temp)

lda_long %>% 
  filter(value >= .15) %>%
  group_by(topic) %>%
  summarise(n_books = n()) %>%
  arrange(desc(n_books)) %>%
  head(15) %>%
  ggplot() + geom_bar(aes(x = topic, y = n_books), stat = 'identity') +
  labs(title = 'Top 15 Frequent Topics') +
  coord_flip ()+
  theme_minimal()

This is a little more top-heavy than we would like, but still adequate.

Book Similarities

We build are similarity matrix using pearson correlation, as it is the simplest to implement. Next, we match books with titles and get an idea of the coherence of our model by examining correlation between the first 20 titles.

lda_sim <- lda_features %>% as.matrix() %>% t() %>% cor 
lda_subset <- lda_sim[1:20, 1:20]
lda_sim <- lda_sim %>%
  data.frame() %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  inner_join(book_to_id, c('row_id' = 'X1')) %>%
  select(-words)

row_to_book <- book_to_id %>%
  inner_join(books, 'book_id')





library(corrplot)

first_20 <- row_to_book %>% arrange(book_id) %>% head(20) %>% mutate(title = str_sub(title,1,25))

colnames(lda_subset) <- first_20$title
rownames(lda_subset) <- first_20$title

corrplot(lda_subset, order = 'hclus')

Procuding Recommendations

For both the seq2seq and LDA content-based recommendations, we’ll use the following simple algorithm:

Choose a user
Find the top 10 rated books by that user
Find the similarity vector (the similarity of all books) for each of the 10 top
Take the average of the similarity vectors
Sort the similarity vectors and pick the top n

One of the drawbacks of this method is that producing recommendations for all users at once is too computationally expensive to be feasible. As a result, typical recommender evaluation metrics a difficult to produce for this algorithm.

#correlation_matrix <- similarity_input %>%
#  ml_corr()
  
user_ratings <- ratings %>%
  group_by(user_id) %>%
  summarise(n_ratings = n())


top_ratings <- function(named_user, ratings_df, n){
  ratings_df %>%
    filter(user_id == named_user) %>%
    arrange(desc(rating)) %>%
    head(n)
}

get_rated_books <- function(named_user, ratings_df){
  ratings_df %>%
    filter(user_id == named_user)
}

calc_user_lda <- function(user_id, ratings_df, similarity_df, k){
  similarity_df %>%
    inner_join(top_ratings(user_id, ratings_df, 10), 'book_id') %>%
    select(-c(book_id, rating, row_id, user_id)) %>%
    summarise_all(mean) %>%
    gather(row_index, rating) %>%
    mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
    inner_join(book_to_id, c( 'row_index' = 'X1')) %>%
    anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated 
    select(book_id, rating) %>%
    drop_na() %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    arrange(desc(rating)) %>%
    head(k)
}

calc_user_lda(1, ratings, lda_sim, 20) %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')

book_id	title	rating
1574	The Left Hand of Darkness	0.6303631
2574	The Black Ice (Harry Bosch, #2; Harry Bosch Universe, #2)	0.6303631
5018	Theodore Boone: Kid Lawyer (Theodore Boone, #1)	0.6303631
78	The Devil Wears Prada (The Devil Wears Prada, #1)	0.6266320
3076	The French Lieutenant’s Woman	0.6174999
4234	Twilight (The Mediator, #6)	0.6174999
8354	Twilight (Warriors: The New Prophecy, #5)	0.6174999
9655	Treasure (Dirk Pitt, #9)	0.6174999
62	The Golden Compass (His Dark Materials, #1)	0.6136638
2439	Way of the Peaceful Warrior: A Book That Changes Lives	0.6136638
4876	The Silent Girl (Rizzoli & Isles, #9)	0.6136638
9340	Tell Me Three Things	0.6136638
562	The Way of Kings (The Stormlight Archive, #1)	0.6135204
2734	Dorothy Must Die (Dorothy Must Die, #1)	0.6122484
794	Doctor Sleep (The Shining, #2)	0.6120954
5165	It Happened One Autumn (Wallflowers, #2)	0.6116760
5270	The Christmas Box (The Christmas Box, #1)	0.6110079
145	Deception Point	0.6089313
8723	Before We Met	0.6086956
6042	True History of the Kelly Gang	0.6076559

There isn’t an obvious pattern here, but I’m also not an avid reader.

seq2seq Model

The tags were preprocessed, then fed into a simple neural network using only dense layers. The architecture was inspired by this article: https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af

The idea is that the middle layer, the encoding layer, becomes a low dimensional representation of the set of tags for a particular book. Related tags are compressed into the same dimension, similar to the way SVD creates latent dimensions. For instance, the model should learn that “fantasy” and “sci-fi fantasy” are related because they have a number of co-occurences.

Transforming Tag Data

The tag data is supplied in a bag-of-words-like format. We want to normalize the “count” to control for popularity and cast it into a wide format.

The following transformations are performed:

Create metadata lookup table
Filter low-information tags
TF-IDF scaling
Normalize by book. If we wanted to account for popularity this step would be removed
Cast into wide form
Log scaling to correct highly skewed distribution

tags_test <- tags %>%
  filter(goodreads_book_id <= 100) %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(goodreads_book_id, tag_id, tag_name, title) %>%
  arrange(goodreads_book_id, tag_id)

tags_expanded <- tags %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(goodreads_book_id, tag_id, tag_name, title, count) %>%
  filter(str_detect(tag_name, '\\d{4,}') == FALSE & str_detect(tag_name, '\\w+') == TRUE
         & str_detect(tag_name, 'book') == FALSE)

mean_counts <- tags_expanded %>%
  group_by(goodreads_book_id) %>%
  summarise(mean_count = mean(count))

tag_counts <- tags_expanded %>%
  group_by(tag_id) %>%
  summarise(freq = n()) %>%
  mutate(idf_weight = log(10000/(freq + 1))) %>%
  arrange(desc(freq)) %>%
  filter(freq >= 500) %>%
  select(idf_weight, tag_id)

tags_fixed <- tags_expanded %>%
  group_by(tag_id, goodreads_book_id) %>%
  summarise(tag_count = sum(count)) %>%
  ungroup() %>%
  inner_join(tag_counts, 'tag_id') %>%
  inner_join(mean_counts, 'goodreads_book_id') %>%
  mutate(tag_count = (tag_count/mean_count) * idf_weight) %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(tag_count, goodreads_book_id, tag_id, tag_name, title)

max_tag <- tags_fixed %>% select(tag_id) %>% distinct() %>% nrow()
max_count <- max(tags_fixed$tag_count, na.rm = TRUE)

book_count <- tags_fixed %>% select(goodreads_book_id) %>% distinct %>% nrow()

ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Raw') + 
  theme_minimal()

ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Logged') +
  theme_minimal()

bag_of_words <- tags_fixed %>%
  select(tag_id, tag_count, goodreads_book_id) %>%
  mutate(tag_count = log(tag_count + 1)) %>%
  spread(tag_id, tag_count) %>%
  replace(., is.na(.), 0) %>%
  arrange(goodreads_book_id) %>%
  select(-goodreads_book_id) %>%
  as.matrix ()

Contstruct NN

This is one of the possible areas for improvement, as my experience with deep learning is somewhat limited. We use two mirrored sequences, with the encodings layer sandwhiched between them. The idea is for the network to learn a 25 dimensional vector that describes the state of the 400+ dimensional tag vector and can reproduce it.

encoder <- keras_model_sequential(name = 'encoder') 
encoder %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(max_tag))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(256))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(128))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 25,  activation = 'relu', name = 'tag_encodings', input_shape = c(64))

decoder <- keras_model_sequential()

decoder %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(25), name = 'embeddings_layer') %>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(64))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(128))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = max_tag, input_shape = c(256)) %>%
  layer_activation('sigmoid', input_shape = c(max_tag))


model <- keras_model_sequential()
model %>%
  encoder %>%
  decoder %>%
  keras::compile(loss = 'mse', optimizer = 'adam', metrics = c('mse'))

model %>% fit( 
  bag_of_words[1:8000, ], 
  bag_of_words[1:8000, ], 
  epochs = 5, 
  batch_size = 10,
  shuffle = FALSE,
  verbose = FALSE,
  validation_data = list(bag_of_words[8001:10000, ], bag_of_words[8001:10000, ])
)

model_outputs <- get_layer(model, 'encoder') %>% get_layer('tag_encodings')
intermediate_layer_model <- keras_model(inputs = encoder$input,
                                        outputs = model_outputs$output)

intermediate_output <- predict(intermediate_layer_model, bag_of_words)

fixed_embeddings <- intermediate_output[, colSums(intermediate_output != 0) > 0]#vectors of all zeros provide no information and make similarity less accurate.

Sanity Check

Similar to the LDA model, we create a similarity matrix for all books using the 25 dimension encodings. The first 20 books are examined to determine the coherence of the model.

matrix_features <- fixed_embeddings %>% t() %>% cor()
matrix_features_small <- matrix_features[1:20, 1:20]


library(corrplot)

ordered_books <- books %>% arrange(goodreads_book_id)
first_20 <- ordered_books %>% arrange(goodreads_book_id) %>% head(20) %>% mutate(title = str_sub(title,1,20))

colnames(matrix_features_small) <- first_20$title
rownames(matrix_features_small) <- first_20$title

corrplot(matrix_features_small)

This pattern is obvious, though it does help that there are so many Harry Potter Books. The middle cluster consists of books related to travel and adventure. Harry Potter being related to Lord of the Rings is also a good sign.

Recommendations

Again similar to LDA, we produce recommendations using the similarity matrix and same algorithm.

row_to_good_reads <- ordered_books %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  select(row_id, goodreads_book_id)

good_reads_lookup <- select(books, goodreads_book_id, book_id)

book_simarilarity <- matrix_features %>%
  data.frame() %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  inner_join(row_to_good_reads, 'row_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(-goodreads_book_id)


calc_user_tags <- function(user_id, ratings_df, similarity_df, k, n = 10){
  
  similarity_df %>%
    inner_join(top_ratings(user_id, ratings_df, n), 'book_id') %>%
    select(starts_with('X')) %>%
    summarise_all(mean) %>%
    gather(row_index, rating) %>%
    mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
    inner_join(row_to_good_reads, c( 'row_index' = 'row_id')) %>%
    inner_join(good_reads_lookup, 'goodreads_book_id') %>%
    anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated 
    select(book_id, rating) %>%
    drop_na() %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    arrange(desc(rating)) %>%
    head(k)
}

calc_user_tags(1, ratings, book_simarilarity, 20)%>%
  kable() %>% kable_styling(bootstrap_options = 'striped')

book_id	title	rating
6413	Home (Gilead, #2)	0.6739000
4338	The Price of Salt	0.6727841
9612	The Charterhouse of Parma	0.6708106
930	Olive Kitteridge	0.6706485
7485	سینوهه	0.6704477
1435	Sophie’s Choice	0.6685887
9632	Falling Man	0.6678417
9761	How Green Was My Valley	0.6677593
8975	Girl With Curious Hair	0.6676609
5596	Quo Vadis	0.6672832
658	The Corrections	0.6667213
6727	Burmese Days	0.6665403
7783	The Leopard	0.6659457
4100	Tinkers	0.6656806
5456	How the García Girls Lost Their Accents	0.6654594
7974	Silence	0.6654357
4788	The Fortress of Solitude	0.6650515
5911	The Book of Illusions	0.6649912
3511	Eva Luna	0.6647221
669	The House of the Spirits	0.6646765

ALS Model

Using a standard Spark ALS implementation, constructing a single model was easier than expected, but tuning on a grid wasn’t practical when running Spark locally.

Optimize Parameters

Spark has built-in funtions for grid search, allowing us to easily optimize RMSE. The dimensionality in the grid will be higher than when we were working with 100K ratings datasets. The dimensionality of the users and books are an order of magnitude higher than they were in previous projects.

estimator <- ml_pipeline(spark_conn) %>%
  ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, cold_start_strategy = 'drop')  

#als_grid <- list(als = list(rank = c(20, 30, 50), reg_param = c(.05, .1)))

als_grid <- list(als = list(rank = c(20,30,50)))
cv <- ml_cross_validator(
  spark_conn, 
  estimator = estimator,
  evaluator = ml_regression_evaluator(spark_conn, label_col = 'rating'), 
  estimator_param_maps = als_grid,
  num_folds = 2
)

als_cv <- ml_fit(cv, ratings_sc)
ml_validation_metrics(als_cv) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)

rmse	rank_1
0.8321923	20
0.8319314	30
0.8334796	50

This RMSE is similar to ALS implementations on other ratings datasets, such as movielens. It could be because of the relatively low book dimensionality that the optimal rank is so low.

Metrics at K

To assees, the practical quality of the ALS portion of our recommender, we’ll look at precision and recall at a few levels of recommendations (10-20). This will give us an idea of how quicly the quality drops off. If the recall increases, but the precision stays level, we would be more comfortable at higher levels of K.

metrics_at_k <- vector('list', length = 2)
for (k in 1:2){
  temp_dfs <- vector('list', length = 2)
  for (i in 1:2){ 
    
    set.seed(42 + i)
    partitioned_set <- ratings_sc %>%
      sdf_random_split(training = .8, testing = .2) 
    
    als_mod <- partitioned_set[[1]] %>%
      ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1,
             implicit_prefs = TRUE)
    
    recs <- ml_recommend(als_mod, type = 'item', k*10) %>%
      full_join(partitioned_set[[2]], c('user_id', 'book_id'), suffix = c('_pred', '_act')) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 1 & is.na(rating_act) == 0, 'FN', '')) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 1, 'FP', truth_cat)) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 0, 'TP', truth_cat)) %>%
      group_by(truth_cat) %>%
      summarise(tot_obs = n()) %>%
      ungroup() %>%
      collect()
    
    recs_cm <- recs %>%
      spread(truth_cat, tot_obs) %>%
      mutate(
        precision = TP/(TP + FP),
        recall = TP/(TP + FN),
        F1 =  2*((precision*recall)/(precision + recall))
      )
    temp_dfs[[i]] <- recs_cm
  }
  summary_df <- bind_rows(temp_dfs) %>%
    summarise_all(mean) %>%
    add_column('k' = k*10)
  metrics_at_k[[k]] <- summary_df
}

metrics_at_k %>%
  bind_rows() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)

FN	FP	TP	precision	recall	F1	k
1132308	471047.5	63192.5	0.1182849	0.0528586	0.0730659	10
1078839	951819.0	116661.0	0.1091841	0.0975834	0.1030583	20

In the first iteration, didn’t use implicit prefs and the precision and recall were considerably lower. This is somewhat surprising given that we have explicit ratings. Our goal is to predict the books users will read, not optimize RMSE. The confusion matrix statistics are more important; we will keep this parameter set to TRUE in our final recommender.

ALS Predictions

Again with user #1, we produce our recommendations for the ALS model.

final_als <- ml_als(ratings_sc, rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1, implicit_prefs = TRUE)


calc_als <- function(model, named_user, k){

  ml_recommend(final_als, type = 'item', 50) %>%
    select(book_id, user_id, rating) %>%
    filter(named_user == user_id) %>%
    collect() %>%
    select(user_id, book_id, rating) %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    anti_join(get_rated_books(named_user, ratings), 'book_id') %>%
    head(k)
}

calc_als(final_als, 1, 10) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)

book_id	title	rating
5	The Great Gatsby	0.8578916
8	The Catcher in the Rye	0.8275669
118	The Joy Luck Club	0.7824225
26	The Da Vinci Code (Robert Langdon, #2)	0.7787534
15	The Diary of a Young Girl	0.7486388
195	The Guernsey Literary and Potato Peel Pie Society	0.7476287
291	Cutting for Stone	0.7297267
14	Animal Farm	0.7212889
1	The Hunger Games (The Hunger Games, #1)	0.7151593
63	Wuthering Heights	0.7139446

We now see how much of our total dataset was recommended.

all_recs <- ml_recommend(als_mod, type = 'item', 10) %>%
  group_by(book_id) %>%
  summarize(book_count = n()) %>%
  collect()

ggplot(all_recs) + geom_bar(aes(x = reorder(book_id, book_count), y = book_count), stat = 'identity') +
labs(title = 'Recommendations by Book', y = 'Book Count', x = 'Book') +
theme(
  axis.text.x = element_blank()
) +
theme_minimal()

Mixed Hybrid

We now bring everything together and create a function that will produce our final recommendations for a given user. This will also show the recently rated books to give us and idea of the user’s tastes.

full_recs <- function (user_id, ratings_df, recs = 30){
  lda_recs <- calc_user_lda(user_id, ratings, lda_sim, ceiling(recs/2)) %>%
    add_column('source' = 'LDA')
  tag_recs <- calc_user_tags(user_id, ratings, book_simarilarity, ceiling(recs/2)) %>%
    add_column('source' = 'Tags')
  als_recs <- calc_als(final_als, user_id, ceiling(recs/2)) %>%
    add_column('source' = 'ALS')
  
  best_10 <- top_ratings(user_id, ratings_df, 10) %>%
    inner_join(books, 'book_id') %>%
    select(title, rating)
  full_set <- rbind(lda_recs, tag_recs, als_recs)
    recs <- full_set %>%
    group_by(book_id) %>%
    summarise(book_count = n()) %>%
    ungroup() %>%
    inner_join(full_set, 'book_id') %>%
    arrange(desc(book_count), desc(rating)) %>%
    head(recs)
  
  print(best_10)
  return(recs)
}

full_recs(1, ratings) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)

## # A tibble: 10 x 2
##    title                                                         rating
##    <chr>                                                          <dbl>
##  1 The Shadow of the Wind (The Cemetery of Forgotten Books,  #1)      5
##  2 Gilead (Gilead, #1)                                                5
##  3 The Kite Runner                                                    5
##  4 Peace Like a River                                                 5
##  5 Divine Secrets of the Ya-Ya Sisterhood                             5
##  6 The Alchemist                                                      5
##  7 To Kill a Mockingbird                                              5
##  8 Antigone (The Theban Plays, #3)                                    5
##  9 Ender's Game (Ender's Saga, #1)                                    5
## 10 The Death of Ivan Ilych                                            5

book_id	book_count	title	rating	source
5	1	The Great Gatsby	0.8578916	ALS
8	1	The Catcher in the Rye	0.8275669	ALS
118	1	The Joy Luck Club	0.7824225	ALS
26	1	The Da Vinci Code (Robert Langdon, #2)	0.7787534	ALS
15	1	The Diary of a Young Girl	0.7486388	ALS
195	1	The Guernsey Literary and Potato Peel Pie Society	0.7476287	ALS
291	1	Cutting for Stone	0.7297267	ALS
14	1	Animal Farm	0.7212889	ALS
1	1	The Hunger Games (The Hunger Games, #1)	0.7151593	ALS
63	1	Wuthering Heights	0.7139446	ALS
28	1	Lord of the Flies	0.7057348	ALS
9	1	Angels & Demons (Robert Langdon, #1)	0.7037891	ALS
172	1	Anna Karenina	0.7002230	ALS
2	1	Harry Potter and the Sorcerer’s Stone (Harry Potter, #1)	0.6899099	ALS
6413	1	Home (Gilead, #2)	0.6739000	Tags
4338	1	The Price of Salt	0.6727841	Tags
9612	1	The Charterhouse of Parma	0.6708106	Tags
930	1	Olive Kitteridge	0.6706485	Tags
7485	1	سینوهه	0.6704477	Tags
1435	1	Sophie’s Choice	0.6685887	Tags
9632	1	Falling Man	0.6678417	Tags
9761	1	How Green Was My Valley	0.6677593	Tags
8975	1	Girl With Curious Hair	0.6676609	Tags
5596	1	Quo Vadis	0.6672832	Tags
658	1	The Corrections	0.6667213	Tags
6727	1	Burmese Days	0.6665403	Tags
7783	1	The Leopard	0.6659457	Tags
4100	1	Tinkers	0.6656806	Tags
5456	1	How the García Girls Lost Their Accents	0.6654594	Tags
1574	1	The Left Hand of Darkness	0.6303631	LDA

Conclusion

The ALS model produced by far the best recommendatoins, but the other models are not without value. ALS is going to recommend more of the same in most cases. The two content-based models, even though not tuned, will provide the user with some nice under the radar titles.

Areas for improvement

With better sources of text data describing the books, the room for improvement with the LDA piece is immense.
Using more of the tags, and possibly passing a compressed tfidf matrix to Keras would yield better tag encodings.
Creating a more scalable algorithm for making the content recommendations. The low hanging fruit would be to take advantage of parallelization, but I would think a ground up redesign might be necessary.