Introduction

The “goodbooks-10k” dataset consists of 6 million ratings over 50 thousand users and 10 thousand books. It was sourced from Goodreads. The ratings data is supplemented with tags and book metadata, leading itself to a hybrid collaborative filtering and content-based recommender. This data was supplemented with text blurbs pulled using the Wikipedia API describing each book.

We will build 3 different recommenders based on each dataset:

  1. A standard ALS recommender using the ratings
  2. Perform LDA on the text content to create a content recommender
  3. Use seq2seq with tags to create content recommender

The final recommendations will be a mixed hybrid, a union of these 3 sets.

Load Data

These files are relatively small so we’ll load from github, then copy to spark.

library(tidyverse)
library(sparklyr)
library(kableExtra)
library(keras)
set.seed(7) #many of our models are stochastic

conf <- spark_config()
conf$`sparklyr.cores.local` <- 16
conf$`sparklyr.shell.driver-memory` <- "24G"
conf$spark.memory.fraction <- 0.9
spark_conn <- spark_connect('local', config = conf)




fp <- 'https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/tidy_words.csv'
words_local <- read_csv(fp)
book_to_id <- read_csv('https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/book_to_id.csv')
tags <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/book_tags.csv')
tag_descs <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv')
books <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv')
ratings <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv')
ratings_sc <- copy_to(spark_conn, ratings, overwrite = T)
descs <- copy_to(spark_conn, words_local, overwrite = T)

Basic Data Exploration

ratings %>%
  group_by(book_id) %>%
  summarize(read_count = n()) %>%
  ggplot() + geom_bar(aes(x = book_id, y = read_count), stat = 'identity') +
  labs(title = 'Long Tail of Preferences', y = 'Book Count', x = 'Book') +
  theme(
    axis.text.x = element_blank()
  ) +
  theme_minimal()

This is part of the justification for our hybrid model. The two content-based pieces should be able to recommend niche titles in the long tail.

ratings %>%
  group_by(user_id) %>%
  summarize(read_count = n()) %>%
  ggplot() + geom_bar(aes(x = reorder(user_id, read_count), y = read_count), stat = 'identity') +
  labs(title = 'User Frequency', y = 'Book Count', x = 'User') +
  theme(
    axis.text.x = element_blank()
  ) +
  theme_minimal()

In this dataset, we won’t really suffer from the cold start problem with the healthy number of ratings each user has. However, if deployed, our content-based nodes would be especially useful for new users.

LDA Model

Book titles were used in the wikipedia api to find relvent pages. Not all book titles could be found. Unpopular books and those not written in English were naturally filtered out, but over 85% of all titles were located. The data was collected in Python and preproccesed using the gensim library. This made it easier to stem words and remove words that occured in over 50% of the documents. These clean documents were exported into a csv for processing with the Spark ml_lda function.

LDA

The parameters required a bit of tuning. With default parameters, nearly all the weight was concentrated in two topics. After looking at gensim’s defaults and a bit of trial and error, the topic distribution was much improved. (two sections below)

features <- descs %>%
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features")


vec_model <- ml_pipeline( ft_tokenizer(spark_conn, "text", "tokens"), ft_count_vectorizer(spark_conn, "tokens", "features")) %>%
  ml_fit(descs)

vocab_key <- ml_vocabulary(ml_stage(vec_model, 'count_vectorizer')) %>% data.frame() %>%
  rownames_to_column('termIndices') %>%
  rename('word' = '.') %>%
  mutate(termIndices = as.integer(termIndices),
    termIndices = termIndices - 1
  )


lda_mod <-  ml_lda(features, k = 50, optimizer = 'online', learning_offset = 1, learning_decay = .5,
                   doc_concentration = .0005, optimize_doc_concentration = TRUE)

Word Distribution by Topic

Printing the topics takes a bit of work since the “ml_describe_topics” function returns token indexes, not actual words. Examining two of our most popular topics, we see that they are coherent, but wont’t always translate to user tastes.

topic_descriptions <- ml_describe_topics(lda_mod) %>%
  collect() %>%
  unnest(termIndices, termWeights) %>%
  mutate(topic = topic + 1)

topic_descriptions$termIndices <- unlist(topic_descriptions$termIndices)
topic_descriptions <- topic_descriptions %>% left_join(vocab_key, 'termIndices')

filter(topic_descriptions, topic %in% c(6, 28)) %>%
  head(20) %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')
topic termIndices termWeights word
6 1 0.0268077465608327 seri
6 29 0.01686876784859 televis
6 27 0.0157436088877961 base
6 133 0.0125640776727025 air
6 111 0.011674862632556 episod
6 82 0.0102914734765798 season
6 132 0.00922415035485263 premier
6 26 0.00880663440243078 unit
6 81 0.00877372544181316 drama
6 0 0.00858964017968547 film
28 1 0.0468798577886867 seri
28 29 0.0151578746619639 televis
28 82 0.0142504182448538 season
28 18 0.0104330331458864 charact
28 215 0.00936227265308494 dead
28 111 0.00846953237484499 episod
28 2219 0.00829962725418106 dinosaur
28 13 0.00821679421880839 includ
28 273 0.00762519986117736 detect
28 73 0.00732577421758458 creat

Topic Distribution by Document

Each document will have a length 50 vector (50 is the number of topics we chose), which can be considered its hidden dimensions. These will be used to build a similarity matrix for each document so we want a distribution that’s not too uniform, nor too top heavy.

do_topic_temp <- ml_transform(lda_mod, features) %>%
  select(topicDistribution) %>%
  collect()


lda_features <- do.call(rbind, do_topic_temp$topicDistribution) 
colnames(lda_features) <- paste('topic', 1:50)


lda_long <- lda_features %>%
  data.frame() %>%
  gather(topic, value) 


lda_long$book_index <- 1:nrow(do_topic_temp)

lda_long %>% 
  filter(value >= .15) %>%
  group_by(topic) %>%
  summarise(n_books = n()) %>%
  arrange(desc(n_books)) %>%
  head(15) %>%
  ggplot() + geom_bar(aes(x = topic, y = n_books), stat = 'identity') +
  labs(title = 'Top 15 Frequent Topics') +
  coord_flip ()+
  theme_minimal()

This is a little more top-heavy than we would like, but still adequate.

Book Similarities

We build are similarity matrix using pearson correlation, as it is the simplest to implement. Next, we match books with titles and get an idea of the coherence of our model by examining correlation between the first 20 titles.

lda_sim <- lda_features %>% as.matrix() %>% t() %>% cor 
lda_subset <- lda_sim[1:20, 1:20]
lda_sim <- lda_sim %>%
  data.frame() %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  inner_join(book_to_id, c('row_id' = 'X1')) %>%
  select(-words)

row_to_book <- book_to_id %>%
  inner_join(books, 'book_id')





library(corrplot)

first_20 <- row_to_book %>% arrange(book_id) %>% head(20) %>% mutate(title = str_sub(title,1,25))

colnames(lda_subset) <- first_20$title
rownames(lda_subset) <- first_20$title

corrplot(lda_subset, order = 'hclus')

Procuding Recommendations

For both the seq2seq and LDA content-based recommendations, we’ll use the following simple algorithm:

  1. Choose a user
  2. Find the top 10 rated books by that user
  3. Find the similarity vector (the similarity of all books) for each of the 10 top
  4. Take the average of the similarity vectors
  5. Sort the similarity vectors and pick the top n

One of the drawbacks of this method is that producing recommendations for all users at once is too computationally expensive to be feasible. As a result, typical recommender evaluation metrics a difficult to produce for this algorithm.

#correlation_matrix <- similarity_input %>%
#  ml_corr()
  
user_ratings <- ratings %>%
  group_by(user_id) %>%
  summarise(n_ratings = n())


top_ratings <- function(named_user, ratings_df, n){
  ratings_df %>%
    filter(user_id == named_user) %>%
    arrange(desc(rating)) %>%
    head(n)
}

get_rated_books <- function(named_user, ratings_df){
  ratings_df %>%
    filter(user_id == named_user)
}

calc_user_lda <- function(user_id, ratings_df, similarity_df, k){
  similarity_df %>%
    inner_join(top_ratings(user_id, ratings_df, 10), 'book_id') %>%
    select(-c(book_id, rating, row_id, user_id)) %>%
    summarise_all(mean) %>%
    gather(row_index, rating) %>%
    mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
    inner_join(book_to_id, c( 'row_index' = 'X1')) %>%
    anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated 
    select(book_id, rating) %>%
    drop_na() %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    arrange(desc(rating)) %>%
    head(k)
}

calc_user_lda(1, ratings, lda_sim, 20) %>%
  kable() %>% kable_styling(bootstrap_options = 'striped')
book_id title rating
1574 The Left Hand of Darkness 0.6303631
2574 The Black Ice (Harry Bosch, #2; Harry Bosch Universe, #2) 0.6303631
5018 Theodore Boone: Kid Lawyer (Theodore Boone, #1) 0.6303631
78 The Devil Wears Prada (The Devil Wears Prada, #1) 0.6266320
3076 The French Lieutenant’s Woman 0.6174999
4234 Twilight (The Mediator, #6) 0.6174999
8354 Twilight (Warriors: The New Prophecy, #5) 0.6174999
9655 Treasure (Dirk Pitt, #9) 0.6174999
62 The Golden Compass (His Dark Materials, #1) 0.6136638
2439 Way of the Peaceful Warrior: A Book That Changes Lives 0.6136638
4876 The Silent Girl (Rizzoli & Isles, #9) 0.6136638
9340 Tell Me Three Things 0.6136638
562 The Way of Kings (The Stormlight Archive, #1) 0.6135204
2734 Dorothy Must Die (Dorothy Must Die, #1) 0.6122484
794 Doctor Sleep (The Shining, #2) 0.6120954
5165 It Happened One Autumn (Wallflowers, #2) 0.6116760
5270 The Christmas Box (The Christmas Box, #1) 0.6110079
145 Deception Point 0.6089313
8723 Before We Met 0.6086956
6042 True History of the Kelly Gang 0.6076559

There isn’t an obvious pattern here, but I’m also not an avid reader.

seq2seq Model

The tags were preprocessed, then fed into a simple neural network using only dense layers. The architecture was inspired by this article: https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af

The idea is that the middle layer, the encoding layer, becomes a low dimensional representation of the set of tags for a particular book. Related tags are compressed into the same dimension, similar to the way SVD creates latent dimensions. For instance, the model should learn that “fantasy” and “sci-fi fantasy” are related because they have a number of co-occurences.

Transforming Tag Data

The tag data is supplied in a bag-of-words-like format. We want to normalize the “count” to control for popularity and cast it into a wide format.

The following transformations are performed:

  1. Create metadata lookup table
  2. Filter low-information tags
  3. TF-IDF scaling
  4. Normalize by book. If we wanted to account for popularity this step would be removed
  5. Cast into wide form
  6. Log scaling to correct highly skewed distribution
tags_test <- tags %>%
  filter(goodreads_book_id <= 100) %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(goodreads_book_id, tag_id, tag_name, title) %>%
  arrange(goodreads_book_id, tag_id)

tags_expanded <- tags %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(goodreads_book_id, tag_id, tag_name, title, count) %>%
  filter(str_detect(tag_name, '\\d{4,}') == FALSE & str_detect(tag_name, '\\w+') == TRUE
         & str_detect(tag_name, 'book') == FALSE)

mean_counts <- tags_expanded %>%
  group_by(goodreads_book_id) %>%
  summarise(mean_count = mean(count))

tag_counts <- tags_expanded %>%
  group_by(tag_id) %>%
  summarise(freq = n()) %>%
  mutate(idf_weight = log(10000/(freq + 1))) %>%
  arrange(desc(freq)) %>%
  filter(freq >= 500) %>%
  select(idf_weight, tag_id)

tags_fixed <- tags_expanded %>%
  group_by(tag_id, goodreads_book_id) %>%
  summarise(tag_count = sum(count)) %>%
  ungroup() %>%
  inner_join(tag_counts, 'tag_id') %>%
  inner_join(mean_counts, 'goodreads_book_id') %>%
  mutate(tag_count = (tag_count/mean_count) * idf_weight) %>%
  inner_join(tag_descs, 'tag_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(tag_count, goodreads_book_id, tag_id, tag_name, title)

max_tag <- tags_fixed %>% select(tag_id) %>% distinct() %>% nrow()
max_count <- max(tags_fixed$tag_count, na.rm = TRUE)

book_count <- tags_fixed %>% select(goodreads_book_id) %>% distinct %>% nrow()

ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Raw') + 
  theme_minimal()

ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Logged') +
  theme_minimal()

bag_of_words <- tags_fixed %>%
  select(tag_id, tag_count, goodreads_book_id) %>%
  mutate(tag_count = log(tag_count + 1)) %>%
  spread(tag_id, tag_count) %>%
  replace(., is.na(.), 0) %>%
  arrange(goodreads_book_id) %>%
  select(-goodreads_book_id) %>%
  as.matrix ()

Contstruct NN

This is one of the possible areas for improvement, as my experience with deep learning is somewhat limited. We use two mirrored sequences, with the encodings layer sandwhiched between them. The idea is for the network to learn a 25 dimensional vector that describes the state of the 400+ dimensional tag vector and can reproduce it.

encoder <- keras_model_sequential(name = 'encoder') 
encoder %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(max_tag))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(256))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(128))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 25,  activation = 'relu', name = 'tag_encodings', input_shape = c(64))

decoder <- keras_model_sequential()

decoder %>%
  layer_dense(units = 64, activation = 'relu', input_shape = c(25), name = 'embeddings_layer') %>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 128, activation = 'relu', input_shape = c(64))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(128))%>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = max_tag, input_shape = c(256)) %>%
  layer_activation('sigmoid', input_shape = c(max_tag))


model <- keras_model_sequential()
model %>%
  encoder %>%
  decoder %>%
  keras::compile(loss = 'mse', optimizer = 'adam', metrics = c('mse'))

model %>% fit( 
  bag_of_words[1:8000, ], 
  bag_of_words[1:8000, ], 
  epochs = 5, 
  batch_size = 10,
  shuffle = FALSE,
  verbose = FALSE,
  validation_data = list(bag_of_words[8001:10000, ], bag_of_words[8001:10000, ])
)
model_outputs <- get_layer(model, 'encoder') %>% get_layer('tag_encodings')
intermediate_layer_model <- keras_model(inputs = encoder$input,
                                        outputs = model_outputs$output)

intermediate_output <- predict(intermediate_layer_model, bag_of_words)

fixed_embeddings <- intermediate_output[, colSums(intermediate_output != 0) > 0]#vectors of all zeros provide no information and make similarity less accurate.

Sanity Check

Similar to the LDA model, we create a similarity matrix for all books using the 25 dimension encodings. The first 20 books are examined to determine the coherence of the model.

matrix_features <- fixed_embeddings %>% t() %>% cor()
matrix_features_small <- matrix_features[1:20, 1:20]


library(corrplot)

ordered_books <- books %>% arrange(goodreads_book_id)
first_20 <- ordered_books %>% arrange(goodreads_book_id) %>% head(20) %>% mutate(title = str_sub(title,1,20))

colnames(matrix_features_small) <- first_20$title
rownames(matrix_features_small) <- first_20$title

corrplot(matrix_features_small)

This pattern is obvious, though it does help that there are so many Harry Potter Books. The middle cluster consists of books related to travel and adventure. Harry Potter being related to Lord of the Rings is also a good sign.

Recommendations

Again similar to LDA, we produce recommendations using the similarity matrix and same algorithm.

row_to_good_reads <- ordered_books %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  select(row_id, goodreads_book_id)

good_reads_lookup <- select(books, goodreads_book_id, book_id)

book_simarilarity <- matrix_features %>%
  data.frame() %>%
  rownames_to_column('row_id') %>%
  mutate(row_id = as.integer(row_id)) %>%
  inner_join(row_to_good_reads, 'row_id') %>%
  inner_join(books, 'goodreads_book_id') %>%
  select(-goodreads_book_id)


calc_user_tags <- function(user_id, ratings_df, similarity_df, k, n = 10){
  
  similarity_df %>%
    inner_join(top_ratings(user_id, ratings_df, n), 'book_id') %>%
    select(starts_with('X')) %>%
    summarise_all(mean) %>%
    gather(row_index, rating) %>%
    mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
    inner_join(row_to_good_reads, c( 'row_index' = 'row_id')) %>%
    inner_join(good_reads_lookup, 'goodreads_book_id') %>%
    anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated 
    select(book_id, rating) %>%
    drop_na() %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    arrange(desc(rating)) %>%
    head(k)
}

calc_user_tags(1, ratings, book_simarilarity, 20)%>%
  kable() %>% kable_styling(bootstrap_options = 'striped')
book_id title rating
6413 Home (Gilead, #2) 0.6739000
4338 The Price of Salt 0.6727841
9612 The Charterhouse of Parma 0.6708106
930 Olive Kitteridge 0.6706485
7485 سینوهه 0.6704477
1435 Sophie’s Choice 0.6685887
9632 Falling Man 0.6678417
9761 How Green Was My Valley 0.6677593
8975 Girl With Curious Hair 0.6676609
5596 Quo Vadis 0.6672832
658 The Corrections 0.6667213
6727 Burmese Days 0.6665403
7783 The Leopard 0.6659457
4100 Tinkers 0.6656806
5456 How the García Girls Lost Their Accents 0.6654594
7974 Silence 0.6654357
4788 The Fortress of Solitude 0.6650515
5911 The Book of Illusions 0.6649912
3511 Eva Luna 0.6647221
669 The House of the Spirits 0.6646765

ALS Model

Using a standard Spark ALS implementation, constructing a single model was easier than expected, but tuning on a grid wasn’t practical when running Spark locally.

Optimize Parameters

Spark has built-in funtions for grid search, allowing us to easily optimize RMSE. The dimensionality in the grid will be higher than when we were working with 100K ratings datasets. The dimensionality of the users and books are an order of magnitude higher than they were in previous projects.

estimator <- ml_pipeline(spark_conn) %>%
  ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, cold_start_strategy = 'drop')  

#als_grid <- list(als = list(rank = c(20, 30, 50), reg_param = c(.05, .1)))

als_grid <- list(als = list(rank = c(20,30,50)))
cv <- ml_cross_validator(
  spark_conn, 
  estimator = estimator,
  evaluator = ml_regression_evaluator(spark_conn, label_col = 'rating'), 
  estimator_param_maps = als_grid,
  num_folds = 2
)

als_cv <- ml_fit(cv, ratings_sc)
ml_validation_metrics(als_cv) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
rmse rank_1
0.8321923 20
0.8319314 30
0.8334796 50

This RMSE is similar to ALS implementations on other ratings datasets, such as movielens. It could be because of the relatively low book dimensionality that the optimal rank is so low.

Metrics at K

To assees, the practical quality of the ALS portion of our recommender, we’ll look at precision and recall at a few levels of recommendations (10-20). This will give us an idea of how quicly the quality drops off. If the recall increases, but the precision stays level, we would be more comfortable at higher levels of K.

metrics_at_k <- vector('list', length = 2)
for (k in 1:2){
  temp_dfs <- vector('list', length = 2)
  for (i in 1:2){ 
    
    set.seed(42 + i)
    partitioned_set <- ratings_sc %>%
      sdf_random_split(training = .8, testing = .2) 
    
    als_mod <- partitioned_set[[1]] %>%
      ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1,
             implicit_prefs = TRUE)
    
    recs <- ml_recommend(als_mod, type = 'item', k*10) %>%
      full_join(partitioned_set[[2]], c('user_id', 'book_id'), suffix = c('_pred', '_act')) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 1 & is.na(rating_act) == 0, 'FN', '')) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 1, 'FP', truth_cat)) %>%
      mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 0, 'TP', truth_cat)) %>%
      group_by(truth_cat) %>%
      summarise(tot_obs = n()) %>%
      ungroup() %>%
      collect()
    
    recs_cm <- recs %>%
      spread(truth_cat, tot_obs) %>%
      mutate(
        precision = TP/(TP + FP),
        recall = TP/(TP + FN),
        F1 =  2*((precision*recall)/(precision + recall))
      )
    temp_dfs[[i]] <- recs_cm
  }
  summary_df <- bind_rows(temp_dfs) %>%
    summarise_all(mean) %>%
    add_column('k' = k*10)
  metrics_at_k[[k]] <- summary_df
}

metrics_at_k %>%
  bind_rows() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
FN FP TP precision recall F1 k
1132308 471047.5 63192.5 0.1182849 0.0528586 0.0730659 10
1078839 951819.0 116661.0 0.1091841 0.0975834 0.1030583 20

In the first iteration, didn’t use implicit prefs and the precision and recall were considerably lower. This is somewhat surprising given that we have explicit ratings. Our goal is to predict the books users will read, not optimize RMSE. The confusion matrix statistics are more important; we will keep this parameter set to TRUE in our final recommender.

ALS Predictions

Again with user #1, we produce our recommendations for the ALS model.

final_als <- ml_als(ratings_sc, rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1, implicit_prefs = TRUE)


calc_als <- function(model, named_user, k){

  ml_recommend(final_als, type = 'item', 50) %>%
    select(book_id, user_id, rating) %>%
    filter(named_user == user_id) %>%
    collect() %>%
    select(user_id, book_id, rating) %>%
    inner_join(books, 'book_id') %>%
    select(book_id, title, rating) %>%
    anti_join(get_rated_books(named_user, ratings), 'book_id') %>%
    head(k)
}

calc_als(final_als, 1, 10) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
book_id title rating
5 The Great Gatsby 0.8578916
8 The Catcher in the Rye 0.8275669
118 The Joy Luck Club 0.7824225
26 The Da Vinci Code (Robert Langdon, #2) 0.7787534
15 The Diary of a Young Girl 0.7486388
195 The Guernsey Literary and Potato Peel Pie Society 0.7476287
291 Cutting for Stone 0.7297267
14 Animal Farm 0.7212889
1 The Hunger Games (The Hunger Games, #1) 0.7151593
63 Wuthering Heights 0.7139446

We now see how much of our total dataset was recommended.

all_recs <- ml_recommend(als_mod, type = 'item', 10) %>%
  group_by(book_id) %>%
  summarize(book_count = n()) %>%
  collect()

ggplot(all_recs) + geom_bar(aes(x = reorder(book_id, book_count), y = book_count), stat = 'identity') +
labs(title = 'Recommendations by Book', y = 'Book Count', x = 'Book') +
theme(
  axis.text.x = element_blank()
) +
theme_minimal()

Mixed Hybrid

We now bring everything together and create a function that will produce our final recommendations for a given user. This will also show the recently rated books to give us and idea of the user’s tastes.

full_recs <- function (user_id, ratings_df, recs = 30){
  lda_recs <- calc_user_lda(user_id, ratings, lda_sim, ceiling(recs/2)) %>%
    add_column('source' = 'LDA')
  tag_recs <- calc_user_tags(user_id, ratings, book_simarilarity, ceiling(recs/2)) %>%
    add_column('source' = 'Tags')
  als_recs <- calc_als(final_als, user_id, ceiling(recs/2)) %>%
    add_column('source' = 'ALS')
  
  best_10 <- top_ratings(user_id, ratings_df, 10) %>%
    inner_join(books, 'book_id') %>%
    select(title, rating)
  full_set <- rbind(lda_recs, tag_recs, als_recs)
    recs <- full_set %>%
    group_by(book_id) %>%
    summarise(book_count = n()) %>%
    ungroup() %>%
    inner_join(full_set, 'book_id') %>%
    arrange(desc(book_count), desc(rating)) %>%
    head(recs)
  
  print(best_10)
  return(recs)
}

full_recs(1, ratings) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
## # A tibble: 10 x 2
##    title                                                         rating
##    <chr>                                                          <dbl>
##  1 The Shadow of the Wind (The Cemetery of Forgotten Books,  #1)      5
##  2 Gilead (Gilead, #1)                                                5
##  3 The Kite Runner                                                    5
##  4 Peace Like a River                                                 5
##  5 Divine Secrets of the Ya-Ya Sisterhood                             5
##  6 The Alchemist                                                      5
##  7 To Kill a Mockingbird                                              5
##  8 Antigone (The Theban Plays, #3)                                    5
##  9 Ender's Game (Ender's Saga, #1)                                    5
## 10 The Death of Ivan Ilych                                            5
book_id book_count title rating source
5 1 The Great Gatsby 0.8578916 ALS
8 1 The Catcher in the Rye 0.8275669 ALS
118 1 The Joy Luck Club 0.7824225 ALS
26 1 The Da Vinci Code (Robert Langdon, #2) 0.7787534 ALS
15 1 The Diary of a Young Girl 0.7486388 ALS
195 1 The Guernsey Literary and Potato Peel Pie Society 0.7476287 ALS
291 1 Cutting for Stone 0.7297267 ALS
14 1 Animal Farm 0.7212889 ALS
1 1 The Hunger Games (The Hunger Games, #1) 0.7151593 ALS
63 1 Wuthering Heights 0.7139446 ALS
28 1 Lord of the Flies 0.7057348 ALS
9 1 Angels & Demons (Robert Langdon, #1) 0.7037891 ALS
172 1 Anna Karenina 0.7002230 ALS
2 1 Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) 0.6899099 ALS
6413 1 Home (Gilead, #2) 0.6739000 Tags
4338 1 The Price of Salt 0.6727841 Tags
9612 1 The Charterhouse of Parma 0.6708106 Tags
930 1 Olive Kitteridge 0.6706485 Tags
7485 1 سینوهه 0.6704477 Tags
1435 1 Sophie’s Choice 0.6685887 Tags
9632 1 Falling Man 0.6678417 Tags
9761 1 How Green Was My Valley 0.6677593 Tags
8975 1 Girl With Curious Hair 0.6676609 Tags
5596 1 Quo Vadis 0.6672832 Tags
658 1 The Corrections 0.6667213 Tags
6727 1 Burmese Days 0.6665403 Tags
7783 1 The Leopard 0.6659457 Tags
4100 1 Tinkers 0.6656806 Tags
5456 1 How the García Girls Lost Their Accents 0.6654594 Tags
1574 1 The Left Hand of Darkness 0.6303631 LDA

Conclusion

The ALS model produced by far the best recommendatoins, but the other models are not without value. ALS is going to recommend more of the same in most cases. The two content-based models, even though not tuned, will provide the user with some nice under the radar titles.

Areas for improvement

  • With better sources of text data describing the books, the room for improvement with the LDA piece is immense.
  • Using more of the tags, and possibly passing a compressed tfidf matrix to Keras would yield better tag encodings.
  • Creating a more scalable algorithm for making the content recommendations. The low hanging fruit would be to take advantage of parallelization, but I would think a ground up redesign might be necessary.