The “goodbooks-10k” dataset consists of 6 million ratings over 50 thousand users and 10 thousand books. It was sourced from Goodreads. The ratings data is supplemented with tags and book metadata, leading itself to a hybrid collaborative filtering and content-based recommender. This data was supplemented with text blurbs pulled using the Wikipedia API describing each book.
We will build 3 different recommenders based on each dataset:
The final recommendations will be a mixed hybrid, a union of these 3 sets.
These files are relatively small so we’ll load from github, then copy to spark.
library(tidyverse)
library(sparklyr)
library(kableExtra)
library(keras)
set.seed(7) #many of our models are stochastic
conf <- spark_config()
conf$`sparklyr.cores.local` <- 16
conf$`sparklyr.shell.driver-memory` <- "24G"
conf$spark.memory.fraction <- 0.9
spark_conn <- spark_connect('local', config = conf)
fp <- 'https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/tidy_words.csv'
words_local <- read_csv(fp)
book_to_id <- read_csv('https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Final%20Project/book_to_id.csv')
tags <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/book_tags.csv')
tag_descs <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv')
books <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv')
ratings <- read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv')
ratings_sc <- copy_to(spark_conn, ratings, overwrite = T)
descs <- copy_to(spark_conn, words_local, overwrite = T)
ratings %>%
group_by(book_id) %>%
summarize(read_count = n()) %>%
ggplot() + geom_bar(aes(x = book_id, y = read_count), stat = 'identity') +
labs(title = 'Long Tail of Preferences', y = 'Book Count', x = 'Book') +
theme(
axis.text.x = element_blank()
) +
theme_minimal()
This is part of the justification for our hybrid model. The two content-based pieces should be able to recommend niche titles in the long tail.
ratings %>%
group_by(user_id) %>%
summarize(read_count = n()) %>%
ggplot() + geom_bar(aes(x = reorder(user_id, read_count), y = read_count), stat = 'identity') +
labs(title = 'User Frequency', y = 'Book Count', x = 'User') +
theme(
axis.text.x = element_blank()
) +
theme_minimal()
In this dataset, we won’t really suffer from the cold start problem with the healthy number of ratings each user has. However, if deployed, our content-based nodes would be especially useful for new users.
Book titles were used in the wikipedia api to find relvent pages. Not all book titles could be found. Unpopular books and those not written in English were naturally filtered out, but over 85% of all titles were located. The data was collected in Python and preproccesed using the gensim library. This made it easier to stem words and remove words that occured in over 50% of the documents. These clean documents were exported into a csv for processing with the Spark ml_lda function.
The parameters required a bit of tuning. With default parameters, nearly all the weight was concentrated in two topics. After looking at gensim’s defaults and a bit of trial and error, the topic distribution was much improved. (two sections below)
features <- descs %>%
ft_tokenizer("text", "tokens") %>%
ft_count_vectorizer("tokens", "features")
vec_model <- ml_pipeline( ft_tokenizer(spark_conn, "text", "tokens"), ft_count_vectorizer(spark_conn, "tokens", "features")) %>%
ml_fit(descs)
vocab_key <- ml_vocabulary(ml_stage(vec_model, 'count_vectorizer')) %>% data.frame() %>%
rownames_to_column('termIndices') %>%
rename('word' = '.') %>%
mutate(termIndices = as.integer(termIndices),
termIndices = termIndices - 1
)
lda_mod <- ml_lda(features, k = 50, optimizer = 'online', learning_offset = 1, learning_decay = .5,
doc_concentration = .0005, optimize_doc_concentration = TRUE)
Printing the topics takes a bit of work since the “ml_describe_topics” function returns token indexes, not actual words. Examining two of our most popular topics, we see that they are coherent, but wont’t always translate to user tastes.
topic_descriptions <- ml_describe_topics(lda_mod) %>%
collect() %>%
unnest(termIndices, termWeights) %>%
mutate(topic = topic + 1)
topic_descriptions$termIndices <- unlist(topic_descriptions$termIndices)
topic_descriptions <- topic_descriptions %>% left_join(vocab_key, 'termIndices')
filter(topic_descriptions, topic %in% c(6, 28)) %>%
head(20) %>%
kable() %>% kable_styling(bootstrap_options = 'striped')
| topic | termIndices | termWeights | word |
|---|---|---|---|
| 6 | 1 | 0.0268077465608327 | seri |
| 6 | 29 | 0.01686876784859 | televis |
| 6 | 27 | 0.0157436088877961 | base |
| 6 | 133 | 0.0125640776727025 | air |
| 6 | 111 | 0.011674862632556 | episod |
| 6 | 82 | 0.0102914734765798 | season |
| 6 | 132 | 0.00922415035485263 | premier |
| 6 | 26 | 0.00880663440243078 | unit |
| 6 | 81 | 0.00877372544181316 | drama |
| 6 | 0 | 0.00858964017968547 | film |
| 28 | 1 | 0.0468798577886867 | seri |
| 28 | 29 | 0.0151578746619639 | televis |
| 28 | 82 | 0.0142504182448538 | season |
| 28 | 18 | 0.0104330331458864 | charact |
| 28 | 215 | 0.00936227265308494 | dead |
| 28 | 111 | 0.00846953237484499 | episod |
| 28 | 2219 | 0.00829962725418106 | dinosaur |
| 28 | 13 | 0.00821679421880839 | includ |
| 28 | 273 | 0.00762519986117736 | detect |
| 28 | 73 | 0.00732577421758458 | creat |
Each document will have a length 50 vector (50 is the number of topics we chose), which can be considered its hidden dimensions. These will be used to build a similarity matrix for each document so we want a distribution that’s not too uniform, nor too top heavy.
do_topic_temp <- ml_transform(lda_mod, features) %>%
select(topicDistribution) %>%
collect()
lda_features <- do.call(rbind, do_topic_temp$topicDistribution)
colnames(lda_features) <- paste('topic', 1:50)
lda_long <- lda_features %>%
data.frame() %>%
gather(topic, value)
lda_long$book_index <- 1:nrow(do_topic_temp)
lda_long %>%
filter(value >= .15) %>%
group_by(topic) %>%
summarise(n_books = n()) %>%
arrange(desc(n_books)) %>%
head(15) %>%
ggplot() + geom_bar(aes(x = topic, y = n_books), stat = 'identity') +
labs(title = 'Top 15 Frequent Topics') +
coord_flip ()+
theme_minimal()
This is a little more top-heavy than we would like, but still adequate.
We build are similarity matrix using pearson correlation, as it is the simplest to implement. Next, we match books with titles and get an idea of the coherence of our model by examining correlation between the first 20 titles.
lda_sim <- lda_features %>% as.matrix() %>% t() %>% cor
lda_subset <- lda_sim[1:20, 1:20]
lda_sim <- lda_sim %>%
data.frame() %>%
rownames_to_column('row_id') %>%
mutate(row_id = as.integer(row_id)) %>%
inner_join(book_to_id, c('row_id' = 'X1')) %>%
select(-words)
row_to_book <- book_to_id %>%
inner_join(books, 'book_id')
library(corrplot)
first_20 <- row_to_book %>% arrange(book_id) %>% head(20) %>% mutate(title = str_sub(title,1,25))
colnames(lda_subset) <- first_20$title
rownames(lda_subset) <- first_20$title
corrplot(lda_subset, order = 'hclus')
For both the seq2seq and LDA content-based recommendations, we’ll use the following simple algorithm:
One of the drawbacks of this method is that producing recommendations for all users at once is too computationally expensive to be feasible. As a result, typical recommender evaluation metrics a difficult to produce for this algorithm.
#correlation_matrix <- similarity_input %>%
# ml_corr()
user_ratings <- ratings %>%
group_by(user_id) %>%
summarise(n_ratings = n())
top_ratings <- function(named_user, ratings_df, n){
ratings_df %>%
filter(user_id == named_user) %>%
arrange(desc(rating)) %>%
head(n)
}
get_rated_books <- function(named_user, ratings_df){
ratings_df %>%
filter(user_id == named_user)
}
calc_user_lda <- function(user_id, ratings_df, similarity_df, k){
similarity_df %>%
inner_join(top_ratings(user_id, ratings_df, 10), 'book_id') %>%
select(-c(book_id, rating, row_id, user_id)) %>%
summarise_all(mean) %>%
gather(row_index, rating) %>%
mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
inner_join(book_to_id, c( 'row_index' = 'X1')) %>%
anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated
select(book_id, rating) %>%
drop_na() %>%
inner_join(books, 'book_id') %>%
select(book_id, title, rating) %>%
arrange(desc(rating)) %>%
head(k)
}
calc_user_lda(1, ratings, lda_sim, 20) %>%
kable() %>% kable_styling(bootstrap_options = 'striped')
| book_id | title | rating |
|---|---|---|
| 1574 | The Left Hand of Darkness | 0.6303631 |
| 2574 | The Black Ice (Harry Bosch, #2; Harry Bosch Universe, #2) | 0.6303631 |
| 5018 | Theodore Boone: Kid Lawyer (Theodore Boone, #1) | 0.6303631 |
| 78 | The Devil Wears Prada (The Devil Wears Prada, #1) | 0.6266320 |
| 3076 | The French Lieutenant’s Woman | 0.6174999 |
| 4234 | Twilight (The Mediator, #6) | 0.6174999 |
| 8354 | Twilight (Warriors: The New Prophecy, #5) | 0.6174999 |
| 9655 | Treasure (Dirk Pitt, #9) | 0.6174999 |
| 62 | The Golden Compass (His Dark Materials, #1) | 0.6136638 |
| 2439 | Way of the Peaceful Warrior: A Book That Changes Lives | 0.6136638 |
| 4876 | The Silent Girl (Rizzoli & Isles, #9) | 0.6136638 |
| 9340 | Tell Me Three Things | 0.6136638 |
| 562 | The Way of Kings (The Stormlight Archive, #1) | 0.6135204 |
| 2734 | Dorothy Must Die (Dorothy Must Die, #1) | 0.6122484 |
| 794 | Doctor Sleep (The Shining, #2) | 0.6120954 |
| 5165 | It Happened One Autumn (Wallflowers, #2) | 0.6116760 |
| 5270 | The Christmas Box (The Christmas Box, #1) | 0.6110079 |
| 145 | Deception Point | 0.6089313 |
| 8723 | Before We Met | 0.6086956 |
| 6042 | True History of the Kelly Gang | 0.6076559 |
There isn’t an obvious pattern here, but I’m also not an avid reader.
The tags were preprocessed, then fed into a simple neural network using only dense layers. The architecture was inspired by this article: https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af
The idea is that the middle layer, the encoding layer, becomes a low dimensional representation of the set of tags for a particular book. Related tags are compressed into the same dimension, similar to the way SVD creates latent dimensions. For instance, the model should learn that “fantasy” and “sci-fi fantasy” are related because they have a number of co-occurences.
The tag data is supplied in a bag-of-words-like format. We want to normalize the “count” to control for popularity and cast it into a wide format.
The following transformations are performed:
tags_test <- tags %>%
filter(goodreads_book_id <= 100) %>%
inner_join(tag_descs, 'tag_id') %>%
inner_join(books, 'goodreads_book_id') %>%
select(goodreads_book_id, tag_id, tag_name, title) %>%
arrange(goodreads_book_id, tag_id)
tags_expanded <- tags %>%
inner_join(tag_descs, 'tag_id') %>%
inner_join(books, 'goodreads_book_id') %>%
select(goodreads_book_id, tag_id, tag_name, title, count) %>%
filter(str_detect(tag_name, '\\d{4,}') == FALSE & str_detect(tag_name, '\\w+') == TRUE
& str_detect(tag_name, 'book') == FALSE)
mean_counts <- tags_expanded %>%
group_by(goodreads_book_id) %>%
summarise(mean_count = mean(count))
tag_counts <- tags_expanded %>%
group_by(tag_id) %>%
summarise(freq = n()) %>%
mutate(idf_weight = log(10000/(freq + 1))) %>%
arrange(desc(freq)) %>%
filter(freq >= 500) %>%
select(idf_weight, tag_id)
tags_fixed <- tags_expanded %>%
group_by(tag_id, goodreads_book_id) %>%
summarise(tag_count = sum(count)) %>%
ungroup() %>%
inner_join(tag_counts, 'tag_id') %>%
inner_join(mean_counts, 'goodreads_book_id') %>%
mutate(tag_count = (tag_count/mean_count) * idf_weight) %>%
inner_join(tag_descs, 'tag_id') %>%
inner_join(books, 'goodreads_book_id') %>%
select(tag_count, goodreads_book_id, tag_id, tag_name, title)
max_tag <- tags_fixed %>% select(tag_id) %>% distinct() %>% nrow()
max_count <- max(tags_fixed$tag_count, na.rm = TRUE)
book_count <- tags_fixed %>% select(goodreads_book_id) %>% distinct %>% nrow()
ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Raw') +
theme_minimal()
ggplot(tags_fixed) + geom_density(aes(x = log(tag_count))) + labs(title = 'Tag Count Logged') +
theme_minimal()
bag_of_words <- tags_fixed %>%
select(tag_id, tag_count, goodreads_book_id) %>%
mutate(tag_count = log(tag_count + 1)) %>%
spread(tag_id, tag_count) %>%
replace(., is.na(.), 0) %>%
arrange(goodreads_book_id) %>%
select(-goodreads_book_id) %>%
as.matrix ()
This is one of the possible areas for improvement, as my experience with deep learning is somewhat limited. We use two mirrored sequences, with the encodings layer sandwhiched between them. The idea is for the network to learn a 25 dimensional vector that describes the state of the 400+ dimensional tag vector and can reproduce it.
encoder <- keras_model_sequential(name = 'encoder')
encoder %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(max_tag))%>%
layer_dropout(rate = .3) %>%
layer_dense(units = 128, activation = 'relu', input_shape = c(256))%>%
layer_dropout(rate = .3) %>%
layer_dense(units = 64, activation = 'relu', input_shape = c(128))%>%
layer_dropout(rate = .3) %>%
layer_dense(units = 25, activation = 'relu', name = 'tag_encodings', input_shape = c(64))
decoder <- keras_model_sequential()
decoder %>%
layer_dense(units = 64, activation = 'relu', input_shape = c(25), name = 'embeddings_layer') %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 128, activation = 'relu', input_shape = c(64))%>%
layer_dropout(rate = .3) %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(128))%>%
layer_dropout(rate = .3) %>%
layer_dense(units = max_tag, input_shape = c(256)) %>%
layer_activation('sigmoid', input_shape = c(max_tag))
model <- keras_model_sequential()
model %>%
encoder %>%
decoder %>%
keras::compile(loss = 'mse', optimizer = 'adam', metrics = c('mse'))
model %>% fit(
bag_of_words[1:8000, ],
bag_of_words[1:8000, ],
epochs = 5,
batch_size = 10,
shuffle = FALSE,
verbose = FALSE,
validation_data = list(bag_of_words[8001:10000, ], bag_of_words[8001:10000, ])
)
model_outputs <- get_layer(model, 'encoder') %>% get_layer('tag_encodings')
intermediate_layer_model <- keras_model(inputs = encoder$input,
outputs = model_outputs$output)
intermediate_output <- predict(intermediate_layer_model, bag_of_words)
fixed_embeddings <- intermediate_output[, colSums(intermediate_output != 0) > 0]#vectors of all zeros provide no information and make similarity less accurate.
Similar to the LDA model, we create a similarity matrix for all books using the 25 dimension encodings. The first 20 books are examined to determine the coherence of the model.
matrix_features <- fixed_embeddings %>% t() %>% cor()
matrix_features_small <- matrix_features[1:20, 1:20]
library(corrplot)
ordered_books <- books %>% arrange(goodreads_book_id)
first_20 <- ordered_books %>% arrange(goodreads_book_id) %>% head(20) %>% mutate(title = str_sub(title,1,20))
colnames(matrix_features_small) <- first_20$title
rownames(matrix_features_small) <- first_20$title
corrplot(matrix_features_small)
This pattern is obvious, though it does help that there are so many Harry Potter Books. The middle cluster consists of books related to travel and adventure. Harry Potter being related to Lord of the Rings is also a good sign.
Again similar to LDA, we produce recommendations using the similarity matrix and same algorithm.
row_to_good_reads <- ordered_books %>%
rownames_to_column('row_id') %>%
mutate(row_id = as.integer(row_id)) %>%
select(row_id, goodreads_book_id)
good_reads_lookup <- select(books, goodreads_book_id, book_id)
book_simarilarity <- matrix_features %>%
data.frame() %>%
rownames_to_column('row_id') %>%
mutate(row_id = as.integer(row_id)) %>%
inner_join(row_to_good_reads, 'row_id') %>%
inner_join(books, 'goodreads_book_id') %>%
select(-goodreads_book_id)
calc_user_tags <- function(user_id, ratings_df, similarity_df, k, n = 10){
similarity_df %>%
inner_join(top_ratings(user_id, ratings_df, n), 'book_id') %>%
select(starts_with('X')) %>%
summarise_all(mean) %>%
gather(row_index, rating) %>%
mutate(row_index = str_sub(row_index, 2) %>% as.numeric()) %>%
inner_join(row_to_good_reads, c( 'row_index' = 'row_id')) %>%
inner_join(good_reads_lookup, 'goodreads_book_id') %>%
anti_join(get_rated_books(user_id, ratings_df), 'book_id') %>% #remove books the user has rated
select(book_id, rating) %>%
drop_na() %>%
inner_join(books, 'book_id') %>%
select(book_id, title, rating) %>%
arrange(desc(rating)) %>%
head(k)
}
calc_user_tags(1, ratings, book_simarilarity, 20)%>%
kable() %>% kable_styling(bootstrap_options = 'striped')
| book_id | title | rating |
|---|---|---|
| 6413 | Home (Gilead, #2) | 0.6739000 |
| 4338 | The Price of Salt | 0.6727841 |
| 9612 | The Charterhouse of Parma | 0.6708106 |
| 930 | Olive Kitteridge | 0.6706485 |
| 7485 | سینوهه | 0.6704477 |
| 1435 | Sophie’s Choice | 0.6685887 |
| 9632 | Falling Man | 0.6678417 |
| 9761 | How Green Was My Valley | 0.6677593 |
| 8975 | Girl With Curious Hair | 0.6676609 |
| 5596 | Quo Vadis | 0.6672832 |
| 658 | The Corrections | 0.6667213 |
| 6727 | Burmese Days | 0.6665403 |
| 7783 | The Leopard | 0.6659457 |
| 4100 | Tinkers | 0.6656806 |
| 5456 | How the García Girls Lost Their Accents | 0.6654594 |
| 7974 | Silence | 0.6654357 |
| 4788 | The Fortress of Solitude | 0.6650515 |
| 5911 | The Book of Illusions | 0.6649912 |
| 3511 | Eva Luna | 0.6647221 |
| 669 | The House of the Spirits | 0.6646765 |
Using a standard Spark ALS implementation, constructing a single model was easier than expected, but tuning on a grid wasn’t practical when running Spark locally.
Spark has built-in funtions for grid search, allowing us to easily optimize RMSE. The dimensionality in the grid will be higher than when we were working with 100K ratings datasets. The dimensionality of the users and books are an order of magnitude higher than they were in previous projects.
estimator <- ml_pipeline(spark_conn) %>%
ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, cold_start_strategy = 'drop')
#als_grid <- list(als = list(rank = c(20, 30, 50), reg_param = c(.05, .1)))
als_grid <- list(als = list(rank = c(20,30,50)))
cv <- ml_cross_validator(
spark_conn,
estimator = estimator,
evaluator = ml_regression_evaluator(spark_conn, label_col = 'rating'),
estimator_param_maps = als_grid,
num_folds = 2
)
als_cv <- ml_fit(cv, ratings_sc)
ml_validation_metrics(als_cv) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| rmse | rank_1 |
|---|---|
| 0.8321923 | 20 |
| 0.8319314 | 30 |
| 0.8334796 | 50 |
This RMSE is similar to ALS implementations on other ratings datasets, such as movielens. It could be because of the relatively low book dimensionality that the optimal rank is so low.
To assees, the practical quality of the ALS portion of our recommender, we’ll look at precision and recall at a few levels of recommendations (10-20). This will give us an idea of how quicly the quality drops off. If the recall increases, but the precision stays level, we would be more comfortable at higher levels of K.
metrics_at_k <- vector('list', length = 2)
for (k in 1:2){
temp_dfs <- vector('list', length = 2)
for (i in 1:2){
set.seed(42 + i)
partitioned_set <- ratings_sc %>%
sdf_random_split(training = .8, testing = .2)
als_mod <- partitioned_set[[1]] %>%
ml_als(rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1,
implicit_prefs = TRUE)
recs <- ml_recommend(als_mod, type = 'item', k*10) %>%
full_join(partitioned_set[[2]], c('user_id', 'book_id'), suffix = c('_pred', '_act')) %>%
mutate(truth_cat = ifelse(is.na(rating_pred) == 1 & is.na(rating_act) == 0, 'FN', '')) %>%
mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 1, 'FP', truth_cat)) %>%
mutate(truth_cat = ifelse(is.na(rating_pred) == 0 & is.na(rating_act) == 0, 'TP', truth_cat)) %>%
group_by(truth_cat) %>%
summarise(tot_obs = n()) %>%
ungroup() %>%
collect()
recs_cm <- recs %>%
spread(truth_cat, tot_obs) %>%
mutate(
precision = TP/(TP + FP),
recall = TP/(TP + FN),
F1 = 2*((precision*recall)/(precision + recall))
)
temp_dfs[[i]] <- recs_cm
}
summary_df <- bind_rows(temp_dfs) %>%
summarise_all(mean) %>%
add_column('k' = k*10)
metrics_at_k[[k]] <- summary_df
}
metrics_at_k %>%
bind_rows() %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| FN | FP | TP | precision | recall | F1 | k |
|---|---|---|---|---|---|---|
| 1132308 | 471047.5 | 63192.5 | 0.1182849 | 0.0528586 | 0.0730659 | 10 |
| 1078839 | 951819.0 | 116661.0 | 0.1091841 | 0.0975834 | 0.1030583 | 20 |
In the first iteration, didn’t use implicit prefs and the precision and recall were considerably lower. This is somewhat surprising given that we have explicit ratings. Our goal is to predict the books users will read, not optimize RMSE. The confusion matrix statistics are more important; we will keep this parameter set to TRUE in our final recommender.
Again with user #1, we produce our recommendations for the ALS model.
final_als <- ml_als(ratings_sc, rating_col = 'rating', user_col = 'user_id', item_col = 'book_id', max_iter = 10, rank = 20, reg_param = .1, implicit_prefs = TRUE)
calc_als <- function(model, named_user, k){
ml_recommend(final_als, type = 'item', 50) %>%
select(book_id, user_id, rating) %>%
filter(named_user == user_id) %>%
collect() %>%
select(user_id, book_id, rating) %>%
inner_join(books, 'book_id') %>%
select(book_id, title, rating) %>%
anti_join(get_rated_books(named_user, ratings), 'book_id') %>%
head(k)
}
calc_als(final_als, 1, 10) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| book_id | title | rating |
|---|---|---|
| 5 | The Great Gatsby | 0.8578916 |
| 8 | The Catcher in the Rye | 0.8275669 |
| 118 | The Joy Luck Club | 0.7824225 |
| 26 | The Da Vinci Code (Robert Langdon, #2) | 0.7787534 |
| 15 | The Diary of a Young Girl | 0.7486388 |
| 195 | The Guernsey Literary and Potato Peel Pie Society | 0.7476287 |
| 291 | Cutting for Stone | 0.7297267 |
| 14 | Animal Farm | 0.7212889 |
| 1 | The Hunger Games (The Hunger Games, #1) | 0.7151593 |
| 63 | Wuthering Heights | 0.7139446 |
We now see how much of our total dataset was recommended.
all_recs <- ml_recommend(als_mod, type = 'item', 10) %>%
group_by(book_id) %>%
summarize(book_count = n()) %>%
collect()
ggplot(all_recs) + geom_bar(aes(x = reorder(book_id, book_count), y = book_count), stat = 'identity') +
labs(title = 'Recommendations by Book', y = 'Book Count', x = 'Book') +
theme(
axis.text.x = element_blank()
) +
theme_minimal()
We now bring everything together and create a function that will produce our final recommendations for a given user. This will also show the recently rated books to give us and idea of the user’s tastes.
full_recs <- function (user_id, ratings_df, recs = 30){
lda_recs <- calc_user_lda(user_id, ratings, lda_sim, ceiling(recs/2)) %>%
add_column('source' = 'LDA')
tag_recs <- calc_user_tags(user_id, ratings, book_simarilarity, ceiling(recs/2)) %>%
add_column('source' = 'Tags')
als_recs <- calc_als(final_als, user_id, ceiling(recs/2)) %>%
add_column('source' = 'ALS')
best_10 <- top_ratings(user_id, ratings_df, 10) %>%
inner_join(books, 'book_id') %>%
select(title, rating)
full_set <- rbind(lda_recs, tag_recs, als_recs)
recs <- full_set %>%
group_by(book_id) %>%
summarise(book_count = n()) %>%
ungroup() %>%
inner_join(full_set, 'book_id') %>%
arrange(desc(book_count), desc(rating)) %>%
head(recs)
print(best_10)
return(recs)
}
full_recs(1, ratings) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
## # A tibble: 10 x 2
## title rating
## <chr> <dbl>
## 1 The Shadow of the Wind (The Cemetery of Forgotten Books, #1) 5
## 2 Gilead (Gilead, #1) 5
## 3 The Kite Runner 5
## 4 Peace Like a River 5
## 5 Divine Secrets of the Ya-Ya Sisterhood 5
## 6 The Alchemist 5
## 7 To Kill a Mockingbird 5
## 8 Antigone (The Theban Plays, #3) 5
## 9 Ender's Game (Ender's Saga, #1) 5
## 10 The Death of Ivan Ilych 5
| book_id | book_count | title | rating | source |
|---|---|---|---|---|
| 5 | 1 | The Great Gatsby | 0.8578916 | ALS |
| 8 | 1 | The Catcher in the Rye | 0.8275669 | ALS |
| 118 | 1 | The Joy Luck Club | 0.7824225 | ALS |
| 26 | 1 | The Da Vinci Code (Robert Langdon, #2) | 0.7787534 | ALS |
| 15 | 1 | The Diary of a Young Girl | 0.7486388 | ALS |
| 195 | 1 | The Guernsey Literary and Potato Peel Pie Society | 0.7476287 | ALS |
| 291 | 1 | Cutting for Stone | 0.7297267 | ALS |
| 14 | 1 | Animal Farm | 0.7212889 | ALS |
| 1 | 1 | The Hunger Games (The Hunger Games, #1) | 0.7151593 | ALS |
| 63 | 1 | Wuthering Heights | 0.7139446 | ALS |
| 28 | 1 | Lord of the Flies | 0.7057348 | ALS |
| 9 | 1 | Angels & Demons (Robert Langdon, #1) | 0.7037891 | ALS |
| 172 | 1 | Anna Karenina | 0.7002230 | ALS |
| 2 | 1 | Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) | 0.6899099 | ALS |
| 6413 | 1 | Home (Gilead, #2) | 0.6739000 | Tags |
| 4338 | 1 | The Price of Salt | 0.6727841 | Tags |
| 9612 | 1 | The Charterhouse of Parma | 0.6708106 | Tags |
| 930 | 1 | Olive Kitteridge | 0.6706485 | Tags |
| 7485 | 1 | سینوهه | 0.6704477 | Tags |
| 1435 | 1 | Sophie’s Choice | 0.6685887 | Tags |
| 9632 | 1 | Falling Man | 0.6678417 | Tags |
| 9761 | 1 | How Green Was My Valley | 0.6677593 | Tags |
| 8975 | 1 | Girl With Curious Hair | 0.6676609 | Tags |
| 5596 | 1 | Quo Vadis | 0.6672832 | Tags |
| 658 | 1 | The Corrections | 0.6667213 | Tags |
| 6727 | 1 | Burmese Days | 0.6665403 | Tags |
| 7783 | 1 | The Leopard | 0.6659457 | Tags |
| 4100 | 1 | Tinkers | 0.6656806 | Tags |
| 5456 | 1 | How the García Girls Lost Their Accents | 0.6654594 | Tags |
| 1574 | 1 | The Left Hand of Darkness | 0.6303631 | LDA |
The ALS model produced by far the best recommendatoins, but the other models are not without value. ALS is going to recommend more of the same in most cases. The two content-based models, even though not tuned, will provide the user with some nice under the radar titles.
Areas for improvement