Overview

I will build out a book recommendation system using the Goodreads 10K dataset which contains book ratings on a scale of 1 to 5. I will use SVD as well as Item Based Collaborative Filtering (IBCF) and User Based Collaborative Filtering (UBCF) and Popular methods and compare their performance for this book ratings dataset. As with all book ratings datasets - this dataset is pretty sparse, there are 10K books and 53K users rating so 6M out of 530M potential ratings is about 99% sparse. There are close to 6 million ratings in the dataset so I will need to focus on the subset of it to keep the analysis less time consuming.

The data can be found here: https://github.com/zygmuntz/goodbooks-10k

Analysis

Loading Packages

Loading recommenderlab, dplyr, ggplot2, and kableExtra packages.

options(warn=-1)
suppressMessages(library("kableExtra"))
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library("dplyr"))

Loading the data

In the code below we are loading the data from two csv files that include Ratings and detailed Book Information.

BX10K <- read.csv(file="https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", header=TRUE, sep=",", stringsAsFactors = F)
BX10Ktitles <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", header=TRUE, sep=",", stringsAsFactors = F)

Inspecting Ratings dataset

Let’s inspect the ratings dataset with “dim” and “summary” functions. We see that there are almost 6 million ratings and the Average rating is 3.92 with the Median rating of 4 which tells me that the overall books are rated fairly highly and the data is likely evenly distributed without significant variations. The ratings are on a scale of 1 to 5.

dim(BX10K)
## [1] 5976479       3
summary(BX10K$rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.92    5.00    5.00
head(BX10K, 10) %>% kable(caption = "Ratings Data") %>% kable_styling("striped", full_width = TRUE)
Ratings Data
user_id book_id rating
1 258 5
2 4081 4
2 260 5
2 9296 5
2 2318 3
2 26 4
2 315 3
2 33 4
2 301 5
2 2686 5

Inspecting Book Titles data

Now lets inspect out Book Titles dataset. The dataset includes some intersting details we can explore, such as publication year, average ratings, and languages.

dim(BX10Ktitles)
## [1] 10000    23
head(BX10Ktitles, 10) %>% kable(caption = "Books Data") %>% kable_styling("striped", full_width = TRUE)
Books Data
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title title language_code average_rating ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008 The Hunger Games The Hunger Games (The Hunger Games, #1) eng 4.34 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m/2767052.jpg https://images.gr-assets.com/books/1447303603s/2767052.jpg
2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997 Harry Potter and the Philosopher’s Stone Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) eng 4.44 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m/3.jpg https://images.gr-assets.com/books/1474154022s/3.jpg
3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005 Twilight Twilight (Twilight, #1) en-US 3.57 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m/41865.jpg https://images.gr-assets.com/books/1361039443s/41865.jpg
4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960 To Kill a Mockingbird To Kill a Mockingbird eng 4.25 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m/2657.jpg https://images.gr-assets.com/books/1361975680s/2657.jpg
5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925 The Great Gatsby The Great Gatsby eng 3.89 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m/4671.jpg https://images.gr-assets.com/books/1490528560s/4671.jpg
6 11870085 11870085 16827462 226 525478817 9.780525e+12 John Green 2012 The Fault in Our Stars The Fault in Our Stars eng 4.26 2346404 2478609 140739 47994 92723 327550 698471 1311871 https://images.gr-assets.com/books/1360206420m/11870085.jpg https://images.gr-assets.com/books/1360206420s/11870085.jpg
7 5907 5907 1540236 969 618260307 9.780618e+12 J.R.R. Tolkien 1937 The Hobbit or There and Back Again The Hobbit en-US 4.25 2071616 2196809 37653 46023 76784 288649 665635 1119718 https://images.gr-assets.com/books/1372847500m/5907.jpg https://images.gr-assets.com/books/1372847500s/5907.jpg
8 5107 5107 3036731 360 316769177 9.780317e+12 J.D. Salinger 1951 The Catcher in the Rye The Catcher in the Rye eng 3.79 2044241 2120637 44920 109383 185520 455042 661516 709176 https://images.gr-assets.com/books/1398034300m/5107.jpg https://images.gr-assets.com/books/1398034300s/5107.jpg
9 960 960 3338963 311 1416524797 9.781417e+12 Dan Brown 2000 Angels & Demons Angels & Demons (Robert Langdon, #1) en-CA 3.85 2001311 2078754 25112 77841 145740 458429 716569 680175 https://images.gr-assets.com/books/1303390735m/960.jpg https://images.gr-assets.com/books/1303390735s/960.jpg
10 1885 1885 3060926 3455 679783261 9.780680e+12 Jane Austen 1813 Pride and Prejudice Pride and Prejudice eng 4.24 2035490 2191465 49152 54700 86485 284852 609755 1155673 https://images.gr-assets.com/books/1320399351m/1885.jpg https://images.gr-assets.com/books/1320399351s/1885.jpg
Top 10 Highest Rating

Let’s find the Top 10 highest rated books. It is interesting that pretty much all of these are part of a series rather than standalone books. Also, some of these titles are part of the same series and some are overlapping.

BX10Ktitles %>% 
  arrange(desc(average_rating)) %>% 
  top_n(10,wt = average_rating) %>% 
  select(title, original_publication_year, ratings_count, average_rating) %>%
  kable(caption = "Average Ratings") %>% kable_styling("striped", full_width = TRUE)
Average Ratings
title original_publication_year ratings_count average_rating
The Complete Calvin and Hobbes 2005 28900 4.82
Words of Radiance (The Stormlight Archive, #2) 2014 73572 4.77
Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5) 2003 33220 4.77
ESV Study Bible 2002 8953 4.76
Mark of the Lion Trilogy 1993 9081 4.76
It’s a Magical World: A Calvin and Hobbes Collection 1996 22351 4.75
Harry Potter Boxset (Harry Potter, #1-7) 1998 190050 4.74
There’s Treasure Everywhere: A Calvin and Hobbes Collection 1996 16766 4.74
Harry Potter Collection (Harry Potter, #1-6) 2005 24618 4.73
The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury 1990 16087 4.73
The Indispensable Calvin and Hobbes 1992 14597 4.73
Highest number of ratings

Now I want to take a look at the books with the highest number of ratings regardless of the ratings numbers, I am going to assume that those books have the highest number of readers whether those readers enjoyed them or not. My assumtion is correct as all those items are bestsellers.

BX10Ktitles %>% 
  arrange(desc(ratings_count)) %>% 
  top_n(10,wt = ratings_count) %>% 
  select(title, original_publication_year, ratings_count, average_rating) %>%
  kable(caption = "Ratings Count") %>% kable_styling("striped", full_width = TRUE)
Ratings Count
title original_publication_year ratings_count average_rating
The Hunger Games (The Hunger Games, #1) 2008 4780653 4.34
Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) 1997 4602479 4.44
Twilight (Twilight, #1) 2005 3866839 3.57
To Kill a Mockingbird 1960 3198671 4.25
The Great Gatsby 1925 2683664 3.89
The Fault in Our Stars 2012 2346404 4.26
The Hobbit 1937 2071616 4.25
The Catcher in the Rye 1951 2044241 3.79
Pride and Prejudice 1813 2035490 4.24
Angels & Demons (Robert Langdon, #1) 2000 2001311 3.85
Highest number of 1 star ratings.

Now I want to take a look at the books with the highest number of 1 star ratings to see which books are most disliked.

BX10Ktitles %>% 
  arrange(desc(ratings_1)) %>% 
  top_n(10,wt = ratings_1) %>% 
  select(title, original_publication_year, ratings_1, average_rating) %>%
  kable(caption = "Worst Rated Books") %>% kable_styling("striped", full_width = TRUE)
Worst Rated Books
title original_publication_year ratings_1 average_rating
Twilight (Twilight, #1) 2005 456191 3.57
Fifty Shades of Grey (Fifty Shades, #1) 2011 165455 3.67
The Catcher in the Rye 1951 109383 3.79
New Moon (Twilight, #2) 2006 102837 3.52
Breaking Dawn (Twilight, #4) 2008 100994 3.70
Eat, Pray, Love 2006 100373 3.51
Lord of the Flies 1954 92779 3.64
The Great Gatsby 1925 86236 3.89
Eclipse (Twilight, #3) 2007 83094 3.69
Angels & Demons (Robert Langdon, #1) 2000 77841 3.85

Interstingly enough this list includes one of my personal least favorite books “Lord of the Flies” as well as ALL the books from my beloved “Twilight Series” (don’t judge if you haven’t read it). I guess the high number of 1 star ratings also says something about how popular the book is.

Publication Year

Let’s take a look at the data by year

ggplot(data = BX10Ktitles, mapping = aes(x = original_publication_year, y = average_rating)) +
    geom_point(alpha = 0.1, aes(color = language_code)) +
    xlim(1800,2020)

We can see that there are a lot more books in the dataset from recent years, however the quality seemed to have gone down, as there are a lot more books from recent 2 decades rated below 3.5 stars, there are hardly any before 2000.

Language

Let’s see if there is a relationship between book language and average rating.

ggplot(data = BX10Ktitles, mapping = aes(y = average_rating, x = language_code)) +
  geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

It looks like books have pretty high ratings in general regardless of the language but some patterns are apparent. It looks that the more books we have in certain languages the lower the average rating is. Some very highly rated languages have a very low count of books.

Recomendatgions

The next step is to develop some recommendation models and avaluate their perormance.

Creating a RealRatingMartix

Now we can create a real rating matrix to make the data easier to work with.

BXMatrix <- as(BX10K, "realRatingMatrix")
dim(BXMatrix@data)
## [1] 53424 10000

Let’s take a look at the histogram of ratings distribution.

hist(getRatings(BXMatrix), main="Book ratings", breaks = c(0:5), col = c("red", "orange", "gray", "blue", "green"))

Let’s decrease the size of our dataset by randomly selecting 30% of our data. Ideally I wanted to grab 30% of users rather than 30% of random data but that was slowing down my machine too much. Keeping in mind the size of my dataset - 30% of the data will work well enough.

BX10KSample<-BX10K %>% sample_frac(0.3, replace = FALSE)
BXMatrix <- as(BX10KSample, "realRatingMatrix")

I will narrow my dataset down to only the users that rated more than 50 books and the books that have more that 100 ratings in order to improve the quality of my recommendations.

BX_Ratings <- BXMatrix[rowCounts(BXMatrix) > 50, colCounts(BXMatrix) > 100]
BX_Ratings
## 2077 x 3739 rating matrix of class 'realRatingMatrix' with 99182 ratings.

Test/Train Split

Let’s split the data into Test and Train data set using 75%/25% split.

set.seed(11)
eval <- evaluationScheme(BX_Ratings, method = "split", train = 0.75, given=5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

Comparing SVD/IBCF/UBCF/Popular Methods

Let’s compare SVD, IBCF, and UBCF Recomender - we will use all 4 methods and compare the RMSE results.

set.seed(44)
recomSVD <- Recommender(train, method = "SVD")
pred <- predict(object = recomSVD, newdata = known, type = "ratings")
eval_accuracy_SVD <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomIBCF <- Recommender(train, method = "IBCF")
pred <- predict(object = recomIBCF, newdata = known, type = "ratings")
eval_accuracy_IBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomUBCF <- Recommender(train, method = "UBCF")
pred <- predict(object = recomUBCF, newdata = known, type = "ratings")
eval_accuracy_UBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomPOP <- Recommender(train, method = "POPULAR")
pred <- predict(object = recomPOP, newdata = known, type = "ratings")
eval_accuracy_POP <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
rbind(eval_accuracy_SVD, eval_accuracy_IBCF, eval_accuracy_UBCF, eval_accuracy_POP)
##                         RMSE       MSE       MAE
## eval_accuracy_SVD  0.9985846 0.9971711 0.7760532
## eval_accuracy_IBCF 1.1969131 1.4326010 0.8592637
## eval_accuracy_UBCF 0.9956455 0.9913099 0.7749485
## eval_accuracy_POP  0.9815095 0.9633609 0.7671001

SVD has the best results and was the fastest of the 4 methods and has a relatively low RMSE. UBCF and Popular are the 2nd best option as they have an even lower RMSE but took a significantly longer amount of time. Both IBCF took a very long time to calculate and has the worst RMSE result.

Recommendaitons

Next and final step is to make some recommendations. We will take user 1 and will get top 10 books recommendaitons based on their rating.

set.seed(44)
pred2 <- predict(object = recomSVD, newdata = unknown, type = "topNList", n = 10)

recc1 <- pred2@items[[1]]
recc_book_user_1 <- pred2@itemLabels[recc1]

recc_book_user_1 <- as.data.frame(recc_book_user_1)
colnames(recc_book_user_1) <- "book_id"
recc_book_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)
User1 Predictions
book_id
5
14
32
8
4
65
13
11
72
54
book_labels <- merge(recc_book_user_1, BX10Ktitles,
                             by = "book_id", all.x = TRUE, all.y = FALSE, sort = FALSE)
book_labels %>% kable(caption = "Books Recommendations") %>% kable_styling("striped", full_width = TRUE)
Books Recommendations
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title title language_code average_rating ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925 The Great Gatsby The Great Gatsby eng 3.89 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m/4671.jpg https://images.gr-assets.com/books/1490528560s/4671.jpg
14 7613 7613 2207778 896 452284244 9.780452e+12 George Orwell 1945 Animal Farm: A Fairy Story Animal Farm eng 3.87 1881700 1982987 35472 66854 135147 433432 698642 648912 https://images.gr-assets.com/books/1424037542m/7613.jpg https://images.gr-assets.com/books/1424037542s/7613.jpg
32 890 890 40283 373 142000671 9.780142e+12 John Steinbeck 1937 Of Mice and Men Of Mice and Men eng 3.84 1467496 1518741 24642 46630 110856 355169 532291 473795 https://images.gr-assets.com/books/1437235233m/890.jpg https://images.gr-assets.com/books/1437235233s/890.jpg
8 5107 5107 3036731 360 316769177 9.780317e+12 J.D. Salinger 1951 The Catcher in the Rye The Catcher in the Rye eng 3.79 2044241 2120637 44920 109383 185520 455042 661516 709176 https://images.gr-assets.com/books/1398034300m/5107.jpg https://images.gr-assets.com/books/1398034300s/5107.jpg
4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960 To Kill a Mockingbird To Kill a Mockingbird eng 4.25 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m/2657.jpg https://images.gr-assets.com/books/1361975680s/2657.jpg
65 4981 4981 1683562 241 385333846 9.780385e+12 Kurt Vonnegut Jr. 1969 Slaughterhouse-Five, or The Children’s Crusade: A Duty-Dance with Death Slaughterhouse-Five eng 4.06 846488 891762 19646 24964 45518 152442 300948 367890 https://images.gr-assets.com/books/1440319389m/4981.jpg https://images.gr-assets.com/books/1440319389s/4981.jpg
13 5470 5470 153313 995 451524934 9.780452e+12 George Orwell, Erich Fromm, Celâl Üster 1949 Nineteen Eighty-Four 1984 eng 4.14 1956832 2053394 45518 41845 86425 324874 692021 908229 https://images.gr-assets.com/books/1348990566m/5470.jpg https://images.gr-assets.com/books/1348990566s/5470.jpg
11 77203 77203 3295919 283 1594480001 9.781594e+12 Khaled Hosseini 2003 The Kite Runner The Kite Runner eng 4.26 1813044 1878095 59730 34288 59980 226062 628174 929591 https://images.gr-assets.com/books/1484565687m/77203.jpg https://images.gr-assets.com/books/1484565687s/77203.jpg
72 11588 11588 849585 289 450040186 9.780450e+12 Stephen King 1977 The Shining The Shining (The Shining #1) eng 4.17 791850 830881 14936 18487 28981 123862 277393 382158 https://images.gr-assets.com/books/1353277730m/11588.jpg https://images.gr-assets.com/books/1353277730s/11588.jpg
54 11 386162 3078186 257 345391802 9.780345e+12 Douglas Adams 1979 The Hitchhiker’s Guide to the Galaxy The Hitchhiker’s Guide to the Galaxy (Hitchhiker’s Guide to the Galaxy, #1) en-US 4.20 936782 1006479 20345 21764 41962 145173 299579 498001 https://images.gr-assets.com/books/1327656754m/11.jpg https://images.gr-assets.com/books/1327656754s/11.jpg

Conclusions

In summary, the good reads dataset was a very intersting dataset to explore and the tools we have been equiped with during the semester were very useful for exploring, visualizing, and building a recommendation system with this data. Some observations are that computational limiations are a serious concern and alternative solutions need to be considered for future analysis. Another observation, SVD seems to be the most effective method for making predicitons. Alternatively, IBCF is a very time consuming method which doesn’t have as good of a result, so it seems that it is better to compare user tastes vs items when building a recommendaiton system.