I will build out a book recommendation system using the Goodreads 10K dataset which contains book ratings on a scale of 1 to 5. I will use SVD as well as Item Based Collaborative Filtering (IBCF) and User Based Collaborative Filtering (UBCF) and Popular methods and compare their performance for this book ratings dataset. As with all book ratings datasets - this dataset is pretty sparse, there are 10K books and 53K users rating so 6M out of 530M potential ratings is about 99% sparse. There are close to 6 million ratings in the dataset so I will need to focus on the subset of it to keep the analysis less time consuming.
The data can be found here: https://github.com/zygmuntz/goodbooks-10k
Loading recommenderlab, dplyr, ggplot2, and kableExtra packages.
options(warn=-1)
suppressMessages(library("kableExtra"))
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library("dplyr"))
In the code below we are loading the data from two csv files that include Ratings and detailed Book Information.
BX10K <- read.csv(file="https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", header=TRUE, sep=",", stringsAsFactors = F)
BX10Ktitles <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", header=TRUE, sep=",", stringsAsFactors = F)
Let’s inspect the ratings dataset with “dim” and “summary” functions. We see that there are almost 6 million ratings and the Average rating is 3.92 with the Median rating of 4 which tells me that the overall books are rated fairly highly and the data is likely evenly distributed without significant variations. The ratings are on a scale of 1 to 5.
dim(BX10K)
## [1] 5976479 3
summary(BX10K$rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.92 5.00 5.00
head(BX10K, 10) %>% kable(caption = "Ratings Data") %>% kable_styling("striped", full_width = TRUE)
user_id | book_id | rating |
---|---|---|
1 | 258 | 5 |
2 | 4081 | 4 |
2 | 260 | 5 |
2 | 9296 | 5 |
2 | 2318 | 3 |
2 | 26 | 4 |
2 | 315 | 3 |
2 | 33 | 4 |
2 | 301 | 5 |
2 | 2686 | 5 |
Now lets inspect out Book Titles dataset. The dataset includes some intersting details we can explore, such as publication year, average ratings, and languages.
dim(BX10Ktitles)
## [1] 10000 23
head(BX10Ktitles, 10) %>% kable(caption = "Books Data") %>% kable_styling("striped", full_width = TRUE)
book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008 | The Hunger Games | The Hunger Games (The Hunger Games, #1) | eng | 4.34 | 4780653 | 4942365 | 155254 | 66715 | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m/2767052.jpg | https://images.gr-assets.com/books/1447303603s/2767052.jpg |
2 | 3 | 3 | 4640799 | 491 | 439554934 | 9.780440e+12 | J.K. Rowling, Mary GrandPré | 1997 | Harry Potter and the Philosopher’s Stone | Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) | eng | 4.44 | 4602479 | 4800065 | 75867 | 75504 | 101676 | 455024 | 1156318 | 3011543 | https://images.gr-assets.com/books/1474154022m/3.jpg | https://images.gr-assets.com/books/1474154022s/3.jpg |
3 | 41865 | 41865 | 3212258 | 226 | 316015849 | 9.780316e+12 | Stephenie Meyer | 2005 | Twilight | Twilight (Twilight, #1) | en-US | 3.57 | 3866839 | 3916824 | 95009 | 456191 | 436802 | 793319 | 875073 | 1355439 | https://images.gr-assets.com/books/1361039443m/41865.jpg | https://images.gr-assets.com/books/1361039443s/41865.jpg |
4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960 | To Kill a Mockingbird | To Kill a Mockingbird | eng | 4.25 | 3198671 | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m/2657.jpg | https://images.gr-assets.com/books/1361975680s/2657.jpg |
5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925 | The Great Gatsby | The Great Gatsby | eng | 3.89 | 2683664 | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m/4671.jpg | https://images.gr-assets.com/books/1490528560s/4671.jpg |
6 | 11870085 | 11870085 | 16827462 | 226 | 525478817 | 9.780525e+12 | John Green | 2012 | The Fault in Our Stars | The Fault in Our Stars | eng | 4.26 | 2346404 | 2478609 | 140739 | 47994 | 92723 | 327550 | 698471 | 1311871 | https://images.gr-assets.com/books/1360206420m/11870085.jpg | https://images.gr-assets.com/books/1360206420s/11870085.jpg |
7 | 5907 | 5907 | 1540236 | 969 | 618260307 | 9.780618e+12 | J.R.R. Tolkien | 1937 | The Hobbit or There and Back Again | The Hobbit | en-US | 4.25 | 2071616 | 2196809 | 37653 | 46023 | 76784 | 288649 | 665635 | 1119718 | https://images.gr-assets.com/books/1372847500m/5907.jpg | https://images.gr-assets.com/books/1372847500s/5907.jpg |
8 | 5107 | 5107 | 3036731 | 360 | 316769177 | 9.780317e+12 | J.D. Salinger | 1951 | The Catcher in the Rye | The Catcher in the Rye | eng | 3.79 | 2044241 | 2120637 | 44920 | 109383 | 185520 | 455042 | 661516 | 709176 | https://images.gr-assets.com/books/1398034300m/5107.jpg | https://images.gr-assets.com/books/1398034300s/5107.jpg |
9 | 960 | 960 | 3338963 | 311 | 1416524797 | 9.781417e+12 | Dan Brown | 2000 | Angels & Demons | Angels & Demons (Robert Langdon, #1) | en-CA | 3.85 | 2001311 | 2078754 | 25112 | 77841 | 145740 | 458429 | 716569 | 680175 | https://images.gr-assets.com/books/1303390735m/960.jpg | https://images.gr-assets.com/books/1303390735s/960.jpg |
10 | 1885 | 1885 | 3060926 | 3455 | 679783261 | 9.780680e+12 | Jane Austen | 1813 | Pride and Prejudice | Pride and Prejudice | eng | 4.24 | 2035490 | 2191465 | 49152 | 54700 | 86485 | 284852 | 609755 | 1155673 | https://images.gr-assets.com/books/1320399351m/1885.jpg | https://images.gr-assets.com/books/1320399351s/1885.jpg |
Let’s find the Top 10 highest rated books. It is interesting that pretty much all of these are part of a series rather than standalone books. Also, some of these titles are part of the same series and some are overlapping.
BX10Ktitles %>%
arrange(desc(average_rating)) %>%
top_n(10,wt = average_rating) %>%
select(title, original_publication_year, ratings_count, average_rating) %>%
kable(caption = "Average Ratings") %>% kable_styling("striped", full_width = TRUE)
title | original_publication_year | ratings_count | average_rating |
---|---|---|---|
The Complete Calvin and Hobbes | 2005 | 28900 | 4.82 |
Words of Radiance (The Stormlight Archive, #2) | 2014 | 73572 | 4.77 |
Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5) | 2003 | 33220 | 4.77 |
ESV Study Bible | 2002 | 8953 | 4.76 |
Mark of the Lion Trilogy | 1993 | 9081 | 4.76 |
It’s a Magical World: A Calvin and Hobbes Collection | 1996 | 22351 | 4.75 |
Harry Potter Boxset (Harry Potter, #1-7) | 1998 | 190050 | 4.74 |
There’s Treasure Everywhere: A Calvin and Hobbes Collection | 1996 | 16766 | 4.74 |
Harry Potter Collection (Harry Potter, #1-6) | 2005 | 24618 | 4.73 |
The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury | 1990 | 16087 | 4.73 |
The Indispensable Calvin and Hobbes | 1992 | 14597 | 4.73 |
Now I want to take a look at the books with the highest number of ratings regardless of the ratings numbers, I am going to assume that those books have the highest number of readers whether those readers enjoyed them or not. My assumtion is correct as all those items are bestsellers.
BX10Ktitles %>%
arrange(desc(ratings_count)) %>%
top_n(10,wt = ratings_count) %>%
select(title, original_publication_year, ratings_count, average_rating) %>%
kable(caption = "Ratings Count") %>% kable_styling("striped", full_width = TRUE)
title | original_publication_year | ratings_count | average_rating |
---|---|---|---|
The Hunger Games (The Hunger Games, #1) | 2008 | 4780653 | 4.34 |
Harry Potter and the Sorcerer’s Stone (Harry Potter, #1) | 1997 | 4602479 | 4.44 |
Twilight (Twilight, #1) | 2005 | 3866839 | 3.57 |
To Kill a Mockingbird | 1960 | 3198671 | 4.25 |
The Great Gatsby | 1925 | 2683664 | 3.89 |
The Fault in Our Stars | 2012 | 2346404 | 4.26 |
The Hobbit | 1937 | 2071616 | 4.25 |
The Catcher in the Rye | 1951 | 2044241 | 3.79 |
Pride and Prejudice | 1813 | 2035490 | 4.24 |
Angels & Demons (Robert Langdon, #1) | 2000 | 2001311 | 3.85 |
Now I want to take a look at the books with the highest number of 1 star ratings to see which books are most disliked.
BX10Ktitles %>%
arrange(desc(ratings_1)) %>%
top_n(10,wt = ratings_1) %>%
select(title, original_publication_year, ratings_1, average_rating) %>%
kable(caption = "Worst Rated Books") %>% kable_styling("striped", full_width = TRUE)
title | original_publication_year | ratings_1 | average_rating |
---|---|---|---|
Twilight (Twilight, #1) | 2005 | 456191 | 3.57 |
Fifty Shades of Grey (Fifty Shades, #1) | 2011 | 165455 | 3.67 |
The Catcher in the Rye | 1951 | 109383 | 3.79 |
New Moon (Twilight, #2) | 2006 | 102837 | 3.52 |
Breaking Dawn (Twilight, #4) | 2008 | 100994 | 3.70 |
Eat, Pray, Love | 2006 | 100373 | 3.51 |
Lord of the Flies | 1954 | 92779 | 3.64 |
The Great Gatsby | 1925 | 86236 | 3.89 |
Eclipse (Twilight, #3) | 2007 | 83094 | 3.69 |
Angels & Demons (Robert Langdon, #1) | 2000 | 77841 | 3.85 |
Interstingly enough this list includes one of my personal least favorite books “Lord of the Flies” as well as ALL the books from my beloved “Twilight Series” (don’t judge if you haven’t read it). I guess the high number of 1 star ratings also says something about how popular the book is.
Let’s take a look at the data by year
ggplot(data = BX10Ktitles, mapping = aes(x = original_publication_year, y = average_rating)) +
geom_point(alpha = 0.1, aes(color = language_code)) +
xlim(1800,2020)
We can see that there are a lot more books in the dataset from recent years, however the quality seemed to have gone down, as there are a lot more books from recent 2 decades rated below 3.5 stars, there are hardly any before 2000.
Let’s see if there is a relationship between book language and average rating.
ggplot(data = BX10Ktitles, mapping = aes(y = average_rating, x = language_code)) +
geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
It looks like books have pretty high ratings in general regardless of the language but some patterns are apparent. It looks that the more books we have in certain languages the lower the average rating is. Some very highly rated languages have a very low count of books.
The next step is to develop some recommendation models and avaluate their perormance.
Now we can create a real rating matrix to make the data easier to work with.
BXMatrix <- as(BX10K, "realRatingMatrix")
dim(BXMatrix@data)
## [1] 53424 10000
Let’s take a look at the histogram of ratings distribution.
hist(getRatings(BXMatrix), main="Book ratings", breaks = c(0:5), col = c("red", "orange", "gray", "blue", "green"))
Let’s decrease the size of our dataset by randomly selecting 30% of our data. Ideally I wanted to grab 30% of users rather than 30% of random data but that was slowing down my machine too much. Keeping in mind the size of my dataset - 30% of the data will work well enough.
BX10KSample<-BX10K %>% sample_frac(0.3, replace = FALSE)
BXMatrix <- as(BX10KSample, "realRatingMatrix")
I will narrow my dataset down to only the users that rated more than 50 books and the books that have more that 100 ratings in order to improve the quality of my recommendations.
BX_Ratings <- BXMatrix[rowCounts(BXMatrix) > 50, colCounts(BXMatrix) > 100]
BX_Ratings
## 2077 x 3739 rating matrix of class 'realRatingMatrix' with 99182 ratings.
Let’s split the data into Test and Train data set using 75%/25% split.
set.seed(11)
eval <- evaluationScheme(BX_Ratings, method = "split", train = 0.75, given=5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
Let’s compare SVD, IBCF, and UBCF Recomender - we will use all 4 methods and compare the RMSE results.
set.seed(44)
recomSVD <- Recommender(train, method = "SVD")
pred <- predict(object = recomSVD, newdata = known, type = "ratings")
eval_accuracy_SVD <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomIBCF <- Recommender(train, method = "IBCF")
pred <- predict(object = recomIBCF, newdata = known, type = "ratings")
eval_accuracy_IBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomUBCF <- Recommender(train, method = "UBCF")
pred <- predict(object = recomUBCF, newdata = known, type = "ratings")
eval_accuracy_UBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
set.seed(44)
recomPOP <- Recommender(train, method = "POPULAR")
pred <- predict(object = recomPOP, newdata = known, type = "ratings")
eval_accuracy_POP <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)
rbind(eval_accuracy_SVD, eval_accuracy_IBCF, eval_accuracy_UBCF, eval_accuracy_POP)
## RMSE MSE MAE
## eval_accuracy_SVD 0.9985846 0.9971711 0.7760532
## eval_accuracy_IBCF 1.1969131 1.4326010 0.8592637
## eval_accuracy_UBCF 0.9956455 0.9913099 0.7749485
## eval_accuracy_POP 0.9815095 0.9633609 0.7671001
SVD has the best results and was the fastest of the 4 methods and has a relatively low RMSE. UBCF and Popular are the 2nd best option as they have an even lower RMSE but took a significantly longer amount of time. Both IBCF took a very long time to calculate and has the worst RMSE result.
Next and final step is to make some recommendations. We will take user 1 and will get top 10 books recommendaitons based on their rating.
set.seed(44)
pred2 <- predict(object = recomSVD, newdata = unknown, type = "topNList", n = 10)
recc1 <- pred2@items[[1]]
recc_book_user_1 <- pred2@itemLabels[recc1]
recc_book_user_1 <- as.data.frame(recc_book_user_1)
colnames(recc_book_user_1) <- "book_id"
recc_book_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)
book_id |
---|
5 |
14 |
32 |
8 |
4 |
65 |
13 |
11 |
72 |
54 |
book_labels <- merge(recc_book_user_1, BX10Ktitles,
by = "book_id", all.x = TRUE, all.y = FALSE, sort = FALSE)
book_labels %>% kable(caption = "Books Recommendations") %>% kable_styling("striped", full_width = TRUE)
book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | title | language_code | average_rating | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925 | The Great Gatsby | The Great Gatsby | eng | 3.89 | 2683664 | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m/4671.jpg | https://images.gr-assets.com/books/1490528560s/4671.jpg |
14 | 7613 | 7613 | 2207778 | 896 | 452284244 | 9.780452e+12 | George Orwell | 1945 | Animal Farm: A Fairy Story | Animal Farm | eng | 3.87 | 1881700 | 1982987 | 35472 | 66854 | 135147 | 433432 | 698642 | 648912 | https://images.gr-assets.com/books/1424037542m/7613.jpg | https://images.gr-assets.com/books/1424037542s/7613.jpg |
32 | 890 | 890 | 40283 | 373 | 142000671 | 9.780142e+12 | John Steinbeck | 1937 | Of Mice and Men | Of Mice and Men | eng | 3.84 | 1467496 | 1518741 | 24642 | 46630 | 110856 | 355169 | 532291 | 473795 | https://images.gr-assets.com/books/1437235233m/890.jpg | https://images.gr-assets.com/books/1437235233s/890.jpg |
8 | 5107 | 5107 | 3036731 | 360 | 316769177 | 9.780317e+12 | J.D. Salinger | 1951 | The Catcher in the Rye | The Catcher in the Rye | eng | 3.79 | 2044241 | 2120637 | 44920 | 109383 | 185520 | 455042 | 661516 | 709176 | https://images.gr-assets.com/books/1398034300m/5107.jpg | https://images.gr-assets.com/books/1398034300s/5107.jpg |
4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960 | To Kill a Mockingbird | To Kill a Mockingbird | eng | 4.25 | 3198671 | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m/2657.jpg | https://images.gr-assets.com/books/1361975680s/2657.jpg |
65 | 4981 | 4981 | 1683562 | 241 | 385333846 | 9.780385e+12 | Kurt Vonnegut Jr. | 1969 | Slaughterhouse-Five, or The Children’s Crusade: A Duty-Dance with Death | Slaughterhouse-Five | eng | 4.06 | 846488 | 891762 | 19646 | 24964 | 45518 | 152442 | 300948 | 367890 | https://images.gr-assets.com/books/1440319389m/4981.jpg | https://images.gr-assets.com/books/1440319389s/4981.jpg |
13 | 5470 | 5470 | 153313 | 995 | 451524934 | 9.780452e+12 | George Orwell, Erich Fromm, Celâl Üster | 1949 | Nineteen Eighty-Four | 1984 | eng | 4.14 | 1956832 | 2053394 | 45518 | 41845 | 86425 | 324874 | 692021 | 908229 | https://images.gr-assets.com/books/1348990566m/5470.jpg | https://images.gr-assets.com/books/1348990566s/5470.jpg |
11 | 77203 | 77203 | 3295919 | 283 | 1594480001 | 9.781594e+12 | Khaled Hosseini | 2003 | The Kite Runner | The Kite Runner | eng | 4.26 | 1813044 | 1878095 | 59730 | 34288 | 59980 | 226062 | 628174 | 929591 | https://images.gr-assets.com/books/1484565687m/77203.jpg | https://images.gr-assets.com/books/1484565687s/77203.jpg |
72 | 11588 | 11588 | 849585 | 289 | 450040186 | 9.780450e+12 | Stephen King | 1977 | The Shining | The Shining (The Shining #1) | eng | 4.17 | 791850 | 830881 | 14936 | 18487 | 28981 | 123862 | 277393 | 382158 | https://images.gr-assets.com/books/1353277730m/11588.jpg | https://images.gr-assets.com/books/1353277730s/11588.jpg |
54 | 11 | 386162 | 3078186 | 257 | 345391802 | 9.780345e+12 | Douglas Adams | 1979 | The Hitchhiker’s Guide to the Galaxy | The Hitchhiker’s Guide to the Galaxy (Hitchhiker’s Guide to the Galaxy, #1) | en-US | 4.20 | 936782 | 1006479 | 20345 | 21764 | 41962 | 145173 | 299579 | 498001 | https://images.gr-assets.com/books/1327656754m/11.jpg | https://images.gr-assets.com/books/1327656754s/11.jpg |
In summary, the good reads dataset was a very intersting dataset to explore and the tools we have been equiped with during the semester were very useful for exploring, visualizing, and building a recommendation system with this data. Some observations are that computational limiations are a serious concern and alternative solutions need to be considered for future analysis. Another observation, SVD seems to be the most effective method for making predicitons. Alternatively, IBCF is a very time consuming method which doesn’t have as good of a result, so it seems that it is better to compare user tastes vs items when building a recommendaiton system.