Final Project Planning Document (Proposal)

Book Recommender system

Deliverable
Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset.

Introduction

This project will explore the contents of goodbook dataset and develop a collaborative filtering recommender system for recommending books. The collaborative filtering uses only user preference through ratings that users provided and will not take into account the features or contents of items. Analysis and evaluation will be done on the recommender system to see how well it performs when recommending items.

Objective

To build a book recommender system using a large dataset.

Dataset

For the final project I will be using goodbooks dataset from Kaggle.
The dataset contains ratings for ten thousand popular books. As to the source, let’s say that these ratings were found on the internet. Generally, there are 100 reviews for each book, although some have less - fewer - ratings. Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424. All users have made at least two ratings. Median number of ratings per user is 8.There are also books marked to read by the users, book metadata (author, year, etc.) and tags.

books<-read.csv("https://raw.githubusercontent.com/ekhahm/Data612/master/Final%20project/books.csv")
ratings<-read.csv("https://raw.githubusercontent.com/ekhahm/Data612/master/Final%20project/ratings.csv")
btags<-read.csv("https://raw.githubusercontent.com/ekhahm/Data612/master/Final%20project/book_tags.csv")
tags<-read.csv("https://raw.githubusercontent.com/ekhahm/Data612/master/Final%20project/tags.csv")
head(books)
##   id  book_id best_book_id  work_id books_count      isbn       isbn13
## 1  1  2767052      2767052  2792775         272 439023483 9.780439e+12
## 2  2        3            3  4640799         491 439554934 9.780440e+12
## 3  3    41865        41865  3212258         226 316015849 9.780316e+12
## 4  4     2657         2657  3275794         487  61120081 9.780061e+12
## 5  5     4671         4671   245494        1356 743273567 9.780743e+12
## 6  6 11870085     11870085 16827462         226 525478817 9.780525e+12
##                       authors original_publication_year
## 1             Suzanne Collins                      2008
## 2 J.K. Rowling, Mary GrandPré                      1997
## 3             Stephenie Meyer                      2005
## 4                  Harper Lee                      1960
## 5         F. Scott Fitzgerald                      1925
## 6                  John Green                      2012
##                             original_title
## 1                         The Hunger Games
## 2 Harry Potter and the Philosopher's Stone
## 3                                 Twilight
## 4                    To Kill a Mockingbird
## 5                         The Great Gatsby
## 6                   The Fault in Our Stars
##                                                      title language_code
## 1                  The Hunger Games (The Hunger Games, #1)           eng
## 2 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)           eng
## 3                                  Twilight (Twilight, #1)         en-US
## 4                                    To Kill a Mockingbird           eng
## 5                                         The Great Gatsby           eng
## 6                                   The Fault in Our Stars           eng
##   average_rating ratings_count work_ratings_count work_text_reviews_count
## 1           4.34       4780653            4942365                  155254
## 2           4.44       4602479            4800065                   75867
## 3           3.57       3866839            3916824                   95009
## 4           4.25       3198671            3340896                   72586
## 5           3.89       2683664            2773745                   51992
## 6           4.26       2346404            2478609                  140739
##   ratings_1 ratings_2 ratings_3 ratings_4 ratings_5
## 1     66715    127936    560092   1481305   2706317
## 2     75504    101676    455024   1156318   3011543
## 3    456191    436802    793319    875073   1355439
## 4     60427    117415    446835   1001952   1714267
## 5     86236    197621    606158    936012    947718
## 6     47994     92723    327550    698471   1311871
##                                                     image_url
## 1  https://images.gr-assets.com/books/1447303603m/2767052.jpg
## 2        https://images.gr-assets.com/books/1474154022m/3.jpg
## 3    https://images.gr-assets.com/books/1361039443m/41865.jpg
## 4     https://images.gr-assets.com/books/1361975680m/2657.jpg
## 5     https://images.gr-assets.com/books/1490528560m/4671.jpg
## 6 https://images.gr-assets.com/books/1360206420m/11870085.jpg
##                                               small_image_url
## 1  https://images.gr-assets.com/books/1447303603s/2767052.jpg
## 2        https://images.gr-assets.com/books/1474154022s/3.jpg
## 3    https://images.gr-assets.com/books/1361039443s/41865.jpg
## 4     https://images.gr-assets.com/books/1361975680s/2657.jpg
## 5     https://images.gr-assets.com/books/1490528560s/4671.jpg
## 6 https://images.gr-assets.com/books/1360206420s/11870085.jpg

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.).
ratings.csv contains book_id, user_id and ratings.
toread.csv provides IDs of the books marked “to read” by each user, as userid,book_id pairs.
book_tags.csv contains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs.
tags.csv translates tag IDs to names.

Methodology

collaborative filitering
SVD and ALS Matrix Factorization
Accuracy comparision by using error matrix between SVD and ALS