Elina Azrilyan - Final Project

Loading Packages

Loading recommenderlab, dplyr, ggplot2, and kableExtra packages.

options(warn=-1)
suppressMessages(library("kableExtra"))
suppressMessages(library("ggplot2"))
suppressMessages(library("recommenderlab"))
suppressMessages(library("dplyr"))

Loading the data

In the code below we are loading the data from two csv files that include Ratings and detailed Book Information.

BX10K <- read.csv(file="https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", header=TRUE, sep=",", stringsAsFactors = F)
BX10Ktitles <- read.csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", header=TRUE, sep=",", stringsAsFactors = F)

Inspecting Ratings dataset

Let’s inspect the ratings dataset with “dim” and “summary” functions. We see that there are almost 6 million ratings and the Average rating is 3.92 with the Median rating of 4 which tells me that the overall books are rated fairly highly and the data is likely evenly distributed without significant variations. The ratings are on a scale of 1 to 5.

dim(BX10K)

## [1] 5976479       3

summary(BX10K$rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.92    5.00    5.00

head(BX10K, 10) %>% kable(caption = "Ratings Data") %>% kable_styling("striped", full_width = TRUE)

Ratings Data
user_id	book_id	rating
1	258	5
2	4081	4
2	260	5
2	9296	5
2	2318	3
2	26	4
2	315	3
2	33	4
2	301	5
2	2686	5

Inspecting Book Titles data

Now lets inspect out Book Titles dataset. The dataset includes some intersting details we can explore, such as publication year, average ratings, and languages.

dim(BX10Ktitles)

## [1] 10000    23

head(BX10Ktitles, 10) %>% kable(caption = "Books Data") %>% kable_styling("striped", full_width = TRUE)

Books Data
book_id	goodreads_book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	title	language_code	average_rating	ratings_count	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url
1	2767052	2767052	2792775	272	439023483	9.780439e+12	Suzanne Collins	2008	The Hunger Games	The Hunger Games (The Hunger Games, #1)	eng	4.34	4780653	4942365	155254	66715	127936	560092	1481305	2706317	https://images.gr-assets.com/books/1447303603m/2767052.jpg	https://images.gr-assets.com/books/1447303603s/2767052.jpg
2	3	3	4640799	491	439554934	9.780440e+12	J.K. Rowling, Mary GrandPré	1997	Harry Potter and the Philosopher’s Stone	Harry Potter and the Sorcerer’s Stone (Harry Potter, #1)	eng	4.44	4602479	4800065	75867	75504	101676	455024	1156318	3011543	https://images.gr-assets.com/books/1474154022m/3.jpg	https://images.gr-assets.com/books/1474154022s/3.jpg
3	41865	41865	3212258	226	316015849	9.780316e+12	Stephenie Meyer	2005	Twilight	Twilight (Twilight, #1)	en-US	3.57	3866839	3916824	95009	456191	436802	793319	875073	1355439	https://images.gr-assets.com/books/1361039443m/41865.jpg	https://images.gr-assets.com/books/1361039443s/41865.jpg
4	2657	2657	3275794	487	61120081	9.780061e+12	Harper Lee	1960	To Kill a Mockingbird	To Kill a Mockingbird	eng	4.25	3198671	3340896	72586	60427	117415	446835	1001952	1714267	https://images.gr-assets.com/books/1361975680m/2657.jpg	https://images.gr-assets.com/books/1361975680s/2657.jpg
5	4671	4671	245494	1356	743273567	9.780743e+12	F. Scott Fitzgerald	1925	The Great Gatsby	The Great Gatsby	eng	3.89	2683664	2773745	51992	86236	197621	606158	936012	947718	https://images.gr-assets.com/books/1490528560m/4671.jpg	https://images.gr-assets.com/books/1490528560s/4671.jpg
6	11870085	11870085	16827462	226	525478817	9.780525e+12	John Green	2012	The Fault in Our Stars	The Fault in Our Stars	eng	4.26	2346404	2478609	140739	47994	92723	327550	698471	1311871	https://images.gr-assets.com/books/1360206420m/11870085.jpg	https://images.gr-assets.com/books/1360206420s/11870085.jpg
7	5907	5907	1540236	969	618260307	9.780618e+12	J.R.R. Tolkien	1937	The Hobbit or There and Back Again	The Hobbit	en-US	4.25	2071616	2196809	37653	46023	76784	288649	665635	1119718	https://images.gr-assets.com/books/1372847500m/5907.jpg	https://images.gr-assets.com/books/1372847500s/5907.jpg
8	5107	5107	3036731	360	316769177	9.780317e+12	J.D. Salinger	1951	The Catcher in the Rye	The Catcher in the Rye	eng	3.79	2044241	2120637	44920	109383	185520	455042	661516	709176	https://images.gr-assets.com/books/1398034300m/5107.jpg	https://images.gr-assets.com/books/1398034300s/5107.jpg
9	960	960	3338963	311	1416524797	9.781417e+12	Dan Brown	2000	Angels & Demons	Angels & Demons (Robert Langdon, #1)	en-CA	3.85	2001311	2078754	25112	77841	145740	458429	716569	680175	https://images.gr-assets.com/books/1303390735m/960.jpg	https://images.gr-assets.com/books/1303390735s/960.jpg
10	1885	1885	3060926	3455	679783261	9.780680e+12	Jane Austen	1813	Pride and Prejudice	Pride and Prejudice	eng	4.24	2035490	2191465	49152	54700	86485	284852	609755	1155673	https://images.gr-assets.com/books/1320399351m/1885.jpg	https://images.gr-assets.com/books/1320399351s/1885.jpg

Top 10 Highest Rating

Let’s find the Top 10 highest rated books. It is interesting that pretty much all of these are part of a series rather than standalone books. Also, some of these titles are part of the same series and some are overlapping.

BX10Ktitles %>% 
  arrange(desc(average_rating)) %>% 
  top_n(10,wt = average_rating) %>% 
  select(title, original_publication_year, ratings_count, average_rating) %>%
  kable(caption = "Average Ratings") %>% kable_styling("striped", full_width = TRUE)

Average Ratings
title	original_publication_year	ratings_count	average_rating
The Complete Calvin and Hobbes	2005	28900	4.82
Words of Radiance (The Stormlight Archive, #2)	2014	73572	4.77
Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5)	2003	33220	4.77
ESV Study Bible	2002	8953	4.76
Mark of the Lion Trilogy	1993	9081	4.76
It’s a Magical World: A Calvin and Hobbes Collection	1996	22351	4.75
Harry Potter Boxset (Harry Potter, #1-7)	1998	190050	4.74
There’s Treasure Everywhere: A Calvin and Hobbes Collection	1996	16766	4.74
Harry Potter Collection (Harry Potter, #1-6)	2005	24618	4.73
The Authoritative Calvin and Hobbes: A Calvin and Hobbes Treasury	1990	16087	4.73
The Indispensable Calvin and Hobbes	1992	14597	4.73

Highest number of ratings

Now I want to take a look at the books with the highest number of ratings regardless of the ratings numbers, I am going to assume that those books have the highest number of readers whether those readers enjoyed them or not. My assumtion is correct as all those items are bestsellers.

BX10Ktitles %>% 
  arrange(desc(ratings_count)) %>% 
  top_n(10,wt = ratings_count) %>% 
  select(title, original_publication_year, ratings_count, average_rating) %>%
  kable(caption = "Ratings Count") %>% kable_styling("striped", full_width = TRUE)

Ratings Count
title	original_publication_year	ratings_count	average_rating
The Hunger Games (The Hunger Games, #1)	2008	4780653	4.34
Harry Potter and the Sorcerer’s Stone (Harry Potter, #1)	1997	4602479	4.44
Twilight (Twilight, #1)	2005	3866839	3.57
To Kill a Mockingbird	1960	3198671	4.25
The Great Gatsby	1925	2683664	3.89
The Fault in Our Stars	2012	2346404	4.26
The Hobbit	1937	2071616	4.25
The Catcher in the Rye	1951	2044241	3.79
Pride and Prejudice	1813	2035490	4.24
Angels & Demons (Robert Langdon, #1)	2000	2001311	3.85

Highest number of 1 star ratings.

Now I want to take a look at the books with the highest number of 1 star ratings to see which books are most disliked.

BX10Ktitles %>% 
  arrange(desc(ratings_1)) %>% 
  top_n(10,wt = ratings_1) %>% 
  select(title, original_publication_year, ratings_1, average_rating) %>%
  kable(caption = "Worst Rated Books") %>% kable_styling("striped", full_width = TRUE)

Worst Rated Books
title	original_publication_year	ratings_1	average_rating
Twilight (Twilight, #1)	2005	456191	3.57
Fifty Shades of Grey (Fifty Shades, #1)	2011	165455	3.67
The Catcher in the Rye	1951	109383	3.79
New Moon (Twilight, #2)	2006	102837	3.52
Breaking Dawn (Twilight, #4)	2008	100994	3.70
Eat, Pray, Love	2006	100373	3.51
Lord of the Flies	1954	92779	3.64
The Great Gatsby	1925	86236	3.89
Eclipse (Twilight, #3)	2007	83094	3.69
Angels & Demons (Robert Langdon, #1)	2000	77841	3.85

Interstingly enough this list includes one of my personal least favorite books “Lord of the Flies” as well as ALL the books from my beloved “Twilight Series” (don’t judge if you haven’t read it). I guess the high number of 1 star ratings also says something about how popular the book is.

Publication Year

Let’s take a look at the data by year

ggplot(data = BX10Ktitles, mapping = aes(x = original_publication_year, y = average_rating)) +
    geom_point(alpha = 0.1, aes(color = language_code)) +
    xlim(1800,2020)

We can see that there are a lot more books in the dataset from recent years, however the quality seemed to have gone down, as there are a lot more books from recent 2 decades rated below 3.5 stars, there are hardly any before 2000.

Language

Let’s see if there is a relationship between book language and average rating.

ggplot(data = BX10Ktitles, mapping = aes(y = average_rating, x = language_code)) +
  geom_boxplot() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

It looks like books have pretty high ratings in general regardless of the language but some patterns are apparent. It looks that the more books we have in certain languages the lower the average rating is. Some very highly rated languages have a very low count of books.

Recomendatgions

The next step is to develop some recommendation models and avaluate their perormance.

Creating a RealRatingMartix

Now we can create a real rating matrix to make the data easier to work with.

BXMatrix <- as(BX10K, "realRatingMatrix")
dim(BXMatrix@data)

## [1] 53424 10000

Let’s take a look at the histogram of ratings distribution.

hist(getRatings(BXMatrix), main="Book ratings", breaks = c(0:5), col = c("red", "orange", "gray", "blue", "green"))

Let’s decrease the size of our dataset by randomly selecting 30% of our data. Ideally I wanted to grab 30% of users rather than 30% of random data but that was slowing down my machine too much. Keeping in mind the size of my dataset - 30% of the data will work well enough.

BX10KSample<-BX10K %>% sample_frac(0.3, replace = FALSE)
BXMatrix <- as(BX10KSample, "realRatingMatrix")

I will narrow my dataset down to only the users that rated more than 50 books and the books that have more that 100 ratings in order to improve the quality of my recommendations.

BX_Ratings <- BXMatrix[rowCounts(BXMatrix) > 50, colCounts(BXMatrix) > 100]
BX_Ratings

## 2077 x 3739 rating matrix of class 'realRatingMatrix' with 99182 ratings.

Test/Train Split

Let’s split the data into Test and Train data set using 75%/25% split.

set.seed(11)
eval <- evaluationScheme(BX_Ratings, method = "split", train = 0.75, given=5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

Comparing SVD/IBCF/UBCF/Popular Methods

Let’s compare SVD, IBCF, and UBCF Recomender - we will use all 4 methods and compare the RMSE results.

set.seed(44)
recomSVD <- Recommender(train, method = "SVD")
pred <- predict(object = recomSVD, newdata = known, type = "ratings")
eval_accuracy_SVD <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

set.seed(44)
recomIBCF <- Recommender(train, method = "IBCF")
pred <- predict(object = recomIBCF, newdata = known, type = "ratings")
eval_accuracy_IBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

set.seed(44)
recomUBCF <- Recommender(train, method = "UBCF")
pred <- predict(object = recomUBCF, newdata = known, type = "ratings")
eval_accuracy_UBCF <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

set.seed(44)
recomPOP <- Recommender(train, method = "POPULAR")
pred <- predict(object = recomPOP, newdata = known, type = "ratings")
eval_accuracy_POP <- calcPredictionAccuracy(pred, unknown, byUser = FALSE)

rbind(eval_accuracy_SVD, eval_accuracy_IBCF, eval_accuracy_UBCF, eval_accuracy_POP)

##                         RMSE       MSE       MAE
## eval_accuracy_SVD  0.9985846 0.9971711 0.7760532
## eval_accuracy_IBCF 1.1969131 1.4326010 0.8592637
## eval_accuracy_UBCF 0.9956455 0.9913099 0.7749485
## eval_accuracy_POP  0.9815095 0.9633609 0.7671001

SVD has the best results and was the fastest of the 4 methods and has a relatively low RMSE. UBCF and Popular are the 2nd best option as they have an even lower RMSE but took a significantly longer amount of time. Both IBCF took a very long time to calculate and has the worst RMSE result.

Recommendaitons

Next and final step is to make some recommendations. We will take user 1 and will get top 10 books recommendaitons based on their rating.

set.seed(44)
pred2 <- predict(object = recomSVD, newdata = unknown, type = "topNList", n = 10)

recc1 <- pred2@items[[1]]
recc_book_user_1 <- pred2@itemLabels[recc1]

recc_book_user_1 <- as.data.frame(recc_book_user_1)
colnames(recc_book_user_1) <- "book_id"
recc_book_user_1 %>% kable(caption = "User1 Predictions") %>% kable_styling("striped", full_width = TRUE)

User1 Predictions
book_id
5
14
32
8
4
65
13
11
72
54

book_labels <- merge(recc_book_user_1, BX10Ktitles,
                             by = "book_id", all.x = TRUE, all.y = FALSE, sort = FALSE)
book_labels %>% kable(caption = "Books Recommendations") %>% kable_styling("striped", full_width = TRUE)

Books Recommendations
book_id	goodreads_book_id	best_book_id	work_id	books_count	isbn	isbn13	authors	original_publication_year	original_title	title	language_code	average_rating	ratings_count	work_ratings_count	work_text_reviews_count	ratings_1	ratings_2	ratings_3	ratings_4	ratings_5	image_url	small_image_url
5	4671	4671	245494	1356	743273567	9.780743e+12	F. Scott Fitzgerald	1925	The Great Gatsby	The Great Gatsby	eng	3.89	2683664	2773745	51992	86236	197621	606158	936012	947718	https://images.gr-assets.com/books/1490528560m/4671.jpg	https://images.gr-assets.com/books/1490528560s/4671.jpg
14	7613	7613	2207778	896	452284244	9.780452e+12	George Orwell	1945	Animal Farm: A Fairy Story	Animal Farm	eng	3.87	1881700	1982987	35472	66854	135147	433432	698642	648912	https://images.gr-assets.com/books/1424037542m/7613.jpg	https://images.gr-assets.com/books/1424037542s/7613.jpg
32	890	890	40283	373	142000671	9.780142e+12	John Steinbeck	1937	Of Mice and Men	Of Mice and Men	eng	3.84	1467496	1518741	24642	46630	110856	355169	532291	473795	https://images.gr-assets.com/books/1437235233m/890.jpg	https://images.gr-assets.com/books/1437235233s/890.jpg
8	5107	5107	3036731	360	316769177	9.780317e+12	J.D. Salinger	1951	The Catcher in the Rye	The Catcher in the Rye	eng	3.79	2044241	2120637	44920	109383	185520	455042	661516	709176	https://images.gr-assets.com/books/1398034300m/5107.jpg	https://images.gr-assets.com/books/1398034300s/5107.jpg
4	2657	2657	3275794	487	61120081	9.780061e+12	Harper Lee	1960	To Kill a Mockingbird	To Kill a Mockingbird	eng	4.25	3198671	3340896	72586	60427	117415	446835	1001952	1714267	https://images.gr-assets.com/books/1361975680m/2657.jpg	https://images.gr-assets.com/books/1361975680s/2657.jpg
65	4981	4981	1683562	241	385333846	9.780385e+12	Kurt Vonnegut Jr.	1969	Slaughterhouse-Five, or The Children’s Crusade: A Duty-Dance with Death	Slaughterhouse-Five	eng	4.06	846488	891762	19646	24964	45518	152442	300948	367890	https://images.gr-assets.com/books/1440319389m/4981.jpg	https://images.gr-assets.com/books/1440319389s/4981.jpg
13	5470	5470	153313	995	451524934	9.780452e+12	George Orwell, Erich Fromm, Celâl Üster	1949	Nineteen Eighty-Four	1984	eng	4.14	1956832	2053394	45518	41845	86425	324874	692021	908229	https://images.gr-assets.com/books/1348990566m/5470.jpg	https://images.gr-assets.com/books/1348990566s/5470.jpg
11	77203	77203	3295919	283	1594480001	9.781594e+12	Khaled Hosseini	2003	The Kite Runner	The Kite Runner	eng	4.26	1813044	1878095	59730	34288	59980	226062	628174	929591	https://images.gr-assets.com/books/1484565687m/77203.jpg	https://images.gr-assets.com/books/1484565687s/77203.jpg
72	11588	11588	849585	289	450040186	9.780450e+12	Stephen King	1977	The Shining	The Shining (The Shining #1)	eng	4.17	791850	830881	14936	18487	28981	123862	277393	382158	https://images.gr-assets.com/books/1353277730m/11588.jpg	https://images.gr-assets.com/books/1353277730s/11588.jpg
54	11	386162	3078186	257	345391802	9.780345e+12	Douglas Adams	1979	The Hitchhiker’s Guide to the Galaxy	The Hitchhiker’s Guide to the Galaxy (Hitchhiker’s Guide to the Galaxy, #1)	en-US	4.20	936782	1006479	20345	21764	41962	145173	299579	498001	https://images.gr-assets.com/books/1327656754m/11.jpg	https://images.gr-assets.com/books/1327656754s/11.jpg