We will build a recommender system to recommend books to users.
The data comes from the Book-Crossing data set, which is available here: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ The data set was mined by Cai-Nicolas Ziegler, DBIS Freiburg.
The data was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
The book ratings range from 1-10, with higher values being a higher rating. Books that weren’t rated are designated with a zero value.
There are 3 data frames. The books data frame contains a list of books, ISBN, author and year of publication.
ISBN | Book.Title | Book.Author |
---|---|---|
195153448 | Classical Mythology | mark p. o. morford |
2005018 | Clara Callan | richard bruce wright |
60973129 | Decision in Normandy | carlo d’este |
374157065 | Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It | gina bari kolata |
393045218 | The Mummies of Urumchi | e. j. w. barber |
399135782 | The Kitchen God’s Wife | amy tan |
The ratings data frame contains the user id, ISBN and book rating.
User.ID | ISBN | Book.Rating |
---|---|---|
276725 | 3454510439 | 0 |
276726 | 155061224 | 5 |
276727 | 446520802 | 0 |
276729 | 5216561539 | 3 |
276729 | 521795028 | 6 |
276733 | 2080674722 | 0 |
The data from the different data frames needs to be joined. Some users recommended over 5000 books, which is implausible. Other users rated hardly any books. We filtered the data frame for users who rated between 150 and 225 books to make a meaningful recommender system and a matrix that was not too sparse.
The data frame needs to be arranged with each user as a different row and each column as a different book.
## 1903 2276 4017 4385 6251 6543
## 0.5050965 0.5360322 1.1265128 1.9013229 0.8060269 -0.2483192
1055607 | 0.1654739 |
2005557 | -0.8345261 |
2154900 | 0.1654739 |
2197154 | -0.8345261 |
2251760 | 2.1654739 |
2550563 | 1.1654739 |
80% of the data will be used to train and 20% of the data will be used to test. The data needs to be converted into a Real Rating Matrix.
A user based collaborative filter recommends books that are most preferred by similar users. The similarity between users is determined by the cosine similarity. The prediction of the 5 best books for users is shown below.
## [[1]]
## character(0)
##
## [[2]]
## [1] "679723161" "440211697" "451132378" "618002227" "671727796"
##
## [[3]]
## [1] "446310786" "8041052639" "833510266" "156528207" "679723161"
##
## [[4]]
## [1] "449005615" "385504209" "440222656" "446672211" "64408663"
##
## [[5]]
## character(0)
##
## [[6]]
## [1] "60977493" "60530839" "385121679" "425188809" "451160444"
##
## [[7]]
## [1] "446310786" "525947647" "812550706" "877017883" "446610038"
##
## [[8]]
## [1] "446310786" "446676098" "156528207" "553284789" "440219078"
##
## [[9]]
## [1] "60977493" "60530839" "385121679" "425188809" "451160444"
##
## [[10]]
## character(0)
The irlba function was used to do singular value decomposition. The number of features was changed from 2 to 20 and the root mean square error was calculated each time.
The following function takes the number of a user and a book ISBN and returns the rating predicted from the SVD function.
## [1] "4385"
## [1] "6543"
## 679723161
## 9.780782
## 440211697
## 9.001933
## 451132378
## 9.253236
## 618002227
## 8.083212
## 671727796
## 7.819176
## [1] "12982"
## 446310786
## 9.021302
## 8041052639
## 8.769924
## 833510266
## 9.71206
## 156528207
## 9.939415
## 679723161
## 9.452563
## [1] "28591"
## 449005615
## 8.814483
## 385504209
## 8.599126
## 440222656
## 7.967524
## 446672211
## 8.434496
## 64408663
## 8.795404
## [1] "35433"
## [1] "51386"
## 60977493
## 9.468746
## 60530839
## 10
## 385121679
## 10
## 425188809
## 10
## 451160444
## 10
## [1] "56447"
## 446310786
## 10
## 525947647
## 10
## 812550706
## 7.831551
## 877017883
## 9.538768
## 446610038
## 9.783758
## [1] "76151"
## 446310786
## 9.710176
## 446676098
## 10
## 156528207
## 10
## 553284789
## 9.55149
## 440219078
## 10
## [1] "99441"
## 60977493
## 8.867078
## 60530839
## 10
## 385121679
## 10
## 425188809
## 10
## 451160444
## 10
## [1] "110483"
To corroborate and compare the UBCF to the SVD predictions, the SVD predictions are taking the first 10 users from the test set and calculating the ratings predicted for the 5 top books that were chosen by the User Based Collaborative Filtering. As you can see, the ratings for the books the UBCF chose for each user are very high. The lowest rating is about 7.8 and many ratings are 9 or 10. This indicates that both models are giving similar results.
In order to build diversity into the recommendation system, we will compare the similarity between books. We will recommend a book that is most disimilar to another book. To start, let’s look at the first 10 books and first 10 users.
Let’s focus on user 6543. We start with 20 books suggested by our UBCF method. The similarity for some sets of are below in graphical and numeric forms.
## $`4385`
## character(0)
## $`6543`
## [1] "679723161" "440211697" "451132378" "618002227" "671727796"
## [6] "812550706" "385484518" "345361792" "413626709" "517123207"
## [11] "684854422" "836218981" "4251177439" "5532124639" "60609575"
## [16] "553071289" "805005889" "345323750" "670891576" "684807610"
## 679723161 440211697 451132378 618002227 671727796 812550706
## 440211697 NA
## 451132378 1.0000000 NA
## 618002227 1.0000000 NA 1.0000000
## 671727796 1.0000000 NA 1.0000000 NA
## 812550706 NA NA 1.0000000 0.9778024 1.0000000
## 385484518 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 345361792 1.0000000 NA 1.0000000 NA 0.9938837 NA
## 413626709 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 517123207 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 684854422 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 836218981 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 4251177439 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 5532124639 1.0000000 NA 1.0000000 NA 1.0000000 NA
## 60609575 NA NA NA NA NA NA
## 553071289 NA NA NA NA NA NA
## 805005889 NA NA NA NA NA NA
## 345323750 NA NA NA NA 1.0000000 NA
## 670891576 1.0000000 NA NA NA NA NA
## 684807610 NA NA NA NA NA NA
## 385484518 345361792 413626709 517123207 684854422 836218981
## 440211697
## 451132378
## 618002227
## 671727796
## 812550706
## 385484518
## 345361792 1.0000000
## 413626709 1.0000000 1.0000000
## 517123207 1.0000000 1.0000000 1.0000000
## 684854422 1.0000000 1.0000000 1.0000000 1.0000000
## 836218981 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 4251177439 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 5532124639 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 60609575 NA NA NA NA NA NA
## 553071289 NA NA NA NA NA NA
## 805005889 NA NA NA NA NA NA
## 345323750 NA 1.0000000 NA NA NA NA
## 670891576 NA NA NA NA NA NA
## 684807610 NA NA NA NA NA NA
## 4251177439 5532124639 60609575 553071289 805005889 345323750
## 440211697
## 451132378
## 618002227
## 671727796
## 812550706
## 385484518
## 345361792
## 413626709
## 517123207
## 684854422
## 836218981
## 4251177439
## 5532124639 1.0000000
## 60609575 NA NA
## 553071289 NA NA 1.0000000
## 805005889 NA NA 1.0000000 1.0000000
## 345323750 NA NA NA NA NA
## 670891576 NA NA NA NA NA NA
## 684807610 NA NA NA NA NA NA
## 670891576
## 440211697
## 451132378
## 618002227
## 671727796
## 812550706
## 385484518
## 345361792
## 413626709
## 517123207
## 684854422
## 836218981
## 4251177439
## 5532124639
## 60609575
## 553071289
## 805005889
## 345323750
## 670891576
## 684807610 1.0000000
We create 500 random models with the intention of keeping the books with the furthest cosine distance to offer a diversity of suggestions to users. The higher the cosine similarity, the less related the books and the yellower the image appears.
We will predict 5 books for user 6543 by choosing books that are most dissimilar to that user’s previous book ratings. Because the data is so sparse, we chose to interpret NA as an indication of dissimilarity. Most of the calculated similarities were close to 1 because the data is sparse and many are too far apart to calculate. The following are the indices of the books chosen.
## 684807610 345323750 440211697 60609575 517123207
## 684807610 0 NA NA NA NA
## 345323750 NA 0 NA NA NA
## 440211697 NA NA 0 NA NA
## 60609575 NA NA NA 0 NA
## 517123207 NA NA NA NA 0
The following are the ISBN, book names and authors for these recommendations for the books chosen.
6742939 | 6550061 | 2005557 | 6380263 | 6163041 | |
---|---|---|---|---|---|
1903 | NA | NA | NA | NA | NA |
2276 | NA | NA | NA | NA | NA |
4017 | NA | NA | NA | NA | NA |
4385 | NA | NA | NA | NA | NA |
6251 | NA | NA | NA | NA | NA |
6543 | NA | NA | NA | NA | NA |
## ISBN Book.Title
## 12567 "6742939" "The Weirdstone of Brisingamen"
## 43690 "6163041" "Bahama Crisis Uk"
## 115315 "6380263" "Perfection of the Morning an Apprentices"
## 213867 "2005557" "A Blade of Grass"
## 220044 "6550061" "Splitting"
## Book.Author
## 12567 "alan garner"
## 43690 "desmond bagley"
## 115315 "sharon butala"
## 213867 "lewis desoto"
## 220044 "fay weldon"
## [1] "Thriftbooks.com reviews or descriptions of the diverse recommendations for user 6543:"
## [1] "------------------------------------------------------------"
## [1] "At times lyrical, this first novel of Lewis DeSoto begins with a great deal of potential. Here are two women who have lost--parents, husband. Here are two women in apartheid South Africa, one black and one white. DeSoto describes grief poignantly without being over the top, but he fails on two points: his dialogue is wooden and he often isn't as subtle as he could be, pointing out his lyricism to the reader too blatantly."
## [1] "------------------------------------------------------------"
## [1] "Beginning with his discovery of evidence that suggests that his wife and daughter were murdered to cover up a heroin smuggling operation, Bahamian hotel owner Thomas Mangan becomes convinced that someone is trying to destroy the Bahamian economy."
## [1] "------------------------------------------------------------"
## [1] "As a children's novel, this book is entirely successful. The plot is compelling, the characters are well-drawn, and it allows in just enough chaos and evil to make the final triumph of order and good truly satisfying. I have dozens of children's novels on my shelves with the same qualifications, but very few of them do I reread with the same frequency and pleasure as I reread both Tales of Alderley."
## [1] "------------------------------------------------------------"
## [1] "Sharon Butala has written a deeply personal book with universal application. She tells of her journey from a fulfilling but hectic urban life to one of isolation and introspection. She joins her new husband on a cattle ranch in southwest Saskatchewan and leaves behind her university teaching, her graduate studies, her support network of feminist friends, and her teenaged son."
The books recommended by the UBCF for user 6543 are different from the books chosen to add an element of diversity. The reviews shown above indicate that there is a real diversity among these choices. Accuracy fell for the diverse model. The original model was optimized mathematically. Substituting more diverse books caused us to choose picks that were not as optimal. As a result, the accuracy was slightly lower.
When we look at our user based collaborative filtering model, our ROC curve shows almost a straight line. Our model is not performing well. It also approaches .06 instead of 1. We think this is because of the sparsity of our data. The precision-recall curve drops off suddenly and then has a maximum later. Both precision and recall are small. This is also owing to the sparsity of the data. As more ratings come in for books, the dataframe would become less sparse.
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.78sec]
## Evaluation results for 1 folds/samples using method 'UBCF'.
If our model were online, we could evaluate it in a more robust manner. Each method could be accompanied by A/B testing to find out how our user would respond to each type of system. We would want to find out how a user responds to our recommendation. We could see if users come back more often when given different types of recommendations. Our data is quite sparse. We eliminated users who rated few books so that our model could be small enough to run. We could test which books might be most useful in order to have a smaller, more efficiently running model. We could learn which books have the most engagement. Which suggestions cause a user to buy; to spend more time on our site learning about the book? We could track users to find out how often they visit and how many recommendations they come back for. We could also find out which suggestions are clicked along with others. That might provide a new cosine distance measure for book similarity. With this, and most of our methods, we will have a cold start problem because of the sparsity of our data. We could use methods to try to get the users to rate more books. Perhaps a secondary recommender system could recommend books to ask a user to rate so the most sparse areas could be filled in.
Sources:https://medium.com/recombee-blog/evaluating-recommender-systems-choosing-the-best-one-for-your-business-c688ab781a35
https://talkroute.com/online-marketing-terms-decoded-seo-ctr-roi-and-all-the-rest/