DATA 643 Recommender Systems Assignment 4

Sarah Wigodsky and Dan Wigodsky

2018-07-03

Accuracy and Beyond

We will build a recommender system to recommend books to users.

The data comes from the Book-Crossing data set, which is available here: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ The data set was mined by Cai-Nicolas Ziegler, DBIS Freiburg.

The data was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

The book ratings range from 1-10, with higher values being a higher rating. Books that weren’t rated are designated with a zero value.

There are 3 data frames. The books data frame contains a list of books, ISBN, author and year of publication.

ISBN	Book.Title	Book.Author
195153448	Classical Mythology	mark p. o. morford
2005018	Clara Callan	richard bruce wright
60973129	Decision in Normandy	carlo d’este
374157065	Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It	gina bari kolata
393045218	The Mummies of Urumchi	e. j. w. barber
399135782	The Kitchen God’s Wife	amy tan

The ratings data frame contains the user id, ISBN and book rating.

User.ID	ISBN	Book.Rating
276725	3454510439	0
276726	155061224	5
276727	446520802	0
276729	5216561539	3
276729	521795028	6
276733	2080674722	0

The data from the different data frames needs to be joined. Some users recommended over 5000 books, which is implausible. Other users rated hardly any books. We filtered the data frame for users who rated between 150 and 225 books to make a meaningful recommender system and a matrix that was not too sparse.

The data frame needs to be arranged with each user as a different row and each column as a different book.

Find Bias For Each User and Book

##       1903       2276       4017       4385       6251       6543 
##  0.5050965  0.5360322  1.1265128  1.9013229  0.8060269 -0.2483192

1055607	0.1654739
2005557	-0.8345261
2154900	0.1654739
2197154	-0.8345261
2251760	2.1654739
2550563	1.1654739

Break the Data into a Training Set and Testing Set

80% of the data will be used to train and 20% of the data will be used to test. The data needs to be converted into a Real Rating Matrix.

User Based Collaborative Filtering - Cosine Similarity

A user based collaborative filter recommends books that are most preferred by similar users. The similarity between users is determined by the cosine similarity. The prediction of the 5 best books for users is shown below.

## [[1]]
## character(0)
## 
## [[2]]
## [1] "679723161" "440211697" "451132378" "618002227" "671727796"
## 
## [[3]]
## [1] "446310786"  "8041052639" "833510266"  "156528207"  "679723161" 
## 
## [[4]]
## [1] "449005615" "385504209" "440222656" "446672211" "64408663" 
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "60977493"  "60530839"  "385121679" "425188809" "451160444"
## 
## [[7]]
## [1] "446310786" "525947647" "812550706" "877017883" "446610038"
## 
## [[8]]
## [1] "446310786" "446676098" "156528207" "553284789" "440219078"
## 
## [[9]]
## [1] "60977493"  "60530839"  "385121679" "425188809" "451160444"
## 
## [[10]]
## character(0)

Singular Value Decomposition (SVD) With Different Features

The irlba function was used to do singular value decomposition. The number of features was changed from 2 to 20 and the root mean square error was calculated each time.

Singular Value Decomposition (SVD) With 10 Features

Rating Predictor

The following function takes the number of a user and a book ISBN and returns the rating predicted from the SVD function.

## [1] "4385"
## [1] "6543"
## 679723161 
##  9.780782 
## 440211697 
##  9.001933 
## 451132378 
##  9.253236 
## 618002227 
##  8.083212 
## 671727796 
##  7.819176 
## [1] "12982"
## 446310786 
##  9.021302 
## 8041052639 
##   8.769924 
## 833510266 
##   9.71206 
## 156528207 
##  9.939415 
## 679723161 
##  9.452563 
## [1] "28591"
## 449005615 
##  8.814483 
## 385504209 
##  8.599126 
## 440222656 
##  7.967524 
## 446672211 
##  8.434496 
## 64408663 
## 8.795404 
## [1] "35433"
## [1] "51386"
## 60977493 
## 9.468746 
## 60530839 
##       10 
## 385121679 
##        10 
## 425188809 
##        10 
## 451160444 
##        10 
## [1] "56447"
## 446310786 
##        10 
## 525947647 
##        10 
## 812550706 
##  7.831551 
## 877017883 
##  9.538768 
## 446610038 
##  9.783758 
## [1] "76151"
## 446310786 
##  9.710176 
## 446676098 
##        10 
## 156528207 
##        10 
## 553284789 
##   9.55149 
## 440219078 
##        10 
## [1] "99441"
## 60977493 
## 8.867078 
## 60530839 
##       10 
## 385121679 
##        10 
## 425188809 
##        10 
## 451160444 
##        10 
## [1] "110483"

To corroborate and compare the UBCF to the SVD predictions, the SVD predictions are taking the first 10 users from the test set and calculating the ratings predicted for the 5 top books that were chosen by the User Based Collaborative Filtering. As you can see, the ratings for the books the UBCF chose for each user are very high. The lowest rating is about 7.8 and many ratings are 9 or 10. This indicates that both models are giving similar results.

Diversity

In order to build diversity into the recommendation system, we will compare the similarity between books. We will recommend a book that is most disimilar to another book. To start, let’s look at the first 10 books and first 10 users.

Let’s focus on user 6543. We start with 20 books suggested by our UBCF method. The similarity for some sets of are below in graphical and numeric forms.

## $`4385`
## character(0)

## $`6543`
##  [1] "679723161"  "440211697"  "451132378"  "618002227"  "671727796" 
##  [6] "812550706"  "385484518"  "345361792"  "413626709"  "517123207" 
## [11] "684854422"  "836218981"  "4251177439" "5532124639" "60609575"  
## [16] "553071289"  "805005889"  "345323750"  "670891576"  "684807610"

##            679723161 440211697 451132378 618002227 671727796 812550706
## 440211697         NA                                                  
## 451132378  1.0000000        NA                                        
## 618002227  1.0000000        NA 1.0000000                              
## 671727796  1.0000000        NA 1.0000000        NA                    
## 812550706         NA        NA 1.0000000 0.9778024 1.0000000          
## 385484518  1.0000000        NA 1.0000000        NA 1.0000000        NA
## 345361792  1.0000000        NA 1.0000000        NA 0.9938837        NA
## 413626709  1.0000000        NA 1.0000000        NA 1.0000000        NA
## 517123207  1.0000000        NA 1.0000000        NA 1.0000000        NA
## 684854422  1.0000000        NA 1.0000000        NA 1.0000000        NA
## 836218981  1.0000000        NA 1.0000000        NA 1.0000000        NA
## 4251177439 1.0000000        NA 1.0000000        NA 1.0000000        NA
## 5532124639 1.0000000        NA 1.0000000        NA 1.0000000        NA
## 60609575          NA        NA        NA        NA        NA        NA
## 553071289         NA        NA        NA        NA        NA        NA
## 805005889         NA        NA        NA        NA        NA        NA
## 345323750         NA        NA        NA        NA 1.0000000        NA
## 670891576  1.0000000        NA        NA        NA        NA        NA
## 684807610         NA        NA        NA        NA        NA        NA
##            385484518 345361792 413626709 517123207 684854422 836218981
## 440211697                                                             
## 451132378                                                             
## 618002227                                                             
## 671727796                                                             
## 812550706                                                             
## 385484518                                                             
## 345361792  1.0000000                                                  
## 413626709  1.0000000 1.0000000                                        
## 517123207  1.0000000 1.0000000 1.0000000                              
## 684854422  1.0000000 1.0000000 1.0000000 1.0000000                    
## 836218981  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000          
## 4251177439 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 5532124639 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 60609575          NA        NA        NA        NA        NA        NA
## 553071289         NA        NA        NA        NA        NA        NA
## 805005889         NA        NA        NA        NA        NA        NA
## 345323750         NA 1.0000000        NA        NA        NA        NA
## 670891576         NA        NA        NA        NA        NA        NA
## 684807610         NA        NA        NA        NA        NA        NA
##            4251177439 5532124639  60609575 553071289 805005889 345323750
## 440211697                                                               
## 451132378                                                               
## 618002227                                                               
## 671727796                                                               
## 812550706                                                               
## 385484518                                                               
## 345361792                                                               
## 413626709                                                               
## 517123207                                                               
## 684854422                                                               
## 836218981                                                               
## 4251177439                                                              
## 5532124639  1.0000000                                                   
## 60609575           NA         NA                                        
## 553071289          NA         NA 1.0000000                              
## 805005889          NA         NA 1.0000000 1.0000000                    
## 345323750          NA         NA        NA        NA        NA          
## 670891576          NA         NA        NA        NA        NA        NA
## 684807610          NA         NA        NA        NA        NA        NA
##            670891576
## 440211697           
## 451132378           
## 618002227           
## 671727796           
## 812550706           
## 385484518           
## 345361792           
## 413626709           
## 517123207           
## 684854422           
## 836218981           
## 4251177439          
## 5532124639          
## 60609575            
## 553071289           
## 805005889           
## 345323750           
## 670891576           
## 684807610  1.0000000

We create 500 random models with the intention of keeping the books with the furthest cosine distance to offer a diversity of suggestions to users. The higher the cosine similarity, the less related the books and the yellower the image appears.

We will predict 5 books for user 6543 by choosing books that are most dissimilar to that user’s previous book ratings. Because the data is so sparse, we chose to interpret NA as an indication of dissimilarity. Most of the calculated similarities were close to 1 because the data is sparse and many are too far apart to calculate. The following are the indices of the books chosen.

##           684807610 345323750 440211697 60609575 517123207
## 684807610         0        NA        NA       NA        NA
## 345323750        NA         0        NA       NA        NA
## 440211697        NA        NA         0       NA        NA
## 60609575         NA        NA        NA        0        NA
## 517123207        NA        NA        NA       NA         0

The following are the ISBN, book names and authors for these recommendations for the books chosen.

	6742939	6550061	2005557	6380263	6163041
1903	NA	NA	NA	NA	NA
2276	NA	NA	NA	NA	NA
4017	NA	NA	NA	NA	NA
4385	NA	NA	NA	NA	NA
6251	NA	NA	NA	NA	NA
6543	NA	NA	NA	NA	NA

##        ISBN      Book.Title                                
## 12567  "6742939" "The Weirdstone of Brisingamen"           
## 43690  "6163041" "Bahama Crisis Uk"                        
## 115315 "6380263" "Perfection of the Morning an Apprentices"
## 213867 "2005557" "A Blade of Grass"                        
## 220044 "6550061" "Splitting"                               
##        Book.Author     
## 12567  "alan garner"   
## 43690  "desmond bagley"
## 115315 "sharon butala" 
## 213867 "lewis desoto"  
## 220044 "fay weldon"

## [1] "Thriftbooks.com reviews or descriptions of the diverse recommendations for user 6543:"

## [1] "------------------------------------------------------------"

## [1] "At times lyrical, this first novel of Lewis DeSoto begins with a great deal of potential. Here are two women who have lost--parents, husband. Here are two women in apartheid South Africa, one black and one white. DeSoto describes grief poignantly without being over the top, but he fails on two points: his dialogue is wooden and he often isn't as subtle as he could be, pointing out his lyricism to the reader too blatantly."

## [1] "------------------------------------------------------------"

## [1] "Beginning with his discovery of evidence that suggests that his wife and daughter were murdered to cover up a heroin smuggling operation, Bahamian hotel owner Thomas Mangan becomes convinced that someone is trying to destroy the Bahamian economy."

## [1] "------------------------------------------------------------"

## [1] "As a children's novel, this book is entirely successful. The plot is compelling, the characters are well-drawn, and it allows in just enough chaos and evil to make the final triumph of order and good truly satisfying. I have dozens of children's novels on my shelves with the same qualifications, but very few of them do I reread with the same frequency and pleasure as I reread both Tales of Alderley."

## [1] "------------------------------------------------------------"

## [1] "Sharon Butala has written a deeply personal book with universal application. She tells of her journey from a fulfilling but hectic urban life to one of isolation and introspection. She joins her new husband on a cattle ranch in southwest Saskatchewan and leaves behind her university teaching, her graduate studies, her support network of feminist friends, and her teenaged son."

The books recommended by the UBCF for user 6543 are different from the books chosen to add an element of diversity. The reviews shown above indicate that there is a real diversity among these choices. Accuracy fell for the diverse model. The original model was optimized mathematically. Substituting more diverse books caused us to choose picks that were not as optimal. As a result, the accuracy was slightly lower.

When we look at our user based collaborative filtering model, our ROC curve shows almost a straight line. Our model is not performing well. It also approaches .06 instead of 1. We think this is because of the sparsity of our data. The precision-recall curve drops off suddenly and then has a maximum later. Both precision and recall are small. This is also owing to the sparsity of the data. As more ratings come in for books, the dataframe would become less sparse.

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.78sec]

## Evaluation results for 1 folds/samples using method 'UBCF'.

If our model were online, we could evaluate it in a more robust manner. Each method could be accompanied by A/B testing to find out how our user would respond to each type of system. We would want to find out how a user responds to our recommendation. We could see if users come back more often when given different types of recommendations. Our data is quite sparse. We eliminated users who rated few books so that our model could be small enough to run. We could test which books might be most useful in order to have a smaller, more efficiently running model. We could learn which books have the most engagement. Which suggestions cause a user to buy; to spend more time on our site learning about the book? We could track users to find out how often they visit and how many recommendations they come back for. We could also find out which suggestions are clicked along with others. That might provide a new cosine distance measure for book similarity. With this, and most of our methods, we will have a cold start problem because of the sparsity of our data. We could use methods to try to get the users to rate more books. Perhaps a secondary recommender system could recommend books to ask a user to rate so the most sparse areas could be filled in.

Sources:https://medium.com/recombee-blog/evaluating-recommender-systems-choosing-the-best-one-for-your-business-c688ab781a35
https://talkroute.com/online-marketing-terms-decoded-seo-ctr-roi-and-all-the-rest/