For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:

  • Content-Based Filtering
  • User-User Collaborative Filtering
  • Item-Item Collaborative Filtering

As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities. You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Get Data

The dataset has book ratings from 1-10 for thousands of users. The dataset is split into ratings data, books data and user data. Datasets were retrived from the below link:

Datset: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Review Data

The initial glimpse of the data shows that book ratings and books datasets have all the data we need. The user dataset is not required for our analysis as it does not provide any additional details for display. One note is User.ID is listed as an integer. We set the User.ID to a factor.

## Observations: 493,813
## Variables: 3
## $ User.ID     <int> 276725, 276726, 276727, 276729, 276729, 276733, 276736,...
## $ ISBN        <fct> 034545104X, 0155061224, 0446520802, 052165615X, 0521795...
## $ Book.Rating <int> 0, 5, 0, 3, 6, 0, 8, 6, 7, 10, 0, 0, 0, 0, 0, 0, 9, 0, ...
## Observations: 115,253
## Variables: 8
## $ ISBN                <fct> 0195153448, 0002005018, 0060973129, 0374157065,...
## $ Book.Title          <fct> Classical Mythology, Clara Callan, Decision in ...
## $ Book.Author         <fct> Mark P. O. Morford, Richard Bruce Wright, Carlo...
## $ Year.Of.Publication <fct> 2002, 2001, 1991, 1999, 1999, 1991, 2000, 1993,...
## $ Publisher           <fct> Oxford University Press, HarperFlamingo Canada,...
## $ Image.URL.S         <fct> http://images.amazon.com/images/P/0195153448.01...
## $ Image.URL.M         <fct> http://images.amazon.com/images/P/0195153448.01...
## $ Image.URL.L         <fct> http://images.amazon.com/images/P/0195153448.01...

Transform to Rating Matrix

The table for ratings shows that ratings of 0 is 64% of the data. We remove any 0 ratings and only use on the first 500 User.IDs.

## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 317794    658   1184   2622   3802  21546  15222  31300  42046  26549  31090

The lengths are now, ISBN = 1619, User.ID = 397 and Book = 1690 .

The table for ratings are now:

## 
##   1   2   3   4   5   6   7   8   9  10 
##   1   9  26  36 166 145 335 380 286 306

We set seed to 123 and create a matrix with ISBN in the columns and User.ID in the rows and Book.Ratings as the data. We then convert the matrix to a realRatingMatrix and results in a matrix with below dimensions:

##  int [1:397, 1:1619] 5 5 5 5 6 6 7 6 6 10 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:397] "8" "9" "10" "12" ...
##   ..$ : chr [1:1619] "0002005018" "074322678X" "0887841740" "1552041778" ...
## 397 x 1619 rating matrix of class 'realRatingMatrix' with 642743 ratings.

Filter/Normalize/Review

In this section we are display of vector table of all values and inlcude a qplot of data in filtered and normalized state. The filtered state gives all values of ratings from 1 through 10. The normalized ranges are from 2 through -6. The normalized qplot gives us a good view of the weight of each rating compared to overall data. The histogram confirm that higher ratings are more prevelant in the first 500 sample of users. We also include a similarity view for 5 users.

## 
##      1      2      3      4      5      6      7      8      9     10 
##    380   3421   9888  13691  63139  55152 127413 144524 108784 116351

##            8         9        10        12
## 9  0.9600979                              
## 10 0.9584032 0.9604749                    
## 12 0.9545462 0.9584442 0.9610469          
## 14 0.9545735 0.9542256 0.9592832 0.9613886
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
##   ..@ data     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ normalize: NULL

Split Data

The data is with 80% for training and 20% for testing.

#Split Data
which_train<- sample(x=c(TRUE, FALSE), size=nrow(srdata.mat), replace=TRUE, prob=c(0.8,0.2))
data_train <- srdata.mat[which_train,]
data_test <- srdata.mat[!which_train,]

Item Based Collaborative

This section we are using IBCF method for a Item Based model. The heatmap shows most of the counts are in row 15 column 25. The histogram confirms the high column count is in 25.

We use the books dataset here to extract the top 5 book titles based on the IBCF method.

## [1] "0375759778" "0061000280" "0425156842" "0551030682" "3257061269"
## [6] "9728440642"
##                                                                             Book.Title
## 1                                                                     Prague : A Novel
## 2 Sophie's World: A Novel About the History of Philosophy (Berkeley Signature Edition)
## 3                                                                  The Fly on the Wall
## 4                                                                       Der Alchimist.
## 5                          Sacred Diary of Adrian Plass, Christian Speaker Aged 45 3/4

User Based Collaborative

In the User Based Collaborative section we use the UBCF method. The heatmap shows most of the counts are in row 15 column 25. The histogram confirms the high column count is in 25.

We use the books dataset here to extract the top 5 book titles based on the IBCF method. Here the method normalizes the data as we can see in the heatmap. The histogram shows -30 may be the high count point.

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 323 users.

## [1] "0007100221" "9722105507" "3466110564" "0425156842" "0590514776"
## [6] "2909051196"
## 323 x 1619 rating matrix of class 'realRatingMatrix' with 522937 ratings.
## Normalized using center on rows.
##                                                                             Book.Title
## 1 Sophie's World: A Novel About the History of Philosophy (Berkeley Signature Edition)
## 2                                                                         TERROR FIRMA
## 3                                           Meet the Stars of Buffy the Vampire Slayer
## 4                                                 Le fran�§ais : Cent difficult�©s

Conclusion

Although the display of books from training data for both UBCF was different from IBCF the test list show the book selections match. These dataset and methods need additaional analysis for better conclusion.

##                                                Book.Title
## 1                                            Clara Callan
## 2                 Where You'll Find Me: And Other Stories
## 3                                      The Middle Stories
## 4                                                Jane Doe
## 5            The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
##                                              Book.Title.1
## 1                                            Clara Callan
## 2                 Where You'll Find Me: And Other Stories
## 3                                      The Middle Stories
## 4                                                Jane Doe
## 5            The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
##                                                Book.Title
## 1                                            Clara Callan
## 2                 Where You'll Find Me: And Other Stories
## 3                                      The Middle Stories
## 4                                                Jane Doe
## 5            The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
##                                              Book.Title.1
## 1                                            Clara Callan
## 2                 Where You'll Find Me: And Other Stories
## 3                                      The Middle Stories
## 4                                                Jane Doe
## 5            The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)

#memory.limit(size=100000)
urate<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Book-Ratings.csv", sep=";",header = TRUE)
#users<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Users.csv", sep=";", header = TRUE)
books<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Books.csv", sep=";", header = TRUE)
glimpse(urate)
#glimpse(users)
glimpse(books)

urate$User.ID<-as.factor(urate$User.ID)
tv<-table(as.vector(urate$Book.Rating))
rtv<-round(tv[1]/length(as.vector(urate$Book.Rating)),2)*100
tv

surate<-subset(urate, Book.Rating !=0 & as.integer(User.ID) <501)

lisbn<-length(unique(surate$ISBN))
luser<-length(unique(surate$User.ID))
lbook<-length(surate$Book.Rating)

table(as.vector(surate$Book.Rating))
set.seed(123)
sdata.mat <- matrix(data=surate$Book.Rating,ncol=length(unique(surate$ISBN)),nrow=length(unique(surate$User.ID)))
rownames(sdata.mat)<-c(paste(unique(surate$User.ID)))
colnames(sdata.mat)<-c(paste(unique(surate$ISBN)))
glimpse(sdata.mat)
srdata.mat<-as(sdata.mat, "realRatingMatrix")
srdata.mat
#Original filtered as Rating Matrix
#rdata.mat<-as(data.mat, "realRatingMatrix")
table(as.vector(srdata.mat@data))
vector_rates<-as.vector(srdata.mat@data)
vector_rates<-vector_rates[vector_rates !=0]
vector_rates<-factor(vector_rates)
qplot(vector_rates)+ggtitle("Distribution of ratings")
image(srdata.mat[1:100, 1:150])
similarity(srdata.mat[1:5,],method="cosine", which="User.ID")
glimpse(srdata.mat)

#Normalized
nrdata.mat<-normalize(srdata.mat)
image(nrdata.mat[1:100, 1:150])

#Split Data
which_train<- sample(x=c(TRUE, FALSE), size=nrow(srdata.mat), replace=TRUE, prob=c(0.8,0.2))
data_train <- srdata.mat[which_train,]
data_test <- srdata.mat[!which_train,]
#Item Based Collaborative Train
ibc_model<-Recommender(data= data_train, method = "IBCF", parameter = list(k=30))
model_details<-getModel(ibc_model)
n_items_top<-30
image(model_details$sim[1:n_items_top,1:n_items_top], main="Heatmap of the first rows and columns")
col_sums<-colSums(model_details$sim>0)
qplot(col_sums)+stat_bin(binwidth = 1) + ggtitle("Distribution of column count")
which_max<-order(col_sums, decreasing = TRUE)[1:6]
rownames(model_details$sim)[which_max]

books%>%
    filter (ISBN %in% c(colnames(model_details$sim)[which_max]))%>%
    select (Book.Title)

#Item Based Collaborative Predict Test
n_recommend <- 6
ibc_predict<- predict(object=ibc_model, newdata= data_test, n=n_recommend)
ibc_matrix <-sapply(ibc_predict@itemLabels, function(x){
    colnames(srdata.mat)[x]
})
ibc<-ibc_matrix[1:6]

i<-as.data.frame(
c(books%>%
    filter (ISBN %in% c(names(ibc)))%>%
    select (Book.Title),

books%>%
    filter (ISBN %in% c(colnames(data_test@data)[1:6]))%>%
    select (Book.Title)))
#User based Collaborative filtering
ubc_model<-Recommender(data= data_train, method = "UBCF")
ubc_model
model_details<-getModel(ubc_model)
n_items_top<-30
image(model_details$data[1:n_items_top,1:n_items_top], main="Heatmap of the first rows and columns")
col_sums<-colSums(model_details$data)
qplot(col_sums)+stat_bin(binwidth = 1) + ggtitle("Distribution of column count")
which_max<-order(col_sums, decreasing = TRUE)[1:6]
colnames(model_details$data)[which_max]
model_details$data

books%>%
    filter (ISBN %in% c(colnames(model_details$data)[which_max]))%>%
    select (Book.Title)


#User Based Collaborative Predict Test
n_recommend <- 6
ubc_predict<- predict(object=ubc_model, newdata= data_test, n=n_recommend)
ubc_matrix <-sapply(ubc_predict@itemLabels, function(x){
    rownames(srdata.mat)[x]
})

ubc<-ubc_matrix[1:6]

u<- as.data.frame(
c(books%>%
    filter (ISBN %in% c(names(ubc)))%>%
    select (Book.Title),

books%>%
    filter (ISBN %in% c(colnames(data_test@data)[1:6]))%>%
    select (Book.Title)))

i
u