For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:
As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities. You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.
The dataset has book ratings from 1-10 for thousands of users. The dataset is split into ratings data, books data and user data. Datasets were retrived from the below link:
Datset: http://www2.informatik.uni-freiburg.de/~cziegler/BX/
The initial glimpse of the data shows that book ratings and books datasets have all the data we need. The user dataset is not required for our analysis as it does not provide any additional details for display. One note is User.ID is listed as an integer. We set the User.ID to a factor.
## Observations: 493,813
## Variables: 3
## $ User.ID <int> 276725, 276726, 276727, 276729, 276729, 276733, 276736,...
## $ ISBN <fct> 034545104X, 0155061224, 0446520802, 052165615X, 0521795...
## $ Book.Rating <int> 0, 5, 0, 3, 6, 0, 8, 6, 7, 10, 0, 0, 0, 0, 0, 0, 9, 0, ...
## Observations: 115,253
## Variables: 8
## $ ISBN <fct> 0195153448, 0002005018, 0060973129, 0374157065,...
## $ Book.Title <fct> Classical Mythology, Clara Callan, Decision in ...
## $ Book.Author <fct> Mark P. O. Morford, Richard Bruce Wright, Carlo...
## $ Year.Of.Publication <fct> 2002, 2001, 1991, 1999, 1999, 1991, 2000, 1993,...
## $ Publisher <fct> Oxford University Press, HarperFlamingo Canada,...
## $ Image.URL.S <fct> http://images.amazon.com/images/P/0195153448.01...
## $ Image.URL.M <fct> http://images.amazon.com/images/P/0195153448.01...
## $ Image.URL.L <fct> http://images.amazon.com/images/P/0195153448.01...
The table for ratings shows that ratings of 0 is 64% of the data. We remove any 0 ratings and only use on the first 500 User.IDs.
##
## 0 1 2 3 4 5 6 7 8 9 10
## 317794 658 1184 2622 3802 21546 15222 31300 42046 26549 31090
The lengths are now, ISBN = 1619, User.ID = 397 and Book = 1690 .
The table for ratings are now:
##
## 1 2 3 4 5 6 7 8 9 10
## 1 9 26 36 166 145 335 380 286 306
We set seed to 123 and create a matrix with ISBN in the columns and User.ID in the rows and Book.Ratings as the data. We then convert the matrix to a realRatingMatrix and results in a matrix with below dimensions:
## int [1:397, 1:1619] 5 5 5 5 6 6 7 6 6 10 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:397] "8" "9" "10" "12" ...
## ..$ : chr [1:1619] "0002005018" "074322678X" "0887841740" "1552041778" ...
## 397 x 1619 rating matrix of class 'realRatingMatrix' with 642743 ratings.
In this section we are display of vector table of all values and inlcude a qplot of data in filtered and normalized state. The filtered state gives all values of ratings from 1 through 10. The normalized ranges are from 2 through -6. The normalized qplot gives us a good view of the weight of each rating compared to overall data. The histogram confirm that higher ratings are more prevelant in the first 500 sample of users. We also include a similarity view for 5 users.
##
## 1 2 3 4 5 6 7 8 9 10
## 380 3421 9888 13691 63139 55152 127413 144524 108784 116351
## 8 9 10 12
## 9 0.9600979
## 10 0.9584032 0.9604749
## 12 0.9545462 0.9584442 0.9610469
## 14 0.9545735 0.9542256 0.9592832 0.9613886
## Formal class 'realRatingMatrix' [package "recommenderlab"] with 2 slots
## ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ normalize: NULL
The data is with 80% for training and 20% for testing.
#Split Data
which_train<- sample(x=c(TRUE, FALSE), size=nrow(srdata.mat), replace=TRUE, prob=c(0.8,0.2))
data_train <- srdata.mat[which_train,]
data_test <- srdata.mat[!which_train,]
This section we are using IBCF method for a Item Based model. The heatmap shows most of the counts are in row 15 column 25. The histogram confirms the high column count is in 25.
We use the books dataset here to extract the top 5 book titles based on the IBCF method.
## [1] "0375759778" "0061000280" "0425156842" "0551030682" "3257061269"
## [6] "9728440642"
## Book.Title
## 1 Prague : A Novel
## 2 Sophie's World: A Novel About the History of Philosophy (Berkeley Signature Edition)
## 3 The Fly on the Wall
## 4 Der Alchimist.
## 5 Sacred Diary of Adrian Plass, Christian Speaker Aged 45 3/4
In the User Based Collaborative section we use the UBCF method. The heatmap shows most of the counts are in row 15 column 25. The histogram confirms the high column count is in 25.
We use the books dataset here to extract the top 5 book titles based on the IBCF method. Here the method normalizes the data as we can see in the heatmap. The histogram shows -30 may be the high count point.
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 323 users.
## [1] "0007100221" "9722105507" "3466110564" "0425156842" "0590514776"
## [6] "2909051196"
## 323 x 1619 rating matrix of class 'realRatingMatrix' with 522937 ratings.
## Normalized using center on rows.
## Book.Title
## 1 Sophie's World: A Novel About the History of Philosophy (Berkeley Signature Edition)
## 2 TERROR FIRMA
## 3 Meet the Stars of Buffy the Vampire Slayer
## 4 Le fran�§ais : Cent difficult�©s
Although the display of books from training data for both UBCF was different from IBCF the test list show the book selections match. These dataset and methods need additaional analysis for better conclusion.
## Book.Title
## 1 Clara Callan
## 2 Where You'll Find Me: And Other Stories
## 3 The Middle Stories
## 4 Jane Doe
## 5 The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
## Book.Title.1
## 1 Clara Callan
## 2 Where You'll Find Me: And Other Stories
## 3 The Middle Stories
## 4 Jane Doe
## 5 The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
## Book.Title
## 1 Clara Callan
## 2 Where You'll Find Me: And Other Stories
## 3 The Middle Stories
## 4 Jane Doe
## 5 The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
## Book.Title.1
## 1 Clara Callan
## 2 Where You'll Find Me: And Other Stories
## 3 The Middle Stories
## 4 Jane Doe
## 5 The Witchfinder (Amos Walker Mystery Series)
## 6 More Cunning Than Man: A Social History of Rats and Man
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)
#memory.limit(size=100000)
urate<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Book-Ratings.csv", sep=";",header = TRUE)
#users<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Users.csv", sep=";", header = TRUE)
books<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project2/BX-Books.csv", sep=";", header = TRUE)
glimpse(urate)
#glimpse(users)
glimpse(books)
urate$User.ID<-as.factor(urate$User.ID)
tv<-table(as.vector(urate$Book.Rating))
rtv<-round(tv[1]/length(as.vector(urate$Book.Rating)),2)*100
tv
surate<-subset(urate, Book.Rating !=0 & as.integer(User.ID) <501)
lisbn<-length(unique(surate$ISBN))
luser<-length(unique(surate$User.ID))
lbook<-length(surate$Book.Rating)
table(as.vector(surate$Book.Rating))
set.seed(123)
sdata.mat <- matrix(data=surate$Book.Rating,ncol=length(unique(surate$ISBN)),nrow=length(unique(surate$User.ID)))
rownames(sdata.mat)<-c(paste(unique(surate$User.ID)))
colnames(sdata.mat)<-c(paste(unique(surate$ISBN)))
glimpse(sdata.mat)
srdata.mat<-as(sdata.mat, "realRatingMatrix")
srdata.mat
#Original filtered as Rating Matrix
#rdata.mat<-as(data.mat, "realRatingMatrix")
table(as.vector(srdata.mat@data))
vector_rates<-as.vector(srdata.mat@data)
vector_rates<-vector_rates[vector_rates !=0]
vector_rates<-factor(vector_rates)
qplot(vector_rates)+ggtitle("Distribution of ratings")
image(srdata.mat[1:100, 1:150])
similarity(srdata.mat[1:5,],method="cosine", which="User.ID")
glimpse(srdata.mat)
#Normalized
nrdata.mat<-normalize(srdata.mat)
image(nrdata.mat[1:100, 1:150])
#Split Data
which_train<- sample(x=c(TRUE, FALSE), size=nrow(srdata.mat), replace=TRUE, prob=c(0.8,0.2))
data_train <- srdata.mat[which_train,]
data_test <- srdata.mat[!which_train,]
#Item Based Collaborative Train
ibc_model<-Recommender(data= data_train, method = "IBCF", parameter = list(k=30))
model_details<-getModel(ibc_model)
n_items_top<-30
image(model_details$sim[1:n_items_top,1:n_items_top], main="Heatmap of the first rows and columns")
col_sums<-colSums(model_details$sim>0)
qplot(col_sums)+stat_bin(binwidth = 1) + ggtitle("Distribution of column count")
which_max<-order(col_sums, decreasing = TRUE)[1:6]
rownames(model_details$sim)[which_max]
books%>%
filter (ISBN %in% c(colnames(model_details$sim)[which_max]))%>%
select (Book.Title)
#Item Based Collaborative Predict Test
n_recommend <- 6
ibc_predict<- predict(object=ibc_model, newdata= data_test, n=n_recommend)
ibc_matrix <-sapply(ibc_predict@itemLabels, function(x){
colnames(srdata.mat)[x]
})
ibc<-ibc_matrix[1:6]
i<-as.data.frame(
c(books%>%
filter (ISBN %in% c(names(ibc)))%>%
select (Book.Title),
books%>%
filter (ISBN %in% c(colnames(data_test@data)[1:6]))%>%
select (Book.Title)))
#User based Collaborative filtering
ubc_model<-Recommender(data= data_train, method = "UBCF")
ubc_model
model_details<-getModel(ubc_model)
n_items_top<-30
image(model_details$data[1:n_items_top,1:n_items_top], main="Heatmap of the first rows and columns")
col_sums<-colSums(model_details$data)
qplot(col_sums)+stat_bin(binwidth = 1) + ggtitle("Distribution of column count")
which_max<-order(col_sums, decreasing = TRUE)[1:6]
colnames(model_details$data)[which_max]
model_details$data
books%>%
filter (ISBN %in% c(colnames(model_details$data)[which_max]))%>%
select (Book.Title)
#User Based Collaborative Predict Test
n_recommend <- 6
ubc_predict<- predict(object=ubc_model, newdata= data_test, n=n_recommend)
ubc_matrix <-sapply(ubc_predict@itemLabels, function(x){
rownames(srdata.mat)[x]
})
ubc<-ubc_matrix[1:6]
u<- as.data.frame(
c(books%>%
filter (ISBN %in% c(names(ubc)))%>%
select (Book.Title),
books%>%
filter (ISBN %in% c(colnames(data_test@data)[1:6]))%>%
select (Book.Title)))
i
u