DATA 612 02[11857] : Final Project [07/14]

1 Pre-Requistes : Available Libraries
- 1.1 ## Loading Packages
2 Description
3 Installations
- 3.1 Install Spark
- 3.2 Connect to Spark
4 Dataset
- 4.1 Data Preparation
5 I. User-Based Collaborative Filtering
6 II. SVD Recommender
7 III. Item-Based Recommender Filtering
8 Conclusion
- 8.1 References:

Recommender System : Amigo De Libro

The goal for your final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. In this final project deliverable, you’ll build out the system that you describe in your planning document.

1 Pre-Requistes : Available Libraries

1.1 ## Loading Packages

dplyr
tidyr
knitr
stringr
randomForest
reshape2
ggplot2
kableExtra
DT
data.table
sqldf
pander
recommenderlab
tictoc

2 Description

This is an R Markdown document for performing analysis of Book Crossing Data that creates and compares three recommenders: User based Collaborative, Item based Collaborative and SVD filtering methods.

Additionally, in this project, a recommender was implemented on a distrubuted system. The Performance of this recommender was then compared with the recommender that was created on Apache Spark using ALS.

3 Installations

The recommender on Apache Spark was installed on a single node. The data manipulation in Sparklyr uses the same verbage as dplyr, so the learning curve is said to be easier for R programmers. Also, Sparkly is faster than R and help documentation are easily available in R (?function_name).

3.1 Install Spark

3.2 Connect to Spark

## Change SPARK_HOME and JAVA_HOME to accommodate sparklyr as they might be set as different directory for scela and SparkR
#hardcoding environment for posterity
#java_path <- normalizePath("C:/Program Files/Java/jdk1.8.0_161"")
#spark_path <-normalizePath("C:\Users\Debabrata\AppData\Local\spark\spark-2.1.0-bin-hadoop2.7")
#Sys.setenv(JAVA_HOME = java_path)
#Sys.setenv(SPARK_HOME = spark_path)

## Starting spark with 4Gb memory limit configuration
#config <- spark_config()
#config$`sparklyr.shell.driver-memory` <- "4G"
#config$`sparklyr.shell.executor-memory` <- "4G"

#cache = TRUE # to speed up compile
opts_knit$set(verbose=TRUE)  # to see where the code is executing

## Generatng the spark connection
sc <- spark_connect(master = "local"
                    #,version = "2.1.0"
                    #,config = config
                    )

The returned spark connection(sc) below provides a remote dplyr data source to the Spark cluster

4 Dataset

The Book Crossing dataset,( courtesy, mined by Cai-Nicolas Ziegler, DBIS Freiburg) was procured from http://www2.informatik.uni-freiburg.de/~cziegler/BX/ The format is a CSV dump, that comprises of 3 datasets primarily of the User-Book Rating , Users Information and Books Information.

#Dataset
#http://www2.informatik.uni-freiburg.de/~cziegler/BX/
set.seed(3445) # to keep #s from the results the same

# load data from local drive
dfbookratings <- read.csv("BX-Book-Ratings.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)
dim(dfbookratings)

## [1] 493813      3

colnames(dfbookratings) # "User.ID";"ISBN";"Book.Rating"

## [1] "User.ID"     "ISBN"        "Book.Rating"

# Rename columns for better handling
colnames(dfbookratings) <- c("user","isbn","rating")

#ncol(dfbookratings)
#head(dfbookratings)
#User.ID ISBN Book.Rating
#nrow(dfbookratings)

dfusers <- read.csv("BX-Users.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)
dim(dfusers)

## [1] 140291      3

colnames(dfusers) # "User.ID";"Location";"Age"

## [1] "User.ID"  "Location" "Age"

# Renaming columns for better handling
colnames(dfusers) <- c("user","location","age")

dfbooks <- read.csv("BX-Books.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)
dim(dfbooks)

## [1] 115253      8

colnames(dfbooks) #"ISBN";"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"

## [1] "ISBN"                "Book.Title"          "Book.Author"        
## [4] "Year.Of.Publication" "Publisher"           "Image.URL.S"        
## [7] "Image.URL.M"         "Image.URL.L"

colnames(dfbooks) <- c("isbn", "title","author","yearpub", "publisher", "iurls","iurlm", "iurll")
dfbooks <- dfbooks %>% select("isbn", "title","author","yearpub", "publisher","iurlm")

#ISBN Book.Title Book.Author Year.Of.Publication Publisher Image.URL.S Image.URL.M Image.URL.L
#Book Category from Google API
bookdec_cfc <- read.csv("final_output_category.csv", header = TRUE, sep =",", stringsAsFactors = FALSE) %>% select(isbn,category)

4.1 Data Preparation

# Validating the rating data against users and books
combinedData <- merge(dfbookratings,dfbooks, by=c("isbn"))
combinedData <- merge(combinedData,dfusers, by=c("user"))
length(unique(combinedData$isbn)) # No of Unique ISBNs

## [1] 46207

length(unique(combinedData$user)) # No of Unique Users

## [1] 14463

dim(combinedData)

## [1] 111981     10

# Reduce dataset for users who have read 4 or more books, and books which are rated by at least 5 users, Long Format

dfbookratingsvalid <- combinedData %>% group_by(user) %>% filter(n()>4) %>% group_by(isbn) %>% filter(n()>5)
dim(dfbookratingsvalid)

## [1] 37146    10

length(unique(dfbookratingsvalid$isbn)) # No of Unique ISBNs after filtering

## [1] 2407

length(unique(dfbookratingsvalid$user)) # No of Unique Users after filtering

## [1] 2652

################### Wide Format Users ~ Books ####################

# Converting to Wide format to get User Book rating matrix
dfbookratingswide <- dfbookratingsvalid %>% select(user, isbn, rating) %>% spread(isbn, rating)%>%  arrange(user)
dim(dfbookratingswide)

## [1] 2652 2408

# Storing users and books as rows and column  names
rownames(dfbookratingswide) <- dfbookratingswide$user
allusersrated <- rownames(dfbookratingswide) 
allbooksrated <- colnames(dfbookratingswide)
#View(dfbookratingswide)

# Free the memory
rm(dfbookratings)
rm(combinedData)


################### Wide Format Books ~ Users ####################
dfbookratingswidet <- dfbookratingswide %>% select(-user)

# Taking a transpose to have book (rows) by user (columns), This is wide format but books as rows and users as columns
dfbookratingswidet <-  t(dfbookratingswidet)
#The rownames and columns names are interchanges correctly
#View(head(dfbookratingswidet, 100))

dim(dfbookratingswidet)

## [1] 2407 2652

# View sample of dataset

#datatable(head(dfbookratingswide[1:10], 10))
#datatable(head(dfusers, 10))
data.table(head(dfbooks, 2))

# We keep and refer only the books that are rated, hence putting that in a new books data set dfbooks2, after validating with the ratings data

bookIds <- unique(dfbooks$isbn) # unique books in book data
bookIdsRated <- unique(dfbookratingsvalid$isbn) # unique books actually rated 
dfbooks2 <- dfbooks[which((bookIds %in% bookIdsRated) == TRUE),] # Keep in the book data only those                                                                                       # rated , saved this in new book data set

# Store the isbns as rownames
#rownames(dfbooks2) <- dfbooks2$isbn

# Free memory
#rm(dfbooks)

# We now have,
# 1. dfbookratingsvalid  (Ratings Data Long format) , 
# 2. dfbookratingswide   (Ratings Data wide format user~books)
# 3. dfbookratingswidet  (Ratings Data wide format books~dfbooks2$yearpub)
# 4. dfusers             (User Data)
# 5. dfbooks2            (Book Data)

5 I. User-Based Collaborative Filtering

This algorithm groups users according to their history of ratings and recommends an item that a user similar to this user (in the same group) liked. So, if user A liked Book 1,2 and 3 and user B liked Book 1 and 2, then Book 3 is a good one to recommend to user B.The assumption of UBCF is that similar users will rate movies similarly. So, the ratings are predicted by first finding a neighborhood of similar users and then aggregating the user ratings to form a prediction.

Popular measures used are Pearson and cosine distance similarity.

colnames(dfbookratingsvalid)

##  [1] "user"      "isbn"      "rating"    "title"     "author"   
##  [6] "yearpub"   "publisher" "iurlm"     "location"  "age"

# Generating the user-item matrix for the predictor 
ratingdcast <- dcast(dfbookratingsvalid, user~isbn, 
                   value.var = "rating", fill=0, fun.aggregate = mean)

# Filling in rownames
rownames(ratingdcast) = ratingdcast$user
#colnames(ratingdcast)
# Removing the first column
ratingdcast <- as.matrix(ratingdcast[,-1])
# Converting to a matrix
ratingdcast <- as.matrix(ratingdcast)

5.1 Splitting the dataset:

Use the evaluationScheme to automatically split dataset into Testing and Training sets. So, using this tool to split ratings_movies into 80% and 20%. To make prediction of ratings, we need to build a recommender. The following sets were extracted by using getData routine.

Train: The training set Known: Test set with the item to build the recommendation Unknown: Test set to test the recommendation

items_to_keep <- 5
percentage_training <- 0.8
rating_threshold <- 5
n_eval <- 1

rdf <- as(as.matrix(ratingdcast),"realRatingMatrix")
rdf <- rdf[,colCounts(rdf) > 3]

head(rdf,5)

## 1 x 2407 rating matrix of class 'realRatingMatrix' with 2407 ratings.

eval_sets <- evaluationScheme(data = rdf, method ="split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold) #, k = n_eval)
eval_sets

## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 2652 x 2407 rating matrix of class 'realRatingMatrix' with 6383364 ratings.

size_sets <- sapply(eval_sets@runsTrain, length)
size_sets

## [1] 2121

5.2 a) Optimizing a numeric parameter (Neighborhood size):

Recommendation models contain a numeric parameter that takes account of the k-closest users/items. k can be optimized, by testing different values of a numeric parameter. So, we can get the value we want to proceed testing with. Default k value is 30. We can explore ranges from 10 and 70. Building and evaluating the models:

vector_k <- c(10, 20, 30,40,50,60,70)
records <- c(5, 10, 15, 20, 25)
model_name <- "UBCF"
method_name <- "Cosine"
k = 70

#define a list of models to evaluate by using lapply( distance metric is cosine )
models_to_evaluate <- lapply(vector_k, function(k) {
  list(name= model_name, param = list(normalize = "Z-score", method = method_name,nn=k))
  
})


names(models_to_evaluate) <- paste0(("UBCF_k_"),vector_k)
list_results <- evaluate(x=eval_sets,method = models_to_evaluate, n = vector_k,progress = FALSE)  
list_results

## List of evaluation results for 7 recommenders:
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.
## Evaluation results for 1 folds/samples using method 'UBCF'.

plot(list_results, annotate = 1, legend ="topleft") 
title("ROC curve for different k values")

This evaluation took about 44 seconds for each iteration.

The best performing kcan be identified by building a chart for these values with the area under the curve (ROC). The highest is K = 70, so its the best performing neighborhood value. So this value will be used in the neighborhood for all UBCF calculations.

Now a similarity matrix is calculated containing all user-to-user similarities using Pearson and Cosine similary measures.

model_to_evaluate <- "UBCF"
model_parameters <- list(normalize = "Z-Score", method="Cosine", nn=70)
 
model_cosine <- Recommender(getData(eval_sets,"train"),model_to_evaluate,param=model_parameters)

prediction_cosine <- predict(model_cosine,getData(eval_sets,"known"),type="ratings")

rmse_cosine <- calcPredictionAccuracy(prediction_cosine, getData(eval_sets, "unknown"))[1]
rmse_cosine

##      RMSE 
## 0.3712451

#0.3850089

Now preparation for the prediction and get prediction for the top 5 users.

Titlelookup  <- subset(dfbooks,select = c(1,2))
colnames(Titlelookup) <- c("ISBN","title")
Titlelookup <- Titlelookup[duplicated(Titlelookup)==FALSE,]

getTitle <- function(ISBN1) {
  title <- subset(Titlelookup, ISBN == ISBN1)$title
}
## Create databases here 

pred2 <- predict(model_cosine,getData(eval_sets, "unknown"),n=5)
# view the recommendations for top 5 users
#as(pred2,"list")[1:5]
#pred2Copy <- pred2

#define a list with the recommendations for each user
recc_matrix <- lapply(pred2@items, function(x){
  colnames(rdf)[x]
})

# Let's take a look at the recommendations for the first 3 users:
recc_matrix[1:3]

## $`53`
## [1] "0440235162" "0553294385" "0440170796" "0449002632" "0553278916"
## 
## $`99`
## [1] "0440214041" "0826308791" "067973225X" "034539092X" "0385491328"
## 
## $`176`
## [1] "0060915544" "0064407675" "0452264464" "067081976X" "0767903862"

Now, to get recommendation from the UBCF method.

The prediction by userID could not be done due to an error that could not be resolved due to time restrictions.

Error: If newdata is a user id then data needes to be the training dataset.

5.3 b) Distance methods:

This method gives measurement of the similarity between users/items based on the distance between them.Popular models are pearson, jaccard and cosine.

model_to_evaluate <- "UBCF"
kval <- 70
valList <- c(0, 20,30,40,50,60,70)

model_parameters1 <- list(normalize = "Z-score",method="Cosine",nn=kval)
model_parameters2 <- list(normalize = "Z-score",method="Pearson",nn=kval)
model_parameters3 <- list(normalize = "Z-score",method="jaccard",nn=kval)

distItem <- list(
   "Cosine" = list(name=model_to_evaluate, param=model_parameters1),
   "Pearson" = list(name=model_to_evaluate, param=model_parameters2),
   "Jaccard" = list(name=model_to_evaluate, param=model_parameters3)
)

dist_resultsUBCF <- evaluate(eval_sets, distItem, n=valList)

## UBCF run fold/sample [model time/prediction time]
##   1  [1.29sec/30.7sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [1.29sec/18.61sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [1.55sec/30.66sec]

#avg(dist_resultsUBCF)

plot(x=dist_resultsUBCF, y ="ROC")
title("ROC curve")

The evaluation time was about 1.5 second for each iteration. From the ROC curve, it can be seen that the performance was best when using the Cosine algorithm(at the top of the graph).

5.4 c) Normalization method:

Data needs to be normalized before applying any algorithm. (normalization is done here by taking userâs averages - which is mean ratings of every user subtracted from known ratings)

Use normalization method for Z score using center and z-score parameters to feed the recommenderlab.

alg_dist <- list(
   "center" = list(name="UBCF", param=list(normalize = "center",method="Cosine",nn=70)),
   "Zscore" = list(name="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=70))
)

dist_resultsUBCF <- evaluate(eval_sets, alg_dist, n=c(1, 5, 10, 15, 20, 25))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.91sec/30.81sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [1.65sec/30.29sec]

#plot ROC
plot(x = dist_resultsUBCF, y = "ROC")
title("ROC curve")

The Center option did better than ZScore.

6 II. SVD Recommender

Singular Value Decomposition(SVD), a data analysis will be used to create the SVD recommender. SVD was used by BellKor team in their recommender project that won the Netflix Prize in 2009. SVD is a matrix factorization technique used to reduce the number of features of a given dataset by reducing dimensions and the matrix factorization here is performed on a user-item ratings matrix.

booksummary <- dfbookratingsvalid %>%
  group_by(isbn) %>%
  summarise(
    avg_review = mean(rating))
  
# Computing the SVD
decomp = irlba(ratingdcast, nu = 3, nv = 3)

The matrix factorization technique called Singular value decomposition (SVD) is used to reduce dimensionality, so neighborhood formation happens in a reduced user space. SVD helps the model to find the low rank dataset matrix. However, SVD is computationally expensive. So the irlba, a package from R will be used to perform the singular value decomposition. The irlba function computes only the number of singular values corresponding to the maximum of the desired singular vectors, max(nu, nv). For example, if 5 singular vectors are desired (nu=nv=5), then only the five corresponding singular values are computed. In contrast, SVD without irlba function always returns the total set of singular values for the matrix, regardless of how many singular vectors are specified.

The irlba package provides a fast way to compute partial singular value decompositions (SVD) of large sparse or dense matrices. The paramters to irlba are: nv = number of right singular vectors to estimate nu = number of left singular vectors to estimate

# Generating the prediction matrix
predBook = booksummary$avg_review + (decomp$u * sqrt(decomp$d)) %*% (sqrt(decomp$d) * t(decomp$v))

RMSE <- function(predictionMatrix, actualMatrix){
  sqrt(mean((predictionMatrix - actualMatrix)^2, na.rm=T))
}

# actualMatrix
NADF <- ratingdcast
is.na(NADF) <- NADF == 0
RMSE(predBook, NADF)

## [1] 5.45191

#5.45191

Using the irlba package gave an RMSE of 5.174704 and that value is not better than UBCF.

7 III. Item-Based Recommender Filtering

7.1 a) Optimizing a numeric parameter ( Neighborhood size):

Now a similarity matrix is calculated containing all user-to-user similarities using Cosine similary measures.

model_to_evaluate <- "IBCF"
model_parameters <- list(normalize = "Z-Score", method="Cosine", k=70)
 
model_cosineIBCF <- Recommender(getData(eval_sets,"train"), model_to_evaluate, param=model_parameters)

prediction_cosineIBCF <- predict(model_cosineIBCF,getData(eval_sets,"known"),type="ratings")

rmse_cosineIBCF <- calcPredictionAccuracy(prediction_cosineIBCF, getData(eval_sets, "unknown"))[1]
rmse_cosineIBCF

##     RMSE 
## 2.744873

#0.3850089

7.2 b) Distance methods:

This method gives measurement of the similarity between users/items based on the distance between them.Popular models are pearson, jaccard and cosine.

model_to_evaluate <- "IBCF"
kval <- 70
valList <- c(0, 20,30,40,50,60,70)

model_parameters1 <- list(normalize = "Z-score",method="Cosine",k=kval)
model_parameters2 <- list(normalize = "Z-score",method="Pearson",k=kval)
model_parameters3 <- list(normalize = "Z-score",method="jaccard",k=kval)

distItem <- list(
   "Cosine" = list(name=model_to_evaluate, param=model_parameters1),
   "Pearson" = list(name=model_to_evaluate, param=model_parameters2),
   "Jaccard" = list(name=model_to_evaluate, param=model_parameters3)
)

dist_resultsIBCF <- evaluate(eval_sets, distItem, n=valList)

## IBCF run fold/sample [model time/prediction time]
##   1  [144.22sec/0.22sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [114.19sec/0.21sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [141.67sec/0.17sec]

#avg(dist_resultsIBCF)

plot(x=dist_resultsIBCF, y ="ROC")
title("ROC curve")

Now prediction using IBCF.

7.3 c) Normalization method:

Data needs to be normalized before applying any algorithm. (normalization is done here by taking userâs averages - which is mean ratings of every user subtracted from known ratings)

Use normalization method for Z score using center and z-score parameters to feed the recommenderlab.

alg_dist <- list(
   "center" = list(name="IBCF", param=list(normalize = "center",method="Cosine",k=70)),
   "Zscore" = list(name="IBCF", param=list(normalize = "Z-score",method="Cosine",k=70))
)

dist_resultsIBCF <- evaluate(eval_sets, alg_dist, n=c(20, 25))

## IBCF run fold/sample [model time/prediction time]
##   1  [140.07sec/0.15sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [143.18sec/0.15sec]

#plot ROC
plot(x = dist_resultsIBCF, y = "ROC")
title("ROC curve")

ZScore did better than Center.

ubcf <- c('0.448','70','Cosine','Center','2 sec per iteration')
ibcf <-c('1.067','70','Pearson','ZScore','202 secs per iteration')
svd <- c('5.45191','N/A','N/A','N/A','N/A')
als <- c('0.9784569','4 seconds total','N/A','N/A','N/A')

myresults.df <- data.frame(ubcf,ibcf,svd,als)
str(myresults.df)

## 'data.frame':    5 obs. of  4 variables:
##  $ ubcf: Factor w/ 5 levels "0.448","2 sec per iteration",..: 1 3 5 4 2
##  $ ibcf: Factor w/ 5 levels "1.067","202 secs per iteration",..: 1 3 4 5 2
##  $ svd : Factor w/ 2 levels "5.45191","N/A": 1 2 2 2 2
##  $ als : Factor w/ 3 levels "0.9784569","4 seconds total",..: 1 2 3 3 3

colnames(myresults.df) <- c("UBCF   ","IBCF   ","SVD   ","ALS")
rownames(myresults.df) <- c("RMSE-Distance","Nearest Neighborhood","Best Similarity using","Normalized using","Execution Time")

kable(myresults.df,  type = "html",caption="Results - comparing recommender systems")

Results - comparing recommender systems
	UBCF	IBCF	SVD	ALS
RMSE-Distance	0.448	1.067	5.45191	0.9784569
Nearest Neighborhood	70	70	N/A	4 seconds total
Best Similarity using	Cosine	Pearson	N/A	N/A
Normalized using	Center	ZScore	N/A	N/A
Execution Time	2 sec per iteration	202 secs per iteration	N/A	N/A

8 Conclusion

The concept of SVD(Singular Value Decomposition) was explored in this project. SVD is a form of dimensionalty reduction technique to extract the maximum variability in observations. It’s sometimes called unsupervized learning where labeled observations are not required and groups of observations that are similar are found by clustering. Reducing dimensions without losing important information was achieved by using a library called irlba.

UBCF is recommended over IBCF and SVD because of the following reasons. When calculating:
a) Distance methods - UBCF had lower RMSE that indicated that UBCF was a better fit. Distance had good reading for the Jaccard model. b) Normalization - Center performed well for UBCF.
c) The neightborhood size - UBCF had the lowest RMSE for 70. d) Compilation time - UBCF also ran much faster than the other models.

UBCF might be a stronger recommender than IBCF because users might not be looking for direct substitutes to the book they had read. They might perfer Serendipity over accuracy. They might want to find good things without looking for them.

When RMSE of ALS ( which was run on AWS cloud) was compared with other models, it compared well and it was the fastest recommender taking only a total time of 4 seconds to calculate the ALS, compared to over 40 seconds time per iteration for UBCF and 2 seconds for IBCF.

So, amongst UBCF, IBCF, SVD and ALS recommender models, the User based model was the best model in terms of performance and a low RMSE.

8.1 References:

spark_and_r javierrstudiocom h2o sparklyr movielens What-is-the-Alternating-Least-Squares-method-in-recommendation-systems mllib-collaborative-filtering ml_als_factorization ml_als_factorization

DATA 612 02[11857] : Final Project [07/14]

DATA 612 - Final Project - Recommender Systems

DATA 612 02[11857] : Final Project [07/14]

1 Pre-Requistes : Available Libraries

1.1 ## Loading Packages

2 Description

3 Installations

3.1 Install Spark

3.2 Connect to Spark

4 Dataset

4.1 Data Preparation

5 I. User-Based Collaborative Filtering

5.1 Splitting the dataset:

5.2 a) Optimizing a numeric parameter (Neighborhood size):

5.3 b) Distance methods:

5.4 c) Normalization method:

6 II. SVD Recommender

7 III. Item-Based Recommender Filtering

7.1 a) Optimizing a numeric parameter ( Neighborhood size):

7.2 b) Distance methods:

7.3 c) Normalization method:

8 Conclusion

8.1 References: