Overview : Final Project Recommender system.

As part of our Data612 , we need to build a working recommender system as our final project assignment.

Below are libraries used for thuis project
recommenderlab
dplyr
reshape2
kableExtra
tidyr
data.table

Data loading , preperation of relevant dataset

Data file is loaded from local drive as the data set was heavy and github doesn’t supports this heavy file. Due to local menmory issues curtailing the dataset to be loaded to only rows 10k . Loading all the 3 csv’s and joining then merged them into one big data frame using dplyr & tidyr functions like merge,filter,groupby etc. Finally we extract from this big data frame our rating DF which consists of user,isbn,ratings. We will be using this new data frame for our recommender models.

First we convert the data frame inot matrix and then into realRatingMatrix object. Data has been taken from Book Ratings

set.seed(3445) 
# to keep #s from the results the same

# load data from local drive
dfbookratings <- read.csv("ratings.csv", header = TRUE, sep =";", stringsAsFactors = FALSE ,nrows = 10000)

dim(dfbookratings)

## [1] 10000     3

colnames(dfbookratings) # "User.ID";"ISBN";"Book.Rating"

## [1] "User.ID"     "ISBN"        "Book.Rating"

colnames(dfbookratings) <- c("user","isbn","rating")

dfusers <- read.csv("users.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)


colnames(dfusers) # "User.ID";"Location";"Age"

## [1] "User.ID"  "Location" "Age"

colnames(dfusers) <- c("user","location","age")

dfbooks <- read.csv("books.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)


# Data Preparation

colnames(dfbooks) <- c("isbn", "title","author","yearpub", "publisher", "iurls","iurlm", "iurll")
dfbooks <- dfbooks %>% select("isbn", "title","author","yearpub", "publisher","iurlm")


combinedData <- merge(dfbookratings,dfbooks, by=c("isbn"))
combinedData <- merge(combinedData,dfusers, by=c("user"))
length(unique(combinedData$isbn)) # No of Unique ISBNs

## [1] 2705

dfbookratingsvalid <- combinedData %>% group_by(user) %>% filter(n()>4) %>% group_by(isbn) %>% filter(n()>5)



dfbookratingswide <- dfbookratingsvalid %>% select(user, isbn, rating) %>% spread(isbn, rating)%>%  arrange(user)



rownames(dfbookratingswide) <- dfbookratingswide$user

## Warning: Setting row names on a tibble is deprecated.

allusersrated <- rownames(dfbookratingswide) 
allbooksrated <- colnames(dfbookratingswide)

dfbookratingswidet <- dfbookratingswide %>% select(-user)



booksDF <- merge(dfbooks,dfbookratings,by ="isbn")



# defining the rating table by selecting the user,ISBN,rating columns
ratingDF <- select(booksDF,"user","isbn","rating")
data.table(head(ratingDF))

ratingDF$isbn <- as.integer(ratingDF$isbn)

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion to integer range

ratingDF$user <- as.integer(ratingDF$user)



ratingDF <- na.omit(ratingDF)

ratings_wide <- reshape(ratingDF, idvar="user",timevar = "isbn", direction = "wide") %>% arrange(user)

colnames(ratings_wide)[colnames(ratings_wide)=="user"] <-"ratingDF.user"
names(ratings_wide) <- substring(names(ratings_wide),8)

ratings_wide_2 <- ratings_wide[,-1]
rownames(ratings_wide_2) <- ratings_wide[,1]
ratings_matrix_wide <- as.matrix(ratings_wide_2)
ratings_matrix_wide[is.na(ratings_matrix_wide)] <- 0
ratings_matrix <- as(ratings_matrix_wide, "realRatingMatrix")

Data Exploration

Exploring the data , extracting the matrix data into vector and then finding the unique ratings . Then summing the total ratings polled for each unique rating . Exteracting out the ratings for 0 and plotting to find the most common rating in data.

#Exploring the data first converting the matrix into vector and see how many unique value are there for ratings 
vector_ratings <- as.vector(ratings_matrix@data)

knitr::kable(unique(vector_ratings))%>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

x
0
5
7
10
8
9
3
4
6
2
1

#displaying in tabular form the count of each unique value.
##### Need to open this when we run on another system as this system has very less RAM
table_ratings <- table(vector_ratings)
knitr::kable(table_ratings)%>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

vector_ratings	Freq
0	1097527
1	3
2	5
3	15
4	23
5	84
6	66
7	164
8	185
9	140
10	132

# As per data description 0 are NA and we can even ignore anything less than 1 
vector_ratings<- factor(vector_ratings[vector_ratings != 0])

#visualize movie data ratngs using ggplot2
qplot(vector_ratings) + ggtitle("Distribution of the ratings")

rm(vector_ratings)
rm(table_ratings)

Clearly 0 is most common rating given, then exluding the zero and plotting it on graph gives us the representation that 8 is most common rating provided followed by 7 and 1 is least rating provided.

Finding the rating per book (column wise) and rating provided by per user(i.e. row wise)

view_per_books <- colCounts(ratings_matrix)

table_views <- data.frame(
  book = names(view_per_books),
  views = view_per_books
)
table_views <- table_views[order(table_views$views,decreasing = TRUE),]

knitr::kable(head(table_views))%>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	book	views
2005018	2005018	472
2740230	2740230	472
6379702	6379702	472
6485294	6485294	472
6543545	6543545	472
6546684	6546684	472

##row counts for user ratings

ratings_per_user <- rowCounts(ratings_matrix)

table_views <- data.frame(
  user = names(ratings_per_user),
  rating = ratings_per_user
)
table_views <- table_views[order(table_views$rating,decreasing = TRUE),]

knitr::kable(head(table_views))%>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	user	rating
2	2	2327
8	8	2327
9	9	2327
10	10	2327
12	12	2327
14	14	2327

From the above observation , we can clearly say the maximum column wise rating per book is 472 and maximum row wise rating provided by user is 2327.

Model Building

Spliting the data into training and testing set. We set 80% of data values into training set, and the rest 20% into testing set. Also set that any rating above 5 or equal to 5 is considered to be a good rating , and below 5 is bad ratings.

Creating a cluster of Models with evaluationScheme and using “split” ,ethod to run against the data and then evaluating them

percentage_training <- 0.8
items_to_keep <- 100
# above 5 is good ratings, below 5 is bad ratings.
rating_threshold <- 5
# times to run the evaluation
n_eval <- 1
evaluation_set <- evaluationScheme(data=ratings_matrix,method = "split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold, k = n_eval)
evaluation_set

## Evaluation scheme with 100 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 472 x 2327 rating matrix of class 'realRatingMatrix' with 1098344 ratings.

getData(evaluation_set,"train")

## 377 x 2327 rating matrix of class 'realRatingMatrix' with 877279 ratings.

models_to_evaluate <- list(
  IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
  IBCF_cor = list(name = "IBCF", param= list (method = "pearson")),
  UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
  UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
  random=list(name="RANDOM", param =NULL)
)
n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [82.7sec/0.09sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [83.19sec/0.1sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/1.24sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/0.87sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.31sec]

avg_matrices <-lapply(list_results,avg)
knitr::kable(head(avg_matrices$UBCF_cos[,1:8]))%>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	TP	FP	FN	TN	precision	recall	TPR	FPR
1	0.0105263	0.9894737	1.705263	2224.295	0.0105263	0.0068493	0.0068493	0.0004447
5	0.0210526	4.9789474	1.694737	2220.305	0.0042105	0.0205479	0.0205479	0.0022374
10	0.0210526	9.9789474	1.694737	2215.305	0.0021053	0.0205479	0.0205479	0.0044844
20	0.0315789	19.9684211	1.684211	2205.316	0.0015789	0.0216895	0.0216895	0.0089734
30	0.0526316	29.9473684	1.663158	2195.337	0.0017544	0.0259513	0.0259513	0.0134577
40	0.0736842	39.9263158	1.642105	2185.358	0.0018421	0.0343227	0.0343227	0.0179421

Plotting the output to draw ROC curve to find the best Prtediction Model from the above set of Models.

plot(list_results,annotate = 1, legend = "topleft",main="ROC Curve")

Looking at the above graph it is clear visible that UBCF model using cosine method is the one covering the maximum area unser the curve which signifies that this model is best.

plot(list_results,"prec/rec", annotate = 1, legend = "bottomright", main = "Precision-recall")

The good performance index is the area under the AUC curve. In our graph, the highest is UBCF with cosine distance, so it is the best-perfoming technique that we can use. For the Precision-recall graph, UBCF with cosine distance is still the top model.

Apache Spark ML implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations

library(sparklyr)

sc <- spark_connect(master = "local")

sp_books <- sdf_copy_to(sc,ratingDF,"spbooks",overwrite = TRUE)
partitions <- sp_books %>% sdf_random_split(training = 0.8, test = 0.2)

sp_books_training <- partitions$training
sp_books_test <- partitions$test
head(sp_books_training)

sparkALS <- ml_als(sp_books_training, max_iter = 5, nonnegative = TRUE,         rating_col = "rating", user_col = "user", item_col = "isbn")

sparkPred <- sparkALS$.jobj %>%
  invoke("transform", spark_dataframe(sp_books_test)) %>%
  collect()

sparkPred <- sparkPred[!is.na(sparkPred$prediction), ]

predictions <- ml_predict(sparkALS, sp_books_test)

mseSpark <- mean((sparkPred$rating - sparkPred$prediction)^2)

rmseSpark <- sqrt(mseSpark)

maeSpark <- mean(abs(sparkPred$rating - sparkPred$prediction))

accuracy <- data.frame(RMSE = rmseSpark, MSE = mseSpark, MAE = maeSpark)

accuracy

predictions <- data.frame(predictions)
predictions$difference <- (predictions$rating - predictions$prediction)
predictions$difference_square <- (predictions$difference)^2

head(predictions)

sqrt(mean(predictions$difference_square,na.rm = TRUE))

## [1] 4.416426

Summary

We can clearly see that UBCF with cosine distance is clearly a better model in prediction.

Final_Project

Samriti Malhotra && Vishal Arora

7/14/2019

Overview : Final Project Recommender system.

Data loading , preperation of relevant dataset

Data Exploration

Model Building

Summary