As part of our Data612 , we need to build a working recommender system as our final project assignment.
Below are libraries used for thuis project
recommenderlab
dplyr
reshape2
kableExtra
tidyr
data.table
Data file is loaded from local drive as the data set was heavy and github doesn’t supports this heavy file. Due to local menmory issues curtailing the dataset to be loaded to only rows 10k . Loading all the 3 csv’s and joining then merged them into one big data frame using dplyr & tidyr functions like merge,filter,groupby etc. Finally we extract from this big data frame our rating DF which consists of user,isbn,ratings. We will be using this new data frame for our recommender models.
First we convert the data frame inot matrix and then into realRatingMatrix object. Data has been taken from Book Ratings
set.seed(3445)
# to keep #s from the results the same
# load data from local drive
dfbookratings <- read.csv("ratings.csv", header = TRUE, sep =";", stringsAsFactors = FALSE ,nrows = 10000)
dim(dfbookratings)
## [1] 10000 3
colnames(dfbookratings) # "User.ID";"ISBN";"Book.Rating"
## [1] "User.ID" "ISBN" "Book.Rating"
colnames(dfbookratings) <- c("user","isbn","rating")
dfusers <- read.csv("users.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)
colnames(dfusers) # "User.ID";"Location";"Age"
## [1] "User.ID" "Location" "Age"
colnames(dfusers) <- c("user","location","age")
dfbooks <- read.csv("books.csv", header = TRUE, sep =";", stringsAsFactors = FALSE)
# Data Preparation
colnames(dfbooks) <- c("isbn", "title","author","yearpub", "publisher", "iurls","iurlm", "iurll")
dfbooks <- dfbooks %>% select("isbn", "title","author","yearpub", "publisher","iurlm")
combinedData <- merge(dfbookratings,dfbooks, by=c("isbn"))
combinedData <- merge(combinedData,dfusers, by=c("user"))
length(unique(combinedData$isbn)) # No of Unique ISBNs
## [1] 2705
dfbookratingsvalid <- combinedData %>% group_by(user) %>% filter(n()>4) %>% group_by(isbn) %>% filter(n()>5)
dfbookratingswide <- dfbookratingsvalid %>% select(user, isbn, rating) %>% spread(isbn, rating)%>% arrange(user)
rownames(dfbookratingswide) <- dfbookratingswide$user
## Warning: Setting row names on a tibble is deprecated.
allusersrated <- rownames(dfbookratingswide)
allbooksrated <- colnames(dfbookratingswide)
dfbookratingswidet <- dfbookratingswide %>% select(-user)
booksDF <- merge(dfbooks,dfbookratings,by ="isbn")
# defining the rating table by selecting the user,ISBN,rating columns
ratingDF <- select(booksDF,"user","isbn","rating")
data.table(head(ratingDF))
ratingDF$isbn <- as.integer(ratingDF$isbn)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion to integer range
ratingDF$user <- as.integer(ratingDF$user)
ratingDF <- na.omit(ratingDF)
ratings_wide <- reshape(ratingDF, idvar="user",timevar = "isbn", direction = "wide") %>% arrange(user)
colnames(ratings_wide)[colnames(ratings_wide)=="user"] <-"ratingDF.user"
names(ratings_wide) <- substring(names(ratings_wide),8)
ratings_wide_2 <- ratings_wide[,-1]
rownames(ratings_wide_2) <- ratings_wide[,1]
ratings_matrix_wide <- as.matrix(ratings_wide_2)
ratings_matrix_wide[is.na(ratings_matrix_wide)] <- 0
ratings_matrix <- as(ratings_matrix_wide, "realRatingMatrix")
Exploring the data , extracting the matrix data into vector and then finding the unique ratings . Then summing the total ratings polled for each unique rating . Exteracting out the ratings for 0 and plotting to find the most common rating in data.
#Exploring the data first converting the matrix into vector and see how many unique value are there for ratings
vector_ratings <- as.vector(ratings_matrix@data)
knitr::kable(unique(vector_ratings))%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
x |
---|
0 |
5 |
7 |
10 |
8 |
9 |
3 |
4 |
6 |
2 |
1 |
#displaying in tabular form the count of each unique value.
##### Need to open this when we run on another system as this system has very less RAM
table_ratings <- table(vector_ratings)
knitr::kable(table_ratings)%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
vector_ratings | Freq |
---|---|
0 | 1097527 |
1 | 3 |
2 | 5 |
3 | 15 |
4 | 23 |
5 | 84 |
6 | 66 |
7 | 164 |
8 | 185 |
9 | 140 |
10 | 132 |
# As per data description 0 are NA and we can even ignore anything less than 1
vector_ratings<- factor(vector_ratings[vector_ratings != 0])
#visualize movie data ratngs using ggplot2
qplot(vector_ratings) + ggtitle("Distribution of the ratings")
rm(vector_ratings)
rm(table_ratings)
Clearly 0 is most common rating given, then exluding the zero and plotting it on graph gives us the representation that 8 is most common rating provided followed by 7 and 1 is least rating provided.
Finding the rating per book (column wise) and rating provided by per user(i.e. row wise)
view_per_books <- colCounts(ratings_matrix)
table_views <- data.frame(
book = names(view_per_books),
views = view_per_books
)
table_views <- table_views[order(table_views$views,decreasing = TRUE),]
knitr::kable(head(table_views))%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
book | views | |
---|---|---|
2005018 | 2005018 | 472 |
2740230 | 2740230 | 472 |
6379702 | 6379702 | 472 |
6485294 | 6485294 | 472 |
6543545 | 6543545 | 472 |
6546684 | 6546684 | 472 |
##row counts for user ratings
ratings_per_user <- rowCounts(ratings_matrix)
table_views <- data.frame(
user = names(ratings_per_user),
rating = ratings_per_user
)
table_views <- table_views[order(table_views$rating,decreasing = TRUE),]
knitr::kable(head(table_views))%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
user | rating | |
---|---|---|
2 | 2 | 2327 |
8 | 8 | 2327 |
9 | 9 | 2327 |
10 | 10 | 2327 |
12 | 12 | 2327 |
14 | 14 | 2327 |
From the above observation , we can clearly say the maximum column wise rating per book is 472 and maximum row wise rating provided by user is 2327.
Spliting the data into training and testing set. We set 80% of data values into training set, and the rest 20% into testing set. Also set that any rating above 5 or equal to 5 is considered to be a good rating , and below 5 is bad ratings.
Creating a cluster of Models with evaluationScheme and using “split” ,ethod to run against the data and then evaluating them
percentage_training <- 0.8
items_to_keep <- 100
# above 5 is good ratings, below 5 is bad ratings.
rating_threshold <- 5
# times to run the evaluation
n_eval <- 1
evaluation_set <- evaluationScheme(data=ratings_matrix,method = "split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold, k = n_eval)
evaluation_set
## Evaluation scheme with 100 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 472 x 2327 rating matrix of class 'realRatingMatrix' with 1098344 ratings.
getData(evaluation_set,"train")
## 377 x 2327 rating matrix of class 'realRatingMatrix' with 877279 ratings.
models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF", param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF", param= list (method = "pearson")),
UBCF_cos = list(name = "UBCF", param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF", param = list(method = "pearson")),
random=list(name="RANDOM", param =NULL)
)
n_recommendations <- c (1,5, seq(10,50,10))
list_results <- evaluate(x = evaluation_set, method = models_to_evaluate, n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [82.7sec/0.09sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [83.19sec/0.1sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.06sec/1.24sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.06sec/0.87sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.31sec]
avg_matrices <-lapply(list_results,avg)
knitr::kable(head(avg_matrices$UBCF_cos[,1:8]))%>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
1 | 0.0105263 | 0.9894737 | 1.705263 | 2224.295 | 0.0105263 | 0.0068493 | 0.0068493 | 0.0004447 |
5 | 0.0210526 | 4.9789474 | 1.694737 | 2220.305 | 0.0042105 | 0.0205479 | 0.0205479 | 0.0022374 |
10 | 0.0210526 | 9.9789474 | 1.694737 | 2215.305 | 0.0021053 | 0.0205479 | 0.0205479 | 0.0044844 |
20 | 0.0315789 | 19.9684211 | 1.684211 | 2205.316 | 0.0015789 | 0.0216895 | 0.0216895 | 0.0089734 |
30 | 0.0526316 | 29.9473684 | 1.663158 | 2195.337 | 0.0017544 | 0.0259513 | 0.0259513 | 0.0134577 |
40 | 0.0736842 | 39.9263158 | 1.642105 | 2185.358 | 0.0018421 | 0.0343227 | 0.0343227 | 0.0179421 |
Plotting the output to draw ROC curve to find the best Prtediction Model from the above set of Models.
plot(list_results,annotate = 1, legend = "topleft",main="ROC Curve")
Looking at the above graph it is clear visible that UBCF model using cosine method is the one covering the maximum area unser the curve which signifies that this model is best.
plot(list_results,"prec/rec", annotate = 1, legend = "bottomright", main = "Precision-recall")
The good performance index is the area under the AUC curve. In our graph, the highest is UBCF with cosine distance, so it is the best-perfoming technique that we can use. For the Precision-recall graph, UBCF with cosine distance is still the top model.
Apache Spark ML implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations
library(sparklyr)
sc <- spark_connect(master = "local")
sp_books <- sdf_copy_to(sc,ratingDF,"spbooks",overwrite = TRUE)
partitions <- sp_books %>% sdf_random_split(training = 0.8, test = 0.2)
sp_books_training <- partitions$training
sp_books_test <- partitions$test
head(sp_books_training)
sparkALS <- ml_als(sp_books_training, max_iter = 5, nonnegative = TRUE, rating_col = "rating", user_col = "user", item_col = "isbn")
sparkPred <- sparkALS$.jobj %>%
invoke("transform", spark_dataframe(sp_books_test)) %>%
collect()
sparkPred <- sparkPred[!is.na(sparkPred$prediction), ]
predictions <- ml_predict(sparkALS, sp_books_test)
mseSpark <- mean((sparkPred$rating - sparkPred$prediction)^2)
rmseSpark <- sqrt(mseSpark)
maeSpark <- mean(abs(sparkPred$rating - sparkPred$prediction))
accuracy <- data.frame(RMSE = rmseSpark, MSE = mseSpark, MAE = maeSpark)
accuracy
predictions <- data.frame(predictions)
predictions$difference <- (predictions$rating - predictions$prediction)
predictions$difference_square <- (predictions$difference)^2
head(predictions)
sqrt(mean(predictions$difference_square,na.rm = TRUE))
## [1] 4.416426
We can clearly see that UBCF with cosine distance is clearly a better model in prediction.