In my final project, I am trying to build a recommender system for the good books. I only took the books csv file and ratings csv file for this project. There are 10000 books and 53424 users in the file.

The first obstacle I encountered is the file is really large. The rating csv file itself is about 69MB. This is the first time I dealt with a relatively big file. When I read csv file in R, it takes a very long time and also not loading. So, to make R running in a time fashion, I only loaded first 10,000 rows in “rating.csv”. In this case, it efficiently reduced the running time.

library(recommenderlab)
library(dplyr)
library(reshape2)
library(ggplot2)

books <- read.csv("/Users/xiaomengkong/Desktop/SPS/Data612/Final Project/books.csv")
ratings <- read.csv("/Users/xiaomengkong/Desktop/SPS/Data612/Final Project/ratings.csv",nrows = 10000)

books <- data.frame(books)
ratings <- data.frame(ratings)

Part 1. Preparing data for “recommenderlab” and “sparklyr”

For “sparklyr”, I combined the books.csv and ratings.csv into 1 data frame by use the same “book_id. The reason why I am choosing data frame because later I would like to use Sparklyr to run ALS algorithm to do the prediction. After I combined them, I chose the columns that I need for the rating matrix or data frame. So I chose the columns of”book_id“,”authors“,”title“,”user_id“, and”rating" for my new dataframe. The reason why I keep these columns because when I do the prediction by using Sparklyr, I would like to see which book the system recommend instead of just the book-id which I did not know which book the system recommend. I called the new data frame is “NewRating” which I will use it later.

books_new <- books %>% select(1,8,11)

NewRatings <- merge(books_new,ratings,by ="book_id")

head(NewRatings)
##   book_id                     authors
## 1       2 J.K. Rowling, Mary GrandPré
## 2       2 J.K. Rowling, Mary GrandPré
## 3       2 J.K. Rowling, Mary GrandPré
## 4       2 J.K. Rowling, Mary GrandPré
## 5       2 J.K. Rowling, Mary GrandPré
## 6       2 J.K. Rowling, Mary GrandPré
##                                                      title user_id rating
## 1 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)      94      2
## 2 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)      83      4
## 3 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)     135      2
## 4 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)      61      4
## 5 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)     447      4
## 6 Harry Potter and the Sorcerer's Stone (Harry Potter, #1)     177      3

Now I am gonna prepare the wide form for my ratings matrix for “recommenderlab”. In here, I just use the ratings.csv with first 10000 rows since in this case, I don’t really need to know the book name for the prediction. In order to use the package of “recommenderlab”, I have to convert the matrix into “realRatingMatrix”.

# Change ratings into wide form

head(ratings)
##   user_id book_id rating
## 1       1     258      5
## 2       2    4081      4
## 3       2     260      5
## 4       2    9296      5
## 5       2    2318      3
## 6       2      26      4
ratings$book_id <- as.numeric(ratings$book_id)
ratings$user_id <- as.numeric(ratings$user_id)

ratings_wide <- reshape(ratings, idvar="user_id",timevar = "book_id", direction = "wide") %>% arrange(user_id)

colnames(ratings_wide)[colnames(ratings_wide)=="user_id"] <-"rating.user"
names(ratings_wide) <- substring(names(ratings_wide),8)

ratings_wide_2 <- ratings_wide[,-1]
rownames(ratings_wide_2) <- ratings_wide[,1]





head(ratings_wide_2[,1:5])
##   258 4081 260 9296 2318
## 1   5   NA  NA   NA   NA
## 2  NA    4   5    5    3
## 4  NA   NA  NA   NA   NA
## 6  NA   NA  NA   NA   NA
## 8  NA   NA  NA   NA   NA
## 9  NA   NA  NA   NA   NA
ratings_matrix_wide <- as.matrix(ratings_wide_2)
head(ratings_matrix_wide[,1:5])
##   258 4081 260 9296 2318
## 1   5   NA  NA   NA   NA
## 2  NA    4   5    5    3
## 4  NA   NA  NA   NA   NA
## 6  NA   NA  NA   NA   NA
## 8  NA   NA  NA   NA   NA
## 9  NA   NA  NA   NA   NA
ratings_matrix_wide[is.na(ratings_matrix_wide)] <- 0
ratings_matrix <- as(ratings_matrix_wide, "realRatingMatrix")
head(ratings_matrix)
## 1 x 2510 rating matrix of class 'realRatingMatrix' with 2510 ratings.

Part 2. Split the data into training and testing set. I set 70% of data values into training set, and the rest 30% into testing set. Here, I also set that any rating is above 3 which are good ratings, and below 3 is bad ratings.

percentage_training <- 0.7
items_to_keep <- 100
# above 3 is good ratings, below 3 is bad ratings.
rating_threshold <- 3
# times to run the evaluation
n_eval <- 1
evaluation_set <- evaluationScheme(data=ratings_matrix,method = "split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold, k = n_eval)
evaluation_set
## Evaluation scheme with 100 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.700
## Good ratings: >=3.000000
## Data set: 255 x 2510 rating matrix of class 'realRatingMatrix' with 640050 ratings.
getData(evaluation_set,"train")
## 178 x 2510 rating matrix of class 'realRatingMatrix' with 446780 ratings.
nrow(getData(evaluation_set,"train")) / nrow(ratings_matrix)
## [1] 0.6980392
# it is about 70% for our training set.
nrow(getData(evaluation_set,"known")) / nrow(ratings_matrix)
## [1] 0.3019608
# k-fold is the most accurate approach. 
n_fold <- 5
evaluation_ratingSet <- evaluationScheme(data = ratings_matrix, method = "cross-validation", k = n_fold, given = items_to_keep,goodRating = rating_threshold)
evaluation_ratingSet
## Evaluation scheme with 100 items given
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=3.000000
## Data set: 255 x 2510 rating matrix of class 'realRatingMatrix' with 640050 ratings.
# Item-based-collaborative filtering, default parameter is "Null", "IBCF recommend new items and predict their ratings."

model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data=getData(evaluation_set,"train"),method = model_to_evaluate, parameter= model_parameters)
items_to_recommend <- 5

eval_prediction <- predict(object = eval_recommender, newdata = getData(evaluation_set,"known"),n=items_to_recommend,type="ratings")
head(eval_prediction)
## 1 x 2510 rating matrix of class 'realRatingMatrix' with 1656 ratings.
qplot(rowCounts(eval_prediction)) +
  geom_histogram(binwidth = 30) +
  ggtitle("Distribution of books per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the distribution, I find a decent amount of “0” for the distribution of books per user since there are a lot of NA. Regardless of this, it looks like bimodal for the distribution, one peak point is around 1250 and another peak point is around 1800.

eval_accuracy <- calcPredictionAccuracy( x = eval_prediction, data = getData(evaluation_set, "unknown"),byUser = TRUE)
head(eval_accuracy)
##         RMSE       MSE        MAE
## 2  0.5240051 0.2745813 0.07305819
## 10 0.7083334 0.5017362 0.12094694
## 15 0.8407079 0.7067898 0.18525537
## 28 0.6966476 0.4853179 0.11321166
## 32 0.7039326 0.4955211 0.13475733
## 36       NaN       NaN        NaN
qplot(eval_accuracy[,"RMSE"])+
  geom_histogram(binwidth = 0.2)+
  ggtitle("Distribution of the RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 27 rows containing non-finite values (stat_bin).

## Warning: Removed 27 rows containing non-finite values (stat_bin).

The RMSE by user from the distribution is about 0.6 which is a relatively small RMSE. I’ll consider it a pretty good prediction.

results <- evaluate(x = evaluation_set,method = model_to_evaluate, n = seq(10,50,10))
## IBCF run fold/sample [model time/prediction time]
##   1  [36.732sec/0.132sec]
head(getConfusionMatrix(results)[[1]])
##           TP        FP       FN       TN  precision     recall        TPR
## 10 0.7272727  5.766234 31.81818 2371.688 0.11200000 0.02236695 0.02236695
## 20 1.2337662 11.753247 31.31169 2365.701 0.09500000 0.03411665 0.03411665
## 30 1.5714286 17.909091 30.97403 2359.545 0.08066667 0.04322779 0.04322779
## 40 1.9610390 24.012987 30.58442 2353.442 0.07550000 0.05196597 0.05196597
## 50 2.3506494 30.116883 30.19481 2347.338 0.07240000 0.06086717 0.06086717
##            FPR
## 10 0.002436857
## 20 0.004966320
## 30 0.007567455
## 40 0.010145905
## 50 0.012724186
plot(results, annotate = TRUE, main = "ROC curve")

plot(results,"prec/rec", annotate = TRUE, main="Precision-recall")