In my final project, I am trying to build a recommender system for the good books. I only took the books csv file and ratings csv file for this project. There are 10000 books and 53424 users in the file.

The first obstacle I encountered is the file is really large. The rating csv file itself is about 69MB. This is the first time I dealt with a relatively big file. When I read csv file in R, it takes a very long time and also not loading. So, to make R running in a time fashion, I only loaded first 10,000 rows in “rating.csv”. In this case, it efficiently reduced the running time.

```
library(recommenderlab)
library(dplyr)
library(reshape2)
library(ggplot2)
books <- read.csv("/Users/xiaomengkong/Desktop/SPS/Data612/Final Project/books.csv")
ratings <- read.csv("/Users/xiaomengkong/Desktop/SPS/Data612/Final Project/ratings.csv",nrows = 10000)
books <- data.frame(books)
ratings <- data.frame(ratings)
```

Part 1. Preparing data for “recommenderlab” and “sparklyr”

For “sparklyr”, I combined the books.csv and ratings.csv into 1 data frame by use the same “book_id. The reason why I am choosing data frame because later I would like to use Sparklyr to run ALS algorithm to do the prediction. After I combined them, I chose the columns that I need for the rating matrix or data frame. So I chose the columns of”book_id“,”authors“,”title“,”user_id“, and”rating" for my new dataframe. The reason why I keep these columns because when I do the prediction by using Sparklyr, I would like to see which book the system recommend instead of just the book-id which I did not know which book the system recommend. I called the new data frame is “NewRating” which I will use it later.

```
books_new <- books %>% select(1,8,11)
NewRatings <- merge(books_new,ratings,by ="book_id")
head(NewRatings)
```

```
## book_id authors
## 1 2 J.K. Rowling, Mary GrandPré
## 2 2 J.K. Rowling, Mary GrandPré
## 3 2 J.K. Rowling, Mary GrandPré
## 4 2 J.K. Rowling, Mary GrandPré
## 5 2 J.K. Rowling, Mary GrandPré
## 6 2 J.K. Rowling, Mary GrandPré
## title user_id rating
## 1 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 94 2
## 2 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 83 4
## 3 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 135 2
## 4 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 61 4
## 5 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 447 4
## 6 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) 177 3
```

Now I am gonna prepare the wide form for my ratings matrix for “recommenderlab”. In here, I just use the ratings.csv with first 10000 rows since in this case, I don’t really need to know the book name for the prediction. In order to use the package of “recommenderlab”, I have to convert the matrix into “realRatingMatrix”.

```
# Change ratings into wide form
head(ratings)
```

```
## user_id book_id rating
## 1 1 258 5
## 2 2 4081 4
## 3 2 260 5
## 4 2 9296 5
## 5 2 2318 3
## 6 2 26 4
```

```
ratings$book_id <- as.numeric(ratings$book_id)
ratings$user_id <- as.numeric(ratings$user_id)
ratings_wide <- reshape(ratings, idvar="user_id",timevar = "book_id", direction = "wide") %>% arrange(user_id)
colnames(ratings_wide)[colnames(ratings_wide)=="user_id"] <-"rating.user"
names(ratings_wide) <- substring(names(ratings_wide),8)
ratings_wide_2 <- ratings_wide[,-1]
rownames(ratings_wide_2) <- ratings_wide[,1]
head(ratings_wide_2[,1:5])
```

```
## 258 4081 260 9296 2318
## 1 5 NA NA NA NA
## 2 NA 4 5 5 3
## 4 NA NA NA NA NA
## 6 NA NA NA NA NA
## 8 NA NA NA NA NA
## 9 NA NA NA NA NA
```

```
ratings_matrix_wide <- as.matrix(ratings_wide_2)
head(ratings_matrix_wide[,1:5])
```

```
## 258 4081 260 9296 2318
## 1 5 NA NA NA NA
## 2 NA 4 5 5 3
## 4 NA NA NA NA NA
## 6 NA NA NA NA NA
## 8 NA NA NA NA NA
## 9 NA NA NA NA NA
```

```
ratings_matrix_wide[is.na(ratings_matrix_wide)] <- 0
ratings_matrix <- as(ratings_matrix_wide, "realRatingMatrix")
head(ratings_matrix)
```

`## 1 x 2510 rating matrix of class 'realRatingMatrix' with 2510 ratings.`

Part 2. Split the data into training and testing set. I set 70% of data values into training set, and the rest 30% into testing set. Here, I also set that any rating is above 3 which are good ratings, and below 3 is bad ratings.

```
percentage_training <- 0.7
items_to_keep <- 100
# above 3 is good ratings, below 3 is bad ratings.
rating_threshold <- 3
# times to run the evaluation
n_eval <- 1
evaluation_set <- evaluationScheme(data=ratings_matrix,method = "split", train = percentage_training, given = items_to_keep, goodRating = rating_threshold, k = n_eval)
evaluation_set
```

```
## Evaluation scheme with 100 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.700
## Good ratings: >=3.000000
## Data set: 255 x 2510 rating matrix of class 'realRatingMatrix' with 640050 ratings.
```

`getData(evaluation_set,"train")`

`## 178 x 2510 rating matrix of class 'realRatingMatrix' with 446780 ratings.`

`nrow(getData(evaluation_set,"train")) / nrow(ratings_matrix)`

`## [1] 0.6980392`

```
# it is about 70% for our training set.
nrow(getData(evaluation_set,"known")) / nrow(ratings_matrix)
```

`## [1] 0.3019608`

```
# k-fold is the most accurate approach.
n_fold <- 5
evaluation_ratingSet <- evaluationScheme(data = ratings_matrix, method = "cross-validation", k = n_fold, given = items_to_keep,goodRating = rating_threshold)
evaluation_ratingSet
```

```
## Evaluation scheme with 100 items given
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=3.000000
## Data set: 255 x 2510 rating matrix of class 'realRatingMatrix' with 640050 ratings.
```

```
# Item-based-collaborative filtering, default parameter is "Null", "IBCF recommend new items and predict their ratings."
model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data=getData(evaluation_set,"train"),method = model_to_evaluate, parameter= model_parameters)
items_to_recommend <- 5
eval_prediction <- predict(object = eval_recommender, newdata = getData(evaluation_set,"known"),n=items_to_recommend,type="ratings")
head(eval_prediction)
```

`## 1 x 2510 rating matrix of class 'realRatingMatrix' with 1656 ratings.`

```
qplot(rowCounts(eval_prediction)) +
geom_histogram(binwidth = 30) +
ggtitle("Distribution of books per user")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

From the distribution, I find a decent amount of “0” for the distribution of books per user since there are a lot of NA. Regardless of this, it looks like bimodal for the distribution, one peak point is around 1250 and another peak point is around 1800.

```
eval_accuracy <- calcPredictionAccuracy( x = eval_prediction, data = getData(evaluation_set, "unknown"),byUser = TRUE)
head(eval_accuracy)
```

```
## RMSE MSE MAE
## 2 0.5240051 0.2745813 0.07305819
## 10 0.7083334 0.5017362 0.12094694
## 15 0.8407079 0.7067898 0.18525537
## 28 0.6966476 0.4853179 0.11321166
## 32 0.7039326 0.4955211 0.13475733
## 36 NaN NaN NaN
```

```
qplot(eval_accuracy[,"RMSE"])+
geom_histogram(binwidth = 0.2)+
ggtitle("Distribution of the RMSE by user")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
## Warning: Removed 27 rows containing non-finite values (stat_bin).
## Warning: Removed 27 rows containing non-finite values (stat_bin).
```

The RMSE by user from the distribution is about 0.6 which is a relatively small RMSE. I’ll consider it a pretty good prediction.

`results <- evaluate(x = evaluation_set,method = model_to_evaluate, n = seq(10,50,10))`

```
## IBCF run fold/sample [model time/prediction time]
## 1 [36.732sec/0.132sec]
```

`head(getConfusionMatrix(results)[[1]])`

```
## TP FP FN TN precision recall TPR
## 10 0.7272727 5.766234 31.81818 2371.688 0.11200000 0.02236695 0.02236695
## 20 1.2337662 11.753247 31.31169 2365.701 0.09500000 0.03411665 0.03411665
## 30 1.5714286 17.909091 30.97403 2359.545 0.08066667 0.04322779 0.04322779
## 40 1.9610390 24.012987 30.58442 2353.442 0.07550000 0.05196597 0.05196597
## 50 2.3506494 30.116883 30.19481 2347.338 0.07240000 0.06086717 0.06086717
## FPR
## 10 0.002436857
## 20 0.004966320
## 30 0.007567455
## 40 0.010145905
## 50 0.012724186
```

`plot(results, annotate = TRUE, main = "ROC curve")`

`plot(results,"prec/rec", annotate = TRUE, main="Precision-recall")`