library(pander)
library(ggplot2)
library(knitr)
library(dplyr)
This is a basic recommender system where we created a sample ratings grid where few books have been rated by discrete users.
Load the sample data set with ratings for books and convert the same into a matrix format.
# load csv into data variable
data <- read.csv("book_ratings.csv",row.names = 1)
# convert data into a matrix
data <- as.matrix(data)
pander(data)
| book1 | book2 | book3 | book4 | book5 | |
|---|---|---|---|---|---|
| A | 3 | NA | 4 | 4 | 3 |
| B | 4 | 5 | 3 | NA | 2 |
| C | NA | 3 | 3 | 3 | 3 |
| D | 5 | 5 | 5 | 5 | NA |
| E | 2 | 3 | 3 | NA | 4 |
| F | 3 | 4 | 4 | 3 | 3 |
| G | 2 | 2 | NA | 2 | 2 |
| H | 3 | NA | 4 | 4 | 4 |
| I | 2 | 3 | 3 | NA | 4 |
| J | 3 | NA | 3 | 3 | 3 |
| K | 1 | 1 | 1 | NA | 1 |
| L | NA | 3 | 4 | 3 | 4 |
| M | 3 | NA | 4 | 4 | 3 |
| N | 4 | 4 | 3 | NA | 2 |
| O | NA | 3 | 4 | 4 | 4 |
| P | 5 | 5 | NA | 5 | 5 |
| Q | 2 | NA | 3 | 3 | 4 |
| R | 3 | NA | 3 | 3 | 3 |
| S | 2 | 2 | 2 | NA | 2 |
| T | 3 | NA | 4 | 4 | 4 |
Here are the top 3 rated books
means <- colMeans(data, na.rm = TRUE)
cols <- colnames(data)[order(means, decreasing = TRUE)[1:3]]
top3 <- data.frame(books = cols,stringsAsFactors = FALSE)
pander(top3)
| books |
|---|
| book4 |
| book3 |
| book2 |
Lets split the data set in to two. Training and Test. we selected 12 reviews from training. we will replace those with NA in the training set. NA was used so it would be omitted from our calculations. In the test dataset we only kept values identified for testing. the others were replaced with NA.
test_rows <- c(1,3,4,5,6,7,14,13,19,20,12,14)
test_cols <- c(1,4,2,3,4,5,2,3,3,4,5,3)
test_indices <- cbind(test_rows,test_cols)
data_train <- data
data_train[test_indices] <- NA
data_test <- data
data_test[test_indices] <- 0
data_test[data_test > 0] <- NA
data_test[test_indices] <- data[test_indices]
data_train
## book1 book2 book3 book4 book5
## A NA NA 4 4 3
## B 4 5 3 NA 2
## C NA 3 3 NA 3
## D 5 NA 5 5 NA
## E 2 3 NA NA 4
## F 3 4 4 NA 3
## G 2 2 NA 2 NA
## H 3 NA 4 4 4
## I 2 3 3 NA 4
## J 3 NA 3 3 3
## K 1 1 1 NA 1
## L NA 3 4 3 NA
## M 3 NA NA 4 3
## N 4 NA NA NA 2
## O NA 3 4 4 4
## P 5 5 NA 5 5
## Q 2 NA 3 3 4
## R 3 NA 3 3 3
## S 2 2 NA NA 2
## T 3 NA 4 NA 4
data_test
## book1 book2 book3 book4 book5
## A 3 NA NA NA NA
## B NA NA NA NA NA
## C NA NA NA 3 NA
## D NA 5 NA NA NA
## E NA NA 3 NA NA
## F NA NA NA 3 NA
## G NA NA NA NA 2
## H NA NA NA NA NA
## I NA NA NA NA NA
## J NA NA NA NA NA
## K NA NA NA NA NA
## L NA NA NA NA 4
## M NA NA 4 NA NA
## N NA 4 3 NA NA
## O NA NA NA NA NA
## P NA NA NA NA NA
## Q NA NA NA NA NA
## R NA NA NA NA NA
## S NA NA 2 NA NA
## T NA NA NA 4 NA
Find the mean, bias and RMSE for both the data sets
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.667 | 3.5 | 3 | 5 | 3 | 3.5 | 2 | 3.75 | 3 | 3 | 1 | 3.333 | 3.333 | 3 | 3.75 | 5 | 3 | 3 | 2 | 3.667 |
| book1 | book2 | book3 | book4 | book5 |
|---|---|---|---|---|
| 2.938 | 3.091 | 3.429 | 3.636 | 3.176 |
Rating for every user-item combination. For Test and Train data sets
## [1] 3.333333
## [1] 3.231884
For Test and Train data sets
RMSE for Train
## [1] 1.037624
RMSE for Test
## [1] 0.8498366
User Bias
## user bias
user_bias <- user_means - raw_train
user_bias_df <- data.frame(as.list(user_bias))
user_bias_df <- tidyr::gather(user_bias_df,"user")
colnames(user_bias_df) <-c("User","Bias")
pander(user_bias_df)
| User | Bias |
|---|---|
| A | 0.4348 |
| B | 0.2681 |
| C | -0.2319 |
| D | 1.768 |
| E | -0.2319 |
| F | 0.2681 |
| G | -1.232 |
| H | 0.5181 |
| I | -0.2319 |
| J | -0.2319 |
| K | -2.232 |
| L | 0.1014 |
| M | 0.1014 |
| N | -0.2319 |
| O | 0.5181 |
| P | 1.768 |
| Q | -0.2319 |
| R | -0.2319 |
| S | -1.232 |
| T | 0.4348 |
Book Bias
#book bias
book_bias <- book_means - raw_train
book_bias_df <- data.frame(as.list(book_bias))
book_bias_df <- tidyr::gather(book_bias_df,"book")
colnames(book_bias_df) <-c("Book","Bias")
pander(book_bias_df)
| Book | Bias |
|---|---|
| book1 | -0.2944 |
| book2 | -0.141 |
| book3 | 0.1967 |
| book4 | 0.4045 |
| book5 | -0.05541 |
# raw average + user bias + book bias
calBaseLine <- function(in_matrix, book_bias_in,user_bias_in,raw_average)
{
out_matrix <- in_matrix
row_count <-1
for(item in 1:nrow(in_matrix))
{
col_count <-1
for(colItem in 1: ncol(in_matrix))
{
#out_matrix[row_count,col_count] <- 0
out_matrix[row_count,col_count] <- raw_average[1] + user_bias_in[[row_count]] + book_bias_in[[col_count]]
col_count <- col_count +1
}
row_count <- row_count +1
}
return(out_matrix)
}
base_pred <- calBaseLine(data_train,book_bias,user_bias,raw_train)
pander(base_pred)
| book1 | book2 | book3 | book4 | book5 | |
|---|---|---|---|---|---|
| A | 3.372 | 3.526 | 3.863 | 4.071 | 3.611 |
| B | 3.206 | 3.359 | 3.697 | 3.904 | 3.445 |
| C | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| D | 4.706 | 4.859 | 5.197 | 5.404 | 4.945 |
| E | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| F | 3.206 | 3.359 | 3.697 | 3.904 | 3.445 |
| G | 1.706 | 1.859 | 2.197 | 2.404 | 1.945 |
| H | 3.456 | 3.609 | 3.947 | 4.154 | 3.695 |
| I | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| J | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| K | 0.7056 | 0.859 | 1.197 | 1.404 | 0.9446 |
| L | 3.039 | 3.192 | 3.53 | 3.738 | 3.278 |
| M | 3.039 | 3.192 | 3.53 | 3.738 | 3.278 |
| N | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| O | 3.456 | 3.609 | 3.947 | 4.154 | 3.695 |
| P | 4.706 | 4.859 | 5.197 | 5.404 | 4.945 |
| Q | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| R | 2.706 | 2.859 | 3.197 | 3.404 | 2.945 |
| S | 1.706 | 1.859 | 2.197 | 2.404 | 1.945 |
| T | 3.372 | 3.526 | 3.863 | 4.071 | 3.611 |
## test data
# finding Error
data_err <- data_test - base_pred
# squaring error
data_err <- (data_err)^2
#finding average
data_rmse_test<- mean(data_err[test_indices])
# square root
data_rmse_test<- sqrt(data_rmse_test)
## training data
# finding Error
data_err_train <- data_train - base_pred
# squaring error
data_err_train <- (data_err_train)^2
#finding average
data_rmse_train <- mean(data_err_train,na.rm = TRUE)
# square root
data_rmse_train<- sqrt(data_rmse_train)
data_rmse_test
## [1] 0.5250754
data_rmse_train
## [1] 0.518253
Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.
The results show that we see a 50% improvement in making a prediction for the ratings in the Training data set. Where as we see only 38% improvement in prediction for the Test data set. Both are positive hoewver the Training data set yielded better prediction.
# Train data set
R1 <- rmse_train
Rb1 <- data_rmse_train
Per_Improv_Train <- (1-(Rb1/R1))*100
Per_Improv_Train
## [1] 50.0539
# Test data set
R2 <- rmse_test
Rb2 <- data_rmse_test
Per_Improv_Test <- (1-(Rb2/R2))*100
Per_Improv_Test
## [1] 38.21455