DATA 612 Project 1 | Global Baseline Predictors and RMSE

Description

User-Book Rating Recommender System

This is a recommender system with 20 users and their respective ratings on 5 books. It recommends books to users based on other user ratings.

Dataset

Step-1:

I made a toy dataset, which has 20 users as rows and 5 Books as columns with numeric values of ratings ranging from 1 to 5

Load Data

Step-2:

Load the necessary libraries

Step-3:

Load the User_Books dataset and create a user-item matrix

  book1 book2 book3 book4 book5
A 3 NA 4 4 3
B 4 5 3 NA 2
C NA 3 3 3 3
D 5 5 5 5 NA
E 2 3 3 NA 4
F 3 4 4 3 3
G 2 2 NA 2 2
H 3 NA 4 4 4
I 2 3 3 NA 4
J 3 NA 3 3 3
K 1 1 1 NA 1
L NA 3 4 3 4
M 3 NA 4 4 3
N 4 4 3 NA 2
O NA 3 4 4 4
P 5 5 NA 5 5
Q 2 NA 3 3 4
R 3 NA 3 3 3
S 2 2 2 NA 2
T 3 NA 4 4 4

Training and Test datasets

Step-4:

Break your ratings into separate training and test datasets.

Lets split the User_Books dataset into two. Training and Test. I selected 12 reviews from training. I will replace those with NA in the training set. NA was used so it would be omitted from our calculations. In the test dataset I only kept values identified for testing. the others were replaced with NA.

Train Dataset

  book1 book2 book3 book4 book5
A    NA    NA     4     4     3
B     4     5     3    NA     2
C    NA     3     3    NA     3
D     5    NA     5     5    NA
E     2     3    NA    NA     4
F     3     4     4    NA     3
G     2     2    NA     2    NA
H     3    NA     4     4     4
I     2     3     3    NA     4
J     3    NA     3     3     3
K     1     1     1    NA     1
L    NA     3     4     3    NA
M     3    NA    NA     4     3
N     4    NA    NA    NA     2
O    NA     3     4     4     4
 [ reached getOption("max.print") -- omitted 5 rows ]

Test Dataset

  book1 book2 book3 book4 book5
A     3    NA    NA    NA    NA
B    NA    NA    NA    NA    NA
C    NA    NA    NA     3    NA
D    NA     5    NA    NA    NA
E    NA    NA     3    NA    NA
F    NA    NA    NA     3    NA
G    NA    NA    NA    NA     2
H    NA    NA    NA    NA    NA
I    NA    NA    NA    NA    NA
J    NA    NA    NA    NA    NA
K    NA    NA    NA    NA    NA
L    NA    NA    NA    NA     4
M    NA    NA     4    NA    NA
N    NA     4     3    NA    NA
O    NA    NA    NA    NA    NA
 [ reached getOption("max.print") -- omitted 5 rows ]

Calculations

Using training data, calculate the raw average (mean) rating for every user-item combination.

This function computes the raw average of the user-item matrix

Mean rating for each user in the User_Books train dataset

Table continues below
A B C D E F G H I J K L M N
3.667 3.5 3 5 3 3.5 2 3.75 3 3 1 3.333 3.333 3
O P Q R S T
3.75 5 3 3 2 3.667

Mean rating for each book in the User_Books train dataset.

book1 book2 book3 book4 book5
2.938 3.091 3.429 3.636 3.176

Calculate the RMSE for raw average for both your training data and your test data.

Rating for every user-item combination, for Test and Train data sets

[1] 3.333333
[1] 3.231884

RMSE for Test and Train data sets

RMSE for train dataset

[1] 1.037624

RMSE for test dataset

[1] 0.8498366

Using your training data, calculate the bias for each user and each item.

User Bias

User Bias
A 0.4348
B 0.2681
C -0.2319
D 1.768
E -0.2319
F 0.2681
G -1.232
H 0.5181
I -0.2319
J -0.2319
K -2.232
L 0.1014
M 0.1014
N -0.2319
O 0.5181
P 1.768
Q -0.2319
R -0.2319
S -1.232
T 0.4348

Book Bias

Book Bias
book1 -0.2944
book2 -0.141
book3 0.1967
book4 0.4045
book5 -0.05541

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

  book1 book2 book3 book4 book5
A 3.372 3.526 3.863 4.071 3.611
B 3.206 3.359 3.697 3.904 3.445
C 2.706 2.859 3.197 3.404 2.945
D 4.706 4.859 5.197 5.404 4.945
E 2.706 2.859 3.197 3.404 2.945
F 3.206 3.359 3.697 3.904 3.445
G 1.706 1.859 2.197 2.404 1.945
H 3.456 3.609 3.947 4.154 3.695
I 2.706 2.859 3.197 3.404 2.945
J 2.706 2.859 3.197 3.404 2.945
K 0.7056 0.859 1.197 1.404 0.9446
L 3.039 3.192 3.53 3.738 3.278
M 3.039 3.192 3.53 3.738 3.278
N 2.706 2.859 3.197 3.404 2.945
O 3.456 3.609 3.947 4.154 3.695
P 4.706 4.859 5.197 5.404 4.945
Q 2.706 2.859 3.197 3.404 2.945
R 2.706 2.859 3.197 3.404 2.945
S 1.706 1.859 2.197 2.404 1.945
T 3.372 3.526 3.863 4.071 3.611

Calculate the RMSE for the baseline predictors for both your training data and your test data.

RMSE for test data

[1] 0.5250754

RMSE for train data

[1] 0.518253

Summarizing results

Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.

The results show that we see a 50% improvement in making a prediction for the ratings in the Training data set. Where as we see only 38% improvement in prediction for the Test data set. Both are positive however the Training data set yielded better prediction.

[1] 50.0539
[1] 38.21455

2019-06-09