This recommender system is based on a random generated ratings. Technique used is baseline predictor.
Data set contains 10 users with ratings accross 20 items, but could be easily expanded to any size. It was extracted from:
recommenderlab: A Framework for Developing and Testing Recommendation Algorithms, at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.323.9961
First, I will generate a random 10x20 matrix in order to manually calculate averages, RMSE, biases and the baseline predictor.
# seeding
set.seed(1234)
# random matrix
m<-matrix(sample(c(as.numeric(1:5), NA), 200, replace=TRUE, prob=c(rep(.4/2,5),.6)), ncol=20, dimnames=list(user=paste("u", 1:10, sep=''), item=paste("i", 1:20, sep='')))
m## item
## user i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20
## u1 NA 5 NA 3 4 NA 2 NA 1 NA NA 5 5 1 4 3 3 2 3 5
## u2 4 4 NA NA 5 NA NA 1 3 1 4 NA 1 5 1 NA 4 4 NA 1
## u3 4 NA NA NA NA 5 NA NA NA NA NA 1 1 NA NA NA 3 5 4 NA
## u4 4 1 NA 4 4 4 NA 2 4 NA NA NA 1 3 3 2 NA 3 NA 1
## u5 2 NA NA NA NA NA NA NA NA NA NA NA 3 1 NA NA NA NA 1 2
## u6 5 2 2 2 4 4 5 4 1 4 NA 1 NA 3 5 1 5 2 NA 2
## u7 NA NA 4 NA 5 3 NA 3 3 NA NA 1 NA 1 5 4 3 3 2 2
## u8 NA NA 1 NA 3 2 4 NA NA NA NA NA 4 4 4 1 NA 4 5 5
## u9 5 NA 2 1 NA NA NA NA NA NA 3 NA 3 5 1 3 2 NA NA 1
## u10 4 NA NA 2 2 2 4 5 1 5 NA 2 NA 2 4 4 NA NA 5 5
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
This the user-item matrix dimension.
## 10 x 20 rating matrix of class 'realRatingMatrix' with 117 ratings.
Exploring the dataset, plotting ratings distribution.
This is a rather balanced rating distribution.
Checking the ratings vector, to check number of missing ratings.
## vec_ratings
## 0 1 2 3 4 5
## 83 25 21 20 29 22
As per the package documentation, a 0 rating represents a missing value.
Building a heatmap.
Most user have rated lots of items, only exception is perhaps user 5 with only 5 ratings.
Data set to be split in Training/Testing subset with a 90/10 ratio.
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.900
## Good ratings: >=3.000000
## Data set: 10 x 20 rating matrix of class 'realRatingMatrix' with 117 ratings.
Training set
## item
## user i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20
## u1 NA 5 NA 3 4 NA NA NA NA NA NA NA 5 NA 4 NA NA NA NA NA
## u2 4 NA NA NA NA NA NA NA 3 1 NA NA NA 5 1 NA NA NA NA NA
## u3 4 NA NA NA NA NA NA NA NA NA NA 1 1 NA NA NA 3 5 NA NA
## u4 4 NA NA 4 NA 4 NA 2 NA NA NA NA NA NA NA 2 NA NA NA NA
## u5 2 NA NA NA NA NA NA NA NA NA NA NA 3 1 NA NA NA NA 1 2
## u6 NA NA NA NA NA 4 5 NA NA NA NA 1 NA NA NA NA 5 2 NA NA
## u7 NA NA 4 NA NA NA NA 3 NA NA NA NA NA 1 5 NA NA 3 NA NA
## u8 NA NA NA NA 3 2 NA NA NA NA NA NA NA NA NA 1 NA NA 5 5
## u9 5 NA 2 NA NA NA NA NA NA NA NA NA NA 5 NA 3 NA NA NA 1
## u10 NA NA NA NA NA 2 4 NA NA NA NA NA NA 2 NA 4 NA NA 5 NA
Confirmation of percentage of users in the training set
## [1] 0.9
Testing set
## item
## user i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20
## u1 NA NA NA NA NA NA 2 NA 1 NA NA 5 NA 1 NA 3 3 2 3 5
## u2 NA 4 NA NA 5 NA NA 1 NA NA 4 NA 1 NA NA NA 4 4 NA 1
## u3 NA NA NA NA NA 5 NA NA NA NA NA NA NA NA NA NA NA NA 4 NA
## u4 NA 1 NA NA 4 NA NA NA 4 NA NA NA 1 3 3 NA NA 3 NA 1
## u5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## u6 5 2 2 2 4 NA NA 4 1 4 NA NA NA 3 5 1 NA NA NA 2
## u7 NA NA NA NA 5 3 NA NA 3 NA NA 1 NA NA NA 4 3 NA 2 2
## u8 NA NA 1 NA NA NA 4 NA NA NA NA NA 4 4 4 NA NA 4 NA NA
## u9 NA NA NA 1 NA NA NA NA NA NA 3 NA 3 NA 1 NA 2 NA NA NA
## u10 4 NA NA 2 2 NA NA 5 1 5 NA 2 NA NA 4 NA NA NA NA 5
Confirmation of percentage of users in the training set
## [1] 0.1
Raw average is the average of the entire dataset. We will calculate a raw average for the training and the test sets.
# Raw Averages
#Training set
train_set_vec<-as.vector(ev_train)
raw_avg_train_total<-mean(train_set_vec,na.rm=TRUE)
n_train<-length(train_set_vec)-sum(is.na(train_set_vec))
raw_avg_train_total## [1] 3.12
#Test set
test_set_vec<-as.vector(ev_test)
raw_avg_test_total<-mean(test_set_vec,na.rm=TRUE)
n_test<-length(test_set_vec)-sum(is.na(test_set_vec))
raw_avg_test_total## [1] 2.940299
RMSE stands for Root Mean of Square Error, and it is defined as the standard deviation of the difference between the real and predicted ratings.
Calculating RMSE for training and test sets
#RMSE Training set
rmse_train<-sqrt(sum((train_set_vec-raw_avg_train_total)^2,na.rm = TRUE)/n_train)
rmse_train## [1] 1.464787
#RMSE Test set
rmse_test<-sqrt(sum((test_set_vec-raw_avg_test_total)^2,na.rm = TRUE)/n_test)
rmse_test## [1] 1.391666
Bias is defined as the difference between the average of each user / item and the raw average. This is only calculated for the training set.
#Bias Training set
bias_user_train<-rowMeans(ev_train,na.rm=TRUE)-raw_avg_train_total
bias_user_train## u1 u2 u3 u4 u5 u6 u7 u8 u9 u10
## 1.08 -0.32 -0.32 0.08 -1.32 0.28 0.08 0.08 0.08 0.28
## i1 i2 i3 i4 i5 i6 i7
## 0.6800000 1.8800000 -0.1200000 0.3800000 0.3800000 -0.1200000 1.3800000
## i8 i9 i10 i11 i12 i13 i14
## -0.6200000 -0.1200000 -2.1200000 NaN -2.1200000 -0.1200000 -0.3200000
## i15 i16 i17 i18 i19 i20
## 0.2133333 -0.6200000 0.8800000 0.2133333 0.5466667 -0.4533333
Baseline predictor which is the predictor for this exercise is calculated as the sum of raw average, user bias and item bias for each combination of user + item.
# Creating matrix but with a floor at 1 and a cap at 5
pred<-matrix(, nrow=length(bias_user_train), ncol=length(bias_item_train))
for(i in 1:length(bias_user_train)){for(j in 1:length(bias_item_train)){
a<-raw_avg_train_total+bias_user_train[i]+bias_item_train[j]
a<-ifelse(a<1,1,ifelse(a>5,5,a))
pred[i,j] <- a
}}
pred## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 4.88 5.00 4.08 4.58 4.58 4.08 5.00 3.58 4.08 2.08 NA 2.08 4.08
## [2,] 3.48 4.68 2.68 3.18 3.18 2.68 4.18 2.18 2.68 1.00 NA 1.00 2.68
## [3,] 3.48 4.68 2.68 3.18 3.18 2.68 4.18 2.18 2.68 1.00 NA 1.00 2.68
## [4,] 3.88 5.00 3.08 3.58 3.58 3.08 4.58 2.58 3.08 1.08 NA 1.08 3.08
## [5,] 2.48 3.68 1.68 2.18 2.18 1.68 3.18 1.18 1.68 1.00 NA 1.00 1.68
## [6,] 4.08 5.00 3.28 3.78 3.78 3.28 4.78 2.78 3.28 1.28 NA 1.28 3.28
## [7,] 3.88 5.00 3.08 3.58 3.58 3.08 4.58 2.58 3.08 1.08 NA 1.08 3.08
## [8,] 3.88 5.00 3.08 3.58 3.58 3.08 4.58 2.58 3.08 1.08 NA 1.08 3.08
## [9,] 3.88 5.00 3.08 3.58 3.58 3.08 4.58 2.58 3.08 1.08 NA 1.08 3.08
## [10,] 4.08 5.00 3.28 3.78 3.78 3.28 4.78 2.78 3.28 1.28 NA 1.28 3.28
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 3.88 4.413333 3.58 5.00 4.413333 4.746667 3.746667
## [2,] 2.48 3.013333 2.18 3.68 3.013333 3.346667 2.346667
## [3,] 2.48 3.013333 2.18 3.68 3.013333 3.346667 2.346667
## [4,] 2.88 3.413333 2.58 4.08 3.413333 3.746667 2.746667
## [5,] 1.48 2.013333 1.18 2.68 2.013333 2.346667 1.346667
## [6,] 3.08 3.613333 2.78 4.28 3.613333 3.946667 2.946667
## [7,] 2.88 3.413333 2.58 4.08 3.413333 3.746667 2.746667
## [8,] 2.88 3.413333 2.58 4.08 3.413333 3.746667 2.746667
## [9,] 2.88 3.413333 2.58 4.08 3.413333 3.746667 2.746667
## [10,] 3.08 3.613333 2.78 4.28 3.613333 3.946667 2.946667
With the baseline predictor, we calculate RMSE on both testing and training set.
#RMSE Training set
rmse_train_pred<-sqrt(sum((ev_train-pred)^2,na.rm = TRUE)/n_train)
rmse_train_pred## [1] 1.149743
## [1] 1.704249
RMSE Improvement based on baseline predictior is measured against raw average RMSE.
## [1] 0.002150781
## [1] -0.002246104
Using baseline predictor method, overall training set RMSE gets improved by 0.2% but testing set does not get improved.