Recsys - Collaborative Filters

Dataset

For this assignment, I will be using Jester ratings dataset. This provides extensive ratings from users for different jokes. Below link contains detailed information about the dataset.

Dataset Link: http://eigentaste.berkeley.edu/dataset/

Objective

In this project, we will perform jokes recommendations to users. We will be using recommenderlab package to provide these recommendations. We will follow below types to provide these recommendations

User-User Collaborative Filtering
Item-Item Collaborative Filtering

Importing and cleaning data

Importing the CSV data and converting the complete dataframe into realRatingMatrix. Also performing some cleanup activity on the dataset.

#**Need to implement whole NA columns

jester <-  read.csv('data/jesterfinal151cols.csv',header=FALSE,sep=",",
            stringsAsFactors = FALSE, na.strings = c('99')) %>% select(c(-1)) %>% 
              select(-c(1, 2, 3, 4, 6, 9, 10, 11, 12, 14)) %>% as.matrix() %>%         as("realRatingMatrix") 

#Verify the class
class(jester)

## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"

Exploring data

Exploratory analysis on the recommender dataset.

Explore Rows and Dimensions

#Dimensions - 50692 users on 150 different jokes
dim(jester)

## [1] 50692   140

#Row counts -- Matches with original dataset
head(rowCounts(jester))

## [1] 62 34 18 82 27 46

#Column counts
head(colCounts(jester))

##    V6    V8    V9   V14   V16   V17 
##   566 50692 50692 50692 50692 50692

#Row sums
head(rowSums(jester))

## [1]  129.8750  127.1250 -122.7188  213.2188  -59.5625   45.1250

#Column sums
head(colSums(jester))

##         V6         V8         V9        V14        V16        V17 
##  -1151.594 -98976.656 -36320.812 -32912.438 -72524.188 -82650.594

Exploring the values

vector_ratings <- as.vector(jester@data)

#Minimum rating
min(vector_ratings)

## [1] -10

#Maximum rating
max(vector_ratings)

## [1] 10

#0 ratings
vector_ratings <- round(vector_ratings)

vector_ratings <- factor(vector_ratings)

qplot(vector_ratings) + ggtitle("Distribution of All ratings")

table(vector_ratings)

## vector_ratings
##     -10      -9      -8      -7      -6      -5      -4      -3      -2 
##   39380   44384   35639   45121   53275   53488   61817   61825   66268 
##      -1       0       1       2       3       4       5       6       7 
##   64077 5479813  119368  142391  129584  129751  112454  107795   94079 
##       8       9      10 
##   87861   89273   79237

vector_ratings <- vector_ratings[vector_ratings != 0]

qplot(vector_ratings) + ggtitle("Distribution of ratings without 0")

From the above charts it clearly says that 0 rating is outnumbered by any ratings.

Exploring average Ratings of each user and jokes

Average ratings by each user

avg_ratings_user <- rowMeans(jester@data)

qplot(avg_ratings_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average user rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The average user rating us centered around 0.

Average ratings by each item(jokes)

avg_ratings_jokes <- colMeans(jester@data)

qplot(avg_ratings_jokes) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average jokes rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here again the avearage jokes rating us centered around 0. There are average ratings around -2 to 2.

Data Preparation

As this dataset is so sparse, we can restrict the low and less ratings from user. As this will increase the sparcity. Restricting will yield better results.

user_ratings <- table(rowCounts(jester)) %>% data.frame()
head(user_ratings)

##   Var1 Freq
## 1    8 7497
## 2    9 2197
## 3   10 1953
## 4   11 1784
## 5   12 1644
## 6   13 1764

#jester[rowCounts(jester) > 10, colCounts(jester) > 50]

Data Modelling

We will perform modelling from the dataset. Also test it with below methods.

Split the dataset into Training and test dataset by rows.
Also we will use the evaluation scheme function which is provide

Item-Item CF

Item based CF - Test1:

Recommending 6 Jokes for each user

#Split between training and test

which_train <- sample(x = c(TRUE, FALSE), size = nrow(jester),replace = TRUE, prob = c(0.8, 0.2))

jester_data_train <- jester[which_train, ]
jester_data_test <- jester[!which_train, ]

#Using Cosine Similarity
recc_model_item <- Recommender(data = jester_data_test, method = "IBCF",parameter = list(method="Cosine"))

recc_predicted <- predict(object = recc_model_item, newdata = jester_data_test, n= 6)


#Below are the predictions of the joke number for each user.
head(recc_predicted@items)

## [[1]]
## [1] 129  89  67  85  32 127
## 
## [[2]]
## [1] 129  67  60  85  24  72
## 
## [[3]]
## [1] 129 103 110 112 133 123
## 
## [[4]]
## [1] 68 23 24 31 36 49
## 
## [[5]]
## [1] 105 110  82 100  99  60
## 
## [[6]]
## [1]  76 130  84  50  28  47

#length(recc_predicted@items)
# matrix(unlist(recc_predicted@items),ncol=6,byrow=T) %>% data.frame() %>% mutate(user = which(!which_train))
# 
# 
# length(unlist(recc_predicted@items))
# 10144*6
# 
# which(!which_train)

Above model predicts for all the users in the dataset. It is really hard to figure out the correct rating. So lets reapply for just two rows.

Item based CF - Test2:

Verifying for just for two users

jester_data_train1 <- jester[-c(2,3), ]
jester_data_test1 <- jester[c(2,3), ]


recc_model_item1 <- Recommender(data = jester_data_test1, method = "IBCF",parameter = list(method="Cosine"))

recc_predicted1 <- predict(object = recc_model_item1, newdata = jester_data_test1, n= 6)

#Labels of recommendation for user 2
recc_predicted@itemLabels[recc_predicted1@items[[1]]]

## [1] "V34"  "V35"  "V38"  "V106" "V110" "V39"

recc_predicted@itemLabels[recc_predicted1@items[[2]]]

## [1] "V22" "V27" "V28" "V29" "V30" "V31"

Above method shows the recommended jokes for just two users.

Item based CF - Validation of the model(RMSE):

Calculate RMSE for the model using evaluate scheme in recommender package.

## create 90/10 split (known/unknown) 
split_data <- evaluationScheme(jester, method="split", train=0.9, k=1, given=8) 
#split_data 

## create a item-based CF recommender using training data 
recommender_model <- Recommender(getData(split_data, "train"), "IBCF",parameter = list(method="Cosine"))
recommender_model

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 45622 users.

## create predictions for the test data using known ratings 
predict_items <- predict(recommender_model, getData(split_data, "known"), type="ratings") 
predict_items

## 5070 x 140 rating matrix of class 'realRatingMatrix' with 535447 ratings.

## compute error metrics averaged per user and then averaged over all recommendations
(accuracy <- calcPredictionAccuracy(predict_items, getData(split_data, "unknown")) )

##      RMSE       MSE       MAE 
##  5.495450 30.199976  4.049061

head(calcPredictionAccuracy(predict_items, getData(split_data, "unknown"), byUser=TRUE))

##          RMSE      MSE      MAE
## [1,] 3.769737 14.21092 2.992503
## [2,] 4.867222 23.68985 3.793029
## [3,] 5.963790 35.56679 4.254053
## [4,] 3.531931 12.47454 3.180298
## [5,] 5.251905 27.58250 3.832261
## [6,] 9.091616 82.65748 7.634548

Above mentioned is the average RMSE for the entire dataset. Lets try User based CF and see if we get different RMSE.

User-user based CF

User based CF - Test1:

Recommending 6 Jokes for each user by user based CF.

recc_model_item <- Recommender(data = jester_data_test, method = "UBCF",parameter = list(method="Pearson"))

recc_predicted <- predict(object = recc_model_item, newdata = jester_data_test, n= 6)

head(recc_predicted@items)

## [[1]]
## [1] 116  26  79  25  43  39
## 
## [[2]]
## [1] 119  94 128 107  98 117
## 
## [[3]]
## [1] 104 117 116 128 119 140
## 
## [[4]]
## [1] 119 117  87  58  39  94
## 
## [[5]]
## [1]  15 117  58  39 119  12
## 
## [[6]]
## [1] 117  98  43  95  25 128

User based CF- Test2:

As we perfomed previously, verifying for two users

recc_model_item1 <- Recommender(data = jester_data_test1, method = "UBCF",parameter = list(method="Pearson"))

recc_predicted1 <- predict(object = recc_model_item1, newdata = jester_data_test1, n= 6)

#Labels of recommendation for user 2
recc_predicted@itemLabels[recc_predicted1@items[[1]]]

## [1] "V106" "V103" "V39"  "V40"  "V23"  "V24"

recc_predicted@itemLabels[recc_predicted1@items[[2]]]

## [1] "V27" "V51" "V30" "V22" "V37" "V29"

This provides different recommendations compared to item based CF.

User based CF - Validation of the model(RMSE):

## create 90/10 split (known/unknown) 

## create a User-based CF recommender using training data 
recommender_model_UB <- Recommender(getData(split_data, "train"), "UBCF") 
recommender_model_UB

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 45622 users.

## create predictions for the test data using known ratings 
predict_items_UB <- predict(recommender_model_UB, getData(split_data, "known"), type="ratings") 
predict_items_UB

## 5070 x 140 rating matrix of class 'realRatingMatrix' with 669240 ratings.

## compute error metrics averaged per user and then averaged over all recommendations
(accuracy_UB <- calcPredictionAccuracy(predict_items_UB, getData(split_data, "unknown")))

##      RMSE       MSE       MAE 
##  4.517493 20.407747  3.414753

#Detailed accuracy
head(calcPredictionAccuracy(predict_items_UB, getData(split_data, "unknown"), byUser=TRUE))

##          RMSE      MSE      MAE
## [1,] 3.512195 12.33552 2.935330
## [2,] 4.013308 16.10664 3.087992
## [3,] 5.970101 35.64211 4.189411
## [4,] 4.930571 24.31053 4.401746
## [5,] 4.402177 19.37916 3.885412
## [6,] 9.563038 91.45169 7.788803

Here all the calculations were performed with “pearson similarity”. RMSE is less in a user based CF.

User based CF - Validation of the model(RMSE) with Cosine:

recommender_model_UB_p <- Recommender(getData(split_data, "train"), "UBCF",parameter = list(method="Cosine"))

recommender_model_UB_p

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 45622 users.

## create predictions for the test data using known ratings 
predict_items_UB_p <- predict(recommender_model_UB_p, getData(split_data, "known"), type="ratings") 
predict_items_UB_p

## 5070 x 140 rating matrix of class 'realRatingMatrix' with 669240 ratings.

## compute error metrics averaged per user and then averaged over all recommendations
(accuracy_UB_p <- calcPredictionAccuracy(predict_items_UB_p, getData(split_data, "unknown")) 
)

##      RMSE       MSE       MAE 
##  4.517493 20.407747  3.414753

head(calcPredictionAccuracy(predict_items_UB_p, getData(split_data, "unknown"), byUser=TRUE))

##          RMSE      MSE      MAE
## [1,] 3.512195 12.33552 2.935330
## [2,] 4.013308 16.10664 3.087992
## [3,] 5.970101 35.64211 4.189411
## [4,] 4.930571 24.31053 4.401746
## [5,] 4.402177 19.37916 3.885412
## [6,] 9.563038 91.45169 7.788803

accuracy_UB

##      RMSE       MSE       MAE 
##  4.517493 20.407747  3.414753

Summary

We have created different recommendations on the Jester dataset. Below are some of the findings.

User based CF(UBCF) and Item based CF(IBCF) does not match with each other. Each provides different recommendations.
IBCF is faster compared with the UBCF.
RMSE of IBCF is higher when compared with UBCF.
IBCF provides same jokes(short tail) recommendations to most of the users. But UBCF provides variety of jokes(longer tail) recommendations compared to IBCF.
Pearson and COsine similarity provides same RMSE value.