For this assignment, I will be using Jester ratings dataset. This provides extensive ratings from users for different jokes. Below link contains detailed information about the dataset.
Dataset Link: http://eigentaste.berkeley.edu/dataset/
In this project, we will perform jokes recommendations to users. We will be using recommenderlab package to provide these recommendations. We will follow below types to provide these recommendations
Importing the CSV data and converting the complete dataframe into realRatingMatrix. Also performing some cleanup activity on the dataset.
#**Need to implement whole NA columns
jester <- read.csv('data/jesterfinal151cols.csv',header=FALSE,sep=",",
stringsAsFactors = FALSE, na.strings = c('99')) %>% select(c(-1)) %>%
select(-c(1, 2, 3, 4, 6, 9, 10, 11, 12, 14)) %>% as.matrix() %>% as("realRatingMatrix")
#Verify the class
class(jester)## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
Exploratory analysis on the recommender dataset.
#Dimensions - 50692 users on 150 different jokes
dim(jester)## [1] 50692 140
#Row counts -- Matches with original dataset
head(rowCounts(jester))## [1] 62 34 18 82 27 46
#Column counts
head(colCounts(jester))## V6 V8 V9 V14 V16 V17
## 566 50692 50692 50692 50692 50692
#Row sums
head(rowSums(jester))## [1] 129.8750 127.1250 -122.7188 213.2188 -59.5625 45.1250
#Column sums
head(colSums(jester))## V6 V8 V9 V14 V16 V17
## -1151.594 -98976.656 -36320.812 -32912.438 -72524.188 -82650.594
vector_ratings <- as.vector(jester@data)
#Minimum rating
min(vector_ratings)## [1] -10
#Maximum rating
max(vector_ratings)## [1] 10
#0 ratings
vector_ratings <- round(vector_ratings)
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) + ggtitle("Distribution of All ratings")table(vector_ratings)## vector_ratings
## -10 -9 -8 -7 -6 -5 -4 -3 -2
## 39380 44384 35639 45121 53275 53488 61817 61825 66268
## -1 0 1 2 3 4 5 6 7
## 64077 5479813 119368 142391 129584 129751 112454 107795 94079
## 8 9 10
## 87861 89273 79237
vector_ratings <- vector_ratings[vector_ratings != 0]
qplot(vector_ratings) + ggtitle("Distribution of ratings without 0")From the above charts it clearly says that 0 rating is outnumbered by any ratings.
avg_ratings_user <- rowMeans(jester@data)
qplot(avg_ratings_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average user rating")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The average user rating us centered around 0.
avg_ratings_jokes <- colMeans(jester@data)
qplot(avg_ratings_jokes) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average jokes rating")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here again the avearage jokes rating us centered around 0. There are average ratings around -2 to 2.
As this dataset is so sparse, we can restrict the low and less ratings from user. As this will increase the sparcity. Restricting will yield better results.
user_ratings <- table(rowCounts(jester)) %>% data.frame()
head(user_ratings)## Var1 Freq
## 1 8 7497
## 2 9 2197
## 3 10 1953
## 4 11 1784
## 5 12 1644
## 6 13 1764
#jester[rowCounts(jester) > 10, colCounts(jester) > 50]We will perform modelling from the dataset. Also test it with below methods.
Recommending 6 Jokes for each user
#Split between training and test
which_train <- sample(x = c(TRUE, FALSE), size = nrow(jester),replace = TRUE, prob = c(0.8, 0.2))
jester_data_train <- jester[which_train, ]
jester_data_test <- jester[!which_train, ]
#Using Cosine Similarity
recc_model_item <- Recommender(data = jester_data_test, method = "IBCF",parameter = list(method="Cosine"))
recc_predicted <- predict(object = recc_model_item, newdata = jester_data_test, n= 6)
#Below are the predictions of the joke number for each user.
head(recc_predicted@items)## [[1]]
## [1] 129 89 67 85 32 127
##
## [[2]]
## [1] 129 67 60 85 24 72
##
## [[3]]
## [1] 129 103 110 112 133 123
##
## [[4]]
## [1] 68 23 24 31 36 49
##
## [[5]]
## [1] 105 110 82 100 99 60
##
## [[6]]
## [1] 76 130 84 50 28 47
#length(recc_predicted@items)
# matrix(unlist(recc_predicted@items),ncol=6,byrow=T) %>% data.frame() %>% mutate(user = which(!which_train))
#
#
# length(unlist(recc_predicted@items))
# 10144*6
#
# which(!which_train)Above model predicts for all the users in the dataset. It is really hard to figure out the correct rating. So lets reapply for just two rows.
Verifying for just for two users
jester_data_train1 <- jester[-c(2,3), ]
jester_data_test1 <- jester[c(2,3), ]
recc_model_item1 <- Recommender(data = jester_data_test1, method = "IBCF",parameter = list(method="Cosine"))
recc_predicted1 <- predict(object = recc_model_item1, newdata = jester_data_test1, n= 6)
#Labels of recommendation for user 2
recc_predicted@itemLabels[recc_predicted1@items[[1]]]## [1] "V34" "V35" "V38" "V106" "V110" "V39"
recc_predicted@itemLabels[recc_predicted1@items[[2]]]## [1] "V22" "V27" "V28" "V29" "V30" "V31"
Above method shows the recommended jokes for just two users.
Calculate RMSE for the model using evaluate scheme in recommender package.
## create 90/10 split (known/unknown)
split_data <- evaluationScheme(jester, method="split", train=0.9, k=1, given=8)
#split_data
## create a item-based CF recommender using training data
recommender_model <- Recommender(getData(split_data, "train"), "IBCF",parameter = list(method="Cosine"))
recommender_model## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 45622 users.
## create predictions for the test data using known ratings
predict_items <- predict(recommender_model, getData(split_data, "known"), type="ratings")
predict_items ## 5070 x 140 rating matrix of class 'realRatingMatrix' with 535447 ratings.
## compute error metrics averaged per user and then averaged over all recommendations
(accuracy <- calcPredictionAccuracy(predict_items, getData(split_data, "unknown")) )## RMSE MSE MAE
## 5.495450 30.199976 4.049061
head(calcPredictionAccuracy(predict_items, getData(split_data, "unknown"), byUser=TRUE)) ## RMSE MSE MAE
## [1,] 3.769737 14.21092 2.992503
## [2,] 4.867222 23.68985 3.793029
## [3,] 5.963790 35.56679 4.254053
## [4,] 3.531931 12.47454 3.180298
## [5,] 5.251905 27.58250 3.832261
## [6,] 9.091616 82.65748 7.634548
Above mentioned is the average RMSE for the entire dataset. Lets try User based CF and see if we get different RMSE.
Recommending 6 Jokes for each user by user based CF.
recc_model_item <- Recommender(data = jester_data_test, method = "UBCF",parameter = list(method="Pearson"))
recc_predicted <- predict(object = recc_model_item, newdata = jester_data_test, n= 6)
head(recc_predicted@items)## [[1]]
## [1] 116 26 79 25 43 39
##
## [[2]]
## [1] 119 94 128 107 98 117
##
## [[3]]
## [1] 104 117 116 128 119 140
##
## [[4]]
## [1] 119 117 87 58 39 94
##
## [[5]]
## [1] 15 117 58 39 119 12
##
## [[6]]
## [1] 117 98 43 95 25 128
As we perfomed previously, verifying for two users
recc_model_item1 <- Recommender(data = jester_data_test1, method = "UBCF",parameter = list(method="Pearson"))
recc_predicted1 <- predict(object = recc_model_item1, newdata = jester_data_test1, n= 6)
#Labels of recommendation for user 2
recc_predicted@itemLabels[recc_predicted1@items[[1]]]## [1] "V106" "V103" "V39" "V40" "V23" "V24"
recc_predicted@itemLabels[recc_predicted1@items[[2]]]## [1] "V27" "V51" "V30" "V22" "V37" "V29"
This provides different recommendations compared to item based CF.
## create 90/10 split (known/unknown)
## create a User-based CF recommender using training data
recommender_model_UB <- Recommender(getData(split_data, "train"), "UBCF")
recommender_model_UB## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 45622 users.
## create predictions for the test data using known ratings
predict_items_UB <- predict(recommender_model_UB, getData(split_data, "known"), type="ratings")
predict_items_UB## 5070 x 140 rating matrix of class 'realRatingMatrix' with 669240 ratings.
## compute error metrics averaged per user and then averaged over all recommendations
(accuracy_UB <- calcPredictionAccuracy(predict_items_UB, getData(split_data, "unknown")))## RMSE MSE MAE
## 4.517493 20.407747 3.414753
#Detailed accuracy
head(calcPredictionAccuracy(predict_items_UB, getData(split_data, "unknown"), byUser=TRUE)) ## RMSE MSE MAE
## [1,] 3.512195 12.33552 2.935330
## [2,] 4.013308 16.10664 3.087992
## [3,] 5.970101 35.64211 4.189411
## [4,] 4.930571 24.31053 4.401746
## [5,] 4.402177 19.37916 3.885412
## [6,] 9.563038 91.45169 7.788803
Here all the calculations were performed with “pearson similarity”. RMSE is less in a user based CF.
recommender_model_UB_p <- Recommender(getData(split_data, "train"), "UBCF",parameter = list(method="Cosine"))
recommender_model_UB_p## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 45622 users.
## create predictions for the test data using known ratings
predict_items_UB_p <- predict(recommender_model_UB_p, getData(split_data, "known"), type="ratings")
predict_items_UB_p## 5070 x 140 rating matrix of class 'realRatingMatrix' with 669240 ratings.
## compute error metrics averaged per user and then averaged over all recommendations
(accuracy_UB_p <- calcPredictionAccuracy(predict_items_UB_p, getData(split_data, "unknown"))
)## RMSE MSE MAE
## 4.517493 20.407747 3.414753
head(calcPredictionAccuracy(predict_items_UB_p, getData(split_data, "unknown"), byUser=TRUE)) ## RMSE MSE MAE
## [1,] 3.512195 12.33552 2.935330
## [2,] 4.013308 16.10664 3.087992
## [3,] 5.970101 35.64211 4.189411
## [4,] 4.930571 24.31053 4.401746
## [5,] 4.402177 19.37916 3.885412
## [6,] 9.563038 91.45169 7.788803
accuracy_UB## RMSE MSE MAE
## 4.517493 20.407747 3.414753
We have created different recommendations on the Jester dataset. Below are some of the findings.