By textbook, the collaborative filtering algorithms are based on measuring the similarity between users of between items. In the “recommenderlab”, I use Jester5k, the joke rating matrix for this project. To compute similarities, in this case, I followed the example from the textbook by comparing cosine and pearson.
library("recommenderlab")
library("ggplot2")
data(Jester5k)
Jester5k
## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.
methods(class= class(Jester5k))
## [1] [ [<- binarize
## [4] calcPredictionAccuracy coerce colCounts
## [7] colMeans colSds colSums
## [10] denormalize dim dimnames
## [13] dimnames<- dissimilarity evaluationScheme
## [16] getData.frame getList getNormalize
## [19] getRatingMatrix getRatings getTopNLists
## [22] image normalize nratings
## [25] Recommender removeKnownRatings rowCounts
## [28] rowMeans rowSds rowSums
## [31] sample show similarity
## see '?methods' for accessing help and source code
# Computing the similarity matrix
# similarity users
similarity_users <- similarity(Jester5k[1:5, ], method="cosine", which = "users")
as.matrix(similarity_users)
## u2841 u15547 u15221 u15573 u21505
## u2841 0.0000000 0.500128118 0.15871815 0.197523538 0.2000859
## u15547 0.5001281 0.000000000 0.08944693 0.003513988 0.1138529
## u15221 0.1587182 0.089446933 0.00000000 0.133177034 0.3900443
## u15573 0.1975235 0.003513988 0.13317703 0.000000000 0.2182228
## u21505 0.2000859 0.113852906 0.39004428 0.218222818 0.0000000
# The more red the cell is, the more similar two users are.
image(as.matrix(similarity_users), main="Use cosine for similarity")
similarity_users <- similarity(Jester5k[1:5, ], method="pearson", which = "users")
as.matrix(similarity_users)
## u2841 u15547 u15221 u15573 u21505
## u2841 0.0000000 0.5000000 0.5000000 0.5 0.5043172
## u15547 0.5000000 0.0000000 0.5414964 0.5 0.5295880
## u15221 0.5000000 0.5414964 0.0000000 0.5 0.5892015
## u15573 0.5000000 0.5000000 0.5000000 0.0 0.5000000
## u21505 0.5043172 0.5295880 0.5892015 0.5 0.0000000
# The more red the cell is, the more similar two users are.
image(as.matrix(similarity_users), main="Use pearson for similarity")
To compute the similarities between users, I used both ways, cosine and pearson. By quora, “https://www.quora.com/In-what-scenario-is-using-Pearson-correlation-better-than-Cosine-similarity”, I found the difference between cosine and pearson. When users tend to have very differing sets of items, pearson would perform worse. As we can see from the heat map, cosine one seems to show more relations between different users. The more red cell is, the more similar two users are.
# similarity items
similarity_items <- similarity(Jester5k[ ,1:5], method = "cosine", which = "items")
as.matrix(similarity_items)
## j1 j2 j3 j4 j5
## j1 0.0000000 0.3839318 0.3916563 0.2357251 0.2115277
## j2 0.3839318 0.0000000 0.2764639 0.3020971 0.2234656
## j3 0.3916563 0.2764639 0.0000000 0.3532990 0.2179025
## j4 0.2357251 0.3020971 0.3532990 0.0000000 0.1874588
## j5 0.2115277 0.2234656 0.2179025 0.1874588 0.0000000
image(as.matrix(similarity_items),main="cosine for Item similarity")
similarity_items <- similarity(Jester5k[ ,1:5], method = "pearson", which = "items")
as.matrix(similarity_items)
## j1 j2 j3 j4 j5
## j1 0.0000000 0.6180590 0.6216113 0.5902784 0.5588692
## j2 0.6180590 0.0000000 0.5795824 0.5999586 0.5625990
## j3 0.6216113 0.5795824 0.0000000 0.6176980 0.5605612
## j4 0.5902784 0.5999586 0.6176980 0.0000000 0.5594754
## j5 0.5588692 0.5625990 0.5605612 0.5594754 0.0000000
image(as.matrix(similarity_items),main="pearson for Item similarity")
I used the same method from comparing users’ similarities to compare the similarities of items. I think cosine seems better again in this case.
# The histogram for the data
hist(getRatings(Jester5k),main="Distribution of ratings")
best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."
# Exploring the average ratings
average_ratings <- colMeans(Jester5k)
qplot(average_ratings) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average joke rating")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The highest value is around 1, and there is no joke rate for 10 averagely or -10. Since this matrix is only for 5k and there is no missing values because missing values count as 99 which does not appear in this data set so we do not have to worry about the ratings are biased because of the missing values.
Item-based collaborative filtering. By textbook, the core algorithm is based on these steps 1. For each two items, measure how similar they are in terms of having received similar ratings by similar users. 2. For each item, idenitfy the k-most similar items 3. For each user, identify the items that are most similar to the user’s purchases
building IBCF model 1.Training set and Test set 2.Applying the recommender model on the test set. I will use the model to recommend jokes to the users in the test set and specify 5 jokes to recommend to users.
ratings_jokes <- Jester5k[rowCounts(Jester5k) > 50,
colCounts(Jester5k) > 100]
# set the probability in the training set as 80%
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_jokes),
replace = TRUE, prob = c(0.8, 0.2))
head(which_train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
# split training and testing set
train <- ratings_jokes[which_train,]
test <- ratings_jokes[!which_train,]
#Remommendation models IBCF
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
names(recommender_models)
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [5] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [7] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [9] "UBCF_realRatingMatrix"
lapply(recommender_models, "[[", "description")
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
#Build IBCF Model, model_details$sim is similarity matrix
IBCF_model <- Recommender(data = train, method= "IBCF", parameter = list (k =30))
IBCF_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 3085 users.
model_detail <- getModel(IBCF_model)
dim(model_detail$sim)
## [1] 100 100
col_sums <- colSums(model_detail$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of
the column count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# In this section, we can find which are the jokes with the most elements.
which_max <- order(col_sums, decreasing = TRUE)[1:6]
rownames(model_detail$sim)[which_max]
## [1] "j44" "j16" "j57" "j58" "j36" "j53"
Applying the recommender model on the test set. (By textbook) 1. Extract the user rating of each purchase associated with this item. The rating is used as as a weight. 2. Extract the similarity of the item with each purchase associated with this item 3. Multiply each weight with the related similarity 4. Sum everything up.
n_recommended <- 5
predict_model <- predict(object = IBCF_model, newdata = test, n = n_recommended)
predict_model
## Recommendations as 'topNList' with n = 5 for 790 users.
class(predict_model)
## [1] "topNList"
## attr(,"package")
## [1] "recommenderlab"
slotNames(predict_model)
## [1] "items" "ratings" "itemLabels" "n"
# Example for the first user, recommendations
recc_user_1 <- predict_model@items[[1]]
jokes_user_1 <- predict_model@itemLabels[recc_user_1]
jokes_user_1
## [1] "j82" "j75" "j94" "j84" "j97"
IBCF bases on the similarity matrix. The model stors the 30-most (default number) similar. It will work very will with lots of data and big rating matrices.
recommender_models <- recommenderRegistry$get_entries(dataType =
"realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
UBCF_model <- Recommender(data = train, method = "UBCF")
UBCF_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 3085 users.
model_details <- getModel(UBCF_model)
names(model_details)
## [1] "description" "data" "method" "nn" "sample"
## [6] "normalize" "verbose"
recc_predicted <- predict(object = UBCF_model,
newdata = test, n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 5 for 790 users.
recc_user_1 <- recc_predicted@items[[1]]
jokes_user_1 <- recc_predicted@itemLabels[recc_user_1]
jokes_user_1
## [1] "j72" "j100" "j83" "j80" "j73"
Compare different models with the recommendations for the first user, only “Joke80” is in both models. According to the textbook, UBCF has to keep the entire database, it does not work good with bigger dataset but its accuracy is slightly better than IBCF.