Project2 Data 612

By textbook, the collaborative filtering algorithms are based on measuring the similarity between users of between items. In the “recommenderlab”, I use Jester5k, the joke rating matrix for this project. To compute similarities, in this case, I followed the example from the textbook by comparing cosine and pearson.

library("recommenderlab")
library("ggplot2")


data(Jester5k)
Jester5k

## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.

methods(class= class(Jester5k))

##  [1] [                      [<-                    binarize              
##  [4] calcPredictionAccuracy coerce                 colCounts             
##  [7] colMeans               colSds                 colSums               
## [10] denormalize            dim                    dimnames              
## [13] dimnames<-             dissimilarity          evaluationScheme      
## [16] getData.frame          getList                getNormalize          
## [19] getRatingMatrix        getRatings             getTopNLists          
## [22] image                  normalize              nratings              
## [25] Recommender            removeKnownRatings     rowCounts             
## [28] rowMeans               rowSds                 rowSums               
## [31] sample                 show                   similarity            
## see '?methods' for accessing help and source code

# Computing the similarity matrix
# similarity users
similarity_users <- similarity(Jester5k[1:5, ], method="cosine", which = "users")
as.matrix(similarity_users)

##            u2841      u15547     u15221      u15573    u21505
## u2841  0.0000000 0.500128118 0.15871815 0.197523538 0.2000859
## u15547 0.5001281 0.000000000 0.08944693 0.003513988 0.1138529
## u15221 0.1587182 0.089446933 0.00000000 0.133177034 0.3900443
## u15573 0.1975235 0.003513988 0.13317703 0.000000000 0.2182228
## u21505 0.2000859 0.113852906 0.39004428 0.218222818 0.0000000

# The more red the cell is, the more similar two users are. 
image(as.matrix(similarity_users), main="Use cosine for similarity")

similarity_users <- similarity(Jester5k[1:5, ], method="pearson", which = "users")
as.matrix(similarity_users)

##            u2841    u15547    u15221 u15573    u21505
## u2841  0.0000000 0.5000000 0.5000000    0.5 0.5043172
## u15547 0.5000000 0.0000000 0.5414964    0.5 0.5295880
## u15221 0.5000000 0.5414964 0.0000000    0.5 0.5892015
## u15573 0.5000000 0.5000000 0.5000000    0.0 0.5000000
## u21505 0.5043172 0.5295880 0.5892015    0.5 0.0000000

# The more red the cell is, the more similar two users are. 
image(as.matrix(similarity_users), main="Use pearson for similarity")

To compute the similarities between users, I used both ways, cosine and pearson. By quora, “https://www.quora.com/In-what-scenario-is-using-Pearson-correlation-better-than-Cosine-similarity”, I found the difference between cosine and pearson. When users tend to have very differing sets of items, pearson would perform worse. As we can see from the heat map, cosine one seems to show more relations between different users. The more red cell is, the more similar two users are.

# similarity items

similarity_items <- similarity(Jester5k[ ,1:5], method = "cosine", which = "items")
as.matrix(similarity_items)

##           j1        j2        j3        j4        j5
## j1 0.0000000 0.3839318 0.3916563 0.2357251 0.2115277
## j2 0.3839318 0.0000000 0.2764639 0.3020971 0.2234656
## j3 0.3916563 0.2764639 0.0000000 0.3532990 0.2179025
## j4 0.2357251 0.3020971 0.3532990 0.0000000 0.1874588
## j5 0.2115277 0.2234656 0.2179025 0.1874588 0.0000000

image(as.matrix(similarity_items),main="cosine for Item similarity")

similarity_items <- similarity(Jester5k[ ,1:5], method = "pearson", which = "items")
as.matrix(similarity_items)

##           j1        j2        j3        j4        j5
## j1 0.0000000 0.6180590 0.6216113 0.5902784 0.5588692
## j2 0.6180590 0.0000000 0.5795824 0.5999586 0.5625990
## j3 0.6216113 0.5795824 0.0000000 0.6176980 0.5605612
## j4 0.5902784 0.5999586 0.6176980 0.0000000 0.5594754
## j5 0.5588692 0.5625990 0.5605612 0.5594754 0.0000000

image(as.matrix(similarity_items),main="pearson for Item similarity")

I used the same method from comparing users’ similarities to compare the similarities of items. I think cosine seems better again in this case.

# The histogram for the data

hist(getRatings(Jester5k),main="Distribution of ratings")

best <- which.max(colMeans(Jester5k))
cat(JesterJokes[best])

## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."

# Exploring the average ratings

average_ratings <- colMeans(Jester5k)
qplot(average_ratings) + stat_bin(binwidth = 0.1) +
   ggtitle("Distribution of the average joke rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The highest value is around 1, and there is no joke rate for 10 averagely or -10. Since this matrix is only for 5k and there is no missing values because missing values count as 99 which does not appear in this data set so we do not have to worry about the ratings are biased because of the missing values.

Item-based collaborative filtering. By textbook, the core algorithm is based on these steps 1. For each two items, measure how similar they are in terms of having received similar ratings by similar users. 2. For each item, idenitfy the k-most similar items 3. For each user, identify the items that are most similar to the user’s purchases

building IBCF model 1.Training set and Test set 2.Applying the recommender model on the test set. I will use the model to recommend jokes to the users in the test set and specify 5 jokes to recommend to users.

ratings_jokes <- Jester5k[rowCounts(Jester5k) > 50,
   colCounts(Jester5k) > 100]
# set the probability in the training set as 80%
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_jokes),
   replace = TRUE, prob = c(0.8, 0.2))
head(which_train)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

# split training and testing set
train <- ratings_jokes[which_train,]
test <- ratings_jokes[!which_train,]

#Remommendation models IBCF
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
names(recommender_models)

## [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix"         "POPULAR_realRatingMatrix"     
## [5] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
## [7] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [9] "UBCF_realRatingMatrix"

lapply(recommender_models, "[[", "description")

## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

#Build IBCF Model, model_details$sim is similarity matrix

IBCF_model <- Recommender(data = train, method= "IBCF", parameter = list (k =30))
IBCF_model

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 3085 users.

model_detail <- getModel(IBCF_model)
dim(model_detail$sim)

## [1] 100 100

col_sums <- colSums(model_detail$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of
   the column count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# In this section, we can find which are the jokes with the most elements. 
which_max <- order(col_sums, decreasing = TRUE)[1:6]
rownames(model_detail$sim)[which_max]

## [1] "j44" "j16" "j57" "j58" "j36" "j53"

Applying the recommender model on the test set. (By textbook) 1. Extract the user rating of each purchase associated with this item. The rating is used as as a weight. 2. Extract the similarity of the item with each purchase associated with this item 3. Multiply each weight with the related similarity 4. Sum everything up.

n_recommended <- 5
predict_model <- predict(object = IBCF_model, newdata = test, n = n_recommended)
predict_model

## Recommendations as 'topNList' with n = 5 for 790 users.

class(predict_model)

## [1] "topNList"
## attr(,"package")
## [1] "recommenderlab"

slotNames(predict_model)

## [1] "items"      "ratings"    "itemLabels" "n"

# Example for the first user, recommendations
recc_user_1 <- predict_model@items[[1]]
jokes_user_1 <- predict_model@itemLabels[recc_user_1]
jokes_user_1

## [1] "j82" "j75" "j94" "j84" "j97"

IBCF bases on the similarity matrix. The model stors the 30-most (default number) similar. It will work very will with lots of data and big rating matrices.

Building the UBCF model (User based collaborative filtering) By textbook Identify which items are similar in terms of having been purchased by the same people Recommend to a new user the items that are similar to its purchase. 1). Measue how similar each user is to the new one. 2). Identify the most similar users. 3). Rate the items purchased by the most similar users. The rating is the average rating among similar users 4). Pick the top-rated items.

recommender_models <- recommenderRegistry$get_entries(dataType =
   "realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

UBCF_model <- Recommender(data = train, method = "UBCF")
UBCF_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 3085 users.

model_details <- getModel(UBCF_model)
names(model_details)

## [1] "description" "data"        "method"      "nn"          "sample"     
## [6] "normalize"   "verbose"

recc_predicted <- predict(object = UBCF_model,
   newdata = test, n = n_recommended)
recc_predicted

## Recommendations as 'topNList' with n = 5 for 790 users.

recc_user_1 <- recc_predicted@items[[1]]
jokes_user_1 <- recc_predicted@itemLabels[recc_user_1]
jokes_user_1

## [1] "j72"  "j100" "j83"  "j80"  "j73"

Compare different models with the recommendations for the first user, only “Joke80” is in both models. According to the textbook, UBCF has to keep the entire database, it does not work good with bigger dataset but its accuracy is slightly better than IBCF.

Project2 Data 612

Vivian Kong

6/12/2019