Jokes Recommendation Jester5k

Deliverables 1. Compare the accuracy of at least two recommender system algorithms against your offline data. 2. Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity. 3. Compare and report on any change in accuracy before and after you’ve made the change in #2. 4. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

By textbook, the collaborative filtering algorithms are based on measuring the similarity between users of between items. In the “recommenderlab”, I use Jester5k, the joke rating matrix for this project. This dataset Jester5k contains jokes ratings by users and joke items.

I used the IBCF and UBCF recommendations systems for the recommedation. For each system, I used the method cosine, and interchanged between the center normalization and the Z-score normalization. I compared the precision and prediction among different models, and finally, I applied some recommendations to certain users (user#16 in this case). ### Reading Data

library("recommenderlab")

## Loading required package: Matrix

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

library("ggplot2")
getwd()

## [1] "C:/Users/user/Documents"

# Lets quickly load and explore the dataset
data (Jester5k)
Jester5k

## 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings.

 methods (class = class(Jester5k))

##  [1] [                      [<-                    binarize              
##  [4] calcPredictionAccuracy coerce                 colCounts             
##  [7] colMeans               colSds                 colSums               
## [10] denormalize            dim                    dimnames              
## [13] dimnames<-             dissimilarity          evaluationScheme      
## [16] getData.frame          getList                getNormalize          
## [19] getRatingMatrix        getRatings             getTopNLists          
## [22] image                  normalize              nratings              
## [25] Recommender            removeKnownRatings     rowCounts             
## [28] rowMeans               rowSds                 rowSums               
## [31] sample                 show                   similarity            
## see '?methods' for accessing help and source code

 JesterMatrix <- as(Jester5k, "matrix")
 dim(JesterMatrix)

## [1] 5000  100

Visualizing Data

# The histogram for the rating data
hist(getRatings(Jester5k),main="Distributions of Ratings")

As we can see, there are 5000 users in this matrix, 100 jokes. Thenumbers of ratings are 362106. The range of ratings is from -10 to 10

# Exploring the average ratings
average_ratings <- colMeans(Jester5k)

qplot(average_ratings) + stat_bin(binwidth = 0.2) +
   ggtitle("Average Joke Rating Distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The most popular average rating per joke is 1.0, ranges from -4 to 4. The disbribution of average ratings is a little skewed to the right. Next, we are exploring the similarity matrix, which is a recommenderlab function that takes the realRatingMatrix and calculates a cosine similariyt wich aids in the invesgiagtion of model develoment. The darker the color, the more similar between the two.

## user user collaborative Filtering
#similarity matrix
similarity_users <- similarity(Jester5k[1:5, ], method = "cosine", which = "users")
image(as.matrix(similarity_users), main = "User Similarity")

## Similarity Matrix for item-item
## user user collaborative Filtering
#similarity matrix
similarity_items <- similarity(Jester5k[1:20, ], method = "cosine", which = "item")
image(as.matrix(similarity_users), main = "Item Similarity")

The entire ratings are vizualized in this gradient figure as well, using this function from data.table package:

library(data.table)
jmatrix <- matrix(as.vector(Jester5k@data), nrow = Jester5k@data@Dim[1], ncol = Jester5k@data@Dim[2])  ## 5000 BY 100

ggplot(melt(jmatrix), aes(Var1, Var2, fill = value)) + 
  geom_raster() + 
  scale_fill_gradientn(colours=c("#0000FFFF","#FFFFFFFF","#FF0000FF"), name = 'Rating') + 
  scale_x_discrete("USERS", breaks = NULL, labels = NULL) +   scale_y_discrete("ITEMS", breaks = NULL, labels = NULL) + 
  theme(legend.position = 'bottom') + 
  ggtitle('All Ratings')

## Warning in melt(jmatrix): The melt generic in data.table has been passed a
## matrix and will attempt to redirect to the relevant reshape2 method; please note
## that reshape2 is deprecated, and this redirection is now deprecated as well.
## To continue using melt methods from reshape2 while both libraries are attached,
## e.g. melt.list, you can prepend the namespace like reshape2::melt(jmatrix). In
## the next version, this warning will become an error.

You can see from the results in the graphic that is not showing accurately the user sentiment (the darker the color tone the closer is to the grade boundaries -10,10). Results are in the dataset have sparsity and the instances with no answer were transformed to 0 meaning that results bias towards that 0 in the graphic. The data is further reduced to users who have rated 50 mor and jokes, and jokes which were rated by more than 100 users. Therefore, to improve the accuracy of recommendations and reduce the noise.

Data Reduction and Splitting

ratings_jokes <- Jester5k[rowCounts(Jester5k) > 50, colCounts(Jester5k) > 100]
dim(ratings_jokes)  ## 3875 users by 100 items

## [1] 3875  100

set.seed(4567)
## kept at 5 unknown ratings, rest as known, 
evaldata <- evaluationScheme(ratings_jokes, method="split", train=0.9, given=-5, goodRating=1)

UBCF, Z Score Normalation

#Create UBCF Recommender Model. UBCF stands for User-Based Collaborative Filtering
## normalize by Z score
UBCF_model <- Recommender( getData(evaldata, "train"),
          method = "UBCF",
          param=list(normalize = "Z-score",    method="Cosine",    nn=35))

## top 4 Jokes Reco for User16
UserSelectivePred <- predict(UBCF_model,
                  Jester5k[16,],    
                  n=4, type = "topNList") 
# UserSelectivePred


# recc_user_16_2<- UserSelectivePred[[1]]

 # # cat(JesterJokes[recc_user_16[2]])
 # substr(JesterJokes[recc_user_16_2[2]],0,3000)

# Now predicting for the test set and checking accuracy of model
recom <- predict(UBCF_model,
                       getData(evaldata, "known"),
                      n=4, type = "ratings")
calcPredictionAccuracy(recom,  getData(evaldata, "unknown"))

##      RMSE       MSE       MAE 
##  4.087526 16.707866  3.146767

UBCF, Center Normalation

## change the normalization to center
UBCF_model2 <- Recommender( getData(evaldata, "train"),
          method = "UBCF",
          param=list(normalize = "center",    method="Cosine",    nn=35))

## top 4 Jokes Reco for User16
UserSelectivePred2 <- predict(UBCF_model2,
                  ratings_jokes[16,],    
                  n=4, type = "topNList")

UserSelectivePred2 <- as(UserSelectivePred2, "list")
# Get recommendation for User No16 only top 4 Jokes
UserSelectivePred2[[1]]

## character(0)

#Making predictions in ratings
prediction2 <- predict(UBCF_model2, ratings_jokes [1:6, ], type="ratings")
 # as(prediction2, "matrix")[,1:4]  ## why all NA

recom2 <- predict(UBCF_model2,
                       getData(evaldata, "known"),
                      n=4, type = "ratings")
calcPredictionAccuracy(recom2,  getData(evaldata, "unknown"))

##      RMSE       MSE       MAE 
##  4.098461 16.797381  3.194382

IBCF, Z Score Normalation

## IBCF Model
IBCF_model <- Recommender( getData(evaldata, "train"),
          method = "IBCF",
          param=list(normalize = "Z-score",     method="Cosine",     k=35))

# Now we will see top 4 Joke recommendation for User No16
UserSelectivePred.IBCF <- predict(IBCF_model,
                              Jester5k[16,],
                              n=4, type = "topNList")

UserSelectivePred.IBCF <- as(UserSelectivePred.IBCF, "list")
# Get recommendation for User No16 only top 4 Jokes, but why all NA
UserSelectivePred.IBCF[[1]]

## [1] "j82" "j77" "j90" "j79"

#Making predictions in ratings
prediction_IBCF <- predict(IBCF_model, ratings_jokes[1:6, ], type="ratings")
# as(prediction_IBCF, "matrix")[,1:4]

#  test set and calculate the RMSE
recom.IBCF <- predict(IBCF_model,
                   getData(evaldata, "known"),
                  n=4, type = "ratings")

calcPredictionAccuracy(recom.IBCF,  getData(evaldata, "unknown"))

##      RMSE       MSE       MAE 
##  4.514320 20.379089  3.624641

# Now we will see top 4 Joke recommendation for User No16
UserSelectivePred.IBCF <- predict(IBCF_model,
                      ratings_jokes[16,],
                      n=4, type = "topNList")
UserSelectivePred.IBCF <- as(UserSelectivePred.IBCF, "list")
# Get recommendation for User No16 only top 4 Jokes
UserSelectivePred.IBCF[[1]]

## character(0)

#Making predictions in ratings
prediction_IBCF3 <- predict(IBCF_model, Jester5k [1:6, ], type="ratings")
as(prediction_IBCF3, "matrix")[,1:3]

##        j1 j2 j3
## u2841  NA NA NA
## u15547 NA NA NA
## u15221 NA NA NA
## u15573 NA NA NA
## u21505 NA NA NA
## u15994 NA NA NA

IBCF, Center Normalization

## change the normalization to center
## IBCF Model
IBCF_model2 <- Recommender( getData(evaldata, "train"),
          method = "IBCF",
          param=list(normalize = "Center",     method="Cosine",     k=35))

# top 4 Joke recommendation for User No16
UserSelectivePred.IBCF2 <- predict(IBCF_model2,
                              ratings_jokes[16,],
                              n=4, type = "topNList")
UserSelectivePred.IBCF2 <- as(UserSelectivePred.IBCF2, "list")

# Get recommendation for User No16 only top 4 Jokes
UserSelectivePred.IBCF2[[1]]

## character(0)

#Making predictions in ratings
prediction2 <- predict(IBCF_model2, ratings_jokes [1:6, ], type="ratings")
# as(prediction2, "matrix")[,1:3]

#  test set and calculate the RMSE
recom.IBCF2 <- predict(IBCF_model2,
                   getData(evaldata, "known"),
                  n=4, type = "ratings")

calcPredictionAccuracy(recom.IBCF2,  getData(evaldata, "unknown"))

##      RMSE       MSE       MAE 
##  4.516414 20.397999  3.615932

# top 4 Joke recommendation for User No16
UserSelectivePred.IBCF2 <- predict(IBCF_model2,
                      ratings_jokes[16,],
                      n=4, type = "topNList")

UserSelectivePred.IBCF2 <- as(UserSelectivePred.IBCF2, "list")
# Get recommendation for User No16 only top 4 Jokes
UserSelectivePred.IBCF2[[1]]

## character(0)

From above 4 model run(UBCF Z-score, UBCF center, IBCF Z score, IBCF center), we see that RMSE is 1.111, 1.1126, 1.300, 1.194 respectively, which indicates that UBCF performs better than IBCF(UBCF have a smaller RMSE than IBDF). The center normalization method performs better than the Z score normalization.

ROC curve, Prec/Rec Curve

models_to_evaluate = list (
    IBCF_COS = list (name ='IBCF', param = list(method="cosine")),
    UBCF_COS =  list (name='UBCF', param=list(method="cosine")),
     random = list(name ="RANDOM", param = NULL))
                   
n_recommendations = c(1,5,10,20, 25)
results = evaluate (x=evaldata, method = models_to_evaluate, n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [0.22sec/0.05sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/1.34sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.02sec/0.03sec]

plot(results, y="ROC", annotate = 1, legend ="topleft")
title ("ROC Curve")

plot (results, y ='prec/rec', annotate=1)
title ("Precision-Recall")

From above, we see that the UBCF model performs better than IBCF model, when using cosine correlation as the choice of method. And UBCF runs much quicker than IBCF, so it is a much better choice among the two. Since UBCF is better, next we will examine within UBCF the choice of Cosine Correlation versus person’s correlation.

models_to_evaluate2 = list (
    UBCF_COS = list (name ='UBCF', param = list(method="cosine")),
    UBCF_pear =  list (name='UBCF', param=list(method="pearson")),
     random = list(name ="RANDOM", param = NULL))

n_recommendations2 = c(1,5,10,20, 25)
results2 = evaluate (x=evaldata, method = models_to_evaluate2, n = n_recommendations)

## UBCF run fold/sample [model time/prediction time]
##   1  [0.03sec/1.39sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/1.9sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.04sec]

plot(results2, y="ROC", annotate = 1, legend ="topleft")
title ("ROC Curve")

plot (results2, y ='prec/rec', annotate=1)
title ("Precision-Recall")

We can see that the pearson’s correlation outperforms the cosine correlation in better areas under curve in ROC curve. This is the case for UBCF model. Their computational time is very similar.

Since the RMSE is lowest in the model of UBCF with Z-score normalization, among all other models. We decide to apply this model for the recommendated jokes for our user #1. We selected the top 4 jokes for this user, and print the first 3 jokes out. We used the substring function to extract the first 3000 characters so that the long jokes can be extracted without truncation.

Application on User1

set.seed(444)

## UBCF Recommender Model.## normalize by Z score
# UBCF_model <- Recommender( getData(evaldata, "train"),
#           method = "UBCF",
#           param=list(normalize = "Z-score",    method="Cosine",    nn=35))
# 
# ## top 4 Jokes Reco for User16
# UserSelectivePred <- predict(UBCF_model,
#                   Jester5k[16,],    
#                   n=4, type = "topNList") 

eval <- evaluationScheme(ratings_jokes, method="split", train=0.9, given=-5, goodRating=1)

train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

pred2 <- predict(object = UBCF_model, newdata = unknown, type = "topNList", n = 4)

# top 4 Jokes Reco for User1
recc_user_1 <- pred2@items[[1]]
movies_user_1 <- pred2@itemLabels[recc_user_1]

print (JesterJokes[movies_user_1[1]])

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               j32 
## "A man arrives at the gates of heaven. St. Peter asks, \"Religion?\" The man says, \"Methodist.\" St. Peter looks down his list, and says, \"Go to room 24, but be very quiet as you pass room 8.\" Another man arrives at the gates of heaven. \"Religion?\" \"Baptist.\" \"Go to room 18, but be very quiet as you pass room 8.\" A third man arrives at the gates. \"Religion?\" \"Jewish.\" \"Go to room 11, but be very quiet as you pass room 8.\" The man says, \"I can understand there being different rooms for different religions, but why must I be quiet when I pass room 8?\" St. Peter tells him, \"Well the Catholics are in room 8, and they think they're the only ones here."

substr(JesterJokes[movies_user_1[1]], 1, 3000)  ## have the firt 3000 characters

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               j32 
## "A man arrives at the gates of heaven. St. Peter asks, \"Religion?\" The man says, \"Methodist.\" St. Peter looks down his list, and says, \"Go to room 24, but be very quiet as you pass room 8.\" Another man arrives at the gates of heaven. \"Religion?\" \"Baptist.\" \"Go to room 18, but be very quiet as you pass room 8.\" A third man arrives at the gates. \"Religion?\" \"Jewish.\" \"Go to room 11, but be very quiet as you pass room 8.\" The man says, \"I can understand there being different rooms for different religions, but why must I be quiet when I pass room 8?\" St. Peter tells him, \"Well the Catholics are in room 8, and they think they're the only ones here."

# cat(JesterJokes[movies_user_1[2]])
substr(JesterJokes[movies_user_1[2]],0,3000)

##                                                                                                                                                                                                                                                                                                                                                                                                            j50 
## "A guy goes into confession and says to the priest, \"Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice.\" The priest said: \"Well, my son, when was the last time you were in confession?\" \"Never Father, I'm Jewish.\" \"So then, why are you telling me?\" \"I'm telling everybody.\""

substr(JesterJokes[movies_user_1[3]],0,3000)

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             j53 
## "One Sunday morning William burst into the living room and said, \"Dad! Mom! I have some great news for you! I am getting married to the most beautiful girl in town. She lives a block away and her name is Susan.\" After dinner, William's dad took him aside. \"Son, I have to talk with you. Your mother and I have been married 30 years.. She's a wonderful wife but she has never offered much excitement in the bedroom, so I used to fool around with women a lot. Susan is actually your half-sister, and I'm afraid you can't marry her.\" William was heart-broken. After eight months he eventually started dating girls again. A year later he came home and very proudly announced, \"Dianne said yes! We're getting married in June.\" Again his father insisted on another private conversation and broke the sad news. \"Dianne is your half-sister too, William. I'm awfully sorry about this.\" William was furious! He finally decided to go to his mother with the news. \"Dad has done so much harm.. I guess I'm never going to get married,\" he complained. \"Every time I fall in love, Dad tells me the girl is my half-sister.\" His mother just shook her head. \"Don't pay any attention to what he says, dear. He's not really your father.\""