Project 4


Deliverables: 1. As in previous assignments, compare the accuracy of at least two recommender systems algorithms against your offline data.

  1. Implement support for at least one buisiness or user experience goal such as increased serendipity, novelty, or diversity.

  2. Compare and report on any change in accuracy before and after you’ve made the change in #2.

  3. As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluations was possible. Also, briefly propose how you would design a reasonable online evaluation environment.

Dataset and Setup

Similar to the previous projects, The following assignment dives into the use of recommender systems via the r package recommenderlab. the vignette for this library can be found here. The dataset that will be used will be pulled directly from the recommender systems package and this project will expand upon examples produced in Chapter 3 and Chapter 4 of Building a recommendation System with R (Suresh K. Gorakala, Michaele Usuelli, 2015). The seed is set as 1 for repeatability of this script.

library(tidyverse)
library(readr)
library(sqldf)
library(dplyr)
library(tidyr)
library(tidymodels)
library(tinytex)
library(recommenderlab)
library(kableExtra)
library(gridExtra)

set.seed(1)

The Jester5k dataset from recommenderlab provides a dataset with 100 jokes and 5000 people who have rated some of these jokes from -10 (worst) to 10 (best). The data can be extracted from the Jester5k as a matrix and can be used in our recommendation system with a few modifications that will be outlined herein. For jokes where users did not provide a ranking, Many ratings within the dataset contain NA’s (0’s when converted to a matrix). It is important that these observations are omitted from our evaluation. The distribution of average movie ratings can be seen in the following figure. This plot is skewed left and indicates that most jokes are rated between 1 and 8 with 2-4 being the most frequent ratings. It is relatively rare for jokes to obtain a score of 10. A sample of 3 jokes from the Jester5k dataset is shown below.

data_package <- data(package = "recommenderlab")
data_package$results[, "Item"]
## [1] "Jester5k"                    "JesterJokes (Jester5k)"     
## [3] "MSWeb"                       "MovieLense"                 
## [5] "MovieLenseMeta (MovieLense)"
data("Jester5k")
kable(head(as.data.frame(JesterJokes),3)) %>% 
  kable_styling(full_width = T,
                position = "center",
                bootstrap_options = c("hover", "condense","responsive")) 
JesterJokes
j1 A man visits the doctor. The doctor says “I have bad news for you.You have cancer and Alzheimer’s disease”. The man replies “Well,thank God I don’t have cancer!”
j2 This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him. “What could they possibly have said to make you move out?” “They told me that you were a pedophile.” He replied, “That’s an awfully big word for a ten year old.”
j3 Q. What’s 200 feet long and has 4 teeth? A. The front row at a Willie Nelson Concert.
recommender_models <- recommenderRegistry$get_entries(dataType = 
                                                        "realRatingMatrix")


vector_ratings <- as.vector(Jester5k@data)

table_ratings <- table(vector_ratings)

vector_ratings <- vector_ratings[vector_ratings != 0]



vector_ratings %>% as.data.frame() %>%
  ggplot(aes(x = .))+
  geom_histogram(fill = "skyblue", color = "black", alpha = 0.3, binwidth = 2)+
  geom_density(aes(y = 2*..count..))+
  ggtitle("Distribution of Joke Ratings")+
  theme(panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.position = 'none')+
  labs(x = "Count of Ratings",
       y = "Rating")

Looking deeper into the Jester5k dataset, we can look at which jokes are the most viewed and popular within the dataset. The following table shows all jokes within the dataset ordered by popularity. The 3 figures below highlight:
-Average of joke ratings with all movies considered.
-Average of joke ratings with adjusted distribution.
-Average of user ratings.

Based on the first two plots we can assume that the dataset for Jester5k is more robust than the MovieLens dataset in terms of missing values. Thus the second plot provides this same distribution except with jokes that have at least 100 ratings and it is identical to the first. The Third figure provides a plot of average user ratings on the jokes which follows a normal distribution with a mode between 1 and 2.

views_per_joke<-colCounts(Jester5k)

average_joke_ratings<- colMeans(Jester5k)
average_user_ratings<- rowMeans(Jester5k)

average_joke_ratings_relevant <- average_joke_ratings[views_per_joke > 100]
p1<-qplot(average_joke_ratings,alpha = 0.5)+theme_bw()+theme(legend.position = 'none')
p2<-qplot(average_joke_ratings_relevant, alpha = 0.5 )+theme_bw()+theme(legend.position = 'none')
p3<-qplot(average_user_ratings, alpha = 0.5 )+theme_bw()+theme(legend.position = 'none')


grid.arrange(p1,p2,p3, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Recommender System

Setup and user experience goal

The recommender systems that will be evaluated in this section are Item-Based Collaborative Filtering (IBCF), User-Based Collaborative Filtering (UBCF), Singular Value Decomposition, and Alternate Least Squares. These methods are described in previous projects which can be found on my Github here.


The dataset will use cross validations and will be split into train/test datasets at a 80/20 ratio. The number of recommendations made to each user will be specified as 10. This parameter will be tweaked during an accuracy check to assess model performance.

The goal of our model is on the basis of serendipity and will recommend jokes to users based on an above average/median rating of 3. This slightly decreases recommendation accuracy than when using 0, however it makes better quality recommendations.

#filtering for users who have watched at least 50 movies, and movies with at least 100 views
ratings_jokes <- Jester5k[rowCounts(Jester5k) > 50,  
                             colCounts(Jester5k) > 100] 


train_percent<-0.8
kept_items<-15
rating_threshold<-3
n_eval<-3
no_recommendations<-10



eval_sets <- evaluationScheme(data = ratings_jokes, method = "cross-validation", 
                              train = train_percent, given = kept_items, goodRating = rating_threshold, k = n_eval)


#used to train
recc_train<- getData(eval_sets,'train')
#used to predict
recc_known<-getData(eval_sets,'known')
#used to test
recc_unknown<-getData(eval_sets,'unknown')


IBCF is a method that looks at similarities between items (jokes) and makes recommendations. The algorithm considers user’s items (jokes read) and recommends similar items and the core of this algorithm is based on: how similar 2 items are when receiving similar ratings from similar users, identifying the k-most similar items and identifying user specific recommendations based on user purchase history. The results of the top 20 most recommended jokes are shown in the following figure.
UBCF is a method that looks at similarities between users and makes recommendations to users based on this similarity. The algorithm measures how similar each user is to another and a similarity matrix can define the top similar users via an algorithm like k-nearest neighbors, or similarity can be determined by some threshold similarity value. The user ratings are used as a weight on the jokes and this is multiplied by the similarity coefficient in order to prioritize recommendations. We can apply the same code and thought process to the recommender function and use method UBCF and the most recommended jokes can be found in the following figure.
Singular value decomposition (SVD) is a common dimensionality reduction technique that identifies latent semantic factors for retrieving information. Some of the difficulties associated with this technique can be contributed to sparse data (missing values) often present in user-item matrices. Unfortunately filling huge matrices with missing values can often be expensive or even misleading. Regularization models (penalty-based error minimizing) can assist with this.

Alternative Least Squares algorith factorizes a matrix into two factors such that the original matrix is approximately equal to the transpose of the first factor matrix multiplied by the second factor matrix.

recc_eval<-Recommender(recc_train,method = "IBCF", parameter = NULL)

recc_predict<-predict(object = recc_eval,newdata = recc_known, n = no_recommendations, type = "ratings")


recc_predicted<-predict(object = recc_eval,newdata=recc_known,n=no_recommendations)


recc_matrix <- sapply(recc_predicted@items, function(x){
  colnames(ratings_jokes)[x]
})



number_of_items<-recc_matrix %>% unlist() %>% table() %>% as.data.frame()


table_top <- data.frame("names.number_of_items_top." = number_of_items$.,  
                        "number_of_items_top"= number_of_items$Freq)

table_top %>%
  top_n(20) %>% arrange(desc(number_of_items_top)) %>% 
  ggplot(mapping = aes(x=fct_reorder(names.number_of_items_top.,-as.numeric(number_of_items_top)), y = as.numeric(number_of_items_top)))+
  geom_col(aes(fill = as.numeric(number_of_items_top)),color = 'black', alpha = 0.5)+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = 'none',
        panel.grid = element_blank(),
        panel.background = element_blank())+
  labs(x = "Joke_ID",
       y = "Number of recommendations",
       title = "Top 20 Joke Recomendations")
## Selecting by number_of_items_top

topjokes<-data.frame()
topjokes<-rbind(topjokes,table_top %>%
  top_n(3) %>% arrange(desc(number_of_items_top)))
## Selecting by number_of_items_top
topjokes<-c("j82",'j95','j79','j50','j32','j35')

recc_eval<-Recommender(recc_train,method = "UBCF")

recc_predict<-predict(object = recc_eval,newdata = recc_known, n = no_recommendations, type = "ratings")


#this method will use the user rating as a weight, the similarity of the joke to other jokes, and multiply the weight by the similarities to come up with recommendations


recc_predicted<-predict(object = recc_eval,newdata=recc_known,n=no_recommendations)

recc_matrix <- sapply(recc_predicted@items, function(x){
  colnames(ratings_jokes)[x]
})


number_of_items<-recc_matrix %>% unlist() %>% table() %>% as.data.frame()

table_top <- data.frame("names.number_of_items_top." = number_of_items$.,  
                        "number_of_items_top"= number_of_items$Freq)


table_top %>%
  top_n(20) %>% 
  ggplot(mapping = aes(x=fct_reorder(names.number_of_items_top.,-as.numeric(number_of_items_top)), y = as.numeric(number_of_items_top)))+
  geom_col(aes(fill = as.numeric(number_of_items_top)),color = 'black', alpha = 0.5)+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = 'none',
        panel.grid = element_blank(),
        panel.background = element_blank())+
  labs(x = "Joke_ID",
       y = "Number of recommendations",
       title = "Top 20 Joke Recomendations")
## Selecting by number_of_items_top

topjokes<-rbind(topjokes,table_top %>%
  top_n(3) %>% arrange(desc(number_of_items_top)))
## Selecting by number_of_items_top

Evaluating Recommenders

Evaluating the recommenders, we will calculate our RMSE for each model. For the collaborative models (UBCF and IBCF), we are using 3 different methods: pearson, jaccard, and cosine. for the factorization models, we z-score and center normalization. Below provides a summary of each model and their RMSE where the best performing model was UBCF Pearson, UBCF Jaccard and singular value decomposition. While the models that did not perform as well include ALS with centered normalization and IBCF jaccard.

{
  UBCF_pearson_eval<-Recommender(recc_train,method = "UBCF", parameter = list(method = "pearson"))
  UBCF_jaccard_eval<-Recommender(recc_train,method = "UBCF", parameter = list(method = "jaccard"))
  UBCF_cosine_eval<- Recommender(recc_train,method = "UBCF", parameter = list(method = "cosine"))
  IBCF_pearson_eval<-Recommender(recc_train,method = "IBCF", parameter = list(method = "pearson"))
  IBCF_jaccard_eval<-Recommender(recc_train,method = "IBCF", parameter = list(method = "jaccard"))
  IBCF_cosine_eval<- Recommender(recc_train,method = "IBCF", parameter = list(method = "cosine"))
  SVD_Z_eval<-       Recommender(recc_train,method = "SVD", parameter = list(normalize = "Z-score"))
  SVD_center_eval<-  Recommender(recc_train,method = "SVD", parameter = list(normalize = "center"))
  ALS_Z_eval<-       Recommender(recc_train,method = "ALS", parameter = list(normalize = "Z-score"))
  ALS_center_eval<- Recommender(recc_train,method = "ALS", parameter = list(normalize = "center"))
}
recommendations<-function(eval){
  recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations)
}
recc_matrix <- sapply(recc_predicted@items, function(x){
  colnames(ratings_jokes)[x]
})

ModelErrors<-function(eval){
  
    recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations, type= "ratings")
    calcPredictionAccuracy(recc_predicted,recc_unknown, byUser = F)

}


rbind(
  UBCF_pearson= ModelErrors(UBCF_pearson_eval),
  UBCF_jaccard = ModelErrors(UBCF_jaccard_eval),
  UBCF_Cosine = ModelErrors(UBCF_cosine_eval),
  IBCF_pearson= ModelErrors(IBCF_pearson_eval),
  IBCF_jaccard = ModelErrors(IBCF_jaccard_eval),
  IBCF_Cosine = ModelErrors(IBCF_cosine_eval),
  SVD_Z= ModelErrors(SVD_Z_eval),
  SVD_Center= ModelErrors(SVD_center_eval),
  ALS_Z= ModelErrors(ALS_Z_eval),
  ALS_Center= ModelErrors(ALS_center_eval)
) %>% 
  kable() %>%   
  kable_styling(full_width = F,
                position = "center",
                bootstrap_options = c("hover", "condense","responsive")) 
RMSE MSE MAE
UBCF_pearson 4.495049 20.20546 3.524677
UBCF_jaccard 4.641267 21.54136 3.670446
UBCF_Cosine 4.542053 20.63024 3.565330
IBCF_pearson 4.697234 22.06400 3.593574
IBCF_jaccard 5.688409 32.35800 4.324923
IBCF_Cosine 5.245499 27.51526 4.099757
SVD_Z 4.649408 21.61699 3.682991
SVD_Center 4.647978 21.60370 3.681730
ALS_Z 5.207312 27.11610 4.404687
ALS_Center 5.667964 32.12582 4.500189

There are many different recommender systems and there are several methods to define which models produce greater results. Some common metrics for model evaluations are: root mean square error (RMSE) as shown above, mean square error (MSE), mean absolute error (MAE), receiver operating characteristic (ROC), area under the ROC curve (AUC). The metric presented below is the ROC curve for all models In addition to the models, a random or “guessing” recommendation will be produced in order to have a baseline to compare model performance to. The plot below shows the results of this evaluation and the UBCF pearson model performs the best amongst all of the different methods.

  models_to_evaluate <- list(
    IBCF_cosine = list(name = "IBCF", param = list(method = "cosine")), 
    IBCF_pearson = list(name = "IBCF", param = list(method ="pearson")), 
    IBCF_jaccard = list(name = "IBCF", param = list(method ="jaccard")), 
    UBCF_cosine = list(name = "UBCF", param = list(method ="cosine")), 
    UBCF_pearson = list(name = "UBCF", param = list(method ="pearson")), 
    UBCF_jaccard = list(name = "UBCF", param = list(method ="jaccard")), 
    SVD_Z = list(name = "SVD", param = list(normalize = "Z-score")),
    SVD_C = list(name = "SVD", param = list(normalize = "center")),
    ALS_N = list(name = "ALS", param = list(normalize = NULL)),
    ALS_C = list(name = "ALS", param = list(normalize = "center")),
    ALS_Z = list(name = "ALS", param = list(normalize = "Z-score")),
    random = list(name = "RANDOM", param=NULL)
  )
  
  n_recommendations <- c(1, 5, seq(10, 100, 10))

  list_results <- evaluate(x = eval_sets, method = models_to_evaluate, n= n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.12sec/0.2sec] 
##   2  [0.13sec/0.19sec] 
##   3  [0.13sec/0.2sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.12sec/0.22sec] 
##   2  [0.14sec/0.16sec] 
##   3  [0.12sec/0.22sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.14sec/0.16sec] 
##   2  [0.16sec/0.19sec] 
##   3  [0.12sec/0.32sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/2.62sec] 
##   2  [0.02sec/2.74sec] 
##   3  [0sec/2.95sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/2.66sec] 
##   2  [0.01sec/2.68sec] 
##   3  [0.02sec/2.48sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/2.77sec] 
##   2  [0.02sec/2.56sec] 
##   3  [0sec/2.5sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.09sec/0.19sec] 
##   2  [0.09sec/0.2sec] 
##   3  [0.07sec/0.19sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.1sec/0.16sec] 
##   2  [0.03sec/0.15sec] 
##   3  [0.06sec/0.15sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/86.17sec] 
##   2  [0sec/86.75sec] 
##   3  [0sec/86.02sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/86.67sec] 
##   2  [0sec/87.8sec] 
##   3  [0sec/85.84sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/85.92sec] 
##   2  [0.02sec/88.27sec] 
##   3  [0sec/90.77sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.18sec] 
##   2  [0sec/0.18sec] 
##   3  [0.02sec/0.17sec]
  avg_matrices <- lapply(list_results, avg)
  plot(list_results, annotate = 1, legend = "topleft") 
  title("ROC curve")

To see the top jokes from some of our reccommender models we can match the jokeID to the Jester5k dataset and extract the jokes. The following provides a table of 6 jokes from a few models.

JesterJokesDF<-data.frame(JesterJokes)
JesterJokesDF$names.number_of_items_top.<-row.names(JesterJokesDF)
kable(JesterJokesDF[match(unique(topjokes$names.number_of_items_top.),JesterJokesDF$names.number_of_items_top.),]) %>% 
  kable_styling(full_width = T,
                position = "center",
                bootstrap_options = c("hover", "condense","responsive")) 
JesterJokes names.number_of_items_top.
j82 Q: How do you keep a computer programmer in the shower all day long? A: Give them a shampoo with a label that says “rinse, lather, repeat”. j82
j50 A guy goes into confession and says to the priest, “Father, I’m 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice.” The priest said: “Well, my son, when was the last time you were in confession?” “Never Father, I’m Jewish.” “So then, why are you telling me?” “I’m telling everybody.” j50
j32 A man arrives at the gates of heaven. St. Peter asks, “Religion?” The man says, “Methodist.” St. Peter looks down his list, and says, “Go to room 24, but be very quiet as you pass room 8.” Another man arrives at the gates of heaven. “Religion?” “Baptist.” “Go to room 18, but be very quiet as you pass room 8.” A third man arrives at the gates. “Religion?” “Jewish.” “Go to room 11, but be very quiet as you pass room 8.” The man says, “I can understand there being different rooms for different religions, but why must I be quiet when I pass room 8?” St. Peter tells him, "Well the Catholics are in room 8, and they think they’re the only ones here. j32
j36 A guy walks into a bar, orders a beer and says to the bartender, “Hey, I got this great Polish Joke…” The barkeep glares at him and says in a warning tone of voice: “Before you go telling that joke you better know that I’m Polish, both bouncers are Polish and so are most of my customers” “Okay” says the customer,“I’ll tell it very slowly.” j36

Another Experiment that could be performed would be to look at the tf-idf of the jokes and compare any similarities in text. This may prove to be an interesting analysis however, the jokes are relatively short texts so it may likely not yield significant results.