Deliverables: 1. As in previous assignments, compare the accuracy of at least two recommender systems algorithms against your offline data.
Implement support for at least one buisiness or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluations was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
Similar to the previous projects, The following assignment dives into the use of recommender systems via the r package recommenderlab. the vignette for this library can be found here. The dataset that will be used will be pulled directly from the recommender systems package and this project will expand upon examples produced in Chapter 3 and Chapter 4 of Building a recommendation System with R (Suresh K. Gorakala, Michaele Usuelli, 2015). The seed is set as 1 for repeatability of this script.
library(tidyverse)
library(readr)
library(sqldf)
library(dplyr)
library(tidyr)
library(tidymodels)
library(tinytex)
library(recommenderlab)
library(kableExtra)
library(gridExtra)
set.seed(1)
The Jester5k dataset from recommenderlab provides a dataset with 100 jokes and 5000 people who have rated some of these jokes from -10 (worst) to 10 (best). The data can be extracted from the Jester5k as a matrix and can be used in our recommendation system with a few modifications that will be outlined herein. For jokes where users did not provide a ranking, Many ratings within the dataset contain NA’s (0’s when converted to a matrix). It is important that these observations are omitted from our evaluation. The distribution of average movie ratings can be seen in the following figure. This plot is skewed left and indicates that most jokes are rated between 1 and 8 with 2-4 being the most frequent ratings. It is relatively rare for jokes to obtain a score of 10. A sample of 3 jokes from the Jester5k dataset is shown below.
data_package <- data(package = "recommenderlab")
data_package$results[, "Item"]
## [1] "Jester5k" "JesterJokes (Jester5k)"
## [3] "MSWeb" "MovieLense"
## [5] "MovieLenseMeta (MovieLense)"
data("Jester5k")
kable(head(as.data.frame(JesterJokes),3)) %>%
kable_styling(full_width = T,
position = "center",
bootstrap_options = c("hover", "condense","responsive"))
| JesterJokes | |
|---|---|
| j1 | A man visits the doctor. The doctor says “I have bad news for you.You have cancer and Alzheimer’s disease”. The man replies “Well,thank God I don’t have cancer!” |
| j2 | This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him. “What could they possibly have said to make you move out?” “They told me that you were a pedophile.” He replied, “That’s an awfully big word for a ten year old.” |
| j3 | Q. What’s 200 feet long and has 4 teeth? A. The front row at a Willie Nelson Concert. |
recommender_models <- recommenderRegistry$get_entries(dataType =
"realRatingMatrix")
vector_ratings <- as.vector(Jester5k@data)
table_ratings <- table(vector_ratings)
vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings %>% as.data.frame() %>%
ggplot(aes(x = .))+
geom_histogram(fill = "skyblue", color = "black", alpha = 0.3, binwidth = 2)+
geom_density(aes(y = 2*..count..))+
ggtitle("Distribution of Joke Ratings")+
theme(panel.grid = element_blank(),
panel.background = element_blank(),
legend.position = 'none')+
labs(x = "Count of Ratings",
y = "Rating")
Looking deeper into the Jester5k dataset, we can look at which jokes are the most viewed and popular within the dataset. The following table shows all jokes within the dataset ordered by popularity. The 3 figures below highlight:
-Average of joke ratings with all movies considered.
-Average of joke ratings with adjusted distribution.
-Average of user ratings.
Based on the first two plots we can assume that the dataset for Jester5k is more robust than the MovieLens dataset in terms of missing values. Thus the second plot provides this same distribution except with jokes that have at least 100 ratings and it is identical to the first. The Third figure provides a plot of average user ratings on the jokes which follows a normal distribution with a mode between 1 and 2.
views_per_joke<-colCounts(Jester5k)
average_joke_ratings<- colMeans(Jester5k)
average_user_ratings<- rowMeans(Jester5k)
average_joke_ratings_relevant <- average_joke_ratings[views_per_joke > 100]
p1<-qplot(average_joke_ratings,alpha = 0.5)+theme_bw()+theme(legend.position = 'none')
p2<-qplot(average_joke_ratings_relevant, alpha = 0.5 )+theme_bw()+theme(legend.position = 'none')
p3<-qplot(average_user_ratings, alpha = 0.5 )+theme_bw()+theme(legend.position = 'none')
grid.arrange(p1,p2,p3, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The recommender systems that will be evaluated in this section are Item-Based Collaborative Filtering (IBCF), User-Based Collaborative Filtering (UBCF), Singular Value Decomposition, and Alternate Least Squares. These methods are described in previous projects which can be found on my Github here.
The dataset will use cross validations and will be split into train/test datasets at a 80/20 ratio. The number of recommendations made to each user will be specified as 10. This parameter will be tweaked during an accuracy check to assess model performance.
The goal of our model is on the basis of serendipity and will recommend jokes to users based on an above average/median rating of 3. This slightly decreases recommendation accuracy than when using 0, however it makes better quality recommendations.
#filtering for users who have watched at least 50 movies, and movies with at least 100 views
ratings_jokes <- Jester5k[rowCounts(Jester5k) > 50,
colCounts(Jester5k) > 100]
train_percent<-0.8
kept_items<-15
rating_threshold<-3
n_eval<-3
no_recommendations<-10
eval_sets <- evaluationScheme(data = ratings_jokes, method = "cross-validation",
train = train_percent, given = kept_items, goodRating = rating_threshold, k = n_eval)
#used to train
recc_train<- getData(eval_sets,'train')
#used to predict
recc_known<-getData(eval_sets,'known')
#used to test
recc_unknown<-getData(eval_sets,'unknown')
IBCF is a method that looks at similarities between items (jokes) and makes recommendations. The algorithm considers user’s items (jokes read) and recommends similar items and the core of this algorithm is based on: how similar 2 items are when receiving similar ratings from similar users, identifying the k-most similar items and identifying user specific recommendations based on user purchase history. The results of the top 20 most recommended jokes are shown in the following figure.
UBCF is a method that looks at similarities between users and makes recommendations to users based on this similarity. The algorithm measures how similar each user is to another and a similarity matrix can define the top similar users via an algorithm like k-nearest neighbors, or similarity can be determined by some threshold similarity value. The user ratings are used as a weight on the jokes and this is multiplied by the similarity coefficient in order to prioritize recommendations. We can apply the same code and thought process to the recommender function and use method UBCF and the most recommended jokes can be found in the following figure.
Singular value decomposition (SVD) is a common dimensionality reduction technique that identifies latent semantic factors for retrieving information. Some of the difficulties associated with this technique can be contributed to sparse data (missing values) often present in user-item matrices. Unfortunately filling huge matrices with missing values can often be expensive or even misleading. Regularization models (penalty-based error minimizing) can assist with this.
Alternative Least Squares algorith factorizes a matrix into two factors such that the original matrix is approximately equal to the transpose of the first factor matrix multiplied by the second factor matrix.
recc_eval<-Recommender(recc_train,method = "IBCF", parameter = NULL)
recc_predict<-predict(object = recc_eval,newdata = recc_known, n = no_recommendations, type = "ratings")
recc_predicted<-predict(object = recc_eval,newdata=recc_known,n=no_recommendations)
recc_matrix <- sapply(recc_predicted@items, function(x){
colnames(ratings_jokes)[x]
})
number_of_items<-recc_matrix %>% unlist() %>% table() %>% as.data.frame()
table_top <- data.frame("names.number_of_items_top." = number_of_items$.,
"number_of_items_top"= number_of_items$Freq)
table_top %>%
top_n(20) %>% arrange(desc(number_of_items_top)) %>%
ggplot(mapping = aes(x=fct_reorder(names.number_of_items_top.,-as.numeric(number_of_items_top)), y = as.numeric(number_of_items_top)))+
geom_col(aes(fill = as.numeric(number_of_items_top)),color = 'black', alpha = 0.5)+
theme(axis.text.x = element_text(angle = 90),
legend.position = 'none',
panel.grid = element_blank(),
panel.background = element_blank())+
labs(x = "Joke_ID",
y = "Number of recommendations",
title = "Top 20 Joke Recomendations")
## Selecting by number_of_items_top
topjokes<-data.frame()
topjokes<-rbind(topjokes,table_top %>%
top_n(3) %>% arrange(desc(number_of_items_top)))
## Selecting by number_of_items_top
topjokes<-c("j82",'j95','j79','j50','j32','j35')
recc_eval<-Recommender(recc_train,method = "UBCF")
recc_predict<-predict(object = recc_eval,newdata = recc_known, n = no_recommendations, type = "ratings")
#this method will use the user rating as a weight, the similarity of the joke to other jokes, and multiply the weight by the similarities to come up with recommendations
recc_predicted<-predict(object = recc_eval,newdata=recc_known,n=no_recommendations)
recc_matrix <- sapply(recc_predicted@items, function(x){
colnames(ratings_jokes)[x]
})
number_of_items<-recc_matrix %>% unlist() %>% table() %>% as.data.frame()
table_top <- data.frame("names.number_of_items_top." = number_of_items$.,
"number_of_items_top"= number_of_items$Freq)
table_top %>%
top_n(20) %>%
ggplot(mapping = aes(x=fct_reorder(names.number_of_items_top.,-as.numeric(number_of_items_top)), y = as.numeric(number_of_items_top)))+
geom_col(aes(fill = as.numeric(number_of_items_top)),color = 'black', alpha = 0.5)+
theme(axis.text.x = element_text(angle = 90),
legend.position = 'none',
panel.grid = element_blank(),
panel.background = element_blank())+
labs(x = "Joke_ID",
y = "Number of recommendations",
title = "Top 20 Joke Recomendations")
## Selecting by number_of_items_top
topjokes<-rbind(topjokes,table_top %>%
top_n(3) %>% arrange(desc(number_of_items_top)))
## Selecting by number_of_items_top
Evaluating the recommenders, we will calculate our RMSE for each model. For the collaborative models (UBCF and IBCF), we are using 3 different methods: pearson, jaccard, and cosine. for the factorization models, we z-score and center normalization. Below provides a summary of each model and their RMSE where the best performing model was UBCF Pearson, UBCF Jaccard and singular value decomposition. While the models that did not perform as well include ALS with centered normalization and IBCF jaccard.
{
UBCF_pearson_eval<-Recommender(recc_train,method = "UBCF", parameter = list(method = "pearson"))
UBCF_jaccard_eval<-Recommender(recc_train,method = "UBCF", parameter = list(method = "jaccard"))
UBCF_cosine_eval<- Recommender(recc_train,method = "UBCF", parameter = list(method = "cosine"))
IBCF_pearson_eval<-Recommender(recc_train,method = "IBCF", parameter = list(method = "pearson"))
IBCF_jaccard_eval<-Recommender(recc_train,method = "IBCF", parameter = list(method = "jaccard"))
IBCF_cosine_eval<- Recommender(recc_train,method = "IBCF", parameter = list(method = "cosine"))
SVD_Z_eval<- Recommender(recc_train,method = "SVD", parameter = list(normalize = "Z-score"))
SVD_center_eval<- Recommender(recc_train,method = "SVD", parameter = list(normalize = "center"))
ALS_Z_eval<- Recommender(recc_train,method = "ALS", parameter = list(normalize = "Z-score"))
ALS_center_eval<- Recommender(recc_train,method = "ALS", parameter = list(normalize = "center"))
}
recommendations<-function(eval){
recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations)
}
recc_matrix <- sapply(recc_predicted@items, function(x){
colnames(ratings_jokes)[x]
})
ModelErrors<-function(eval){
recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations, type= "ratings")
calcPredictionAccuracy(recc_predicted,recc_unknown, byUser = F)
}
rbind(
UBCF_pearson= ModelErrors(UBCF_pearson_eval),
UBCF_jaccard = ModelErrors(UBCF_jaccard_eval),
UBCF_Cosine = ModelErrors(UBCF_cosine_eval),
IBCF_pearson= ModelErrors(IBCF_pearson_eval),
IBCF_jaccard = ModelErrors(IBCF_jaccard_eval),
IBCF_Cosine = ModelErrors(IBCF_cosine_eval),
SVD_Z= ModelErrors(SVD_Z_eval),
SVD_Center= ModelErrors(SVD_center_eval),
ALS_Z= ModelErrors(ALS_Z_eval),
ALS_Center= ModelErrors(ALS_center_eval)
) %>%
kable() %>%
kable_styling(full_width = F,
position = "center",
bootstrap_options = c("hover", "condense","responsive"))
| RMSE | MSE | MAE | |
|---|---|---|---|
| UBCF_pearson | 4.495049 | 20.20546 | 3.524677 |
| UBCF_jaccard | 4.641267 | 21.54136 | 3.670446 |
| UBCF_Cosine | 4.542053 | 20.63024 | 3.565330 |
| IBCF_pearson | 4.697234 | 22.06400 | 3.593574 |
| IBCF_jaccard | 5.688409 | 32.35800 | 4.324923 |
| IBCF_Cosine | 5.245499 | 27.51526 | 4.099757 |
| SVD_Z | 4.649408 | 21.61699 | 3.682991 |
| SVD_Center | 4.647978 | 21.60370 | 3.681730 |
| ALS_Z | 5.207312 | 27.11610 | 4.404687 |
| ALS_Center | 5.667964 | 32.12582 | 4.500189 |
There are many different recommender systems and there are several methods to define which models produce greater results. Some common metrics for model evaluations are: root mean square error (RMSE) as shown above, mean square error (MSE), mean absolute error (MAE), receiver operating characteristic (ROC), area under the ROC curve (AUC). The metric presented below is the ROC curve for all models In addition to the models, a random or “guessing” recommendation will be produced in order to have a baseline to compare model performance to. The plot below shows the results of this evaluation and the UBCF pearson model performs the best amongst all of the different methods.
models_to_evaluate <- list(
IBCF_cosine = list(name = "IBCF", param = list(method = "cosine")),
IBCF_pearson = list(name = "IBCF", param = list(method ="pearson")),
IBCF_jaccard = list(name = "IBCF", param = list(method ="jaccard")),
UBCF_cosine = list(name = "UBCF", param = list(method ="cosine")),
UBCF_pearson = list(name = "UBCF", param = list(method ="pearson")),
UBCF_jaccard = list(name = "UBCF", param = list(method ="jaccard")),
SVD_Z = list(name = "SVD", param = list(normalize = "Z-score")),
SVD_C = list(name = "SVD", param = list(normalize = "center")),
ALS_N = list(name = "ALS", param = list(normalize = NULL)),
ALS_C = list(name = "ALS", param = list(normalize = "center")),
ALS_Z = list(name = "ALS", param = list(normalize = "Z-score")),
random = list(name = "RANDOM", param=NULL)
)
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets, method = models_to_evaluate, n= n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.12sec/0.2sec]
## 2 [0.13sec/0.19sec]
## 3 [0.13sec/0.2sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.12sec/0.22sec]
## 2 [0.14sec/0.16sec]
## 3 [0.12sec/0.22sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.14sec/0.16sec]
## 2 [0.16sec/0.19sec]
## 3 [0.12sec/0.32sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/2.62sec]
## 2 [0.02sec/2.74sec]
## 3 [0sec/2.95sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/2.66sec]
## 2 [0.01sec/2.68sec]
## 3 [0.02sec/2.48sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.01sec/2.77sec]
## 2 [0.02sec/2.56sec]
## 3 [0sec/2.5sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.09sec/0.19sec]
## 2 [0.09sec/0.2sec]
## 3 [0.07sec/0.19sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.1sec/0.16sec]
## 2 [0.03sec/0.15sec]
## 3 [0.06sec/0.15sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/86.17sec]
## 2 [0sec/86.75sec]
## 3 [0sec/86.02sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/86.67sec]
## 2 [0sec/87.8sec]
## 3 [0sec/85.84sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/85.92sec]
## 2 [0.02sec/88.27sec]
## 3 [0sec/90.77sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.18sec]
## 2 [0sec/0.18sec]
## 3 [0.02sec/0.17sec]
avg_matrices <- lapply(list_results, avg)
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")
To see the top jokes from some of our reccommender models we can match the jokeID to the Jester5k dataset and extract the jokes. The following provides a table of 6 jokes from a few models.
JesterJokesDF<-data.frame(JesterJokes)
JesterJokesDF$names.number_of_items_top.<-row.names(JesterJokesDF)
kable(JesterJokesDF[match(unique(topjokes$names.number_of_items_top.),JesterJokesDF$names.number_of_items_top.),]) %>%
kable_styling(full_width = T,
position = "center",
bootstrap_options = c("hover", "condense","responsive"))
| JesterJokes | names.number_of_items_top. | |
|---|---|---|
| j82 | Q: How do you keep a computer programmer in the shower all day long? A: Give them a shampoo with a label that says “rinse, lather, repeat”. | j82 |
| j50 | A guy goes into confession and says to the priest, “Father, I’m 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice.” The priest said: “Well, my son, when was the last time you were in confession?” “Never Father, I’m Jewish.” “So then, why are you telling me?” “I’m telling everybody.” | j50 |
| j32 | A man arrives at the gates of heaven. St. Peter asks, “Religion?” The man says, “Methodist.” St. Peter looks down his list, and says, “Go to room 24, but be very quiet as you pass room 8.” Another man arrives at the gates of heaven. “Religion?” “Baptist.” “Go to room 18, but be very quiet as you pass room 8.” A third man arrives at the gates. “Religion?” “Jewish.” “Go to room 11, but be very quiet as you pass room 8.” The man says, “I can understand there being different rooms for different religions, but why must I be quiet when I pass room 8?” St. Peter tells him, "Well the Catholics are in room 8, and they think they’re the only ones here. | j32 |
| j36 | A guy walks into a bar, orders a beer and says to the bartender, “Hey, I got this great Polish Joke…” The barkeep glares at him and says in a warning tone of voice: “Before you go telling that joke you better know that I’m Polish, both bouncers are Polish and so are most of my customers” “Okay” says the customer,“I’ll tell it very slowly.” | j36 |
Another Experiment that could be performed would be to look at the tf-idf of the jokes and compare any similarities in text. This may prove to be an interesting analysis however, the jokes are relatively short texts so it may likely not yield significant results.