DATA612 Final Project | Jester Dataset

Load Libraries

Project Goals & Methodology

The overall goal of this project is to build a recommender system that predicts joke ratings. Although we initially aimed to use spark on databricks, we ran into a number of challenges that prevented us from continuing. Instead, we have chosen to create a recommender that is based on a more advanced technique. We will use a monolothic hybrid approach, whereby two recommender methods are integrated to produce a final model. The first model will be the traditional Item-based collaborative filtering approach; each user will be recommended jokes that are similar to the ones they’ve already rated. The second model will be be based purely on joke characteristics; we will extract information from the joke text (ex: type of joke, length of joke) to create novel features. The end model will be a weighted combination of the similarity matrices from both models. We also tackle the “cold start” problem in which a user has not rated any jokes previously by incorporating a function to recommend the top n rated jokes from the dataset.

About the Dataset

The Jester datset 3 will be used in this project. Jester is a joke recommender system developed by the University of California, Berkeley Laboratory of Automation Science and Engineering. The Jester dataset contains millions of joke ratings made by users of the recommender system, which learns from new users. Our chosen dataset (several Jester sets of different sizes are available) initially describes almost 55,000 users making over 1.8 million ratings of 150 different jokes. The Jester team collected these ratings from April 1999 to May 2003 and November 2006 to March 2015. Ratings are real values ranging from -10.00 to +10.00 (the value “99” corresponds to “null” = “not rated”). Each row in the dataset represents a different user, and the first column represents the total number of jokes the individual has rated. The remaining 100 columns give the ratings for each joke. (Learn more about Jester here.) The joke text included in the jokeText file will also be used so as to facilitate features engineering.

Data Preprocessing

Read and prepare data

First, we will load in the jester data set. We can see from the extraction that we have \(54,905\) users and \(150\) jokes (plus one column for the rated joke count).

# Downloading ratings data to tempfile
dl <- tempfile()
download.file("http://eigentaste.berkeley.edu/dataset/JesterDataset3.zip", dl)

# Unzipping then reading into a tibble
con <- unzip(dl, "FINAL jester 2006-15.xls")
working <- data.frame(read_xls(con, col_names = FALSE))

dim(working)

## [1] 54905   151

We’ll start by removing the count column and retired jokes (jokes that weren’t rated). After this cleansing, we’re left with a total of \(128\) jokes.

# Removing count column
ratings_working <- working[-1]

# Adding column and row names
names(ratings_working) <- 1:dim(ratings_working)[2]
row.names(ratings_working) <- 1:nrow(ratings_working)

# Removing retired jokes (columns)
retired <- c("1","2","3","4","5","6","9","10","11","12","14","20","27","31","43","51","52","61","73","80","100","116")  # per Jester website -- jokes removed and thus not rated
ratings_working[, retired] <- list(NULL)

dim(ratings_working)

## [1] 54905   128

The raw data represents non-rated jokes as the number 99, so we will replace these values with nulls. We’ll also create a large ratings matrix to be used in our IBCF model.

# Replacing "99" with NA
ratings_working[ratings_working == 99] <- NA
ratings_working <- as.matrix(ratings_working)

# Create realRatingMatrix
finalRatings <- as(ratings_working, 'realRatingMatrix')

Exploratory Data Analysis of Real Ratings

We can now dive a bit deeper into the ratings data.

Distribution of joke counts

First, let’s take a look at the number of jokes that each user has rated. The distribution is left skewed, which means that only a few users have rated a lot of the jokes.

jokeCount <- rowCounts(finalRatings)
hist(jokeCount,
     main = ' Distribution of Number of Jokes Rated per User',
     xlab = 'Number of Jokes Rated',
     ylab = 'Number of Users',
      col = 'lightblue')

Average rating across jokes

We can also look at the average rating for the jokes. We can see from the histogram that the mean rating is below 4 for all jokes.

mean_rating <- colMeans(finalRatings, na.rm = T)
hist(mean_rating,
     main = 'Distribution of joke ratings',
     xlab = 'Average Rating',
     ylab = 'Number of users')

Joke Text Feature Engineering

We will base our joke features on research by Rada Mihalcea and Stephen Pulman.

Joke Characteristics

The joke text is included in the jokeText file, so we will import this into our session. Since we subset our original jokes, we’ll eliminate any jokes that should not be included in the feature set.

jokes <- data.frame(as.integer(colnames(ratings_working)))
colnames(jokes) <- c('JOKE_NUM')

# https://www.cs.ox.ac.uk/files/244/mihalcea.cicling07.pdf
jokeText <- data.frame(read_xls("jokeText.xls", col_names = FALSE))
colnames(jokeText) <- c('JOKE_TEXT')
jokeText$JOKE_NUM <- 1:nrow(jokeText)

jokeText <- inner_join(jokeText, jokes, by='JOKE_NUM')

Let’s create a few features from the joke text:

JOKE_TYPE: There are 2 general types of jokes - (1) Q&A (ex: Q. What’s the difference between a man and a toilet? A. A toilet doesn’t follow you around after you use it.) and (2) story-line jokes (ex: A man visits the doctor. The doctor says “I have bad news for you.You have cancer and Alzheimer’s disease”. The man replies "Well,thank God I don’t have cancer!) We will classify Q&A jokes as \(1\) and story-line jokes as \(0\).
JOKE_LENGTH: The number of words in the joke.
EXCITED_COUNT: The number of times an exclamation point is used in a given joke.
SENTIMENT FEATURES: We will analyze the general sentiment of the joke using the the lexicon of Bing Liu and collaborators. We’ll record the positive scores, negative scores, total sentiment, and categorize the sentiment as 1 (positive) or 0 (neutral, or negative).

# JOKE_TYPE, JOKE_LENGTH, EXCITED_COUNT columns
jokeText <- jokeText %>%
  mutate(JOKE_TYPE = ifelse(str_detect(JOKE_TEXT, 'Q:') == 'TRUE', 1, 0),
         JOKE_LENGTH = str_count(JOKE_TEXT, '\\w+'),
         EXCITED_COUNT = str_count(JOKE_TEXT, '!'))

Joke Sentiment

To analyze the joke sentiment, we’ll create a corpus for our joke text using the tm package. We’ll remove whitespace, stopwords, and punctuation and also transform the text to lowercase. This will allow us to create a DocumentTermMatrix which identifies the counts of words in each joke.

When we inspect the first few elements of the DocumentTermMatrix, we can see that it is very sparse: 99% of the counts are 0. This means that there are not a lot of common words used in all jokes.

# create a corpus of words in the text
jokeCorpus <- VCorpus(VectorSource(jokeText$JOKE_TEXT))

# no whitespace, all lowercase, remove stopwords
jokeCorpus <- tm_map(jokeCorpus, stripWhitespace)
jokeCorpus <- tm_map(jokeCorpus, content_transformer(tolower))
jokeCorpus <- tm_map(jokeCorpus, removeWords, stopwords("english"))
jokeCorpus <- tm_map(jokeCorpus, removePunctuation)

# create document-term matrix
dtm <- DocumentTermMatrix(jokeCorpus)
tm::inspect(dtm)

## <<DocumentTermMatrix (documents: 128, terms: 1870)>>
## Non-/sparse entries: 3430/235930
## Sparsity           : 99%
## Maximal term length: 16
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  back engineer father man one replied said says time two
##   105    1        0      3   0   0       0    1    3    0   0
##   107    0        0      0   0   1       0    0    0    0   1
##   120    0        0     14  11   1       0    0    0    0   0
##   128    0        0      0   0   0       0    0    0    0   0
##   32     0        6      0   0   2       0    0    0    0   0
##   36     0        0      2   0   1       0    2    1    1   0
##   37     1        0      0   0   0       0    0    0    0   0
##   69     0        0      0   0   1       0    0    0    0   0
##   85     1        8      0   0   1       0    0    4    0   0
##   94     1        0      0   0   0       0    2    0    0   1

We will transform the DocumentTermMatrix into a format that will allow us to analyze the sentiment of each word. We can use the the get_sentiments() function, which looks at the words with a positive score from the lexicon of Bing Liu and collaborators.

# turn it into a one-term-per-document-per-row data frame
dtm_td <- tidy(dtm)

# sentiments per word
jokeSentiments <- dtm_td %>%
   inner_join(get_sentiments("bing"),by = c(term = "word"))

# sentiments per doc
sentPerDoc <- jokeSentiments %>%
   count(document, sentiment, wt = count) %>%
   spread(sentiment, n, fill = 0) %>%
   mutate(SENTIMENT_SCORE = positive - negative,
          POSITIVE_SCORE = positive,
          NEGATIVE_SCORE = negative,
          SCORE_CATEGORY = ifelse(SENTIMENT_SCORE > 0, 1, 0)) %>%
   select(document, SENTIMENT_SCORE, POSITIVE_SCORE, NEGATIVE_SCORE, SCORE_CATEGORY) %>%
   arrange(SENTIMENT_SCORE)

# convert to a number
sentPerDoc$document <- as.integer(sentPerDoc$document)

We can now join this to our original jokes dataset to get the positive, negative, and total sentiment scores of each joke. We will replace nulls in jokes where no positive or negative words were found with 0 (neutral). We can confirm that the final dataset includes the original features and sentiment counts.

finalJokes <- left_join(jokeText,sentPerDoc, by = c('JOKE_NUM' = 'document'))
finalJokes[is.na(finalJokes)] <- 0

colnames(finalJokes)

## [1] "JOKE_TEXT"       "JOKE_NUM"        "JOKE_TYPE"       "JOKE_LENGTH"    
## [5] "EXCITED_COUNT"   "SENTIMENT_SCORE" "POSITIVE_SCORE"  "NEGATIVE_SCORE" 
## [9] "SCORE_CATEGORY"

Feature Similarity Matrix

Finally, we can create a similarity matrix for our data based on all of the features we’ve defined. The higher the score, the more similar the two jokes are.

jokeFeatures <- finalJokes %>% 
   select(JOKE_TYPE, JOKE_LENGTH, EXCITED_COUNT, SENTIMENT_SCORE, POSITIVE_SCORE, NEGATIVE_SCORE, SCORE_CATEGORY) 

jokeSims <- as.matrix(simil(jokeFeatures))

paste0('Min similarity score: ',min(jokeSims, na.rm=TRUE),' | Max similarity score: ', max(jokeSims, na.rm=TRUE))

## [1] "Min similarity score: 0.316908195253023 | Max similarity score: 1"

We can also take a look the distribution of similarity scores and see that it is right-skewed, meaning many of the jokes are similar to one another:

hist(jokeSims,
     main = 'Distribution of similarity scores',
     xlab = 'Similarity Score',
     ylab = 'Number of Scores')

Recommenders

Our final, hybrid model combines (1) item-based collaborative filtering with (2) the constructed features from the joke text. Each piece will be built out separately and later combined to create the final prediction. The process for building this model is based on Chapter 5 of the course text.

IBCF

First, we can build our initial item-based recommender. We’ll define the following:

Training Percent: The percent of the data that should be used in training. The remaining data will be used for testing.
Items To Keep: The total number of items that will be used to generate the recommendations. The remaining items will be used to test the model accuracy. We’ll identify the min number of jokes that an individual has rated and use a few less than that.
Rating Threshold: The threshold to be used for positive ratings. Since our data is on a scale of -10 to 10, we will use 1 as the threshold for a good joke.

trainPct <- 0.8
toKeep <- min(rowCounts(finalRatings)) - 3
ratingThreshold <- 1
nFold <- 3

# define evaluation scheme
evalScheme <- evaluationScheme(finalRatings, 
                               method = "split", 
                               given = toKeep, 
                               goodRating = ratingThreshold)

# models to compare
evalModels <- list(
 IBCF_cos = list(name = "IBCF", param = list(method =
 "cosine")),
 IBCF_pea = list(name = "IBCF", param = list(method =
 "pearson")),
 IBCF_jac = list(name = "IBCF", param = list(method =
 "jaccard")) 
)

# number of recommendations
nRecs <- c(1, seq(3, 20, 1))
finalResults <- evaluate(x = evalScheme, method = evalModels, n = nRecs)

## IBCF run fold/sample [model time/prediction time]
##   1  [10.15sec/1.69sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [5.15sec/1.09sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [9.26sec/1.33sec]

We can see from the plot of the final results that the IBCF model with the pearson similarity works best.

plot(finalResults, annotate = 1, legend = "topleft")
title("ROC curve")

# IBCF Model
ibcfRec <- Recommender(getData(evalScheme, "train"), 'IBCF', parameter = list(method = 'pearson', normalize = 'center'))

ibcfPred <- predict(ibcfRec,getData(evalScheme,'known'), type = 'ratings')
ibcfAcc <- calcPredictionAccuracy(ibcfPred, getData(evalScheme, "unknown"))

We’ll define a similarity matrix for the jokes, which will give us an idea of how similar jokes are to each other. This matrix is based on the rated jokes – the more users that rate the jokes positively together, the more similar they are. We can visualize this matrix as well – the darker colored squares represent jokes that are similar.

# Defining a similarity matrix object
ratings_dist <- as(ibcfRec@model$sim, "matrix")

# Visualize similarity matrix
image(ibcfRec@model$sim, 
      main = "Similarity matrix",
      xlab = 'Joke Number',
      ylab = 'Joke Number')

In the matrix, the score of \(0\) represents 1 of 2 things: either the joke is being compared to itself or the two jokes in question are very dissimilar. We can take a look at the distribution of the similarities and see that most items are ranked \(0\) or between \(0.5\) and \(0.6\).

hist(ratings_dist)

Combining the Similarity Matrices

Now that we’ve defined a similarity matrix for the jokes using the IBCF model, we can combine it with the similarity matrix for our defined features. We can identify what weighting works best.

accList <- c()
for(i in seq(0,1,0.1)){
   finalMatrix <- jokeSims * i + ratings_dist * (1 - i)
   
   # replace similarity matrix with new matrix
   ibcfRec@model$sim <- as(finalMatrix, "dgCMatrix")
   hybridPred <- predict(ibcfRec,getData(evalScheme,'known'), type = 'ratings')
   hybridAcc <- calcPredictionAccuracy(hybridPred, getData(evalScheme, "unknown"))
   accList <- rbind(accList, c(i,hybridAcc))
      
}

We can see from the RMSE comparison that the best combination of IBCF / derived features is 70% / 30%. We will use this to make our final model, which will recommend the top 5 jokes per user.

plot(data.frame(accList)$V1, data.frame(accList)$RMSE,
     main = 'RMSE by Weighting Factor',
     xlab = 'Weighting Factor',
     ylab = 'RMSE')

finalMatrix <- jokeSims * 0.3 + ratings_dist * (1 - 0.3)
   
# replace similarity matrix with new matrix
ibcfRec@model$sim <- as(finalMatrix, "dgCMatrix")

Cold Start Problem

Because our model is based on items that are similar in nature, we need a user profile for it to recommend new jokes. This is a problem when we are presented with a user that has not rated any jokes. To deal with this problem we will recommend the top \(n\) rated jokes as a default.

First, we can calculate the top 5 rated jokes from the dataset.

meanRatings <- data.frame(colMeans(ratings_working, na.rm=TRUE))
colnames(meanRatings) <- c('MEAN_RATING')
meanRatings$JOKE_NUM <- rownames(meanRatings)

top5 <- meanRatings %>%
   arrange(desc(MEAN_RATING)) %>%
   top_n(n=5, wt =MEAN_RATING) %>%
   select(JOKE_NUM)

top5$JOKE_NUM <- as.integer(top5$JOKE_NUM)
top5.ls <- as.list(top5)
top5.ls

## $JOKE_NUM
## [1] 105  53  89 129  35

Next, we’ll create a function to calculate the predictions - if the number of ratings for the user profile is 0, the default predictions will be the top 5 jokes. Otherwise, the predictions will be based on the user’s rated items.

finalRecs <- function(profile){
   colCounts <- max(colCounts(profile))
   if(colCounts == 0){
      finalJokes <- top5.ls
   } else{
      preds <- predict(ibcfRec,
        profile, 
        n = 5)
      finalJokes <- preds@items
   }
   return(finalJokes)
}

Finally, we can test this out with 2 test users - one that has no ratings and one that has a few ratings. We can verify that the output for no users is the same as the top 10 jokes.

testUser <- matrix(data = rep(NA,128),
                   nrow = 1)
colnames(testUser) <- colnames(finalRatings)
rownames(testUser) <- 1111111
testUser.rm <- as(testUser, 'realRatingMatrix')

finalRecs(testUser.rm)

## $JOKE_NUM
## [1] 105  53  89 129  35

top5.ls

## $JOKE_NUM
## [1] 105  53  89 129  35

And we can also see what jokes are recommended to a user that has recommended jokes:

testUser2 <- getData(evalScheme,'known')[1]
finalRecs(testUser2)

## $`4`
## [1]  85 107  18  37  50

Conclusion

Summary of Results

Our final recommender model incorporates item-based similarities from both a traditional IBCF technique and a derived similarity matrix using joke text features. We performed a pseudo grid search to identify the optimal parameters to use in our IBCF model, with the pearson similarity winning out. We were able to extract relevant features from the joke text that ultimately resulted in a superior model than using IBCF alone. The final model was developed based on an analysis of weightings to identify the best combination of feature/IBCF weightings. Finally, we were able to overcome the “cold start” problem by incorporating average joke ratings for new users with no previously rated jokes.

Challenges with Implementation

Our biggest challenge with this project was the lack of support for recommender packages using R. We had a stretch goal of using both an advanced approach (a self-developed hybrid system) AND deploying using spark, but this wasn’t feasible with the language we chose to use. This is why we ultimately decided to run the entire recommender locally, but increase complexity by tackling the “cold start” problem.

Future Work

Our final RMSE of over 4 leaves a lot of room for improvement with this system. Incorporating user and item bias in the model might help to improve our scores. Additionally, we would like to play with different “cold start” techniques - we used the average rating for a joke to determine the recommended items for new users with no profiles, but this could potentially be replaced/combined with popular jokes and median joke ratings. Finally, and most importantly, we think that porting the entire analysis to pyspark compatible libraries would likely (1) increase final accuracy and (2) increase computation time.

Sources

Mihalcea, Rada, and Stephen Pulman. “Characterizing Humour: An Exploration of Features in Humorous Texts.” Computational Linguistics and Intelligent Text Processing Lecture Notes in Computer Science, 2007, pp. 337–347., doi:10.1007/978-3-540-70939-8_30.
Usuelli, M., & Gorakala, S. K. (2015). Building a Recommendation System with R. Packt Publishing Limited.