DATA 643 Project 1: Coding a Recommender System

Motivation

The purpose of this exercise is to hand-code a recommender system, including the creation of a similarity function. The results of the hand-coded system are compared with the evaluation of the same model using an R package – in this case recommenderlab.

Data Utilized

To allow for ease of comparison between models, the Jester5k dataset from the recommenderlab package is utilized for the recommender system. According to the help file, the dataset includes “5000 users from the anonymous ratings data from the Jester Online Joke Recommender System.” The ratings contained in the data frame range between -10.00 and 10.00. Each user included has rated at least 36 jokes.

The dataset is stored as an item of class realRatingMatrix; it is converted to a data frame for use in the hand-coded system.

library(recommenderlab)
data(Jester5k)
jester <- as.vector(Jester5k@data)
jester <- matrix(data = jester, nrow = Jester5k@data@Dim[1], ncol = Jester5k@data@Dim[2])
jester <- as.data.frame(jester)
rownames(jester) <- Jester5k@data@Dimnames[[1]]
names(jester) <- Jester5k@data@Dimnames[[2]]

System Creation

The goal of the recommender system will be to provide 5 recommended jokes per user for the first twenty users in the dataset. Based on the information available and the dimensions of the dataset, item-based collaborative filtering will be utilized. Similarities will be calculated using the cosine method.

Similarity Function

In order to provide recommendations in an item-based collaborative filtering system, the similarities between items must be computed and saved into a matrix. For two vectors (representing the ratings assinged to a joke in this case), the cosine distance is given by the dot product of the matrices divided by the product of their norms. Because the norm function of the Matrix does not handle vectors, a custom function is created, with handling for potential NA values accounted for:

vec_similarity <- function(v1, v2) {
  dot_product <- sum(v1 * v2, na.rm = TRUE)
  norm1 <- sqrt(sum(v1^2, na.rm = TRUE))
  norm2 <- sqrt(sum(v2^2, na.rm = TRUE))
  dot_product / (norm1 * norm2)
}

System Utilization

With the necessary similarity function created, the model is implemented.

Mean-Centering

In order to evaluate the item-item similarities, the ratings provided by users must be mean-centered so that their ratings are viewed relative to their own average ratings. In order to do this, the mean for each user (represented by a row in the dataset) must be calculated, and this mean must be subtracted from the rating. Since non-rated jokes are represented by values of 0, these values must be replaced by NAs to avoid biasing the mean towards zero. The original Jester5k dataset contains no ratings of exactly 0.00, so this step does not lead to the loss of any information.

jester[jester == 0] <- NA
user_means <- rowMeans(jester, na.rm = TRUE)
for (i in 1:nrow(jester)) {
  jester[i, ] <- jester[i, ] - user_means[i]
}

Creating the Similarity Matrix

The similarity matrix for the 100 jokes is created using the mean-centered data and the similarity function.

item_sim <- data.frame(matrix(NA, nrow = ncol(jester), ncol = ncol(jester)))
for (i in 1:nrow(item_sim)) {
  for(j in 1:ncol(item_sim)) {
    item_sim[i, j] <- vec_similarity(jester[, i], jester[, j])
  }
}
names(item_sim) <- names(jester) -> rownames(item_sim)
image(as.matrix(item_sim), axes = FALSE)

The similarity matrix shows a wide range of similarity – the similarities (not including the diagonal with similarity = 1) range from -0.378 to 0.478.

Running the Model

In order to prepare the model, a subset of the jester dataset is utilized. Users with fewer than 20 unrated jokes are removed so that the recommendations are meaningful (the five best recommendations for someone who has rated all but five jokes will be the remaining five jokes, regardless of if the user is expected to like them). For the first twenty users in this new dataset, a predicted rating for each unrated joke is produced using the twenty most similar jokes. Of these unrated jokes, the five jokes with the highest predicted ratings are returned.

library(dplyr)

jester_rec <- subset(jester, rowSums(is.na(jester)) > 20)

manual_recs <- data.frame(matrix(nrow = 0, ncol = 5))
for (i in 1:20) {
  joke_name <- character(0)
  joke_guess <- numeric(0)
  for (j in 1:ncol(jester_rec)){
    if(is.na(jester_rec[i, j])) {
      joke_name <- c(joke_name, names(item_sim)[j])
      # locate max similarities for item
      sims <- item_sim[, j]
      sims_df <- data.frame(joke = names(item_sim), sims, stringsAsFactors = FALSE)
      sims_df <- sims_df %>% filter(joke != names(item_sim)[j]) %>% arrange(desc(sims)) %>% top_n(20)
      # get predicted rating
      pred <- NULL
      for (k in 1:20) {
        pred <- sum(pred, sims_df$sims[k] * jester_rec[i, sims_df$joke[k]], na.rm = TRUE)
      }
      joke_guess <- c(joke_guess, pred / sum(sims_df$sims, na.rm = TRUE))
    }
  }
  best_jokes <- data.frame(joke_name, joke_guess, stringsAsFactors = FALSE)
  best_jokes <- best_jokes %>% arrange(desc(joke_guess))
  for (n in 1:5) {
    manual_recs[i, n] <- best_jokes$joke_name[n]
  }
  rownames(manual_recs)[i] <- rownames(jester_rec)[i]
}
names(manual_recs) <- as.character(1:5)

Results

The five jokes for the first twenty users are presented below:

	1	2	3	4	5
u15547	j89	j93	j76	j81	j83
u21505	j100	j80	j81	j93	j89
u238	j11	j46	j53	j45	j25
u5809	j36	j26	j89	j72	j93
u16636	j71	j82	j99	j89	j97
u17322	j81	j89	j72	j100	j97
u13610	j89	j72	j93	j76	j87
u20906	j100	j98	j80	j85	j83
u7147	j14	j39	j11	j6	j22
u4662	j47	j6	j63	j38	j90
u7904	j26	j46	j81	j14	j76
u5462	j40	j98	j6	j47	j45
u13120	j98	j80	j83	j1	j90
u22827	j6	j89	j72	j93	j76
u20747	j89	j100	j80	j64	j37
u1143	j89	j72	j93	j91	j76
u7602	j93	j89	j72	j81	j91
u12658	j89	j11	j93	j6	j72
u5021	j89	j72	j93	j45	j76
u6457	j81	j44	j33	j37	j89

The most-recommended jokes are presented below:

Joke	n
j89	13
j93	9
j72	8
j76	6
j81	6

Prepackaged System

The recommender systems in the recommenderlab package automatically mean-center data and calculate similarities (using the specified method). Many of the functions in the package take inputs of the class realRatingMatrix – for this reason, the raw Jester5k dataset from the package is utilized.

Creating the Model

The model (including similarity matrix) is created using the full dataset. Arguments to the function are set to match those used in the manually-created model.

premade_model <- Recommender(data = Jester5k, method = "IBCF", parameter = list(method = "Cosine", k = 20))

Applying the Model

The model is then applied to the dataset to obtain five recommendations each for the first twenty users. Here, the dataset is again subsetted to exclude those with fewer than 20 unrated jokes.

Jester5k_rec <- Jester5k[100 - rowCounts(Jester5k) > 20]
premade_allrecs <- predict(object = premade_model, newdata = Jester5k_rec, n = 5)

Results

The five recommendations for the first twenty users are returned:

premade_recs <- data.frame(matrix(nrow = 20, ncol = 5))
rownames(premade_recs) <- names(premade_allrecs@items)[1:20]
for(i in 1:20) {
  for (j in 1:5) {
    premade_recs[i, j] <- paste0("j", premade_allrecs@items[[i]][j])
  }
}
names(premade_recs) <- as.character(1:5)

	1	2	3	4	5
u15547	j88	j91	j89	j93	j87
u21505	j92	j80	j78	j77	j73
u238	j25	j52	j45	j3	j11
u5809	j89	j76	j72	j93	j90
u16636	j77	j78	j82	j95	j96
u17322	j81	j89	j87	j72	j97
u13610	j88	j87	j91	j72	j94
u20906	j78	j80	j85	j83	j77
u7147	j22	j98	j14	j52	j39
u4662	j90	j63	j82	j84	j38
u7904	j90	j84	j82	j79	j98
u5462	j40	j98	j81	j10	j45
u13120	j73	j78	j80	j90	j77
u22827	j76	j91	j72	j89	j96
u20747	j80	j89	j78	j37	j100
u1143	j91	j72	j80	j76	j89
u7602	j93	j89	j97	j72	j81
u12658	j73	j79	j89	j93	j72
u5021	j96	j89	j93	j88	j76
u6457	j81	j95	j79	j87	j41

The most-recommended jokes are presented below:

Joke	n
j89	9
j72	7
j78	5
j80	5
j93	5

Comparison

The top five suggested jokes by both models share 3 jokes, sharing the most-recommended joke, j89:

A radio conversation of a US naval ship with Canadian authorities … Americans: Please divert your course 15 degrees to the North to avoid a collision. Canadians: Recommend you divert YOUR course 15 degrees to the South to avoid a collision. Americans: This is the Captain of a US Navy ship. I say again, divert YOUR course. Canadians: No. I say again, you divert YOUR course. Americans: This is the aircraft carrier USS LINCOLN, the second largest ship in the United States’ Atlantic Fleet. We are accompanied by three destroyers, three cruisers and numerous support vessels. I demand that you change your course 15 degrees north, that’s ONE FIVE DEGREES NORTH, or counter-measures will be undertaken to ensure the safety of this ship. Canadians: This is a lighthouse. Your call.

Of the 20 users evaluated, roughly 65% have three or four recommendations common to both models. One user had zero matching recommendations; zero users had fully matched recommendations. These results are presented in the histogram below:

The manual model completed in just over 60 seconds; the prebuilt model completed in 2.75 seconds. This significant difference in time indicates that the recommendation system included in recommenderlab is likely the more useful model. It also contains other filtering techniques, which could be implemented and tested for precision without the need to build additional models.

DATA 643 Project 1: Coding a Recommender System

Dan Smilowitz

June 19, 2016

Motivation

Data Utilized

System Creation

Similarity Function

System Utilization

Mean-Centering

Creating the Similarity Matrix

Running the Model

Results

Prepackaged System

Creating the Model

Applying the Model

Results

Comparison