Project Instructions
Introduction and motivation
- Load Data
Explore Data Set
Prepare Data For Modeling
- Select relevant data
- Split into training and testing
Recommendation Algorithms
Discussion
References

Project Instructions

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.For assignment 2, start with an existing dataset of user-item ratings, such MovieLens. Implement at least two of these recommendation algorithms:

Content-Based Filtering
User-User Collaborative Filtering
Item-Item Collaborative Filtering

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Introduction and motivation

In this project we will use the built-in MovieLense dataset from the recommenderlab library. We will implement and compare results from several recommendation algorithms. Our recommenders will predict movie ratings given user and movie profiles as well as users’ past ratings.

Load Data

Our data is stored as a realRatingMatrix, where each row represents a user and each column represents a movie. We can see that there are a total of 99,392 ratings, so not every user has watched every movie. This means that our matrix is sparse – most cells have no information.

set.seed(200)
data("MovieLense")
movielense <- MovieLense@data
show(MovieLense)

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Explore Data Set

With reference to code by (Hahsler, 2019), we did some exploration of the dataset so that we get a sense of the how ratings are distributed according to user and item. This will inform how we will subset the data for modeling.

Distribution of ratings by user and item

First, we can take a look at the number of ratings per user.

## number of ratings per user using base R
hist(rowCounts(MovieLense))

The data is severely right skewed, which indicates that most users have only rated a few movies.

Next, we can take a look at the number of ratings per movie.

## number of ratings per movie using base R
hist(colCounts(MovieLense))

This distribution is also severely right skewed, which means that most movies only have a few ratings.

This aligns with our original intuitios – given the skewedness in both of these distributions, it is obvious that there is a lot of sparsity in the dataset.

Overall distribution of ratings

Next, we can look at the distribution of ratings.

# Visualise the distribution of ratings in this dataset
movielense %>% 
  as.vector() %>% 
  as_tibble() %>% 
  filter_all(any_vars(. != 0)) %>% 
  ggplot(aes(value)) + 
  geom_bar(fill = 'red') +
  labs(title = " Overall Distribution of Ratings", y = "", x = "Ratings") +
  theme_minimal()

The overall distribution of ratings is left skewed, indicating that people tend to rate movies more positively. The mean rating is confirmed below:

## mean rating (averaged over users)
mean(rowMeans(MovieLense))

## [1] 3.587565

View of meta data and heatmap of ratings

The format of MovieLenseMeta is a data frame with movie title, year, IMDb URL and indicator variables for 19 genres.

# View the meta data:
movie_meta <- MovieLenseMeta
str(movie_meta)

## 'data.frame':    1664 obs. of  22 variables:
##  $ title      : chr  "Toy Story (1995)" "GoldenEye (1995)" "Four Rooms (1995)" "Get Shorty (1995)" ...
##  $ year       : num  1995 1995 1995 1995 1995 ...
##  $ url        : chr  "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)" "http://us.imdb.com/M/title-exact?GoldenEye%20(1995)" "http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)" "http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)" ...
##  $ unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Children's : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Film-Noir  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sci-Fi     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...

We can also build out a heatmap of the most relevant users/movies. This will consist of the users that rated the most movies and the movies that have the most ratings. Dark columns represent highly rated movies and dark rows represent users that give high ratings. This indicates the need for normalization of the data – we don’t want to skew the results from users that predominantly rank high or low.

min_n_movies <- quantile(rowCounts(MovieLense), 0.99)
min_n_users <- quantile(colCounts(MovieLense), 0.99)

image(MovieLense[rowCounts(MovieLense) > min_n_movies,
colCounts(MovieLense) > min_n_users], main = "Heatmap of the top users
and movies")

Prepare Data For Modeling

Our data prep will involve: selecting relevant data (movies that have a lot of ratings and users that have seen a lot of movies) and splitting into training and test sets. Once we’ve cleaned things up, we can start to build our recommender models.

Select relevant data

It is obvious that the dataset is very sparse. It includes many individuals that rated very few movies, and many movies that have very few ratings. In order to have a healthy baseline on which to our build recommendation models, we will take into consideration those users who have rated at least 50 movies and those movies that are rated by at least 100 users. A look at the disribution of average rating per user shows us that we have a large variation across the subset.

# base population
most_rated <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
average_ratings_per_user <- rowMeans(most_rated)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average rating per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Split into training and testing

Next, we can split the data into train and test sets. We are left with 462 records in the training set and 98 records in the test set.

# split the dataset into test set and train set
which_train <- sample(x = c(TRUE, FALSE), size = nrow(most_rated), replace = TRUE, prob = c(0.8, 0.2))
train_set <- most_rated[which_train, ]
test_set <- most_rated[!which_train, ]

paste0('Training Set: ',nrow(train_set), ' records | Test Set: ', nrow(test_set), ' records')

## [1] "Training Set: 462 records | Test Set: 98 records"

Recommendation Algorithms

Now that we’ve cleaned up our data, we can configure and implement our recommender systems. There are 3 main filtering approaches: content-based, user-user collaborative, and item-item collaborative.

In content-based filtering, recommendations are user-based – a user profile is defined and the algorithm recommends new items that are similar to previously rated items.

In collaborative filtering, recommendations are based on information about similar users (user-user) or similar items (item-item). The recommenderlab library contains 3 similarity functions that we will explore: cosine, pearson, and jaccard.

For this project, we will focus on user-user and item-item based collaborative filtering.

User-User Collaborative Filtering

User-based collaborative filtering: This recommends to a user the items that are the most preferred by similar users. Our approach will be to:

Measure the similarity between every pair of users.
Identify the most similar users.
Aggregate the ratings from the movies watched by similar users.
Select the top rated movies for recommendation.

The parameters for this approach are similar to those of the item-item filtering: the similarity method (cosine, pearson, and jaccard), the number of nearest neighbors, and the normalization technique.

We’ll start off with a simple, non-optimized user-user based model to get a feel for the functions. We’ll use our train set to create the model and our test set to get recommendations on. Here, we’ve defined the number of recommendations as 5, the similarity method as cosine, and the number of neighbors as 5. We’ll plot the distribution of recommendations.

# recommend 5 movies to each user
n_recommended <- 5

# create model on training set
recc_model <- Recommender(data = train_set, method = "UBCF", parameter = list(method = 'cosine', nn = 5, normalize = 'center' ))

# predict on test set
recc_predict <- predict(object = recc_model, newdata = test_set, n = n_recommended)

# create a matrix of predictions
recc_matrix <- sapply(recc_predict@items, function(x){
colnames(most_rated)[x]
})

# plot distribution of movie recommendations
numRec <- factor(table(recc_matrix))

qplot(numRec,
      main = 'Distribution of Movie Recommendations',
      ylab='Number of Movies',
      xlab = 'Times Recommended')

We can see that that most movies are recommended only a few times.

Now, we can optimize our model by tweaking the similarity method, number of neighbors, and normalization type. We’ll create a function to calculate and record the RMSE, MSE, and MAE for each set of parameters.

set.seed(200)

# function to calculate error 
errorCalc <- function(e, sm, nb, nrm, evl){
  
  # create a user-based CF recommender using training data
  rec <- Recommender(getData(e, "train"), 
                     evl, 
                     parameter = list(method = sm, 
                                      nn = nb, 
                                      normalize = nrm))
  
  # create predictions for the test data using known ratings 
  pred <- predict(rec, getData(e, "known"), type="ratings")
  
  # avg error metrics per user, avgd over all recommendations
  error.ubcf <-calcPredictionAccuracy(pred, getData(e, "unknown"))
  
  # evaluate topNLists instead 
  pred.top10 <- predict(rec, getData(e, "known"), type="topNList", n = 5)
  error.ubcf.top10 <- calcPredictionAccuracy(pred.top10,
                                             getData(e,"unknown"), 
                                             given=5, 
                                             goodRating=3)
  
  error.ubcf <- c(error.ubcf, sm, nb, nrm)
  names(error.ubcf) <- c('RMSE', 'MSE', 'MAE','similarity',
                         'neighbors','norm')
  
  return(error.ubcf)
}

In this case, we’ll define an evaluation scheme that consists of 10-fold cross validation with a “good rating” defined as 3. We will cycle through all parameter combinations to identify what the optimal combination is.

# cycle through all parameter combinations
simMethod <- c('cosine', 'pearson', 'jaccard')
neighbors <- seq(5,100,5)
norm <- c('center', 'Z-score')
evalType <- "UBCF"

# 10-fold cross validation, good ratings = 3, withholding 10 items for eval
e <- evaluationScheme(most_rated, method = "cross-validation", k=10, given = 10, goodRating=3)

# create list for errors
errors.names <- c("RMSE", "MSE", "MAE", 'similarity', 'neighbors','norm')
ubcfErrors.ls <- vector("list", length(errors.names))
names(ubcfErrors.ls) <- errors.names

for (i in simMethod){
  for (j in neighbors){
    for (k in norm){
      ubcfErrors.ls <- rbind(ubcfErrors.ls, errorCalc(e, i, j, k, evalType))
       
    }
  }
}

We can now visualize our errors:

finalErrors.ubcf <- as.data.frame(ubcfErrors.ls) %>%
  filter(RMSE != 'NULL') %>%
  mutate(RMSE = as.double(RMSE),
         MSE = as.double(MSE), 
         MAE = as.double(MAE)) 

finalErrors.ubcf$ERROR_MEAN<- rowMeans(subset(finalErrors.ubcf, select = c(RMSE, MSE, MAE)), na.rm = TRUE)

ggplot(finalErrors.ubcf, aes(1:nrow(finalErrors.ubcf))) +                    
  geom_line(aes(y=RMSE), colour="red") +  
  geom_line(aes(y=MSE), colour="green") + 
  geom_line(aes(y=MAE), colour="blue") +
  geom_line(aes(y=ERROR_MEAN), colour="purple") +
  xlab('Trial Run') +
  ylab('Error') +
  ggtitle('Errors across all runs')

To identify which set of parameters to use, we can identify the combination that has the lowest error.

paramVals.ubcf <- finalErrors.ubcf %>%
  filter(ERROR_MEAN == min(finalErrors.ubcf$ERROR_MEAN))

paramVals.ubcf

##        RMSE       MSE      MAE similarity neighbors   norm ERROR_MEAN
## 1 0.9588247 0.9193448 0.748171    pearson       100 center  0.8754469

This is the set of parameters we will use for the final comparison.

Item-Item Collaborative Filtering

Item-based collaborative filtering: This recommends to a user the items that are most similar on the user’s previous ratings. We will follow this process:

Measure the similarity in rating between every pair of movies.
Identify the k-most similar movies for each movie title.
Identify the movies most similar to previous rated movies for each user.

Here, we mimic our prior approach to user-based collaborative filtering.

# recommend 5 movies to each user
n_recommended_i <- 5

# create model on training set
recc_model_i <- Recommender(data = train_set, method = "IBCF", parameter = list(method = 'cosine', k = 5, normalize = 'center'))

# predict on test set
recc_predict_i <- predict(object = recc_model_i, newdata = test_set, n = n_recommended_i)

# create a matrix of predictions
recc_matrix_i <- sapply(recc_predict_i@items, function(x){
  colnames(most_rated)[x]
})

# plot distribution of movie recommendations
numRec_i <- factor(table(recc_matrix_i))
qplot(numRec_i,
      main = 'Distribution of Movie Recommendations',
      ylab='Number of Movies',
      xlab = 'Times Recommended')

The results are very similar to UBCF – there are a large number of movies that are only recommended once.

Next, we’ll define a IBCF-specific function that will calculate the error associated with differing parameters.

# function to calculate error 
errorCalc_i <- function(e, sm, kn, nrm, evl){
  set.seed(200)
  # create a user-based CF recommender using training data
  rec <- Recommender(getData(e, "train"), evl, parameter = list(method = sm, k = kn, normalize = nrm))
  
  # create predictions for the test data using known ratings 
  pred <- predict(rec, getData(e, "known"), type="ratings")
  
  # avg error metrics per user, avgd over all recommendations
  error.ibcf <-calcPredictionAccuracy(pred, getData(e, "unknown"))
  error.ibcf <- c(error.ibcf, sm, kn, nrm)
  names(error.ibcf) <- c('RMSE', 'MSE', 'MAE','similarity', 'neighbor items', 'norm')
  return(error.ibcf)
}

Finally, we’ll cycle through all combinations of our parameters and calculate the error associated with each.

# Model parameters
simMethod <- c('cosine', 'pearson', 'jaccard')
vector_k <- seq(5, 100, 5)
norm <- c('center', 'Z-score')
evalType <- "IBCF"

# 10-fold cross validation, good ratings = 3, withholding 10 items for eval
e <- evaluationScheme(most_rated, method = "cross-validation", k=10, given = 10, goodRating=3)

# create list for errors
errors.names <- c("RMSE", "MSE", "MAE", 'similarity', 'k-closest items', 'norm')
ibcfErrors.ls <- vector("list", length(errors.names))
names(ibcfErrors.ls) <- errors.names
for (i in simMethod) {
  for (j in vector_k) {
    for (k in norm){
      ibcfErrors.ls <- rbind(ibcfErrors.ls, errorCalc_i(e, i, j, k, evalType))
    }
  }
}

We can identify which set of parameters yields the lowest average error. We will use this as the basis of comparison against the UBCF methodology.

# Pulling lists of errors into dataframe
finalErrors.ibcf <- as.data.frame(ibcfErrors.ls) %>%
  filter(RMSE != 'NULL') %>%
  mutate(RMSE = as.double(RMSE),
         MSE = as.double(MSE), 
         MAE = as.double(MAE)) 

# Adding a column containing mean error
finalErrors.ibcf$ERROR_MEAN<- rowMeans(subset(finalErrors.ibcf, select = c(RMSE, MSE, MAE)), na.rm = TRUE)

# Plotting the errors
ggplot(finalErrors.ibcf, aes(1:nrow(finalErrors.ibcf))) +                    
  geom_line(aes(y=RMSE), colour="red") +  
  geom_line(aes(y=MSE), colour="green") + 
  geom_line(aes(y=MAE), colour="blue") +
  geom_line(aes(y=ERROR_MEAN), colour="purple") +
  xlab('Trial Run') +
  ylab('Error') +
  ggtitle('Errors across all runs')

The combination of parameters that minimizes the mean error includes Pearson’s correlation, 100 neighbors, and Z-score normalization. Already, we can see that the error using the IBCF methodology is much higher than the UBCF methodology.

# Finding parameter values that minimize the mean error
paramVals.ibcf <- finalErrors.ibcf %>%
  filter(ERROR_MEAN == min(finalErrors.ibcf$ERROR_MEAN))
paramVals.ibcf

##       RMSE      MSE       MAE similarity k-closest items    norm
## 1 1.150618 1.323922 0.8954775    pearson             100 Z-score
##   ERROR_MEAN
## 1   1.123339

Evaluation

For our final evaluation, we compare the user-user and item-item based models, each optimized with those combinations of parameters that minimize mean error. We plot a ROC curve and Precision-recall for each model, and then examine the differences.

User-based collaborative filtering

# Results of model that minimizes mean error
results.ubcf <- evaluate(e, 'UBCF', parameter = list(method = 'cosine', 
                                      nn = 100, 
                                      normalize = 'center'))

## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0.02sec/0.17sec] 
##   3  [0.02sec/0.15sec] 
##   4  [0sec/0.22sec] 
##   5  [0.02sec/0.22sec] 
##   6  [0sec/0.19sec] 
##   7  [0.02sec/0.17sec] 
##   8  [0sec/0.15sec] 
##   9  [0sec/0.17sec] 
##   10  [0.02sec/0.37sec]

# Plotting True Positive Rate (TPR) against False Positive Rate (FPR)
plot(results.ubcf, annotate = TRUE, main = "ROC curve")

# Plotting Precision-recall
plot(results.ubcf, "prec/rec", annotate = TRUE, main = "Precision-recall - UBCF")

Item-based collaborative filtering

# Evaluating using the parameter values - Pearson's, k = 100, Z-score normalization
results.ibcf <- evaluate(e, 'IBCF', parameter = list(method = paramVals.ibcf[[4]], k = as.integer(paramVals.ibcf[[5]]), normalize = paramVals.ibcf[[6]]))

## IBCF run fold/sample [model time/prediction time]
##   1  [0.88sec/0.03sec] 
##   2  [0.68sec/0.05sec] 
##   3  [0.65sec/0.04sec] 
##   4  [0.66sec/0.03sec] 
##   5  [0.68sec/0.03sec] 
##   6  [0.68sec/0.03sec] 
##   7  [0.91sec/0.03sec] 
##   8  [0.66sec/0.05sec] 
##   9  [0.58sec/0.01sec] 
##   10  [0.51sec/0.02sec]

# Plotting ROC curve and Precision-recall
plot(results.ibcf, annotate = TRUE, main = "ROC curve")

plot(results.ibcf, "prec/rec", annotate = TRUE, main = "Precision-recall")

Comparison Finally, we can compare the UBCF and IBCF models to each other.

# UBCF and IBCF models with optimal parameters
models_to_evaluate <- list(results.ubcf = list(name = "UBCF", param = list(method = 'cosine', nn = 100, normalize = 'center')),
                           results.ibcf = list(name = "IBCF", param = list(method = paramVals.ibcf[[4]], k = as.integer(paramVals.ibcf[[5]]), normalize = paramVals.ibcf[[6]])))

# Evaluating at various numbers of recommendations
n_recommendations <- seq(5, 100, 5)                          
list_results <- evaluate(e, method = models_to_evaluate, n = n_recommendations)

## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.12sec] 
##   2  [0sec/0.11sec] 
##   3  [0sec/0.12sec] 
##   4  [0.02sec/0.29sec] 
##   5  [0sec/0.12sec] 
##   6  [0sec/0.14sec] 
##   7  [0sec/0.16sec] 
##   8  [0sec/0.38sec] 
##   9  [0.01sec/0.18sec] 
##   10  [0.02sec/0.17sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.76sec/0.07sec] 
##   2  [0.84sec/0.03sec] 
##   3  [0.65sec/0.05sec] 
##   4  [0.72sec/0.03sec] 
##   5  [0.63sec/0.05sec] 
##   6  [1.05sec/0.03sec] 
##   7  [0.75sec/0.03sec] 
##   8  [0.75sec/0.05sec] 
##   9  [0.72sec/0.05sec] 
##   10  [0.68sec/0.03sec]

# Plotting and comparing ROC curve and Precision-recall for each model
plot(list_results, legend = "topleft", main = "ROC curve")

plot(list_results, "prec/rec", legend = "bottomright", main = "Precision-recall")

Discussion

This analysis suggests that, for our subset of the MovieLense dataset, a user-based collaborative filtering model with optimal model parameters outperforms its item-based collaborative filtering counterpart.

References

Hahsler, M. (2019, August 27). MovieLense Dataset (100k). Rdrr.Io. https://rdrr.io/cran/recommenderlab/man/MovieLense.html

“Recommender Systems 101 – a Step by Step Practical Example in R.” R, 24 Dec. 2014, www.r-bloggers.com/recommender-systems-101-a-step-by-step-practical-example-in-r/.

DATA 612: Project 2 | Content Based and Collaborative Filtering

Amber Ferger, Charlie Rosemond, Juanelle Marks

6/9/2020