The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches. Start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing.
Implement at least two of these recommendation algorithms:
+ Content-Based Filtering
+ User-User Collaborative Filtering
+ Item-Item Collaborative Filtering
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.
You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.
Loading packages and the MovieLense dataset
Similar to our Building a Recommendation System with R book, I’ll use the MovieLense dataset for this project. In order to access this data, we’ll need to load the recommenderlab package and the MovieLense data stored within this package.
require(recommenderlab)
require(ggplot2)
data("MovieLense")
class(MovieLense)
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
require(knitr)
require(kable)
require(kableExtra)
require(devtools)
require(tidyverse)
require(dplyr)
require(stats)
Before we start to build out our recommender systems, we can do some basic data exploration to get a better handle of our ratings. First, we can show the dimensions of our user-item matrix:
dim(MovieLense)
## [1] 943 1664
As we can see, there are 943 users and 1164 movies in this dataset. From this, we can see that there are 1,569,152 possible user-item combinations (943 * 1664).
We can also take a look at the frequency of ratings for all movies on a 1-5 scale, with 0 indicating movies that the user has not seen.
vector_ratings <- as.vector(MovieLense@data)
table_ratings <- table(vector_ratings)
table_ratings
## vector_ratings
## 0 1 2 3 4 5
## 1469760 6059 11307 27002 33947 21077
From our table above, we can see that a large majority of movies are unrated by users (which is to be expected). However, there is also a pretty substantial number of ratings that we can use later to continue to build out our recommendations.
Now, we can filter out all of the values equal to 0 to isolate only user-item combinations that only show ratings for movies:
vector_ratings <- vector_ratings[vector_ratings != 0]
Then, we can create a quick histogram to see how users are typically rating the 1,664 movies.
From the histogram above, we can see that a large majority of ratings fall in the 3 to 5 range, with 4 being the most common rating for all movies.
Visualizing the matrix
We can also use the image function in the recommenderlab package to help visualize the user-item matrix. To do this, we can generate an image below of a small subset of our larger matrix. Here, I subsetted to show 48 users and their ratings for the first 82 movies in the matrix:
Later on, we’ll be utilizing our user-item matrix to generate recommendations on a test set of data. However, it’ll be important that we utilize ratings and data from non-sparse samples of our training set in order to do so. As an example, we can find the top five percent of users and movies in our matrix.
min_n_movies <- quantile(rowCounts(MovieLense), 0.95)
min_n_users <- quantile(colCounts(MovieLense), 0.95)
image(MovieLense[rowCounts(MovieLense) > min_n_movies, colCounts(MovieLense) > min_n_users], main = "Heatmap of top 5% of users and movies")
We can quickly see here that this is much less sparse (less white squares), than our previous matrix that just showed a random set of user-item combinations.
Average Ratings
We can also take a look at the average ratings of movies across all of our movies:
As mentioned in the textbook, there are a few movies that are rated quite low in this distribution (around 1), and a few that are rated quite high (around 5) – for these instances, this would likely indicate that there is a small sample size that could potentially be skewing these averages. Therefore, to account for this, we can subset our data to only include movies that have been viewed/rated by at least 100 users:
views_per_movie <- colCounts(MovieLense)
average_ratings_update <- average_ratings[views_per_movie >= 100]
We can now re-run our histogram:
We can see that this is a bit cleaner, and shows a more normal distribution, removing movies that had uncharacteristically low or high ratings.
Similar to earlier, we’ll need to select the most relevant data in order to set ourselves up well to build our recommenders. The items that we will need to account for:
Ensure that we do not have movies in our training dataset that have been viewed only a few times (this could lead to biased ratings for these movies due to lack of data)
Ensure that we do not have users in our training dataset that only rated a few movies, since their ratings might be biased and we wouldn’t be able to generate reliable similarities with other users.
Establishing a threshold
In order to account for these two items above, we’ll start with a generic threshold that is similar to our textbook. Our ratings matrix will include only those that meet the following thresholds:
Therefore, we can filter out data by using the following syntax:
min_n_movies <- 50
min_n_users <- 100
ratings_movies <- MovieLense[rowCounts(MovieLense) > min_n_movies, colCounts(MovieLense) > min_n_users]
ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
Our ratings matrix is now much smaller, reducing it from its initial list of 943 users to 560 users, and 1,664 moves to 332 movies. Although this subsets the matrix by quite a lot, we’ll still be able to generate reliable recommendations from such a large dataset.
Visualizing the top 5 percent of users and movies in the new matrix
min_movies <- quantile(rowCounts(ratings_movies), 0.95)
min_users <- quantile(colCounts(ratings_movies), 0.95)
image(ratings_movies[rowCounts(ratings_movies) > min_movies, colCounts(ratings_movies) > min_users])
A quick note on normalization
Before progressing, we need to take a look at the average ratings across our different users. It’ll be helpful to normalize the ratings in order to limit bias – users who repeatedly give high (or low) ratings might bias recommendations. Fortunately, the recommenderlab package, and the Recommender function already does this for us. However, I’ll quickly demonstrate one way to normalize our data, to center each user’s ratings at zero. We can normalize the ratings by using the normalize function in R:
ratings_movies_norm <- normalize(ratings_movies)
This creates a matrix where the ratings are transformed from a 1 to 5 rating scale, to a -5 to 5 rating scale. We can see this in the updated image of the matrix below:
image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies, colCounts(ratings_movies_norm) > min_users])
Although we are seeing a fair amount of ratings that are darker red and blue, indicating that they are farther away from our average of 0, we can confirm that our average rating by user is now zero given our normalize function:
round(sum(rowMeans(ratings_movies_norm)), 2)
## [1] 0
After doing a bit of data exploration, and working through normalization, we can run our first recommender algorithm.
For our first algorithm, we’ll use an Item-based collaborative filtering method to measure similarities between movies. We can then use these similarity metrics to recommend similar movies to users based on a user’s previous ratings.
To do this, we first need to set up the training and testing sets. I’ll do this manually below, however in future projects we can utilize the built-in functions from the recommenderlab package to do this:
set.seed(123)
training_cast <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies), replace = TRUE, prob = c(0.8, 0.2))
head(training_cast)
## [1] TRUE TRUE TRUE FALSE FALSE TRUE
After setting a seed to keep our sample function output consistent across multiple runs of our code, we can then take the rows in our user-item matrix and do an 80/20 training to testing split. The output shows the first 5 values of our vector, that we’ll then use to filter out our ratings_movies matrix into our training and testing datasets.
training_data <- ratings_movies[training_cast]
testing_data <- ratings_movies[!training_cast]
dim(training_data)
## [1] 452 332
dim(testing_data)
## [1] 108 332
We can see above that our datasets have been split to include 452 users and their corresponding ratings in the training dataset, and 108 users and their corresponding ratings in the testing dataset.
Now, we can use the built-in Recommender function in the recommenderlab package to run our first IBCF model.
ibcf <- Recommender(data = training_data, method = "IBCF", parameter = list(method="cosine", k=30, normalize = "center"))
After setting our parameters to do the following:
We can then take a look at some of the details of the IBCF model that we ran:
ibcf_details <- getModel(ibcf)
dim(ibcf_details$sim)
## [1] 332 332
As expected, our model generated a square similarity matrix that is 332 rows by 332 columns. This matrix computes the Cosine Similarity for each pair of movies. The model then used this matrix to store the k most similar movies to each pair of movies.
As you can see from below, each row has 30 elements that are greater than 0 (shaded in gray boxes). As mentioned in the textbook, the number of gray boxes doesn’t necessary equal 30 going down a column, given that column counts depend on how many times the corresponding movie was included in the top 30 list of other movies.
To show this distribution across columns, we can create a histogram:
As we can see from this distribution, there are a few movies that are similar to most others, with column sums greater than 100 (meaning that they were neighbors to more than 100 other movies based on their Cosine Similarity scores).
We can isolate the top six movies that are most similar to all others in the following dataframe:
| col_sums | |
|---|---|
| Close Shave, A (1995) | 124 |
| Mimic (1997) | 120 |
| Jungle2Jungle (1997) | 115 |
| Batman Forever (1995) | 111 |
| Spawn (1997) | 109 |
| Down Periscope (1996) | 101 |
Applying model to the test set
With our model trained on our training dataset, we can now apply it to our test dataset of users. We’ll set n_recommended equal to six to recommend six movies to each user:
n_recommended <- 6
ibcf_predictions <- predict(object = ibcf, newdata = testing_data, n = n_recommended)
We now have recommendations for each user in our test set! We can save these in a matrix:
recc_matrix_ibcf <- sapply(ibcf_predictions@items, function(x){
colnames(ratings_movies)[x]
})
As an example of our output, the first user in our test set would have these six movies as recommendations:
recc_matrix_ibcf[, 1]
## [1] "Babe (1995)" "Phenomenon (1996)"
## [3] "Time to Kill, A (1996)" "Heat (1995)"
## [5] "African Queen, The (1951)" "Net, The (1995)"
Now, in order to identify the most recommended movies from our IBCF model, we can create a histogram:
From above, we can see that a large majority of movies were only recommended to users 1 or 2 times.
The movies that are very far to the right on our distribution, ones that have been recommended to users at least 9 times, are the following:
| most_rec_movies_ibcf | |
|---|---|
| Beautiful Girls (1996) | 13 |
| Con Air (1997) | 11 |
| Kiss the Girls (1997) | 11 |
| Spawn (1997) | 11 |
| Cable Guy, The (1996) | 9 |
Running our IBCF model again, while calculating error
Although we have effectively run our IBCF model above and generated movie recommendations for each user in our test dataset, it’s difficult for us to determine how effective our recommendations are to our users. Therefore, I’m going to go back through and run the IBCF model multiple times (utilizing different neighborhood sizes and additional parameters), and taking into consideration error calculations in order to identify the optimal IBCF model on our dataset. We can utilize many of our built-in recommenderlab functions in order to do this.
We can step back and utilize our already subsetted data ratings_movies to set up our evaluation scheme. This time, instead of splitting our data into an 80/20 training and testing dataset, I’ll use cross-validation (10-fold), to create 10 equal test sets – this was recommended in our textbook. Additionally, we can indicate a ‘good rating’ is equal to 3:
e <- evaluationScheme(ratings_movies, method = "cross-validation", k=10, given=10, goodRating=3)
Then, similar to week one, we can utilize error calculations such as Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE) to determine how good our model(s) are at providing effective movie recommendations to users.
Instead of running our model in pieces as we did earlier, I’ll combine it into one function called errorOptimization and will do the following:
Recommender function with the given parameters to create the recommender modelcalcPredictionAccuracy function to calculate the RMSE, MSE, and MAEHere’s the structure:
errorOptimization <- function(evalScheme, similarity_calc, neighborhood_size, normalization_techniques, algorithm_type, k_or_nn){
# build the recommender model
recommender <- Recommender(getData(e, 'train'), algorithm_type, parameter= list(method = similarity_calc, k = neighborhood_size, normalize = normalization_techniques))
# create the predictions on the test set using the model
predictions <- predict(recommender, getData(e, 'known'), type='ratings')
# compute error metrics averaged per user and then averaged over all
error <- calcPredictionAccuracy(predictions, getData(e, "unknown"))
# store these computations in an error vector
error_vector <- c(error, similarity_calc, neighborhood_size, normalization_techniques)
# create headers for the vector
names(error_vector) <- c('RMSE', 'MSE', 'MAE', 'similarity_calc', 'neighbors', 'norm')
return(error_vector)
}
Now, with the function ready to go, I thought it would be interesting to see what happens to our error values given our different parameters. By creating vectors with our different parameter options (and the different types of similarity calculations, neighborhood sizes and normalization techniques we can use), I’ll do my best to incorporate them all into one data frame that we can use to see which is optimal to create our recommendations:
neighborhood_size <- seq(5, 100, 1)
normalization_techniques <- c('center', 'Z-score')
similarity_calcs <- c('cosine', 'pearson')
algorithm_type <- 'ibcf'
# create empty vector
errors_list <- c()
# cycle through the similarity, neighbor, and normalization vectors, append errors to empty vector
for (s in similarity_calcs){
for(k in neighborhood_size){
for(n in normalization_techniques){
errors_list <- rbind(errors_list, errorOptimization(e, s, k, n, algorithm_type))
}
}
}
# change the data types of the error values from characters to numbers
errors_df <- as.data.frame(errors_list)
errors_df$RMSE <- as.numeric(as.character(errors_df$RMSE))
errors_df$MSE <- as.numeric(as.character(errors_df$MSE))
errors_df$MAE <- as.numeric(as.character(errors_df$MAE))
errors_df$AVG_ERROR <- rowMeans(subset(errors_df, select=c(RMSE, MSE, MAE)))
Based on the different parameters and techniques included above, we can visualize our error values for using our normalization technique of ‘center’ (blue) and ‘z-score’ (red) to see which yielded higher errors:
Based on the different parameters and techniques included above, we can visualize our error values for using our similarity functions of ‘cosine’ (blue) and ‘pearson’ (red) to see which yielded higher errors:
From our visuals above, it looks like our similarity calculation using the pearson method may be more effective. However, it doesn’t look like either normalization technique is particularly more effective for our IBCF model. Now, we can find our optimal IBCF model by finding the minimum AVG_ERROR value across all of our model runs:
| RMSE | MSE | MAE | similarity_calc | neighbors | norm | AVG_ERROR |
|---|---|---|---|---|---|---|
| 1.069733 | 1.144328 | 0.8238457 | pearson | 100 | Z-score | 1.012635 |
We can now utilize these values later on to compare against our the error outputs of our second algorithm, a User-Based Collaborative Filtering method.
Utilizing our same function above, we can adjust the Recommender parameter from ‘ibcf’ to ‘ubcf’, while continuing the same cross-validation processes and calculations. However, this time, instead of identifying similar movies that have been seen and rated by the same people, and then recommending movies to new users based on these similarity ratings and past user ratings, we’ll be taking the following approach:
We’ll use the same training/testing dataset from earlier for this initial run:
ubcf <- Recommender(training_data, method='ubcf', parameter= list(method = 'cosine', nn = 10, normalize = 'center'))
Applying model to the test set
After this, we can use our model to make predictions on our test dataset from earlier:
ubcf_predictions <- predict(ubcf, testing_data, n_recommended)
We now have recommendations for each user in our test set! We can save these in a matrix:
recc_matrix_ubcf <- sapply(ubcf_predictions@items, function(x){
colnames(ratings_movies[,x])
})
As an example, the first user in our test set would have these six movies as recommendations:
recc_matrix_ubcf[,1]
## [1] "Good Will Hunting (1997)" "Edge, The (1997)"
## [3] "Devil's Advocate, The (1997)" "Wag the Dog (1997)"
## [5] "Nikita (La Femme Nikita) (1990)" "Contact (1997)"
Now, in order to identify the most recommended movies from our UBCF model, we can create a histogram:
Interestingly, we can see that the UBCF model seems to be recommending the same movies to a broader set of users (more frequently), than the distribution we saw from the IBCF model.
We can check the most frequently recommended movies from our UBCF model:
| most_rec_movies_ubcf | |
|---|---|
| Good Will Hunting (1997) | 30 |
| Titanic (1997) | 28 |
| Schindler’s List (1993) | 23 |
| L.A. Confidential (1997) | 22 |
| Raiders of the Lost Ark (1981) | 19 |
As we can see here, movies such as Good Will Hunting and Titanic were recommended much more frequently to users than they were in the IBCF model.
Running our UBCF model again, while calculating error
Similar to the approach with IBCF, we’ll rerun the UBCF model to make a better determination of whether or not this model is effective at providing recommendations of movies to users. Therefore, I’m going to go back through and run the UBCF model multiple times utilizing the function we created earlier errorOptimization().
We’ll use the same evaluation scheme as earlier, with cross-validation (10-fold), to create 10 equal test sets. Additionally, we can indicate a ‘good rating’ is equal to 3. Next, we can use our training data from our evaluation scheme to generate our ubcf model:
similarity_calcs <- c('cosine', 'pearson')
neighborhood_size <- seq(5, 100, 1)
normalization_techniques <- c('center', 'Z-score')
algorithm_type <- 'ubcf'
# create empty vector
errors_list_ubcf <- c()
# cycle through the similarity, neighbor, and normalization vectors, append errors to empty vector
for (s in similarity_calcs){
for(k in neighborhood_size){
for(n in normalization_techniques){
errors_list_ubcf <- rbind(errors_list_ubcf, errorOptimization(e, s, k, n, algorithm_type))
}
}
}
# change the data types of the error values from characters to numbers
errors_df_ubcf <- as.data.frame(errors_list_ubcf)
errors_df_ubcf$RMSE <- as.numeric(as.character(errors_df_ubcf$RMSE))
errors_df_ubcf$MSE <- as.numeric(as.character(errors_df_ubcf$MSE))
errors_df_ubcf$MAE <- as.numeric(as.character(errors_df_ubcf$MAE))
errors_df_ubcf$AVG_ERROR <- rowMeans(subset(errors_df_ubcf, select=c(RMSE, MSE, MAE)))
Visualizing Error
With our errors calculated across many runs of the UBCF model, we can generate our visuals again of a few parameters:
errors_df_cosine_ubcf <- errors_df_ubcf %>%
filter(similarity_calc == 'cosine')
errors_df_pearson_ubcf <- errors_df_ubcf %>%
filter(similarity_calc != 'cosine')
sim_calc_df_ubcf <- data.frame(`cosine` = errors_df_cosine_ubcf$AVG_ERROR, `pearson` = errors_df_pearson_ubcf$AVG_ERROR)
ggplot(sim_calc_df_ubcf, aes(1:nrow(sim_calc_df_ubcf))) +
geom_line(aes(y=sim_calc_df_ubcf$`cosine`), color="blue") +
geom_line(aes(y=sim_calc_df_ubcf$`pearson`), color="red") +
ylab("Average Error (RMSE, MSE, MAE)") +
xlab("Model runs") +
ggtitle("Average Error by Similarity Function")
Similar to evaluation of our error values for IBCF, it looks like our similarity calculation using the pearson method may be more effective for UBCF, too. Also, it doesn’t look like either normalization technique is particularly more effective for our UBCF model. Now, we can find our optimal model by finding the minimum AVG_ERROR value across all of our model runs:
| RMSE | MSE | MAE | similarity_calc | neighbors | norm | AVG_ERROR |
|---|---|---|---|---|---|---|
| 0.9727797 | 0.9463003 | 0.7629914 | pearson | 98 | center | 0.8940238 |
After utilizing Item-Item and User-User Collaborative Filtering algorithms, I was able to generate movie recommendations for each user based on our original user-item matrix from the MovieLense dataset. Following a bit of data exploration, I tidy’d the data in order to get it ready to run through my algorithms. I briefly discussed normalization of ratings, and ran through ways to split the Large realRatingMatrix into a training and test dataset. Finally, I utilized the recommenderlab package to generate my models, and computed error to measure the effectiveness of both algorithms and their subsequent recommendations. Here are a few things I noticed:
Similar to many user-item matrices, the MovieLense data was quite sparse, so it was important to create a threshold that would subset the sparse matrix to limit bias.
Both models generated movie recommendations, and it was interesting to see that each algorithm recommended quite different movies for each user – there didn’t seem to be too much consistency in recommendations across the two methods. Albeit, more in-depth investigation is needed to calculate specific differences, it was interesting to see that the UBCF model recommended some of the same movies more frequently over a larger subset of users than the IBCF model, which recommended the same movies less frequently across users.
I noticed that when I was running error calculations on both models, the normalization technique, whether it was ‘center’ or ‘z-score’ didn’t tend to show major differences in error.
I also noticed that when running error calculations on both models, the similarity function used did tend to show differences in error – the pearson correlation method seemed to yield lower error than the cosine method for both algorithms.
Our k-nearest neighbors inputs seemed to show lower error with higher values, which we could see clearly from our plots. Additionally, both of our lowest error values for our IBCF and UBCF algorithms had neighbor values of 100 and 98 respectively, indicating that the larger number of neighbors were more effective for the MovieLense data when computing similarity measures.
Ultimately, after computing my final error values by averaging the RMSE, MSE, and MAE values, and then taking the lowest average error value across the two algorithms, it appears that the User-User Collaborative Filtering model generated recommendations with lower error. This is something that I hope to explore further in future weeks, since we can do further evaluation on both of these models (plot ROC curves, calculate precision-recall, etc.), but for now, it looks like there were substantial differences in my error outputs between the two techniques.
Overall, this my first time building a recommender system, and it was interesting to dive in using both of these techniques!