Data 612 Project 2 | Content-Based and Collaborative Filtering

Assignment Instructions
Introduction
Data Preparation
Recommendation Models
User Based Collaborative Filtering
Model Evaluation
Conclusion

Assignment Instructions

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches. For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester, or another dataset of your choosing. Implement at least two of these recommendation algorithms:

Content-Based Filtering
User-User Collaborative Filtering
Item-Item Collaborative Filtering

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Introduction

For this assignment, I used the MovieLense dataset that ships with the recommenderlab package to implement both user and item based collaborative filtering.

Load the MovieLense Dataset

Each row of the MovieLense dataset corresponds to a user, each column corresponds to a movie, and each value is a rating. The dataset contains 943 users, and 1664 movies.

set.seed(150)
data(MovieLense)
show(MovieLense)

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Data Exploration

Before implementing the recommendation models, I wanted to explore the data to get an idea of the distribution of ratings, and to identify any inconsistencies in the data.

Ratings Exploration

Ratings are integer values ranging from 0 to 5. A zero value rating represents a missing value. As you can see from the below table, there are a lot of missing values so in the interest of keeping our data accurate, we can remove these.

ratings_vector <- as.vector(MovieLense@data)
ratings_table <- table(ratings_vector)
ratings_table

## ratings_vector
##       0       1       2       3       4       5 
## 1469760    6059   11307   27002   33947   21077

Remove empty rating values.

ratings_vector <- ratings_vector[ratings_vector != 0]
ratings_table <- table(ratings_vector)
ratings_table

## ratings_vector
##     1     2     3     4     5 
##  6059 11307 27002 33947 21077

Ratings Distribution.

Now that we have removed the missing values, we can take a look at the distribution of ratings. As you can see from the below histrogram, the most common movie ratings given by users are 3, 4, and 5, with 4 being the most frequent rating.

ratings_vector <- factor(ratings_vector)
qplot(ratings_vector) + ggtitle('Distribution of Ratings') +
  xlab('Rating') +
  ylab('Count') +
  geom_bar(fill = 'red')

Most Viewed Movies

The next thing I wanted to explore was which movies received the most views. The below histogram displays the 10 most viewed movies. As we can see from the histogram, Star Wars received the most views.

views_per_movie <- colCounts(MovieLense)
views_table <- data.frame(
  movie = names(views_per_movie),
  views = views_per_movie
)

views_table <- views_table[order(views_table$views, decreasing = TRUE), ]
ggplot(views_table[1:10, ], aes(reorder(movie, - views), y = views)) +
geom_bar(stat = 'identity', fill = 'red') +
theme(axis.text.x = element_text(angle = 50, hjust = 1)) +
  ggtitle('10 Most Viewed Movies') +
  xlab('Movie') +
  ylab('Number of Views')

Data Preparation

Data preperation will involve prunning the data to ensure the accuracy of our recommendations, and splitting the data into test and training sets.

Prune The Data

Increase the accuracy of recommendations by selecting only users who have rated at least 50 movies, and selecting movies that have been watched at least 100 times.

movie_ratings <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]

Average Rating Per User Distribution

A quick look at the average rating per user distribution shows us that there is a lot of variation across the ratings data.

average_ratings_per_user <- rowMeans(movie_ratings)
qplot(average_ratings_per_user) + ggtitle('Average Rating Per User Distribution') +
  xlab('Average Rating Per User') +
  ylab('Count')

Split the data into training and test sets

After the split, we are left with 449 records in the training set, and 111 in the test set.

train <- sample(x = c(TRUE, FALSE), size = nrow(movie_ratings), replace = TRUE, prob = c(0.8, 0.2))
training_data <- movie_ratings[train, ]
test_data <- movie_ratings[!train, ]

print(nrow(training_data))

## [1] 449

print(nrow(test_data))

## [1] 111

Recommendation Models

Now that the data has been cleaned and split into test and training sets, we can build our recommendation models.

Item Based Collaborative Filtering

Build the item based model

item_item_model <- Recommender(data = training_data, method = 'IBCF')
item_item_model

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 449 users.

Apply the model to the test data.

predicted <- predict(object = item_item_model, newdata = test_data, n = 6)
recommendation_matrix <- sapply(predicted@items, function(x) { colnames(movie_ratings)[x] })

# First 4 user recommendations.
recommendation_matrix[, 1:4]

##      5                             
## [1,] "Pulp Fiction (1994)"         
## [2,] "Devil's Advocate, The (1997)"
## [3,] "Fly Away Home (1996)"        
## [4,] "Firm, The (1993)"            
## [5,] "Outbreak (1995)"             
## [6,] "Con Air (1997)"              
##      12                                       14                         
## [1,] "Natural Born Killers (1994)"            "Babe (1995)"              
## [2,] "Professional, The (1994)"               "Cinema Paradiso (1988)"   
## [3,] "Quiz Show (1994)"                       "Raging Bull (1980)"       
## [4,] "What's Eating Gilbert Grape (1993)"     "Sting, The (1973)"        
## [5,] "Hudsucker Proxy, The (1994)"            "Dead Poets Society (1989)"
## [6,] "Snow White and the Seven Dwarfs (1937)" "Edge, The (1997)"         
##      21                                             
## [1,] "Fly Away Home (1996)"                         
## [2,] "Batman (1989)"                                
## [3,] "Emma (1996)"                                  
## [4,] "William Shakespeare's Romeo and Juliet (1996)"
## [5,] "Beauty and the Beast (1991)"                  
## [6,] "Beautiful Girls (1996)"

Distribution of Movie Recommendations

The movie recommendations distribution tells us that the majority of movies are only recommended a few times.

items_count <- factor(table(recommendation_matrix))
qplot(items_count, ylab = 'Movie Count', xlab = 'Recommendation Count') + ggtitle('Distribution of Movie Recommendations')

items_count_sorted <- sort(items_count, decreasing = TRUE)
top_items <- head(items_count_sorted, n = 10)
top_items_table <- data.frame(top_items)
colnames(top_items_table) <- c('Rating')
top_items_table

##                                        Rating
## Ace Ventura: Pet Detective (1994)          13
## Craft, The (1996)                          13
## Frighteners, The (1996)                    12
## Everyone Says I Love You (1996)             8
## Jungle2Jungle (1997)                        8
## That Thing You Do! (1996)                   8
## Con Air (1997)                              7
## Nightmare on Elm Street, A (1984)           7
## Welcome to the Dollhouse (1995)             7
## Beavis and Butt-head Do America (1996)      6

User Based Collaborative Filtering

Build the user based model

user_user_model <- Recommender(data = training_data, method = 'UBCF')
user_user_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 449 users.

Apply the model to the test data.

predicted <- predict(object = user_user_model, newdata = test_data, n = 6)
recommendation_matrix <- sapply(predicted@items, function(x) { colnames(movie_ratings)[x] })

# First 4 user recommendations.
recommendation_matrix[, 1:4]

##      5                           12                         
## [1,] "Good Will Hunting (1997)"  "Good Will Hunting (1997)" 
## [2,] "Titanic (1997)"            "Titanic (1997)"           
## [3,] "L.A. Confidential (1997)"  "Trainspotting (1996)"     
## [4,] "Amistad (1997)"            "Fargo (1996)"             
## [5,] "Godfather, The (1972)"     "As Good As It Gets (1997)"
## [6,] "As Good As It Gets (1997)" "Lost Highway (1997)"      
##      14                           21                                
## [1,] "Contact (1997)"             "Shawshank Redemption, The (1994)"
## [2,] "Babe (1995)"                "Princess Bride, The (1987)"      
## [3,] "Wrong Trousers, The (1993)" "Graduate, The (1967)"            
## [4,] "Good Will Hunting (1997)"   "Usual Suspects, The (1995)"      
## [5,] "Toy Story (1995)"           "Schindler's List (1993)"         
## [6,] "As Good As It Gets (1997)"  "Raiders of the Lost Ark (1981)"

Distribution of User Based Collaborative Filtering items

Like the item based movie recommendations distribution, the user based distribution tells us that the majority of movies are only recommended a few times.

items_count <- factor(table(recommendation_matrix))
qplot(items_count, ylab = 'Movie Count', xlab = 'Recommendation Count') + ggtitle('Distribution of Movie Recommendations')

items_count_sorted <- sort(items_count, decreasing = TRUE)
top_items <- head(items_count_sorted, n = 10)
top_items_table <- data.frame(top_items)
colnames(top_items_table) <- c('Rating')
top_items_table

##                                  Rating
## Good Will Hunting (1997)             49
## Titanic (1997)                       42
## L.A. Confidential (1997)             40
## Schindler's List (1993)              28
## Silence of the Lambs, The (1991)     20
## Apt Pupil (1998)                     19
## Godfather, The (1972)                19
## Contact (1997)                       18
## As Good As It Gets (1997)            16
## Fargo (1996)                         16

Model Evaluation

To evaluate the models, I used the split method to calculate the Root Mean Square Error for both the user based model, and the item based model. As you can see from the below table, The RMSE for the user based model is lower than that of the item based model suggesting that it may be the more accurate of the two models.

e <- evaluationScheme(movie_ratings, method = 'split', train = 0.9, given = 15)
ubcf <- Recommender(getData(e, 'train'), 'UBCF')
ibcf <- Recommender(getData(e, 'train'), 'IBCF')

predict_ubcf <- predict(ubcf, getData(e, 'known'), type = 'ratings')
predict_ibcf <- predict(ibcf, getData(e, 'known'), type = 'ratings')

ubcf_error <- calcPredictionAccuracy(predict_ubcf, getData(e, 'unknown'))
ibcf_error <- calcPredictionAccuracy(predict_ibcf, getData(e, 'unknown'))

error <- rbind(ubcf_error, ibcf_error)
rownames(error) <- c('UBCF','IBCF')
error

##           RMSE       MSE       MAE
## UBCF 0.9893546 0.9788225 0.7939271
## IBCF 1.3678246 1.8709441 1.0429857

Conclusion

Based on my subset of the MovieLense dataset, the above analysis suggests that user based collaborative models out perform item based collaborative models in terms of accuracy.