Project 2 | Content-Based and Collaborative Filtering

Project Objectives

For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing.

Implement at least two of these recommendation algorithms:
• Content-Based Filtering
• User-User Collaborative Filtering
• Item-Item Collaborative Filtering

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want. Please provide at least one graph, and a textual summary of your findings and recommendations.

Data Preparation and Exploration

We gathered data from section “recommended for education and development” of site https://grouplens.org/datasets/movielens/. This site provides two links, from which we chose the link for the smaller file, because the larger one (named as Full) is too large to load into github. Description of the data is as follows:

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018. There are 4 *.csv files, from which we chose two files movies.cv and ratings.csv, for our down stream analysis.

Citation

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

Preview data

movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller
7 Sabrina (1995) Comedy|Romance
8 Tom and Huck (1995) Adventure|Children
9 Sudden Death (1995) Action
10 GoldenEye (1995) Action|Adventure|Thriller
userId movieId rating timestamp
1 1 4 964982703
1 3 4 964981247
1 6 4 964982224
1 47 5 964983815
1 50 5 964982931
1 70 3 964982400
1 101 5 964980868
1 110 4 964982176
1 151 5 964984041
1 157 5 964984100

Combine Data

Join movies with ratings on movieId

movieId userId rating timestamp title genres
1 1 4.0 964982703 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 555 4.0 978746159 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 232 3.5 1076955621 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 590 4.0 1258420408 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 601 4.0 1521467801 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 179 4.0 852114051 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 606 2.5 1349082950 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 328 5.0 1494210665 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 206 5.0 850763267 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 468 4.0 831400444 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

Create Matrix

As opposed to MovieLense of the recommenderlab, our dataset does not come as member of class realRatingMatrix. So, in the following code-chunk, we’ll create a realRatingMatrix dataset, called moviematrix. By putting moviematrix into class realRatingMatrix, we’ll be able to apply some useful functions on moviematrix (refer: page 33 of Building a Recommendation System with R).

## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

At this point, we’ll take stalk of the important characteristics of moviematrix.

## [1]  610 9724
##      user item rating
## 1       1    1      4
## 326     1    3      4
## 434     1    6      4
## 2108    1   47      5
## 2380    1   50      5
## 2860    1   70      3

Data Visualization

Exploring the values of the rating

##  [1] 4.0 0.0 4.5 2.5 3.5 3.0 5.0 0.5 2.0 1.5 1.0
vector_ratings Freq
0 5830804
0.5 1370
1 2811
1.5 1791
2 7551
2.5 5550
3 20047
3.5 13136
4 26818
4.5 8551
5 13211

Rating equal to 0 represents a missing value, so we’ll purge out the zero-ratings from vector_ratings.

Selecting the most relevant data

When we explored the data, we noticed that the table contains

  1. Movies that have been viewed only a few times. Therefore, their ratings might be biased. So, we’ll keep movies that have been watched at least 50 times.
  2. Users, who rated only a few movies. Therefore, their ratings might be biased too. So, we’ll keep users, who have rated at least 50 movies
## 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
## [1] 378 436

Now we have 378 users and 436 items with 36214 ratings.

Let’s build the chart again:

Item-Item Collaborative Filtering

This is a filtering method, where similarity between items is calculated using users’ ratings of items. That means the algorithm recommends items similar to the users’ previous selections. In the algorithm, the similarities between different items are computed by one of the similarity measures, and then similarity values are used to predict ratings for user-item pairs absent in the data.

Training model

In below step we’ll train the model, with a value of k = 30, which is the default.

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 302 users.
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 302 users.

Recommendations using test set

## Recommendations as 'topNList' with n = 6 for 76 users.
Movie Rating genres
Shawshank Redemption, The (1994) 5.0 Crime|Drama
Forrest Gump (1994) 5.0 Comedy|Drama|Romance|War
Blade Runner (1982) 5.0 Action|Sci-Fi|Thriller
One Flew Over the Cuckoo’s Nest (1975) 5.0 Drama
Hook (1991) 5.0 Adventure|Comedy|Fantasy
Kill Bill: Vol. 2 (2004) 5.0 Action|Drama|Thriller
Casino Royale (2006) 5.0 Action|Adventure|Thriller
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000) 4.5 Action|Drama|Romance
Traffic (2000) 4.5 Crime|Drama|Thriller
Mulholland Drive (2001) 4.5 Crime|Drama|Film-Noir|Mystery|Thriller
Bowling for Columbine (2002) 4.5 Documentary
Interview with the Vampire: The Vampire Chronicles (1994) 4.0 Drama|Horror
Gladiator (2000) 4.0 Action|Adventure|Drama
Chicken Run (2000) 4.0 Animation|Children|Comedy
Best in Show (2000) 4.0 Comedy
Lost in Translation (2003) 4.0 Comedy|Drama|Romance
Mystic River (2003) 4.0 Crime|Drama|Mystery
Kill Bill: Vol. 1 (2003) 4.0 Action|Crime|Thriller
Incredibles, The (2004) 4.0 Action|Adventure|Animation|Children|Comedy
Prestige, The (2006) 4.0 Drama|Mystery|Sci-Fi|Thriller
No Country for Old Men (2007) 4.0 Crime|Drama
Inglourious Basterds (2009) 4.0 Action|Drama|War
Fight Club (1999) 3.5 Action|Crime|Drama|Thriller
Monsters, Inc. (2001) 3.5 Adventure|Animation|Children|Comedy|Fantasy
Royal Tenenbaums, The (2001) 3.5 Comedy|Drama
Beautiful Mind, A (2001) 3.5 Drama|Romance
Bourne Identity, The (2002) 3.5 Action|Mystery|Thriller
Finding Nemo (2003) 3.5 Adventure|Animation|Children|Comedy
Eternal Sunshine of the Spotless Mind (2004) 3.5 Drama|Romance|Sci-Fi
Superbad (2007) 3.5 Comedy
Avatar (2009) 3.5 Action|Adventure|Sci-Fi|IMAX
Godfather, The (1972) 3.0 Crime|Drama
Memento (2000) 3.0 Mystery|Thriller
Shrek (2001) 3.0 Adventure|Animation|Children|Comedy|Fantasy|Romance
Dark Knight, The (2008) 3.0 Action|Crime|Drama|IMAX
O Brother, Where Art Thou? (2000) 2.5 Adventure|Comedy|Crime
Pirates of the Caribbean: The Curse of the Black Pearl (2003) 2.5 Action|Adventure|Comedy|Fantasy
Batman Begins (2005) 2.5 Action|Crime|IMAX
Departed, The (2006) 2.5 Crime|Drama|Thriller
Bourne Ultimatum, The (2007) 2.5 Action|Crime|Thriller
Million Dollar Baby (2004) 2.0 Drama
WALL·E (2008) 2.0 Adventure|Animation|Children|Romance|Sci-Fi
Up (2009) 2.0 Adventure|Animation|Children|Drama
Donnie Darko (2001) 1.5 Drama|Mystery|Sci-Fi|Thriller
28 Days Later (2002) 1.5 Action|Horror|Sci-Fi
Pan’s Labyrinth (Laberinto del fauno, El) (2006) 1.5 Drama|Fantasy|Thriller
Ratatouille (2007) 1.5 Animation|Children|Drama
Lord of the Rings: The Fellowship of the Ring, The (2001) 1.0 Adventure|Fantasy
Lord of the Rings: The Return of the King, The (2003) 1.0 Action|Adventure|Drama|Fantasy
High Fidelity (2000) 0.5 Comedy|Drama|Romance
Requiem for a Dream (2000) 0.5 Drama
Harry Potter and the Chamber of Secrets (2002) 0.5 Adventure|Fantasy
Big Fish (2003) 0.5 Drama|Fantasy|Romance
V for Vendetta (2006) 0.5 Action|Sci-Fi|Thriller|IMAX
Juno (2007) 0.5 Comedy|Drama|Romance
Iron Man (2008) 0.5 Action|Adventure|Sci-Fi
Slumdog Millionaire (2008) 0.5 Crime|Drama|Romance
Star Trek (2009) 0.5 Action|Adventure|Sci-Fi|IMAX
Hangover, The (2009) 0.5 Comedy|Crime
District 9 (2009) 0.5 Mystery|Sci-Fi|Thriller
Movie genres
Stargate (1994) Action|Adventure|Sci-Fi
Robin Hood: Men in Tights (1993) Comedy
Schindler’s List (1993) Drama|War
Alien (1979) Horror|Sci-Fi
The Devil’s Advocate (1997) Drama|Mystery|Thriller
Big (1988) Comedy|Drama|Fantasy|Romance

User-User Collaborative Filtering

Training the model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 302 users.

Recommendations using test set

## Recommendations as 'topNList' with n = 6 for 76 users.
Movie genres
Babe (1995) Children|Drama
Fugitive, The (1993) Thriller
Braveheart (1995) Action|Drama|War
Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Sci-Fi
Aladdin (1992) Adventure|Animation|Children|Comedy|Musical

Comparison of Recommender Models

## [1] 11
## Evaluation scheme with 10 items given
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=3.500000
## Data set: 378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
RMSE MSE MAE
item 1.2937300 1.6737372 0.955398
user 0.9555293 0.9130362 0.734025
## IBCF run fold/sample [model time/prediction time]
##   1  [0.35sec/0.03sec] 
##   2  [0.33sec/0.03sec] 
##   3  [0.34sec/0.03sec] 
##   4  [0.39sec/0.03sec]
TP FP FN TN precision recall TPR FPR
10 1.416667 8.583333 59.61458 356.3854 0.1416667 0.0291171 0.0291171 0.0237326
20 2.947917 17.052083 58.08333 347.9167 0.1473958 0.0564288 0.0564288 0.0469233
30 4.645833 25.354167 56.38542 339.6146 0.1548611 0.0836695 0.0836695 0.0695461
40 6.031250 33.968750 55.00000 331.0000 0.1507812 0.1065546 0.1065546 0.0933082
50 7.489583 42.510417 53.54167 322.4583 0.1497917 0.1313989 0.1313989 0.1167724
60 8.572917 51.427083 52.45833 313.5417 0.1428819 0.1471139 0.1471139 0.1413827
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.12sec] 
##   2  [0sec/0.14sec] 
##   3  [0sec/0.13sec] 
##   4  [0sec/0.14sec]
TP FP FN TN precision recall TPR FPR
10 2.854167 7.145833 58.17708 357.8229 0.2854167 0.0570086 0.0570086 0.0190046
20 4.875000 15.125000 56.15625 349.8438 0.2437500 0.0948137 0.0948137 0.0406465
30 6.687500 23.312500 54.34375 341.6562 0.2229167 0.1221262 0.1221262 0.0627652
40 8.645833 31.354167 52.38542 333.6146 0.2161458 0.1522087 0.1522087 0.0844114
50 10.510417 39.489583 50.52083 325.4792 0.2102083 0.1786933 0.1786933 0.1062689
60 12.197917 47.802083 48.83333 317.1667 0.2032986 0.2051133 0.2051133 0.1289054

## IBCF run fold/sample [model time/prediction time]
##   1  [0.42sec/0.05sec] 
##   2  [0.41sec/0.02sec] 
##   3  [0.57sec/0.01sec] 
##   4  [0.35sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.41sec/0.03sec] 
##   2  [0.41sec/0.03sec] 
##   3  [0.42sec/0.03sec] 
##   4  [0.41sec/0.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.15sec] 
##   2  [0.02sec/0.12sec] 
##   3  [0sec/0.12sec] 
##   4  [0.02sec/0.12sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.16sec] 
##   2  [0sec/0.14sec] 
##   3  [0sec/0.14sec] 
##   4  [0.02sec/0.12sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
##   2  [0sec/0.03sec] 
##   3  [0sec/0.03sec] 
##   4  [0sec/0.04sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0sec/0.18sec] 
##   2  [0sec/0.28sec] 
##   3  [0sec/0.17sec] 
##   4  [0sec/0.17sec]

Summary

By building the movie recommender system, we got a better understanding of how it works. The text book “Building Recommendation System with R” is not clear in some places. So, we had to google, to find out the implementation details.

The pros and cons of User based Collaborative Filtering (UBCF) and Item based Collaborative Filtering (IBCF) approaches.

  • Recommendations of UBCF complements the item that the user was interactibg with. Since users might not be looking for direct substitutes to a movie, UBCF provides a better recommendation than IBCF.

  • UBCF is memory intensive. So, with humongous number of users, processing time would be high.

  • UBCF relies on historial choices of user to make future recommendations. It assumes that users’ preference to be by and large constant.

Forhad Akbar

6/14/2020