Project 2 | Content-Based and Collaborative Filtering

Objective
Data Loading and Data Preparation
Data Normalization
Algorithms Used
- Item-Item Collaborative Filtering
- User-User Collaborative Filtering
Summary

Objective

From an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:

• Content-Based Filtering

• User-User Collaborative Filtering

• Item-Item Collaborative Filtering

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

Data Loading and Data Preparation

Load the Libraries

library(recommenderlab)
library(reshape2)
library(RCurl)
library(pander)
library(ggplot2)
library(knitr)
library(dplyr)
library(tidyr)
library(ggplot2)

Load the Data

The data set is from MovieLens project and it was downloaded from [Movie Lens] (https://grouplens.org/datasets/movielens/)

ratings <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data612/master/Project_2/ratings.csv")

movies <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data612/master/Project_2/movies.csv")

knitr::kable(head(movies))

movieId	title	genres
1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	Jumanji (1995)	Adventure\|Children\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
5	Father of the Bride Part II (1995)	Comedy
6	Heat (1995)	Action\|Crime\|Thriller

knitr::kable(head(ratings))

userId	movieId	rating	timestamp
1	1	4	964982703
1	3	4	964981247
1	6	4	964982224
1	47	5	964983815
1	50	5	964982931
1	70	3	964982400

Convert to Matrix

Movie_Matrix <- ratings %>% select(-timestamp) %>% spread(movieId, rating)

row.names(Movie_Matrix) <- Movie_Matrix[, 1]

Movie_Matrix <- Movie_Matrix[-c(1)]
Movie_Matrix <- as(as.matrix(Movie_Matrix), "realRatingMatrix")

Movie_Matrix

610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.

Our movie matrix contains 610 users and 9,724 items/movies.

Data Preparation

Let us examine the distribution of the ratings dataset

Ratings_distribution <- as.vector(Movie_Matrix@data)
Ratings_distribution <- Ratings_distribution[Ratings_distribution != 0]
ggplot() + aes(Ratings_distribution) + geom_histogram(binwidth = 0.5) + xlab("Ratings") + 
    ylab("No. of Ratings")

As we can see, that the dataset consists of some sparse data, which would not be appropriate to build our model. Therefore, we would proceed with users and items with most information.

(movies <- Movie_Matrix[rowCounts(Movie_Matrix) > 50, colCounts(Movie_Matrix) > 
    50])

378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.

Now we have 378 users and 436 items.

In general, any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users may give higher ratings than others.

To see the bias distribution of ratings, we will plot average rating per user.

avg <- rowMeans(movies)
ggplot() + aes(avg) + geom_histogram(binwidth = 0.1) + xlab("Average Rating") + 
    ylab(" No. of Ratings")

We can see from the avg ratings distribution plot below that it varies a lot.

Data Normalization

recommender lab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.

movie_Normalization <- normalize(movies)
avg <- round(rowMeans(movie_Normalization), 5)
table(avg)

avg
  0 
378

min_Items <- quantile(rowCounts(movies), 0.95)
min_Users <- quantile(colCounts(movies), 0.95)

image(movies[rowCounts(movies) > min_Items, colCounts(movies) > min_Users], 
    main = "Heatmap of the Top Users and Movies (Non-Normalized")

image(movie_Normalization[rowCounts(movie_Normalization) > min_Items, colCounts(movie_Normalization) > 
    min_Users], main = "Heatmap of the Top Users and Movies (Normalized)")

Reviewing rows in two heatmaps above, we can see that after normalization, the average rating is more uniform. Visually it does appear that bias is reduced.

Algorithms Used

Item-Item Collaborative Filtering

Step 1)

Split the dataset into training set (80%) and testing set (20%).

set.seed(80)
train_set <- sample(x = c(TRUE, FALSE), size = nrow(movies), replace = TRUE, 
    prob = c(0.8, 0.2))

movie_Train <- movies[train_set, ]
movie_Test <- movies[!train_set, ]

Training the model

Step 2)

Now, after splitting the dataset into training and test sets, let’s create a model using training set.

(model <- Recommender(movie_Train, method = "IBCF"))

Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 302 users.

With recommenderlab package , we can train the model.

Examining the Similarity Matrix

Step 3) By examining the similarity matrix, we can find top ten movies that are similar to other movies.

similarityMatrix <- getModel(model)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Disclosure (1994)
Piano, The (1993)
City Slickers II: The Legend of Curly’s Gold (1994)
Congo (1995)
Broken Arrow (1996)
Wild Wild West (1999)
First Knight (1995)
Eraser (1996)
Coneheads (1993)
Beverly Hills Cop III (1994)

Recommendations using test set

(pred <- predict(model, newdata = movie_Test, n = 6))

Recommendations as 'topNList' with n = 6 for 76 users.

Now, Let’s extract recommenders

Movie Ratings for first user

user1 <- as.data.frame(movie_Test@data[1, movie_Test@data[1, ] > 0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- movies %>% inner_join(user1, by = "movieId") %>% select(Movie = "title", 
    Rating) %>% arrange(desc(Rating))
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie	Rating
Shawshank Redemption, The (1994)	5.0
Forrest Gump (1994)	5.0
Blade Runner (1982)	5.0
One Flew Over the Cuckoo’s Nest (1975)	5.0
Hook (1991)	5.0
Kill Bill: Vol. 2 (2004)	5.0
Casino Royale (2006)	5.0
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)	4.5
Traffic (2000)	4.5
Mulholland Drive (2001)	4.5
Bowling for Columbine (2002)	4.5
Interview with the Vampire: The Vampire Chronicles (1994)	4.0
Gladiator (2000)	4.0
Chicken Run (2000)	4.0
Best in Show (2000)	4.0
Lost in Translation (2003)	4.0
Mystic River (2003)	4.0
Kill Bill: Vol. 1 (2003)	4.0
Incredibles, The (2004)	4.0
Prestige, The (2006)	4.0
No Country for Old Men (2007)	4.0
Inglourious Basterds (2009)	4.0
Fight Club (1999)	3.5
Monsters, Inc. (2001)	3.5
Royal Tenenbaums, The (2001)	3.5
Beautiful Mind, A (2001)	3.5
Bourne Identity, The (2002)	3.5
Finding Nemo (2003)	3.5
Eternal Sunshine of the Spotless Mind (2004)	3.5
Superbad (2007)	3.5
Avatar (2009)	3.5
Godfather, The (1972)	3.0
Memento (2000)	3.0
Shrek (2001)	3.0
Dark Knight, The (2008)	3.0
O Brother, Where Art Thou? (2000)	2.5
Pirates of the Caribbean: The Curse of the Black Pearl (2003)	2.5
Batman Begins (2005)	2.5
Departed, The (2006)	2.5
Bourne Ultimatum, The (2007)	2.5
Million Dollar Baby (2004)	2.0
WALL·E (2008)	2.0
Up (2009)	2.0
Donnie Darko (2001)	1.5
28 Days Later (2002)	1.5
Pan’s Labyrinth (Laberinto del fauno, El) (2006)	1.5
Ratatouille (2007)	1.5
Lord of the Rings: The Fellowship of the Ring, The (2001)	1.0
Lord of the Rings: The Return of the King, The (2003)	1.0
High Fidelity (2000)	0.5
Requiem for a Dream (2000)	0.5
Harry Potter and the Chamber of Secrets (2002)	0.5
Big Fish (2003)	0.5
V for Vendetta (2006)	0.5
Juno (2007)	0.5
Iron Man (2008)	0.5
Slumdog Millionaire (2008)	0.5
Star Trek (2009)	0.5
Hangover, The (2009)	0.5
District 9 (2009)	0.5

First user recommendations are as follows. Of course, this evaluation is highly subjective, but I would only question the recommendation of the first user.

Recommendations for the first user

recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Stargate (1994)
Robin Hood: Men in Tights (1993)
Schindler’s List (1993)
Alien (1979)
The Devil’s Advocate (1997)
Big (1988)

User-User Collaborative Filtering

Training the model

Let’s create a user based model using the training set.

(model <- Recommender(movie_Train, method = "UBCF"))

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 302 users.

Recommendations using test set

(pred <- predict(model, newdata = movie_Test, n = 6))

Recommendations as 'topNList' with n = 6 for 76 users.

Let us consider the first user and look at his/her recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Babe (1995)
Fugitive, The (1993)
Braveheart (1995)
Lion King, The (1994)
Star Wars: Episode IV - A New Hope (1977)
Aladdin (1992)

Normalization Test

Consider if we build the model without normalizing the data. Without normalization, recommendations include more usual suspects (movies very highly rated by majority of users) and therefore are more generic.

model <- Recommender(movie_Train, method = "UBCF", parameter = list(normalize = NULL))
pred <- predict(model, newdata = movie_Test, n = 6)
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Babe (1995)
Fugitive, The (1993)
Jurassic Park (1993)
Silence of the Lambs, The (1991)
Braveheart (1995)
Pulp Fiction (1994)

Summary

For both Item-Item and User-User collaborative filtering, the recommendations which the first user got was almost similar. Movie ratings extracted for the first user consisted of drama genre movies. Therefore it was interesting to see that the movies recommended to the same user also consisted of drama genre movies.

The algorithms are carried out using recommenderlab package in R. It is a basic recommender system. More development and testing would be needed for a usable recommender system.

Comparing the results of UBCF with IBCF helps in understanding the algorithm better. UBCF needs to access the initial data, so it is a lazy-learning model. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix.

However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big.