Project 2 | Content-Based and Collaborative Filtering
Objective
From an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:
• Content-Based Filtering
• User-User Collaborative Filtering
• Item-Item Collaborative Filtering
You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.
Data Loading and Data Preparation
Load the Libraries
Load the Data
The data set is from MovieLens project and it was downloaded from [Movie Lens] (https://grouplens.org/datasets/movielens/)
ratings <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data612/master/Project_2/ratings.csv")
movies <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data612/master/Project_2/movies.csv")| movieId | title | genres |
|---|---|---|
| 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 2 | Jumanji (1995) | Adventure|Children|Fantasy |
| 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 5 | Father of the Bride Part II (1995) | Comedy |
| 6 | Heat (1995) | Action|Crime|Thriller |
| userId | movieId | rating | timestamp |
|---|---|---|---|
| 1 | 1 | 4 | 964982703 |
| 1 | 3 | 4 | 964981247 |
| 1 | 6 | 4 | 964982224 |
| 1 | 47 | 5 | 964983815 |
| 1 | 50 | 5 | 964982931 |
| 1 | 70 | 3 | 964982400 |
Convert to Matrix
Movie_Matrix <- ratings %>% select(-timestamp) %>% spread(movieId, rating)
row.names(Movie_Matrix) <- Movie_Matrix[, 1]
Movie_Matrix <- Movie_Matrix[-c(1)]
Movie_Matrix <- as(as.matrix(Movie_Matrix), "realRatingMatrix")
Movie_Matrix610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
Our movie matrix contains 610 users and 9,724 items/movies.
Data Preparation
Let us examine the distribution of the ratings dataset
Ratings_distribution <- as.vector(Movie_Matrix@data)
Ratings_distribution <- Ratings_distribution[Ratings_distribution != 0]
ggplot() + aes(Ratings_distribution) + geom_histogram(binwidth = 0.5) + xlab("Ratings") +
ylab("No. of Ratings")As we can see, that the dataset consists of some sparse data, which would not be appropriate to build our model. Therefore, we would proceed with users and items with most information.
378 x 436 rating matrix of class 'realRatingMatrix' with 36214 ratings.
Now we have 378 users and 436 items.
In general, any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users may give higher ratings than others.
To see the bias distribution of ratings, we will plot average rating per user.
avg <- rowMeans(movies)
ggplot() + aes(avg) + geom_histogram(binwidth = 0.1) + xlab("Average Rating") +
ylab(" No. of Ratings")We can see from the avg ratings distribution plot below that it varies a lot.
Data Normalization
recommender lab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.
avg
0
378
min_Items <- quantile(rowCounts(movies), 0.95)
min_Users <- quantile(colCounts(movies), 0.95)
image(movies[rowCounts(movies) > min_Items, colCounts(movies) > min_Users],
main = "Heatmap of the Top Users and Movies (Non-Normalized")image(movie_Normalization[rowCounts(movie_Normalization) > min_Items, colCounts(movie_Normalization) >
min_Users], main = "Heatmap of the Top Users and Movies (Normalized)")Reviewing rows in two heatmaps above, we can see that after normalization, the average rating is more uniform. Visually it does appear that bias is reduced.
Algorithms Used
Item-Item Collaborative Filtering
Step 1)
Split the dataset into training set (80%) and testing set (20%).
set.seed(80)
train_set <- sample(x = c(TRUE, FALSE), size = nrow(movies), replace = TRUE,
prob = c(0.8, 0.2))
movie_Train <- movies[train_set, ]
movie_Test <- movies[!train_set, ]Training the model
Step 2)
Now, after splitting the dataset into training and test sets, let’s create a model using training set.
Recommender of type 'IBCF' for 'realRatingMatrix'
learned using 302 users.
With recommenderlab package , we can train the model.
Examining the Similarity Matrix
Step 3) By examining the similarity matrix, we can find top ten movies that are similar to other movies.
similarityMatrix <- getModel(model)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Movie |
|---|
| Disclosure (1994) |
| Piano, The (1993) |
| City Slickers II: The Legend of Curly’s Gold (1994) |
| Congo (1995) |
| Broken Arrow (1996) |
| Wild Wild West (1999) |
| First Knight (1995) |
| Eraser (1996) |
| Coneheads (1993) |
| Beverly Hills Cop III (1994) |
Recommendations using test set
Recommendations as 'topNList' with n = 6 for 76 users.
Now, Let’s extract recommenders
Movie Ratings for first user
user1 <- as.data.frame(movie_Test@data[1, movie_Test@data[1, ] > 0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- movies %>% inner_join(user1, by = "movieId") %>% select(Movie = "title",
Rating) %>% arrange(desc(Rating))
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Movie | Rating |
|---|---|
| Shawshank Redemption, The (1994) | 5.0 |
| Forrest Gump (1994) | 5.0 |
| Blade Runner (1982) | 5.0 |
| One Flew Over the Cuckoo’s Nest (1975) | 5.0 |
| Hook (1991) | 5.0 |
| Kill Bill: Vol. 2 (2004) | 5.0 |
| Casino Royale (2006) | 5.0 |
| Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000) | 4.5 |
| Traffic (2000) | 4.5 |
| Mulholland Drive (2001) | 4.5 |
| Bowling for Columbine (2002) | 4.5 |
| Interview with the Vampire: The Vampire Chronicles (1994) | 4.0 |
| Gladiator (2000) | 4.0 |
| Chicken Run (2000) | 4.0 |
| Best in Show (2000) | 4.0 |
| Lost in Translation (2003) | 4.0 |
| Mystic River (2003) | 4.0 |
| Kill Bill: Vol. 1 (2003) | 4.0 |
| Incredibles, The (2004) | 4.0 |
| Prestige, The (2006) | 4.0 |
| No Country for Old Men (2007) | 4.0 |
| Inglourious Basterds (2009) | 4.0 |
| Fight Club (1999) | 3.5 |
| Monsters, Inc. (2001) | 3.5 |
| Royal Tenenbaums, The (2001) | 3.5 |
| Beautiful Mind, A (2001) | 3.5 |
| Bourne Identity, The (2002) | 3.5 |
| Finding Nemo (2003) | 3.5 |
| Eternal Sunshine of the Spotless Mind (2004) | 3.5 |
| Superbad (2007) | 3.5 |
| Avatar (2009) | 3.5 |
| Godfather, The (1972) | 3.0 |
| Memento (2000) | 3.0 |
| Shrek (2001) | 3.0 |
| Dark Knight, The (2008) | 3.0 |
| O Brother, Where Art Thou? (2000) | 2.5 |
| Pirates of the Caribbean: The Curse of the Black Pearl (2003) | 2.5 |
| Batman Begins (2005) | 2.5 |
| Departed, The (2006) | 2.5 |
| Bourne Ultimatum, The (2007) | 2.5 |
| Million Dollar Baby (2004) | 2.0 |
| WALL·E (2008) | 2.0 |
| Up (2009) | 2.0 |
| Donnie Darko (2001) | 1.5 |
| 28 Days Later (2002) | 1.5 |
| Pan’s Labyrinth (Laberinto del fauno, El) (2006) | 1.5 |
| Ratatouille (2007) | 1.5 |
| Lord of the Rings: The Fellowship of the Ring, The (2001) | 1.0 |
| Lord of the Rings: The Return of the King, The (2003) | 1.0 |
| High Fidelity (2000) | 0.5 |
| Requiem for a Dream (2000) | 0.5 |
| Harry Potter and the Chamber of Secrets (2002) | 0.5 |
| Big Fish (2003) | 0.5 |
| V for Vendetta (2006) | 0.5 |
| Juno (2007) | 0.5 |
| Iron Man (2008) | 0.5 |
| Slumdog Millionaire (2008) | 0.5 |
| Star Trek (2009) | 0.5 |
| Hangover, The (2009) | 0.5 |
| District 9 (2009) | 0.5 |
First user recommendations are as follows. Of course, this evaluation is highly subjective, but I would only question the recommendation of the first user.
Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Movie |
|---|
| Stargate (1994) |
| Robin Hood: Men in Tights (1993) |
| Schindler’s List (1993) |
| Alien (1979) |
| The Devil’s Advocate (1997) |
| Big (1988) |
User-User Collaborative Filtering
Training the model
Let’s create a user based model using the training set.
Recommender of type 'UBCF' for 'realRatingMatrix'
learned using 302 users.
Recommendations using test set
Recommendations as 'topNList' with n = 6 for 76 users.
Let us consider the first user and look at his/her recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.
# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Movie |
|---|
| Babe (1995) |
| Fugitive, The (1993) |
| Braveheart (1995) |
| Lion King, The (1994) |
| Star Wars: Episode IV - A New Hope (1977) |
| Aladdin (1992) |
Normalization Test
Consider if we build the model without normalizing the data. Without normalization, recommendations include more usual suspects (movies very highly rated by majority of users) and therefore are more generic.
model <- Recommender(movie_Train, method = "UBCF", parameter = list(normalize = NULL))
pred <- predict(model, newdata = movie_Test, n = 6)
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movies, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped",
"hover"))| Movie |
|---|
| Babe (1995) |
| Fugitive, The (1993) |
| Jurassic Park (1993) |
| Silence of the Lambs, The (1991) |
| Braveheart (1995) |
| Pulp Fiction (1994) |
Summary
For both Item-Item and User-User collaborative filtering, the recommendations which the first user got was almost similar. Movie ratings extracted for the first user consisted of drama genre movies. Therefore it was interesting to see that the movies recommended to the same user also consisted of drama genre movies.
The algorithms are carried out using recommenderlab package in R. It is a basic recommender system. More development and testing would be needed for a usable recommender system.
Comparing the results of UBCF with IBCF helps in understanding the algorithm better. UBCF needs to access the initial data, so it is a lazy-learning model. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix.
However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big.