Project 2: Content-Based and Collaborative Filtering

Introduction

This project is based on the recommenderlab package as described in the Building a Recommendation System with R book by Gorakala and Usuelli. It presents basic item-item and user-user collaborative filtering.

# Required libraries
library(recommenderlab)
library(dplyr)
library(tidyr)
library(ggplot2)

Data Set

The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/.

# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                           "master/Project%202/ml-latest-small/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                          "master/Project%202/ml-latest-small/movies.csv"))

# Convert to matrix
movieMatrix <- ratings %>% 
  select(-timestamp) %>% 
  spread(movieId, rating)
row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- movieMatrix[-c(1)]
movieMatrix <- as(as.matrix(movieMatrix), "realRatingMatrix")

movieMatrix

## 671 x 9066 rating matrix of class 'realRatingMatrix' with 100004 ratings.

Our movie matrix contains 671 users and 9,066 items/movies.

Exploration and Preparation

Let us take a quick look at the distribution of all ratings to make sure that there are no surprises. It seems that users favor whole number ratings. 4 is the most common rating which seems to be common for 5-star rating systems.

vRatings <- as.vector(movieMatrix@data)
vRatings <- vRatings[vRatings != 0]
ggplot() + aes(vRatings) + 
  geom_histogram(binwidth = 0.5) +
  xlab("Rating") + ylab("No of Ratings")

It does not make sense to use sparse data to build our model, so we will select only users and items with the most information.

( movies <- movieMatrix[rowCounts(movieMatrix) > 50, 
                        colCounts(movieMatrix) > 50] )

## 421 x 444 rating matrix of class 'realRatingMatrix' with 37915 ratings.

We are left with 421 users and 444 items. This is about two thirds of the original users and under 5% of the original items. However, this covers almost 38% of original ratings. It seems that there were a lot of items/movies with just a few ratings.

Any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users just give higher ratings than others. Consider average rating per user. We can see from the distribution plot below that it varies a lot.

avg <- rowMeans(movies)
ggplot() + aes(avg) + 
  geom_histogram(binwidth = 0.1) +
  xlab("Average Rating") + ylab("No of Ratings")

recommenderlab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.

moviesNorm <- normalize(movies)
avg <- round(rowMeans(moviesNorm),5)
table(avg)

## avg
##   0 
## 421

minItems <- quantile(rowCounts(movies), 0.95)
minUsers <- quantile(colCounts(movies), 0.95)

image(movies[rowCounts(movies) > minItems, 
                 colCounts(movies) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Non-Normalized")

image(moviesNorm[rowCounts(moviesNorm) > minItems, 
                 colCounts(moviesNorm) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Normalized)")

Reviewing rows in two heatmaps above, we can see that after normalization, the average rating is more uniform. Visually it does appear that bias is reduced.

Item-Item Collaborative Filtering

Let us split our set into training set (80%) and testing set (20%).

set.seed(88)
which_train <- sample(x = c(TRUE, FALSE), size = nrow(movies),
                      replace = TRUE, prob = c(0.8, 0.2))

movieTrain <- movies[which_train, ]
movieTest <- movies[!which_train, ]

Training the Model

Now let us create a model using the training set.

( model <- Recommender(movieTrain, method = "IBCF") )

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 338 users.

Interestingly, what can be the most time consuming and critical step - training the model - can be done in one line of code with recommenderlab package.

We can examine the similarity matrix and find top ten movies that are similar to other movies.

similarityMatrix <- getModel(model)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Disclosure (1994)
Beverly Hills Cop III (1994)
City Slickers II: The Legend of Curly’s Gold (1994)
Grumpier Old Men (1995)
Eraser (1996)
Perfect Storm, The (2000)
Judge Dredd (1995)
Coneheads (1993)
Congo (1995)
Seven Samurai (Shichinin no samurai) (1954)

Making Recommendations Using Test Set

( pred <- predict(model, newdata = movieTest, n = 6) )

## Recommendations as 'topNList' with n = 6 for 83 users.

Now we can extract recommendations. Consider the first user. Clearly a Star Wars fan and not a Trekkie.

# Movie ratings of the first user
user1 <- as.data.frame(movieTest@data[1,movieTest@data[1,]>0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- titles %>% 
  inner_join(user1, by = "movieId") %>% 
  select(Movie = "title", Rating) %>%
  arrange(desc(Rating))
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie	Rating
Braveheart (1995)	5
Star Wars: Episode IV - A New Hope (1977)	5
Shawshank Redemption, The (1994)	5
Wallace & Gromit: A Close Shave (1995)	5
Wallace & Gromit: The Wrong Trousers (1993)	5
Star Wars: Episode V - The Empire Strikes Back (1980)	5
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	5
Star Wars: Episode VI - Return of the Jedi (1983)	5
Amadeus (1984)	5
Glory (1989)	5
Babe (1995)	4
Birdcage, The (1996)	4
True Lies (1994)	4
Jurassic Park (1993)	4
Blade Runner (1982)	4
Nightmare Before Christmas, The (1993)	4
Aladdin (1992)	4
Dances with Wolves (1990)	4
Snow White and the Seven Dwarfs (1937)	4
2001: A Space Odyssey (1968)	4
Fish Called Wanda, A (1988)	4
Monty Python’s Life of Brian (1979)	4
Monty Python and the Holy Grail (1975)	4
Blues Brothers, The (1980)	4
Terminator, The (1984)	4
This Is Spinal Tap (1984)	4
Field of Dreams (1989)	4
Star Trek II: The Wrath of Khan (1982)	4
Toy Story (1995)	3
GoldenEye (1995)	3
Get Shorty (1995)	3
Happy Gilmore (1996)	3
Star Trek: Generations (1994)	3
Forrest Gump (1994)	3
Four Weddings and a Funeral (1994)	3
Lion King, The (1994)	3
Mask, The (1994)	3
Speed (1994)	3
Mrs. Doubtfire (1993)	3
Sleepless in Seattle (1993)	3
Terminator 2: Judgment Day (1991)	3
Batman (1989)	3
Beauty and the Beast (1991)	3
Truth About Cats & Dogs, The (1996)	3
Independence Day (a.k.a. ID4) (1996)	3
Die Hard (1988)	3
Willy Wonka & the Chocolate Factory (1971)	3
E.T. the Extra-Terrestrial (1982)	3
Princess Bride, The (1987)	3
Back to the Future (1985)	3
Young Frankenstein (1974)	3
Indiana Jones and the Last Crusade (1989)	3
When Harry Met Sally… (1989)	3
Star Trek III: The Search for Spock (1984)	3
Star Trek IV: The Voyage Home (1986)	3
Raising Arizona (1987)	3
Stargate (1994)	2
Eraser (1996)	2
Twister (1996)	1

His/her recommendations are as follows. Of course, this evaluation is highly subjective, but I would only question the recommendation of Casper.

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Ace Ventura: When Nature Calls (1995)
Casper (1995)
Congo (1995)
Judge Dredd (1995)
Clerks (1994)
Legends of the Fall (1994)

User-User Collaborative Filtering

The setup and code for the user based collaborative filtering is very similar to the item-based collaborative filtering above.

Training the Model

Now let us create a user-based model using the training set.

( model <- Recommender(movieTrain, method = "UBCF") )

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 338 users.

Making Recommendations Using Test Set

( pred <- predict(model, newdata = movieTest, n = 6) )

## Recommendations as 'topNList' with n = 6 for 83 users.

Again let us consider the first user and look at his/her recommendations. I would consider these as solid recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Citizen Kane (1941)
Amelie (Fabuleux destin d’AmÃ©lie Poulain, Le) (2001)
Psycho (1960)
Eternal Sunshine of the Spotless Mind (2004)
12 Angry Men (1957)
Airplane! (1980)

Normalization Test

Consider if we build the model without normalizing the data. Only one movie is featured on both lists of recommendations for the first user - Eternal Sunshine of the Spotless Mind. I would argue that without normalization recommendations include more usual suspects (movies very highly rated by majority of users) and therefore are more generic. This is, of course, highly subjective and needs to be researched and tested using more objective means than this writer’s opinion.

model <- Recommender(movieTrain, method = "UBCF", parameter = list(normalize = NULL))
pred <- predict(model, newdata = movieTest, n = 6)
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
American Beauty (1999)
Pulp Fiction (1994)
Silence of the Lambs, The (1991)
Lord of the Rings: The Fellowship of the Ring, The (2001)
Eternal Sunshine of the Spotless Mind (2004)
Clockwork Orange, A (1971)

Summary

This project presented the most basic approach to building a recommender system in R using the recommenderlab package. It describes some basic steps; however, it is important to note that this is a learning exercise. More development and testing would be needed for a usable recommender system. It would be particularly interesting to see the effect of additional features and to see the performance of the recommender when little information is known about user’s preference.

Project 2: Content-Based and Collaborative Filtering

Ilya Kats

June 19, 2018

Introduction

Data Set

Exploration and Preparation

Item-Item Collaborative Filtering

Training the Model

Making Recommendations Using Test Set

User-User Collaborative Filtering

Training the Model

Making Recommendations Using Test Set

Normalization Test

Summary