Project 2: Content-Based and Collaborative Filtering

Introduction

Start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester or another dataset of your choosing. Implement at least two two of these recommendation algorithms:

Content-Based Filtering
User-User Collaborative Filtering
Item-Item Collaborative Filtering

# Loading libraries
library(recommenderlab)
library(tidyr)
library(ggplot2)
library(dplyr)

Overview of data

The data set is MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/.

# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ErindaB/Data-612/master/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/ErindaB/Data-612/master/movies.csv"))
# Convert to matrix
MovieMatrix <- ratings %>% 
  select(-timestamp) %>% 
  spread(movieId, rating)
row.names(MovieMatrix) <- MovieMatrix[,1]
MovieMatrix <- MovieMatrix[-c(1)]
MovieMatrix <- as(as.matrix(MovieMatrix), "realRatingMatrix")

MovieMatrix

## 671 x 9066 rating matrix of class 'realRatingMatrix' with 100004 ratings.

The movie matrix contains 671 users and 9,066 items/movies.

Data Exploration

vRatings <- as.vector(MovieMatrix@data)
vRatings <- vRatings[vRatings != 0]
ggplot() + aes(vRatings) + 
  geom_histogram(binwidth = 0.5) +
  xlab("Rating") + ylab("No of Ratings")

We will select only users and items with the most information since it does not make sense to use sparse data to build our model

( movies <- MovieMatrix[rowCounts(MovieMatrix) > 50, 
                        colCounts(MovieMatrix) > 50] )

## 421 x 444 rating matrix of class 'realRatingMatrix' with 37915 ratings.

We are left with 421 users and 444 items. This is about two thirds of the original users and under 5% of the original items. However, this covers almost 38% of original ratings. It seems that there were a lot of items/movies with just a few ratings.

Any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users just give higher ratings than others. Consider average rating per user. We can see from the distribution plot below that it varies a lot.

avg <- rowMeans(movies)
ggplot() + aes(avg) + 
  geom_histogram(binwidth = 0.1) +
  xlab("Average Rating") + ylab("No of Ratings")

recommenderlab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.

moviesNorm <- normalize(movies)
avg <- round(rowMeans(moviesNorm),5)
table(avg)

## avg
##   0 
## 421

minItems <- quantile(rowCounts(movies), 0.95)
minUsers <- quantile(colCounts(movies), 0.95)

image(movies[rowCounts(movies) > minItems, 
                 colCounts(movies) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Non-Normalized")

image(moviesNorm[rowCounts(moviesNorm) > minItems, 
                 colCounts(moviesNorm) > minUsers], 
      main = "Heatmap of the Top Users and Movies (Normalized)")

After normalization, the average rating is more uniform. Visually it does appear that bias is reduced.

Item-Item Collaborative Filtering

#Let us split our set into training set (80%) and testing set (20%). 
set.seed(123)
which_train <- sample(x = c(TRUE, FALSE), size = nrow(movies),
                      replace = TRUE, prob = c(0.8, 0.2))

movieTrain <- movies[which_train, ]
movieTest <- movies[!which_train, ]

Training the Model

Now let us create a model using the training set.

( model <- Recommender(movieTrain, method = "IBCF") )

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 342 users.

We can examine the similarity matrix and find top ten movies that are similar to other movies.

similarityMatrix <- getModel(model)$sim
which_max <- order(colSums(similarityMatrix > 0), decreasing = TRUE)[1:10]
topMovies <- as.data.frame(as.integer(rownames(similarityMatrix)[which_max]))
colnames(topMovies) <- c("movieId")
data <- topMovies %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Disclosure (1994)
Beverly Hills Cop III (1994)
City Slickers II: The Legend of Curly’s Gold (1994)
Grumpier Old Men (1995)
Maltese Falcon, The (1941)
Congo (1995)
Hoop Dreams (1994)
Eraser (1996)
I, Robot (2004)
Seven Samurai (Shichinin no samurai) (1954)

Making Recommendations Using Test Set

( pred <- predict(model, newdata = movieTest, n = 6) )

## Recommendations as 'topNList' with n = 6 for 79 users.

Now we can extract recommendations. Consider the first user. Clearly a Star Wars fan and not a Trekkie.

# Movie ratings of the first user
user1 <- as.data.frame(movieTest@data[1,movieTest@data[1,]>0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))
data <- titles %>% 
  inner_join(user1, by = "movieId") %>% 
  select(Movie = "title", Rating) %>%
  arrange(desc(Rating))
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie	Rating
Pretty Woman (1990)	5.0
Sound of Music, The (1965)	5.0
Grease (1978)	5.0
Little Mermaid, The (1989)	5.0
Mrs. Doubtfire (1993)	4.5
Liar Liar (1997)	4.5
As Good as It Gets (1997)	4.5
There’s Something About Mary (1998)	4.5
Almost Famous (2000)	4.5
Miss Congeniality (2000)	4.5
Beautiful Mind, A (2001)	4.5
My Big Fat Greek Wedding (2002)	4.5
Spider-Man (2002)	4.5
Spider-Man 2 (2004)	4.5
Million Dollar Baby (2004)	4.5
Grumpier Old Men (1995)	4.0
Clueless (1995)	4.0
Happy Gilmore (1996)	4.0
Birdcage, The (1996)	4.0
Apollo 13 (1995)	4.0
Forrest Gump (1994)	4.0
Lion King, The (1994)	4.0
Mask, The (1994)	4.0
Speed (1994)	4.0
Dave (1993)	4.0
Home Alone (1990)	4.0
Beauty and the Beast (1991)	4.0
Wizard of Oz, The (1939)	4.0
Cinderella (1950)	4.0
Graduate, The (1967)	4.0
When Harry Met Sally… (1989)	4.0
Truman Show, The (1998)	4.0
Titanic (1997)	4.0
Wedding Singer, The (1998)	4.0
Rain Man (1988)	4.0
Breakfast Club, The (1985)	4.0
Antz (1998)	4.0
You’ve Got Mail (1998)	4.0
Austin Powers: The Spy Who Shagged Me (1999)	4.0
American Pie (1999)	4.0
Erin Brockovich (2000)	4.0
Pianist, The (2002)	4.0
Finding Nemo (2003)	4.0
28 Days Later (2002)	4.0
Love Actually (2003)	4.0
I, Robot (2004)	4.0
40-Year-Old Virgin, The (2005)	4.0
Dumb & Dumber (Dumb and Dumber) (1994)	3.5
Ace Ventura: Pet Detective (1994)	3.5
Aladdin (1992)	3.5
Nutty Professor, The (1996)	3.5
Vertigo (1958)	3.5
Jerry Maguire (1996)	3.5
Lost World: Jurassic Park, The (1997)	3.5
Exorcist, The (1973)	3.5
Bug’s Life, A (1998)	3.5
Office Space (1999)	3.5
Sixth Sense, The (1999)	3.5
Bowfinger (1999)	3.5
Ferris Bueller’s Day Off (1986)	3.5
Being John Malkovich (1999)	3.5
Toy Story 2 (1999)	3.5
Talented Mr. Ripley, The (1999)	3.5
Patriot, The (2000)	3.5
Meet the Parents (2000)	3.5
Cast Away (2000)	3.5
Shrek (2001)	3.5
Moulin Rouge (2001)	3.5
Bowling for Columbine (2002)	3.5
One Flew Over the Cuckoo’s Nest (1975)	3.0
Ocean’s Eleven (2001)	3.0
Harry Potter and the Chamber of Secrets (2002)	3.0
Godfather, The (1972)	2.5
Godfather: Part II, The (1974)	2.5
Lost in Translation (2003)	2.5

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Batman Returns (1992)
Unbreakable (2000)
Father of the Bride Part II (1995)
Pirates of the Caribbean: Dead Man’s Chest (2006)
Shutter Island (2010)
City of God (Cidade de Deus) (2002)

User-User Collaborative Filtering

The setup and code for the user based collaborative filtering is very similar to the item-based collaborative filtering above.

Training the Model

Now let us create a user-based model using the training set.

( model <- Recommender(movieTrain, method = "UBCF") )

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 342 users.

Making Recommendations Using Test Set

( pred <- predict(model, newdata = movieTest, n = 6) )

## Recommendations as 'topNList' with n = 6 for 79 users.

Again let us consider the first user and look at his/her recommendations. I would consider these as solid recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.

# Recommendations for the first user
recommended <- pred@itemLabels[pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(titles, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Crow, The (1994)
Batman Begins (2005)
Like Water for Chocolate (Como agua para chocolate) (1992)
Hoop Dreams (1994)
Charlie’s Angels (2000)
Lawrence of Arabia (1962)

Summary

When comparing the results of UBCF with IBCF : UBCF needs to access the initial data, so it is a lazy-learning model. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix.

UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big.

Also note that collaborative filtering has some limitations. When dealing with new users and/ or new items, the algorithm has these potential problems:

If the new user hasn’t seen any movie yet, neither the IBCF nor the UBCF is able to recommend any item. Unless the IBCF knows the items purchased by the new user, it can’t work. The UBCF needs to know which users have similar preferences to the new one, but we don’t know about its ratings.

If the new item hasn’t been purchased by anyone, it will never be recommended. IBCF matches items that have been purchased by the same users, so it won’t match the new item with any of the others. UBCF recommends to each user items purchased by similar users, and no one purchased the new item. So, the algorithm won’t recommend it to anyone.

Project 2: Content-Based and Collaborative Filtering

Erinda Budo

June 16, 2020

Introduction

Overview of data

Data Exploration

Item-Item Collaborative Filtering

Training the Model

Making Recommendations Using Test Set

User-User Collaborative Filtering

Training the Model

Making Recommendations Using Test Set

Summary