Data 612 Project 2 - Content-Based and Collaborative Filtering

if (!require("knitr")) install.packages("knitr")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("dplyr")) install.packages("dplyr")
if (!require("Matrix")) install.packages("Matrix")
if (!require("recommenderlab")) install.packages("recommenderlab")
if (!require("gridExtra")) install.packages("gridExtra")
if (!require("graphics")) install.packages("graphics")

Introduction

For assignment 2, start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester [http://eigentaste.berkeley.edu/dataset/] or another dataset of your choosing. Implement at least two of these recommendation algorithms:

• Content-Based Filtering • User-User Collaborative Filtering • Item-Item Collaborative Filtering

As an example of implementing a Content-Based recommender, you could build item profiles for a subset of MovieLens movies from scraping http://www.imdb.com/ or using the API at https://www.omdbapi.com/ (which has very recently instituted a small monthly fee). A more challenging method would be to pull movie summaries or reviews and apply tf-idf and/or topic modeling.

You should evaluate and compare different approaches, using different algorithms, normalization techniques, similarity methods, neighborhood sizes, etc. You don’t need to be exhaustive—these are just some suggested possibilities.

You may use the course text’s recommenderlab or any other library that you want.

Data

Load your data into (for example) I use MovieLens small datasets: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

ratings_data <- read.csv('ratings.csv', stringsAsFactors = F)
movie_data <- read.csv('movies.csv', stringsAsFactors = F)

#Load results
head(ratings_data) %>% kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F)

userId	movieId	rating	timestamp
1	1	4	964982703
1	3	4	964981247
1	6	4	964982224
1	47	5	964983815
1	50	5	964982931
1	70	3	964982400

head(movie_data)%>% kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F)

movieId	title	genres
1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	Jumanji (1995)	Adventure\|Children\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
5	Father of the Bride Part II (1995)	Comedy
6	Heat (1995)	Action\|Crime\|Thriller

### Structure
str(ratings_data)

## 'data.frame':    100836 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  1 3 6 47 50 70 101 110 151 157 ...
##  $ rating   : num  4 4 4 5 5 3 5 4 5 5 ...
##  $ timestamp: int  964982703 964981247 964982224 964983815 964982931 964982400 964980868 964982176 964984041 964984100 ...

str(movie_data)

## 'data.frame':    9742 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...

### Summary
str(movie_data)

## 'data.frame':    9742 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...

summary(movie_data)

##     movieId          title              genres         
##  Min.   :     1   Length:9742        Length:9742       
##  1st Qu.:  3248   Class :character   Class :character  
##  Median :  7300   Mode  :character   Mode  :character  
##  Mean   : 42200                                        
##  3rd Qu.: 76232                                        
##  Max.   :193609

Transform Data

I used realRatingMatrix from ‘recommenderlab’ to transform data.

#### use pivot_wider to make a user matrix
ratings_data$userId <- as.factor(ratings_data$userId)

UI <- as(ratings_data, "realRatingMatrix")
dim(UI@data)

## [1]  610 9724

First, I will look at the similarity between items. The more Yellow the cell is, the more similar two items are. Note that the diagonal is Red, since it’s comparing each items with itself.

Items

#Items
similarity_items <- similarity(UI[,1:10], method = "cosine", which = "items")
# The more red the cell is, the more similar two users are. 
image(as.matrix(similarity_items), main="Items Similarity (cosine)")

Users

User based collaborative filtering algorithms are based on measuring the similarity between users. First, I will look at the similarity between users. The more yellow the cell is, the more similar two users are.

#Users
similarity_users <- similarity(UI[1:10, ], method = "cosine", which = "userId")

# The more red the cell is, the more similar two users are. 
image(as.matrix(similarity_users), main="User Similarity (cosine)")

Distribution of The Ratings

#Distribution plot
ratings_dis <- as.vector(UI@data)
ratings_dis <- ratings_dis[ratings_dis != 0]
ratings_dis <- as.data.frame(table(ratings_dis))

ggplot(ratings_dis, aes(x = ratings_dis, y= Freq)) + 
  geom_bar(stat= "identity") +
  geom_text(aes(label=Freq), position=position_stack(vjust = 0.5), color="white", size=4)+
  labs(y = "No. of Ratings", x = "Ratings") + ggtitle("Distribution of the ratings")+
  theme(plot.title = element_text(hjust = 0.5))

Users who rated only a few movies their ratings might be biased, so I can remove that.

movies <- UI[rowCounts(UI) > 50, colCounts(UI) > 
    50]

In general, any ratings matrix, especially movie ratings matrix, is bound to have some bias. Some users may give higher ratings than others.

To see the bias distribution of ratings, we will plot average rating per user.

avg <- rowMeans(movies)
ggplot() + aes(avg) + geom_histogram(binwidth = 0.1) + xlab("Average Rating") + 
    ylab(" No. of Ratings")

We can see from the avg ratings distribution plot below that it varies a lot.

recommender lab normalizes the data when building a model. Let us normalize the ratings and confirm that all averages are 0 now to see what kind of effect it has.

movie_Normalization <- normalize(movies)
avg <- round(rowMeans(movie_Normalization), 5)
table(avg)

## avg
##   0 
## 378

min_Items <- quantile(rowCounts(movies), 0.95)
min_Users <- quantile(colCounts(movies), 0.95)

image(movies[rowCounts(movies) > min_Items, colCounts(movies) > min_Users], 
    main = "Heatmap of the Top Users and Movies (Non-Normalized")

image(movie_Normalization[rowCounts(movie_Normalization) > min_Items, colCounts(movie_Normalization) > 
    min_Users], main = "Heatmap of the Top Users and Movies (Normalized)")

Split training and test datasets.

Split the dataset into training set (80%) and testing set (20%).

set.seed(10)
train_set <- sample(x = c(TRUE, FALSE), size = nrow(movies), replace = TRUE, 
    prob = c(0.8, 0.2))

movie_Train <- movies[train_set, ]
movie_Test <- movies[!train_set, ]

##Item-Item Collaborative Filtering

I build an item-item collaborative filtering where I recommend movies to users where their item’s ratings are similar.

Using your training data, let’s create a model using method IBCF .

IBCF_model <- Recommender(movie_Train, method = "IBCF")

### Recommendations using test set
IBCF_pred <- predict(IBCF_model, newdata = movie_Test, n = 6)

Now, Let’s extract recommenders Movie Ratings for first user

user1 <- as.data.frame(movie_Test@data[1, movie_Test@data[1, ] > 0])
colnames(user1) <- c("Rating")
user1[c("movieId")] <- as.integer(rownames(user1))

data <- movie_data %>% inner_join(user1, by = "movieId") %>% select(Movie = "title", 
    Rating) %>% arrange(desc(Rating))
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie	Rating
Usual Suspects, The (1995)	5
Star Wars: Episode IV - A New Hope (1977)	5
Pulp Fiction (1994)	5
Shawshank Redemption, The (1994)	5
Batman (1989)	5
Silence of the Lambs, The (1991)	5
Fargo (1996)	5
Casablanca (1942)	5
Die Hard (1988)	5
Willy Wonka & the Chocolate Factory (1971)	5
Princess Bride, The (1987)	5
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	5
Mars Attacks! (1996)	5
Sabrina (1995)	4
Get Shorty (1995)	4
Leaving Las Vegas (1995)	4
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	4
Dead Man Walking (1995)	4
Birdcage, The (1996)	4
While You Were Sleeping (1995)	4
Dave (1993)	4
Schindler’s List (1993)	4
Sleepless in Seattle (1993)	4
Truth About Cats & Dogs, The (1996)	4
Rock, The (1996)	4
Fish Called Wanda, A (1988)	4
Star Wars: Episode V - The Empire Strikes Back (1980)	4
Star Trek: First Contact (1996)	4
Jerry Maguire (1996)	4
Toy Story (1995)	3
Grumpier Old Men (1995)	3
Heat (1995)	3
Clueless (1995)	3
Mr. Holland’s Opus (1995)	3
Broken Arrow (1996)	3
Batman Forever (1995)	3
What’s Eating Gilbert Grape (1993)	3
Crow, The (1994)	3
Four Weddings and a Funeral (1994)	3
Addams Family Values (1993)	3
Tombstone (1993)	3
Mission: Impossible (1996)	3
Dragonheart (1996)	3
Twister (1996)	3
Independence Day (a.k.a. ID4) (1996)	3
Cable Guy, The (1996)	3
Eraser (1996)	3
Monty Python’s Life of Brian (1979)	3
Scream (1996)	2
Nutty Professor, The (1996)	1

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

recommended <- IBCF_pred@itemLabels[IBCF_pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movie_data, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Congo (1995)
Die Hard: With a Vengeance (1995)
Clear and Present Danger (1994)
Fugitive, The (1993)
Three Musketeers, The (1993)
Stand by Me (1986)

User-User Collaborative Filtering

Finally, I build a user-user collaborative filtering where I recommend movies to users based on how similar they are with other users.

UBCF_model <- Recommender(movie_Train, method = "UBCF")

####Recommendations using test set

UBCF_pred <- predict(UBCF_model, newdata = movie_Test, n = 6)

Let us consider the first user and look at his/her recommendations. The first user gravitated towards more critically acclaimed dramas and these recommendations are among the best movies produced.

# Recommendations for the first user
UBCF_recommended <- UBCF_pred@itemLabels[UBCF_pred@items[[1]]]
UBCF_recommended <- as.data.frame(as.integer(UBCF_recommended))
colnames(UBCF_recommended) <- c("movieId")

data <- UBCF_recommended %>% inner_join(movie_data, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Lord of the Rings: The Two Towers, The (2002)
Casino Royale (2006)
Matrix, The (1999)
Lord of the Rings: The Fellowship of the Ring, The (2001)
Gladiator (2000)
Dark Knight Rises, The (2012)

Normalized

Normalized_model <- Recommender(movie_Train, method = "UBCF", parameter = list(normalize = NULL))
Normalized_pred <- predict(Normalized_model, newdata = movie_Test, n = 6)
recommended <- Normalized_pred@itemLabels[Normalized_pred@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movieId")
data <- recommended %>% inner_join(movie_data, by = "movieId") %>% select(Movie = "title")
knitr::kable(data, format = "html") %>% kableExtra::kable_styling(bootstrap_options = c("striped", 
    "hover"))

Movie
Matrix, The (1999)
Fight Club (1999)
American Beauty (1999)
Dark Knight, The (2008)
Memento (2000)
Lord of the Rings: The Fellowship of the Ring, The (2001)

Summary results.

For both Item-Item and User-User collaborative filtering, the recommendations which the user got was almost similar. Movie ratings extracted for the first user consisted of action genre movies. Therefore it was interesting to see that the movies recommended to the same user also consisted of same genre movies.