DATA 643 Project 2 | Content-Based and Collaborative Filtering

Dataset

I will use the MovieLense dataset; the 100k MovieLense ratings data set. The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,000 ratings (1-5) from 943 users on 1664 movies.

set.seed(1)
data("MovieLense")
movielense_dt <- MovieLense@data

Data Exploration

In this section, I will explore the dataset by visualizing it through graphs. I look at the distribution of the ratings. We see the ratings of 4 has the highest count. I also plot the heatmap of the rating matrix.

# distribution of ratings
movielense_dt %>% 
  as.vector() %>% 
  as_tibble() %>% 
  filter_all(any_vars(. != 0)) %>% 
  ggplot(aes(value)) + 
  geom_bar() +
  labs(title = "Distribution of the ratings", y = "", x = "Ratings") +
  theme_minimal()

# heatmap of the rating matrix
image(movielense_dt, main = "Heatmap of the rating matrix")

Data Preparation

In this section I will prepare data for most accurate model. First, I select the most relevant data because movies that have been viewed only a few times. Their ratings might be biased because of lack of data. Also, users who rated only a few movies. Their ratings might be biased as well. I will not normalize the dataset because the Recommender function that builds the model normalizes the data by default in such a way that the average rating of each user is 0. Lastly, we will split the dataset to build the model.

# selecting the most relevant data
ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]

# split the dataset
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies), replace = TRUE, prob = c(0.8, 0.2))
train <- ratings_movies[which_train, ]
test <- ratings_movies[!which_train, ]

Model 1

Item based collaborative filtering algorithms are based on measuring the similarity between items. First, I will look at the similarity between items. The more red the cell is, the more similar two items are. Note that the diagonal is yellow, since it’s comparing each items with itself. Then I build an item-item collaborative filtering where I recommend movies to users where their item’s ratings are similar.

# compute the item similarity matrix
similarity_items <- similarity(MovieLense[, 1:4], method = "cosine", which = "items")

# visualize the item similarity matrix
image(as.matrix(similarity_items), main = "Item similarity")

# build recommender model
recc_model1 <- Recommender(train, method = "IBCF", parameter = list(k = 30))
model_details1 <- getModel(recc_model1)

# prediction
recc_predicted1 <- predict(object = recc_model1, newdata = test, n = 6)
recc_matrix1 <- sapply(recc_predicted1@items, function(x) {colnames(ratings_movies)[x]})
recc_matrix1[, 1:3] %>% kable() %>% kable_styling(full_width = T)

5	7	8
Babe (1995)	Edge, The (1997)	Mighty Aphrodite (1995)
African Queen, The (1951)	Men in Black (1997)	Sleepless in Seattle (1993)
Cape Fear (1991)	Cold Comfort Farm (1995)	Cold Comfort Farm (1995)
Alien: Resurrection (1997)	Everyone Says I Love You (1996)	Twister (1996)
Ace Ventura: Pet Detective (1994)	While You Were Sleeping (1995)	Star Trek IV: The Voyage Home (1986)
G.I. Jane (1997)	Peacemaker, The (1997)	Full Monty, The (1997)

Model 2

User based collaborative filtering algorithms are based on measuring the similarity between users. First, I will look at the similarity between users. The more red the cell is, the more similar two ewers are. Note that the diagonal is yellow, since it’s comparing each items with itself. Finally, I build a user-user collaborative filtering where I recommend movies to users based on how similar they are with other users.

# compute the user similarity matrix
similarity_users <- similarity(MovieLense[1:4, ], method = "pearson", which = "users")

# visualize the user similarity matrix
image(as.matrix(similarity_users), main = "User similarity")

recc_model2 <- Recommender(train, method = "UBCF", parameter = list(k = 25))

## Available parameter (with default values):
## method    =  cosine
## nn    =  25
## sample    =  FALSE
## normalize     =  center
## verbose   =  FALSE

model_details2 <- getModel(recc_model2)

recc_predicted2 <- predict(object = recc_model2, newdata = test, n = 6)

recc_matrix2 <- sapply(recc_predicted2@items, function(x) {colnames(ratings_movies)[x]})
recc_matrix2[, 1:3] %>% kable() %>% kable_styling(full_width = T)

5	7	8
L.A. Confidential (1997)	Good Will Hunting (1997)	Good Will Hunting (1997)
Good Will Hunting (1997)	As Good As It Gets (1997)	Titanic (1997)
As Good As It Gets (1997)	Amistad (1997)	L.A. Confidential (1997)
Schindler’s List (1993)	Titanic (1997)	Scream 2 (1997)
Godfather, The (1972)	Big Night (1996)	Remains of the Day, The (1993)
Amistad (1997)	Trainspotting (1996)	Silence of the Lambs, The (1991)