Start with an existing dataset of user-item ratings, such as our toy books dataset, MovieLens, Jester or another dataset of your choosing. Implement at least two two of these recommendation algorithms:
1- Content-Based Filtering
1.1 User-User Collaborative Filtering
1.2 Item-Item Collaborative Filtering
# Loading libraries
library(knitr)
library(kableExtra)
library(tidyverse)
library(recommenderlab)
Although there are lots of dataset for recommender systems but for sake of simplicity I will include MovieLens dataset. This data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,000 ratings (1-5) from 943 users on 1664 movies. Movie metadata is also provided in MovieLenseMeta. It is a built-in dataset in “recommenderlab” library.
# Loading the MovieLense dataset
set.seed(233)
data("MovieLense")
# Creating a seperate dataframe
movies <- MovieLense@data
# Data structure
dim(movies)
## [1] 943 1664
The MovieLense dataset contains reviews from 943 users and 1664 movies.
Let’s explore the distribution of user’s review via barchart and heatmap
# Creating barcharts of ratings
movies %>% as.vector() %>% as.tibble() %>% filter_all(any_vars(. != 0)) %>%
ggplot()+ geom_bar(aes(value))+ labs(title="Movie Ratings by users", x="Ratings", y="Numbers")+ theme_classic()
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
Most of the reviews consist around 4 and somewhere 3. Now let’s create heatmap to see overall picture of the dataset.
# Creating heatmap for the entire dataset
image(movies, main="Heatmap - Ratings")
# Mostly viewed movies by users
table <- data.frame(
movie= names(colCounts(MovieLense)),
views= colCounts(MovieLense)
)
# Mostly viewed movies
table %>% head() %>% kable() %>% kable_styling()
movie | views | |
---|---|---|
Toy Story (1995) | Toy Story (1995) | 452 |
GoldenEye (1995) | GoldenEye (1995) | 131 |
Four Rooms (1995) | Four Rooms (1995) | 90 |
Get Shorty (1995) | Get Shorty (1995) | 209 |
Copycat (1995) | Copycat (1995) | 86 |
Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) | Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) | 26 |
# Creating graph for mostly viewed movies
ggplot(table[1:10, ], aes(reorder(x=movie, views), y=views))+geom_bar(stat="identity")+theme(axis.text.x = element_text(angle=45, hjust=1))+labs(title="Mostly viewed movies", x="Movies", y="Views")
Data preparation is a significant step to create an accurate prediction and recommendations. According to Gorakala and Usuelli(2015), movies that have been viewed only a few times and users who rated only a few movies that’s why it creates biasness in the result. To avoid biasness, I will select users who rated at least 50 movies and watched at least 100 times. Another requirement is also needed that is NORMALIZATION which I won’t do as Recommender function does it by itself.
# Selecting data
movies_s <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
# Creating train and test dataset
which_movies <- sample(x=c(TRUE, FALSE), size= nrow(movies_s), replace = TRUE, prob=c(0.8, 0.2))
movies_train <- movies_s[which_movies, ] # Training
movies_test <- movies_s[!which_movies, ] # Testing
Item based collaborative filtering is type of recommender systems in which similarities between the items are calculated for recommendation. Let’s create similary matrix to see the similarity between the items. I will create similarity index for 10 movies
# Create similarity index for 10 movies
sim_item <- similarity(MovieLense[, 1:4], method="cosine", which="items")
image(as.matrix(sim_item), main="Item similarity")
# Creating recommender system - item based collaborative filtering
item_model <- Recommender(movies_train, method="IBCF", parameter= list(k=30))
model_detail <- getModel(item_model)
# Creating heatmap for few rows and columns
image(model_detail$sim[1:20, 1:20], main="Heatmap for few rows and columns")
# Creating predictions
train_pred <- predict(object= item_model, newdata = movies_test, n=10)
train_matrix <- sapply(train_pred@items, function(x) {colnames(movies_s)[x]})
train_matrix[, 1:5] %>% kable() %>% kable_styling()
1 | 5 | 13 | 14 | 16 |
---|---|---|---|---|
Heat (1995) | Army of Darkness (1993) | Lost World: Jurassic Park, The (1997) | Phenomenon (1996) | Taxi Driver (1976) |
In & Out (1997) | Godfather: Part II, The (1974) | Phenomenon (1996) | Henry V (1989) | Birdcage, The (1996) |
English Patient, The (1996) | In the Line of Fire (1993) | Firm, The (1993) | Sling Blade (1996) | Ed Wood (1994) |
Ransom (1996) | Ben-Hur (1959) | Frighteners, The (1996) | River Wild, The (1994) | Mask, The (1994) |
Ghost (1990) | Rock, The (1996) | Multiplicity (1996) | Maltese Falcon, The (1941) | Truth About Cats & Dogs, The (1996) |
Hunchback of Notre Dame, The (1996) | African Queen, The (1951) | Time to Kill, A (1996) | Lawrence of Arabia (1962) | Swingers (1996) |
Fried Green Tomatoes (1991) | Conan the Barbarian (1981) | Men in Black (1997) | Great Escape, The (1963) | Return of the Jedi (1983) |
E.T. the Extra-Terrestrial (1982) | Outbreak (1995) | Tin Cup (1996) | Philadelphia (1993) | Blues Brothers, The (1980) |
Scream 2 (1997) | Interview with the Vampire (1994) | Star Trek: Generations (1994) | Fifth Element, The (1997) | Raging Bull (1980) |
Pinocchio (1940) | Absolute Power (1997) | Crash (1996) | Batman Returns (1992) | Dead Poets Society (1989) |
User based collaborative filtering is a recommender system that measures similarity between users. Below, I will create a similarity matrix and then create a model to predict movies based on similarity among the users.
# Creating similarity index and image
sim_users <- similarity(MovieLense[1:4, ], method="pearson", which="users")
image(as.matrix(sim_users), main="Similarity among users")
# Creating a recommender model - user based collaborative filtering
sim_model2 <- Recommender(movies_train, method="UBCF", parameter=list(k=25))
## Warning: Unknown parameter: k
## Available parameter (with default values):
## method = cosine
## nn = 25
## sample = FALSE
## normalize = center
## verbose = FALSE
model_detail2 <- getModel(sim_model2)
# Calculate predictions
train_ub_pred <- predict(object=sim_model2, newdata= movies_test, n=10)
train_ub_matrix <- sapply(train_ub_pred@items, function(x) {colnames(movies_s)[x]})
train_ub_matrix[, 1:5] %>% kable() %>% kable_styling()
1 | 5 | 13 | 14 | 16 |
---|---|---|---|---|
Trainspotting (1996) | Good Will Hunting (1997) | Citizen Kane (1941) | Contact (1997) | Good Will Hunting (1997) |
Secrets & Lies (1996) | L.A. Confidential (1997) | Chasing Amy (1997) | Babe (1995) | Titanic (1997) |
Close Shave, A (1995) | Titanic (1997) | Gone with the Wind (1939) | Lawrence of Arabia (1962) | Amistad (1997) |
Leaving Las Vegas (1995) | Amistad (1997) | Wrong Trousers, The (1993) | Forrest Gump (1994) | Star Wars (1977) |
Magnificent Seven, The (1954) | As Good As It Gets (1997) | Fried Green Tomatoes (1991) | As Good As It Gets (1997) | Star Trek: First Contact (1996) |
L.A. Confidential (1997) | Godfather, The (1972) | Donnie Brasco (1997) | Murder at 1600 (1997) | Face/Off (1997) |
Titanic (1997) | Secrets & Lies (1996) | Reservoir Dogs (1992) | North by Northwest (1959) | Trainspotting (1996) |
Butch Cassidy and the Sundance Kid (1969) | Schindler’s List (1993) | Unforgiven (1992) | G.I. Jane (1997) | Mighty Aphrodite (1995) |
Donnie Brasco (1997) | Apt Pupil (1998) | Mr. Smith Goes to Washington (1939) | Apocalypse Now (1979) | Princess Bride, The (1987) |
Boot, Das (1981) | Primal Fear (1996) | It’s a Wonderful Life (1946) | Schindler’s List (1993) | Good, The Bad and The Ugly, The (1966) |
Both user-based and item-based collaborative filtering are popoular recommender systems that are widely being used by popular companies such as Netflix, Amazon, youtube, etc which helps not only the company but also the customers finding the relevant products that the customer may like. It is either based on user profile or item profile and both techniques are pretty much useful. Data cleaning is an important step which if not done correctly then the entire prediction may be inaccurate. Luckily, Recommender function takes care of normalization of the data. After data cleaning process then the predictions can be done.