This report is part of the capstone project of HarvardX’s Data Science Professional Certificate1 program.
The R Markdown code used to generate this report and the PDF version are available on GitHub.
The HTML version is available on RPubs.
Recommendation systems plays an important role in e-commerce and online streaming services, such as Netflix, YouTube and Amazon. Making the right recommendation for the next product, music or movie increases user retention and satisfaction, leading to sales and profit growth. Companies competing for customer loyalty invest on systems that capture and analyses the user’s preferences, and offer products or services with higher likelihood of purchase.
The economic impact of such company-customer relationship is clear: Amazon is the largest online retail company by sales and part of its success comes from the recommendation system and marketing based on user preferences. In 2006 Netflix offered a one million dollar prize2 for the person or group that could improve their recommendation system by at least 10%.
Usually recommendation systems are based on a rating scale from 1 to 5 grades or stars, with 1 indicating lowest satisfaction and 5 is the highest satisfaction. Other indicators can also be used, such as comments posted on previously used items; video, music or link shared with friends; percentage of movie watched or music listened; web pages visited and time spent on each page; product category; and any other interaction with the company’s web site or application can be used as a predictor.
The primary goal of recommendation systems is to help users find what they want based on their preferences and previous interactions, and predicting the rating for a new item. In this document, we create a movie recommendation system using the MovieLens dataset and applying the lessons learned during the HarvardX’s Data Science Professional Certificate3 program.
This document is structured as follows. Chapter 1 describes the dataset and summarizes the goal of the project and key steps that were performed. In chapter 2 we explain the process and techniques used, such as data cleaning, data exploration and visualization, any insights gained, and the modeling approach. In chapter 3 we present the modeling results and discuss the model performance. We conclude in chapter 4 with a brief summary of the report, its limitations and future work.
GroupLens4 is a research lab in the University of Minnesota that has collected and made available rating data for movies in the MovieLens web site5.
The complete MovieLens dataset6 consists of 27 million ratings of 58,000 movies by 280,000 users. The research presented in this paper is based in a subset7 of this dataset with 10 million ratings on 10,000 movies by 72,000 users.
The evaluation of machine learning algorithms consists in comparing the predicted value with the actual outcome. The loss function measures the difference between both values. There are other metrics that go beyond the scope of this document.
The most common loss functions in machine learning are the mean absolute error (MAE), mean squared error (MSE) and root mean squared error (RMSE).
Regardless of the loss function, when the user consistently selects the predicted movie, the error is equal to zero and the algorithm is perfect.
The goal of this project is to create a recommendation system with RMSE lower than 0.8649. There’s no specific target for MSE and MAE.
The mean absolute error is the average of absolute differences between the predicted value and the true value. The metric is linear, which means that all errors are equally weighted. Thus, when predicting ratings in a scale of 1 to 5, the MAE assumes that an error of 2 is twice as bad as an error of 1.
The MAE is given by this formula:
\[MAE=\frac{1}{N}\sum_{i=1}^{N}|\hat y_i - y_i|\] where \(N\) is the number of observations, \(\hat y_i\) is the predicted value and \(y\) is the true value.
The MSE is the average squared error of the predictions. Larger errors are weighted more than smaller ones, thus if the error doubles, the MSE increases four times. On the other hand, errors smaller than 1 also decrease faster, for example an error of 0.5 results in MSE of 0.25.
\[MSE=\frac{1}{N}\sum_{u,i}(\hat{y}_i-y_i)^2\]
The Root Mean Squared Error, RMSE, is the square root of the MSE. It is the typical metric to evaluate recommendation systems, and is defined by the formula:
\[RMSE=\sqrt{\frac{1}{N}\sum_{u,i}(\hat{y}_{u,i}-y_{u,i})^2}\]
where \(N\) is the number of ratings, \(y_{u,i}\) is the rating of movie \(i\) by user \(u\) and \(\hat{y}_{u,i}\) is the prediction of movie \(i\) by user \(u\).
Similar to MSE, the RMSE penalizes large deviations from the mean and is appropriate in cases that small errors are not relevant. Contrary to the MSE, the error has the same unit as the measurement.
The main steps in a data science project include:
First we download the dataset from MovieLens website and split into two subsets used for training and validation. The training subset is called edx and the validation subset is called validation. The edx set is split again into two subsets used for training and and testing. When the model reaches the RMSE target in the testing set, we train the edx set with the model and use the validation set for final validation. We pretend the validation set is new data with unknown outcomes.
In the next step we create charts, tables and statistics summary to understand how the features can impact the outcome. The information and insights obtained during exploration will help to build the machine learning model.
Creating a recommendation system involves the identification of the most important features that helps to predict the rating any given user will give to any movie. We start building a very simple model, which is just the mean of the observed values. Then, the user and movie effects are included in the linear model, improving the RMSE. Finally, the user and movie effects receive regularization parameter that penalizes samples with few ratings.
Although the linear model with regularization achieves the desired RMSE, matrix factorization using the LIBMF8 algorithm is evaluated and provides a better prediction. LIBMF is available through the R package recosystem9.
In this section we download and prepare the dataset to be used in the analysis. We split the dataset in two parts, the training set called edx and the evaluation set called validation with 90% and 10% of the original dataset respectively.
Then, we split the edx set in two parts, the train set and test set with 90% and 10% of edx set respectively. The model is created and trained in the train set and tested in the test set until the RMSE target is achieved, then finally we train the model again in the entire edx set and validate in the validation set. The name of this method is cross-validation.
if(!require(tidyverse))
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret))
install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table))
install.packages("data.table", repos = "http://cran.us.r-project.org")
# MovieLens 10M dataset:
# https://grouplens.org/datasets/movielens/10m/
# http://files.grouplens.org/datasets/movielens/ml-10m.zip
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>%
mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
# 'Validation' set will be 10% of MovieLens data
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = movielens$rating,
times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
# Make sure userId and movieId in 'validation' set are also in 'edx' set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from 'validation' set back into 'edx' set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
The edx set is used for training and testing, and the validation set is used for final validation to simulate the new data.
Here, we split the edx set in 2 parts: the training set and the test set.
The model building is done in the training set, and the test set is used to test the model. When the model is complete, we use the validation set to calculate the final RMSE.
We use the same procedure used to create edx and validation sets.
The training set will be 90% of edx data and the test set will be the remaining 10%.
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = edx$rating, times = 1, p = 0.1, list = FALSE)
train_set <- edx[-test_index,]
temp <- edx[test_index,]
# Make sure userId and movieId in test set are also in train set
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)
rm(test_index, temp, removed)
Before start building the model, we need to understand the structure of the data, the distribution of ratings and the relationship of the predictors. This information will help build a better model.
str(edx)
## 'data.frame': 9000055 obs. of 6 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : num 122 185 292 316 329 355 356 362 364 370 ...
## $ rating : num 5 5 5 5 5 5 5 5 5 5 ...
## $ timestamp: int 838985046 838983525 838983421 838983392 838983392 838984474 838983653 838984885 838983707 838984596 ...
## $ title : chr "Boomerang (1992)" "Net, The (1995)" "Outbreak (1995)" "Stargate (1994)" ...
## $ genres : chr "Comedy|Romance" "Action|Crime|Thriller" "Action|Drama|Sci-Fi|Thriller" "Action|Adventure|Sci-Fi" ...
From this initial exploration, we discover that edx has 6 columns:
| movieId | integer |
| userId | integer |
| rating | numeric |
| timestamp | numeric |
| title | character |
| genres | character |
How many rows and columns are there in the edx dataset?
dim(edx)
## [1] 9000055 6
The next table shows the structure and content of edx dataset
The dataset is in tidy format, i.e. each row has one observation and the column names are the features. The rating column is the desired outcome. The user information is stored in userId; the movie information is both in movieId and title columns. The rating date is available in timestamp measured in seconds since January 1st, 1970. Each movie is tagged with one or more genre in the genres column.
head(edx)
| userId | movieId | rating | timestamp | title | genres | |
|---|---|---|---|---|---|---|
| 1 | 1 | 122 | 5 | 838985046 | Boomerang (1992) | Comedy|Romance |
| 2 | 1 | 185 | 5 | 838983525 | Net, The (1995) | Action|Crime|Thriller |
| 4 | 1 | 292 | 5 | 838983421 | Outbreak (1995) | Action|Drama|Sci-Fi|Thriller |
| 5 | 1 | 316 | 5 | 838983392 | Stargate (1994) | Action|Adventure|Sci-Fi |
| 6 | 1 | 329 | 5 | 838983392 | Star Trek: Generations (1994) | Action|Adventure|Drama|Sci-Fi |
| 7 | 1 | 355 | 5 | 838984474 | Flintstones, The (1994) | Children|Comedy|Fantasy |
The next sections discover more details about each feature and outcome.
Along with the movie title, MovieLens provides the list of genres for each movie. Although this information can be used to make better predictions, this research doesn’t use it. However it’s worth exploring this information as well.
The data set contains 797 different combinations of genres. Here is the list of the first six.
edx %>% group_by(genres) %>%
summarise(n=n()) %>%
head()
| genres | n |
|---|---|
| (no genres listed) | 7 |
| Action | 24482 |
| Action|Adventure | 68688 |
| Action|Adventure|Animation|Children|Comedy | 7467 |
| Action|Adventure|Animation|Children|Comedy|Fantasy | 187 |
| Action|Adventure|Animation|Children|Comedy|IMAX | 66 |
The table above shows that several movies are classified in more than one genre. The number of genres in each movie is listed in this table, sorted in descend order.
tibble(count = str_count(edx$genres, fixed("|")), genres = edx$genres) %>%
group_by(count, genres) %>%
summarise(n = n()) %>%
arrange(-count) %>%
head()
| count | genres | n |
|---|---|---|
| 7 | Action|Adventure|Comedy|Drama|Fantasy|Horror|Sci-Fi|Thriller | 256 |
| 6 | Adventure|Animation|Children|Comedy|Crime|Fantasy|Mystery | 10975 |
| 6 | Adventure|Animation|Children|Comedy|Drama|Fantasy|Mystery | 355 |
| 6 | Adventure|Animation|Children|Comedy|Fantasy|Musical|Romance | 515 |
| 5 | Action|Adventure|Animation|Children|Comedy|Fantasy | 187 |
| 5 | Action|Adventure|Animation|Children|Comedy|IMAX | 66 |
The rating period was collected over almost 14 years.
library(lubridate)
tibble(`Initial Date` = date(as_datetime(min(edx$timestamp), origin="1970-01-01")),
`Final Date` = date(as_datetime(max(edx$timestamp), origin="1970-01-01"))) %>%
mutate(Period = duration(max(edx$timestamp)-min(edx$timestamp)))
| Initial Date | Final Date | Period |
|---|---|---|
| 1995-01-09 | 2009-01-05 | 441479727s (~13.99 years) |
if(!require(ggthemes))
install.packages("ggthemes", repos = "http://cran.us.r-project.org")
if(!require(scales))
install.packages("scales", repos = "http://cran.us.r-project.org")
edx %>% mutate(year = year(as_datetime(timestamp, origin="1970-01-01"))) %>%
ggplot(aes(x=year)) +
geom_histogram(color = "white") +
ggtitle("Rating Distribution Per Year") +
xlab("Year") +
ylab("Number of Ratings") +
scale_y_continuous(labels = comma) +
theme_economist()
Rating distribution per year.
The following table lists the days with more ratings. Not surprisingly, the movies are well known blockbusters.
edx %>% mutate(date = date(as_datetime(timestamp, origin="1970-01-01"))) %>%
group_by(date, title) %>%
summarise(count = n()) %>%
arrange(-count) %>%
head(10)
| date | title | count |
|---|---|---|
| 1998-05-22 | Chasing Amy (1997) | 322 |
| 2000-11-20 | American Beauty (1999) | 277 |
| 1999-12-11 | Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) | 254 |
| 1999-12-11 | Star Wars: Episode V - The Empire Strikes Back (1980) | 251 |
| 1999-12-11 | Star Wars: Episode VI - Return of the Jedi (1983) | 241 |
| 2005-03-22 | Lord of the Rings: The Two Towers, The (2002) | 239 |
| 2005-03-22 | Lord of the Rings: The Fellowship of the Ring, The (2001) | 227 |
| 2000-11-20 | Terminator 2: Judgment Day (1991) | 221 |
| 1999-12-11 | Matrix, The (1999) | 210 |
| 2000-11-20 | Jurassic Park (1993) | 201 |
Users have the option to choose a rating value from 0.5 to 5.0, totaling 10 possible values. This is unusual scale, so most movies get a rounded value rating, as shown in the chart below.
Count the number of each ratings:
edx %>% group_by(rating) %>% summarize(n=n())
| rating | n |
|---|---|
| 0.5 | 85374 |
| 1.0 | 345679 |
| 1.5 | 106426 |
| 2.0 | 711422 |
| 2.5 | 333010 |
| 3.0 | 2121240 |
| 3.5 | 791624 |
| 4.0 | 2588430 |
| 4.5 | 526736 |
| 5.0 | 1390114 |
How many ratings are in edx?
edx %>% group_by(rating) %>%
summarise(count=n()) %>%
ggplot(aes(x=rating, y=count)) +
geom_line() +
geom_point() +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
ggtitle("Rating Distribution", subtitle = "Higher ratings are prevalent.") +
xlab("Rating") +
ylab("Count") +
theme_economist()
Round values receive more ratings than decimals.
There are 10677 different movies in the edx set. We know from intuition that some of them are rated more than others, since many movies are watched by few users and blockbusters tend to have more ratings.
edx %>% group_by(movieId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Distribution of Movies",
subtitle = "The distribution is almost symetric.") +
xlab("Number of Ratings") +
ylab("Number of Movies") +
theme_economist()
10.7% of movies have less than 10 ratings.
There are 69878 different users are in the edx set.
The majority of users rate few movies, while a few users rate more than a thousand movies.
5% users rated less than 20 movies.
edx %>% group_by(userId) %>%
summarise(n=n()) %>%
arrange(n) %>%
head()
| userId | n |
|---|---|
| 62516 | 10 |
| 22170 | 12 |
| 15719 | 13 |
| 50608 | 13 |
| 901 | 14 |
| 1833 | 14 |
The user distribution is right skewed.
edx %>% group_by(userId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Distribution of Users",
subtitle="The distribution is right skewed.") +
xlab("Number of Ratings") +
ylab("Number of Users") +
scale_y_continuous(labels = comma) +
theme_economist()
5% of users rated less than 20 movies.
Show the heatmap of users x movies
This user-movie matrix is sparse, with the vast majority of empty cells. Notice that four movies have more ratings, and one or two users are more active.
users <- sample(unique(edx$userId), 100)
edx %>% filter(userId %in% users) %>%
select(userId, movieId, rating) %>%
mutate(rating = 1) %>%
spread(movieId, rating) %>%
select(sample(ncol(.), 100)) %>%
as.matrix() %>% t(.) %>%
image(1:100, 1:100,. , xlab="Movies", ylab="Users")
abline(h=0:100+0.5, v=0:100+0.5, col = "grey")
title("User x Movie Matrix")
As previously discussed, several features can be used to predict the rating for a given user. However, many predictors increases the model complexity and requires more computer resources, so in this research the estimated rating uses only movie and user information.
train_set <- train_set %>% select(userId, movieId, rating, title)
test_set <- test_set %>% select(userId, movieId, rating, title)
A very simple model is just randomly predict the rating using the probability distribution observed during the data exploration. For example, if we know the probability of all users giving a movie a rating of 3 is 10%, then we may guess that 10% of the ratings will have a rating of 3.
Such prediction sets the worst error we may get, so any other model should provide better result.
The simplest model predicts all users will give the same rating to all movies and assumes the movie to movie variation is the randomly distributed error. Although the predicted rating can be any value, statistics theory says that the average minimizes the RMSE, so the initial prediction is just the average of all observed ratings, as described in this formula:
\[\hat Y_{u,i}=\mu+\epsilon_{i,u}\]
Where \(\hat Y\) is the predicted rating, \(\mu\) is the mean of observed data and \(\epsilon_{i,u}\) is the error distribution. Any value other than the mean increases the RMSE, so this is a good initial estimation.
Part of the movie to movie variability can be explained by the fact that different movies have different rating distribution. This is easy to understand, since some movies are more popular than others and the public preference varies. This is called movie effect or movie bias, and is expressed as \(b_i\) in this formula:
\[\hat Y_{u,i}=\mu + b_i + \epsilon_{i,u}\] The movie effect can be calculated as the mean of the difference between the observed rating \(y\) and the mean \(\mu\).
\[\hat b_i=\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat \mu)\] Similar to the movie effect, different users have different rating pattern or distribution. For example, some users like most movies and consistently rate 4 or 5, while other users dislike most movies rating 1 or 2. This is called user effect or user bias and is expressed in this formula:
\[\hat b_u=\frac{1}{N}\sum_{i=1}^{N}(y_{u,i}-\hat b_i-\hat \mu)\] The prediction model that includes the user effect becomes:
\[\hat Y_{u,i}=\mu+b_i+b_u+\epsilon_{u,i}\] Movies can be grouped into categories or genres, with different distributions. In general, movies in the same genre get similar ratings. In this project we won’t evaluate the genre effect.
The linear model provides a good estimation for the ratings, but doesn’t consider that many movies have very few number of ratings, and some users rate very few movies. This means that the sample size is very small for these movies and these users. Statistically, this leads to large estimated error.
The estimated value can be improved adding a factor that penalizes small sample sizes and have have little or no impact otherwise. Thus, estimated movie and user effects can be calculated with these formulas:
\[\hat b_i=\frac{1}{n_i+\lambda}\sum_{u=1}^{n_i}(y_{u,i}-\hat \mu)\]
\[\hat b_u=\frac{1}{n_u+\lambda}\sum_{i=1}^{n_u}( y_{u,i}-\hat b_i-\hat \mu)\]
For values of \(N\) smaller than or similar to \(\lambda\), \(\hat b_i\) and \(\hat b_u\) is smaller than the original values, whereas for values of \(N\) much larger than \(\lambda\), \(\hat b_i\) and \(\hat b_u\) change very little.
An effective method to choose \(\lambda\) that minimizes the RMSE is running simulations with several values of \(\lambda\).
Matrix factorization is widely used machine learning tool for predicting ratings in recommendation systems. This method became widely known during the Netflix Prize challenge10.
The data can be converted into a matrix such that each user is in a row, each movie is in a column and the rating is in the cell, then the algorithm attempts to fill in the missing values. The table below provides a simple example of a \(4\times 5\) matrix.
| movie 1 | movie 2 | movie 3 | movie 4 | movie 5 | |
|---|---|---|---|---|---|
| user 1 | ? | ? | 4 | ? | 3 |
| user 2 | 2 | ? | ? | 4 | ? |
| user 3 | ? | 3 | ? | ? | 5 |
| user 4 | 3 | ? | 2 | ? | ? |
The concept is to approximate a large rating matrix \(R_{m\times n}\) into the product of two lower dimension matrices \(P_{k\times m}\) and \(Q_{k\times n}\), such that
\[R\approx {P}' Q\]
The R recosystem11 package provides methods to decompose the rating matrix and estimate the user rating, using parallel matrix factorization.
This section presents the code and results of the models.
Here we define the loss functions.
# Define Mean Absolute Error (MAE)
MAE <- function(true_ratings, predicted_ratings){
mean(abs(true_ratings - predicted_ratings))
}
# Define Mean Squared Error (MSE)
MSE <- function(true_ratings, predicted_ratings){
mean((true_ratings - predicted_ratings)^2)
}
# Define Root Mean Squared Error (RMSE)
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
The first model randomly predicts the ratings using the observed probabilities in the training set. First, we calculate the probability of each rating in the training set, then we predict the rating for the test set and compare with actual rating. Any model should be better than this one.
Since the training set is a sample of the entire population and we don’t know the real distribution of ratings, the Monte Carlo simulation with replacement provides a good approximation of the rating distribution.
set.seed(4321, sample.kind = "Rounding")
# Create the probability of each rating
p <- function(x, y) mean(y == x)
rating <- seq(0.5,5,0.5)
# Estimate the probability of each rating with Monte Carlo simulation
B <- 10^3
M <- replicate(B, {
s <- sample(train_set$rating, 100, replace = TRUE)
sapply(rating, p, y= s)
})
prob <- sapply(1:nrow(M), function(x) mean(M[x,]))
# Predict random ratings
y_hat_random <- sample(rating, size = nrow(test_set),
replace = TRUE, prob = prob)
# Create a table with the error results
result <- tibble(Method = "Project Goal", RMSE = 0.8649, MSE = NA, MAE = NA)
result <- bind_rows(result,
tibble(Method = "Random prediction",
RMSE = RMSE(test_set$rating, y_hat_random),
MSE = MSE(test_set$rating, y_hat_random),
MAE = MAE(test_set$rating, y_hat_random)))
The RMSE of random prediction is very high.
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.864900 | NA | NA |
| Random prediction | 1.501245 | 2.253735 | 1.167597 |
We’re building the linear model based on the formula:
\[\hat y = \mu + b_i + b_u + \epsilon_{u,i}\]
The initial prediction is just the mean of the ratings, \(\mu\).
\[\hat y = \mu + \epsilon_{u,i}\]
# Mean of observed values
mu <- mean(train_set$rating)
# Update the error table
result <- bind_rows(result,
tibble(Method = "Mean",
RMSE = RMSE(test_set$rating, mu),
MSE = MSE(test_set$rating, mu),
MAE = MAE(test_set$rating, mu)))
# Show the RMSE improvement
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.864900 | NA | NA |
| Random prediction | 1.501245 | 2.253735 | 1.1675974 |
| Mean | 1.060054 | 1.123714 | 0.8551636 |
\(b_i\) is the movie effect (bias) for movie \(i\).
\[\hat y = \mu + b_i + \epsilon_{u,i}\]
# Movie effects (bi)
bi <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
head(bi)
| movieId | b_i |
|---|---|
| 1 | 0.4150040 |
| 2 | -0.3064057 |
| 3 | -0.3613952 |
| 4 | -0.6372808 |
| 5 | -0.4416058 |
| 6 | 0.3018943 |
The movie effect is normally left skewed distributed.
bi %>% ggplot(aes(x = b_i)) +
geom_histogram(bins=10, col = I("black")) +
ggtitle("Movie Effect Distribution") +
xlab("Movie effect") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
Distribution of movie effects.
# Predict the rating with mean + bi
y_hat_bi <- mu + test_set %>%
left_join(bi, by = "movieId") %>%
.$b_i
# Calculate the RMSE
result <- bind_rows(result,
tibble(Method = "Mean + bi",
RMSE = RMSE(test_set$rating, y_hat_bi),
MSE = MSE(test_set$rating, y_hat_bi),
MAE = MAE(test_set$rating, y_hat_bi)))
# Show the RMSE improvement
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
\(b_u\) is the user effect (bias) for user \(u\).
\[\hat y_{u,i} = \mu + b_i + b_u + \epsilon_{u,i}\]
Predict the rating with mean + bi + bu
# User effect (bu)
bu <- train_set %>%
left_join(bi, by = 'movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
# Prediction
y_hat_bi_bu <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
# Update the results table
result <- bind_rows(result,
tibble(Method = "Mean + bi + bu",
RMSE = RMSE(test_set$rating, y_hat_bi_bu),
MSE = MSE(test_set$rating, y_hat_bi_bu),
MAE = MAE(test_set$rating, y_hat_bi_bu)))
# Show the RMSE improvement
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
| Mean + bi + bu | 0.8646843 | 0.7476789 | 0.6684369 |
The user effect is normally distributed.
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(color = "black") +
ggtitle("User Effect Distribution") +
xlab("User Bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
User effect is normally distributed.
The RMSE improved from the initial estimation based on the mean. However, we still need to check if the model makes good ratings predictions.
Check the 10 largest residual differences
train_set %>%
left_join(bi, by='movieId') %>%
mutate(residual = rating - (mu + b_i)) %>%
arrange(desc(abs(residual))) %>%
slice(1:10)
| userId | movieId | rating | title | b_i | residual |
|---|---|---|---|---|---|
| 26423 | 6483 | 5.0 | From Justin to Kelly (2003) | -2.638139 | 4.125683 |
| 5279 | 6371 | 5.0 | Pokémon Heroes (2003) | -2.472133 | 3.959677 |
| 57863 | 6371 | 5.0 | Pokémon Heroes (2003) | -2.472133 | 3.959677 |
| 2507 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 7708 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 9214 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 9568 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 9975 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 10749 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
| 13496 | 318 | 0.5 | Shawshank Redemption, The (1994) | 0.944111 | -3.956567 |
titles <- train_set %>%
select(movieId, title) %>%
distinct()
Top 10 best movies (ranked by \(b_i\)).
These are unknown movies
bi %>%
inner_join(titles, by = "movieId") %>%
arrange(-b_i) %>%
select(title) %>%
head()
| title |
|---|
| Hellhounds on My Trail (1999) |
| Satan’s Tango (Sátántangó) (1994) |
| Shadows of Forgotten Ancestors (1964) |
| Fighting Elegy (Kenka erejii) (1966) |
| Sun Alley (Sonnenallee) (1999) |
| Blue Light, The (Das Blaue Licht) (1932) |
Top 10 worst movies (ranked by \(b_i\)), also unknown movies:
bi %>%
inner_join(titles, by = "movieId") %>%
arrange(b_i) %>%
select(title) %>%
head()
| title |
|---|
| Besotted (2001) |
| Hi-Line, The (1999) |
| Accused (Anklaget) (2005) |
| Confessions of a Superhero (2007) |
| War of the Worlds 2: The Next Wave (2008) |
| SuperBabies: Baby Geniuses 2 (2004) |
Number of ratings for 10 best movies:
train_set %>%
left_join(bi, by = "movieId") %>%
arrange(desc(b_i)) %>%
group_by(title) %>%
summarise(n = n()) %>%
slice(1:10)
| title | n |
|---|---|
| ’burbs, The (1989) | 1201 |
| ’night Mother (1986) | 178 |
| ’Round Midnight (1986) | 40 |
| ’Til There Was You (1997) | 242 |
| “Great Performances” Cats (1998) | 4 |
| *batteries not included (1987) | 389 |
| …All the Marbles (a.k.a. The California Dolls) (1981) | 17 |
| …And God Created Woman (Et Dieu… créa la femme) (1956) | 68 |
| …And God Spoke (1993) | 19 |
| …And Justice for All (1979) | 500 |
train_set %>% count(movieId) %>%
left_join(bi, by="movieId") %>%
arrange(desc(b_i)) %>%
slice(1:10) %>%
pull(n)
## [1] 1 1 1 1 1 1 4 2 4 4
Now, we regularize the user and movie effects adding a penalty factor \(\lambda\), which is a tuning parameter. We define a number of values for \(\lambda\) and use the regularization function to pick the best value that minimizes the RMSE.
regularization <- function(lambda, trainset, testset){
# Mean
mu <- mean(trainset$rating)
# Movie effect (bi)
b_i <- trainset %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))
# User effect (bu)
b_u <- trainset %>%
left_join(b_i, by="movieId") %>%
filter(!is.na(b_i)) %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))
# Prediction: mu + bi + bu
predicted_ratings <- testset %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, testset$rating))
}
# Define a set of lambdas to tune
lambdas <- seq(0, 10, 0.25)
# Tune lambda
rmses <- sapply(lambdas,
regularization,
trainset = train_set,
testset = test_set)
# Plot the lambda vs RMSE
tibble(Lambda = lambdas, RMSE = rmses) %>%
ggplot(aes(x = Lambda, y = RMSE)) +
geom_point() +
ggtitle("Regularization",
subtitle = "Pick the penalization that gives the lowest RMSE.") +
theme_economist()
Choose the lambda that minimizes RMSE.
Next, we apply the best \(\lambda\) to the linear model.
# We pick the lambda that returns the lowest RMSE.
lambda <- lambdas[which.min(rmses)]
# Then, we calculate the predicted rating using the best parameters
# achieved from regularization.
mu <- mean(train_set$rating)
# Movie effect (bi)
b_i <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))
# User effect (bu)
b_u <- train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))
# Prediction
y_hat_reg <- test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
# Update the result table
result <- bind_rows(result,
tibble(Method = "Regularized bi and bu",
RMSE = RMSE(test_set$rating, y_hat_reg),
MSE = MSE(test_set$rating, y_hat_reg),
MAE = MAE(test_set$rating, y_hat_reg)))
# Regularization made a small improvement in RMSE.
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
| Mean + bi + bu | 0.8646843 | 0.7476789 | 0.6684369 |
| Regularized bi and bu | 0.8641362 | 0.7467313 | 0.6687070 |
Matrix factorization approximates a large user-movie matrix into the product of two smaller dimension matrices. Information in the train set is stored in tidy format, with one observation per row, so it needs to be converted to the user-movie matrix before using matrix factorization. This code executes this transformation.
train_data <- train_set %>%
select(userId, movieId, rating) %>%
spread(movieId, rating) %>%
as.matrix()
The code above uses more memory than a commodity laptop is able to process, so we use an alternative method: the recosystem package, which provides the complete solution for a recommendation system using matrix factorization.
The package vignette12 describes how to use recosystem:
Usage of recosystem
The usage of recosystem is quite simple, mainly consisting of the following steps:
Reco().$tune() method to select best tuning parameters along a set of candidate values.$train() method. A number of parameters can be set inside the function, possibly coming from the result of $tune().$output(), i.e. write the factorization matrices \(P\) and \(Q\) into files or return them as \(R\) objects.$predict() method to compute predicted values.if(!require(recosystem))
install.packages("recosystem", repos = "http://cran.us.r-project.org")
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm
# Convert the train and test sets into recosystem input format
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
# Create the model object
r <- recosystem::Reco()
# Select the best tuning parameters
opts <- r$tune(train_data, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))
# Train the algorithm
r$train(train_data, opts = c(opts$min, nthread = 4, niter = 20))
## iter tr_rmse obj
## 0 0.9825 1.1036e+007
## 1 0.8740 8.9621e+006
## 2 0.8419 8.3303e+006
## 3 0.8208 7.9554e+006
## 4 0.8047 7.6948e+006
## 5 0.7921 7.5024e+006
## 6 0.7817 7.3538e+006
## 7 0.7728 7.2387e+006
## 8 0.7652 7.1393e+006
## 9 0.7587 7.0615e+006
## 10 0.7530 6.9936e+006
## 11 0.7479 6.9347e+006
## 12 0.7435 6.8850e+006
## 13 0.7394 6.8436e+006
## 14 0.7356 6.8037e+006
## 15 0.7322 6.7692e+006
## 16 0.7292 6.7380e+006
## 17 0.7262 6.7104e+006
## 18 0.7235 6.6837e+006
## 19 0.7212 6.6625e+006
# Calculate the predicted values
y_hat_reco <- r$predict(test_data, out_memory())
head(y_hat_reco, 10)
## [1] 4.851838 3.720369 3.274899 2.896342 3.550605 3.879312 3.637041
## [8] 3.427570 4.030577 3.193461
Matrix factorization improved substantially the RMSE.
result <- bind_rows(result,
tibble(Method = "Matrix Factorization - recosystem",
RMSE = RMSE(test_set$rating, y_hat_reco),
MSE = MSE(test_set$rating, y_hat_reco),
MAE = MAE(test_set$rating, y_hat_reco)))
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
| Mean + bi + bu | 0.8646843 | 0.7476789 | 0.6684369 |
| Regularized bi and bu | 0.8641362 | 0.7467313 | 0.6687070 |
| Matrix Factorization - recosystem | 0.7851291 | 0.6164277 | 0.6047646 |
As we can see from the result table, regularization and matrix factorization achieved the target RMSE. So, finally we train the complete edx set with both models and calculate the RMSE in the validation set. The project goal is achieved if the RMSE stays below the target.
During the training and testing phases, the linear model with regularization achieved the target RMSE with a small margin. Here we do the final validation with the validation set.
mu_edx <- mean(edx$rating)
# Movie effect (bi)
b_i_edx <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_edx)/(n()+lambda))
# User effect (bu)
b_u_edx <- edx %>%
left_join(b_i_edx, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_edx)/(n()+lambda))
# Prediction
y_hat_edx <- validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
pull(pred)
# Update the results table
result <- bind_rows(result,
tibble(Method = "Final Regularization (edx vs validation)",
RMSE = RMSE(validation$rating, y_hat_edx),
MSE = MSE(validation$rating, y_hat_edx),
MAE = MAE(validation$rating, y_hat_edx)))
# Show the RMSE improvement
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
| Mean + bi + bu | 0.8646843 | 0.7476789 | 0.6684369 |
| Regularized bi and bu | 0.8641362 | 0.7467313 | 0.6687070 |
| Matrix Factorization - recosystem | 0.7851291 | 0.6164277 | 0.6047646 |
| Final Regularization (edx vs validation) | 0.8648177 | 0.7479097 | 0.6693494 |
As expected, the RMSE calculated on the validation set (0.8648177) is lower than the target of 0.8649 and slightly higher than the RMSE of the test set (0.8641362).
Top 10 best movies
validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(-pred) %>%
group_by(title) %>%
select(title) %>%
head(10)
| title |
|---|
| Usual Suspects, The (1995) |
| Shawshank Redemption, The (1994) |
| Shawshank Redemption, The (1994) |
| Shawshank Redemption, The (1994) |
| Eternal Sunshine of the Spotless Mind (2004) |
| Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) |
| Schindler’s List (1993) |
| Donnie Darko (2001) |
| Star Wars: Episode VI - Return of the Jedi (1983) |
| Schindler’s List (1993) |
Top 10 worst movies
validation %>%
left_join(b_i_edx, by = "movieId") %>%
left_join(b_u_edx, by = "userId") %>%
mutate(pred = mu_edx + b_i + b_u) %>%
arrange(pred) %>%
group_by(title) %>%
select(title) %>%
head(10)
| title |
|---|
| Battlefield Earth (2000) |
| Police Academy 4: Citizens on Patrol (1987) |
| Karate Kid Part III, The (1989) |
| Pokémon Heroes (2003) |
| Turbo: A Power Rangers Movie (1997) |
| Kazaam (1996) |
| Pokémon Heroes (2003) |
| Free Willy 3: The Rescue (1997) |
| Shanghai Surprise (1986) |
| Steel (1997) |
The initial test shows that matrix factorization gives the best RMSE. Now it’s time to validate with the entire edx and validation sets.
set.seed(1234, sample.kind = "Rounding")
# Convert 'edx' and 'validation' sets to recosystem input format
edx_reco <- with(edx, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
validation_reco <- with(validation, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
# Create the model object
r <- recosystem::Reco()
# Tune the parameters
opts <- r$tune(edx_reco, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))
# Train the model
r$train(edx_reco, opts = c(opts$min, nthread = 4, niter = 20))
## iter tr_rmse obj
## 0 0.9728 1.2026e+007
## 1 0.8723 9.8942e+006
## 2 0.8404 9.1864e+006
## 3 0.8187 8.7792e+006
## 4 0.8026 8.4855e+006
## 5 0.7908 8.2918e+006
## 6 0.7809 8.1354e+006
## 7 0.7725 8.0127e+006
## 8 0.7654 7.9133e+006
## 9 0.7591 7.8295e+006
## 10 0.7538 7.7614e+006
## 11 0.7491 7.7028e+006
## 12 0.7448 7.6521e+006
## 13 0.7410 7.6053e+006
## 14 0.7374 7.5660e+006
## 15 0.7343 7.5314e+006
## 16 0.7314 7.5003e+006
## 17 0.7288 7.4738e+006
## 18 0.7263 7.4475e+006
## 19 0.7242 7.4260e+006
# Calculate the prediction
y_hat_final_reco <- r$predict(validation_reco, out_memory())
# Update the result table
result <- bind_rows(result,
tibble(Method = "Final Matrix Factorization - recosystem",
RMSE = RMSE(validation$rating, y_hat_final_reco),
MSE = MSE(validation$rating, y_hat_final_reco),
MAE = MAE(validation$rating, y_hat_final_reco)))
The final RMSE with matrix factorization is 0.7826974, 9.5% better than the linear model with regularization (0.8648177).
# Show the RMSE improvement
result
| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.5012445 | 2.2537350 | 1.1675974 |
| Mean | 1.0600537 | 1.1237139 | 0.8551636 |
| Mean + bi | 0.9429615 | 0.8891764 | 0.7370384 |
| Mean + bi + bu | 0.8646843 | 0.7476789 | 0.6684369 |
| Regularized bi and bu | 0.8641362 | 0.7467313 | 0.6687070 |
| Matrix Factorization - recosystem | 0.7851291 | 0.6164277 | 0.6047646 |
| Final Regularization (edx vs validation) | 0.8648177 | 0.7479097 | 0.6693494 |
| Final Matrix Factorization - recosystem | 0.7826974 | 0.6126152 | 0.6030689 |
Now, let’s check the best and worst movies predicted with matrix factorization.
Top 10 best movies:
tibble(title = validation$title, rating = y_hat_final_reco) %>%
arrange(-rating) %>%
group_by(title) %>%
select(title) %>%
head(10)
| title |
|---|
| Lord of the Rings: The Return of the King, The (2003) |
| Contempt (Mépris, Le) (1963) |
| Schindler’s List (1993) |
| Shawshank Redemption, The (1994) |
| Shawshank Redemption, The (1994) |
| Shawshank Redemption, The (1994) |
| Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) |
| Annie Hall (1977) |
| Boys Life 2 (1997) |
| Pulp Fiction (1994) |
Top 10 worst movies:
tibble(title = validation$title, rating = y_hat_final_reco) %>%
arrange(rating) %>%
group_by(title) %>%
select(title) %>%
head(10)
| title |
|---|
| Time Walker (a.k.a. Being From Another Planet) (1982) |
| Dead Girl, The (2006) |
| Giant Gila Monster, The (1959) |
| Phish: Bittersweet Motel (2000) |
| Pokémon Heroes (2003) |
| Bug (2007) |
| Blue Kite, The (Lan feng zheng) (1993) |
| Dirty Love (2005) |
| Mass Transit (1998) |
| Kiss Them for Me (1957) |
We started collecting and preparing the dataset for analysis, then we explored the information seeking for insights that might help during the model building.
Next, we created a random model that predicts the rating based on the probability distribution of each rating. This model gives the worst result.
We started the linear model with a very simple model which is just the mean of the observed ratings. From there, we added movie and user effects, that models the user behavior and movie distribution. With regularization we added a penalty value for the movies and users with few number of ratings. The linear model achieved the RMSE of 0.8648177, successfully passing the target of 0.8649.
Finally, we evaluated the recosystem package that implements the LIBMF algorithm, and achieved the RMSE of 0.7826974.
Some machine learning algorithms are computationally expensive to run in a commodity laptop and therefore were unable to test. The required amount of memory far exceeded the available in a commodity laptop, even with increased virtual memory.
Only two predictors are used, the movie and user information, not considering other features. Modern recommendation system models use many predictors, such as genres, bookmarks, playlists, etc.
The model works only for existing users, movies and rating values, so the algorithm must run every time a new user or movie is included, or when the rating changes. This is not an issue for small client base and a few movies, but may become a concern for large data sets. The model should consider these changes and update the predictions as information changes.
There is no initial recommendation for a new user or for users that usually don’t rate movies. Algorithms that uses several features as predictors can overcome this issue.
This report briefly describes simple models that predicts ratings. There are two other widely adopted approaches not discussed here: content-based and collaborative filtering. The recommenderlab package implements these methods and provides an environment to build and test recommendation systems.
Besides recommenderlab, there are other packages for building recommendation systems available in The Comprehensive R Archive Network (CRAN) website13.
https://www.edx.org/professional-certificate/harvardx-data-science↩
https://www.edx.org/professional-certificate/harvardx-data-science↩
A Matrix-factorization Library for Recommender Systems - https://www.csie.ntu.edu.tw/~cjlin/libmf/↩
https://cran.r-project.org/web/packages/recosystem/index.html↩
https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩
https://cran.r-project.org/web/packages/available_packages_by_name.html↩