Process and Workflow
The following basic data science steps will be followed:
- Data Preparation: download, parse, import and prepare the data to be processed and analysed.
- Data Exploration and visualization: explore data to understand the features and the relationship between the features and predictors.
- Data Cleaning: eventually the dataset contains unnecessary information that needs to be removed.
- Data Analysis and modeling: create the model using the insights gained during exploration. Also test and validate the model.
- Communicate: create the report and publish the results.
Load the necessary Packages
library(dplyr)library(tidyverse)library(data.table)library(lubridate)library(ggplot2)library(ggthemes)library(caret)library(scales)
Data Exploration
Data exploration and visualization: explore data to understand the features and the relationship between the features and predictors.
Read the 100k movies dataset
movies <- read.csv('Project.csv')
knitr::kable(head(movies))| userId | movieId | rating | timestamp |
|---|---|---|---|
| 1 | 31 | 2.5 | 1260759144 |
| 1 | 1029 | 3.0 | 1260759179 |
| 1 | 1061 | 3.0 | 1260759182 |
| 1 | 1129 | 2.0 | 1260759185 |
| 1 | 1172 | 4.0 | 1260759205 |
| 1 | 1263 | 2.0 | 1260759151 |
Insight on the dataset
str(movies)## 'data.frame': 100004 obs. of 4 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
## $ rating : num 2.5 3 3 2 4 2 2 2 3.5 2 ...
## $ timestamp: int 1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...
There are 4 variables and with integer and numeric datatypes |
variables | datatype | |———–|———–| | movieId | integer | | userId |
integer |
| rating | numeric |
| timestamp | integer |
dim(movies)## [1] 100004 4
From the first data exploration. There are 100,004 rows and 4 variables (columns)
Further exploration on the dataset.
# number of unique users
cat(n_distinct(movies$movieId), "Number of Movies\n",
n_distinct(movies$userId), "Number of Users\n",
n_distinct(movies$rating), "Unique Ratings")## 9066 Number of Movies
## 671 Number of Users
## 10 Unique Ratings
Count the number of each ratings:
movies %>%
group_by(rating) %>%
summarise(count=n()) ## # A tibble: 10 x 2
## rating count
## <dbl> <int>
## 1 0.5 1101
## 2 1 3326
## 3 1.5 1687
## 4 2 7271
## 5 2.5 4449
## 6 3 20064
## 7 3.5 10538
## 8 4 28750
## 9 4.5 7723
## 10 5 15095
There are 671 users who gave 10 types of ratings (0.5 - 5.0) on 9066 Movies.
About the Date
tibble(`Start Date` = date(as_datetime(min(movies$timestamp), origin="1970-01-01")),
`End Date` = date(as_datetime(max(movies$timestamp), origin="1970-01-01"))) %>%
mutate(Period = duration(max(movies$timestamp)-min(movies$timestamp)))## # A tibble: 1 x 3
## `Start Date` `End Date` Period
## <date> <date> <Duration>
## 1 1995-01-09 2016-10-16 686988635s (~21.77 years)
The rating period was collected over almost 21 years and 9 months.
Convert the numeric value of timestamp into date
movies_df = movies %>%
mutate(date = date(as_datetime(timestamp))) %>%
mutate_at(vars(date), list(year=year, month=month, day=day))
knitr::kable(head(movies_df))| userId | movieId | rating | timestamp | date | year | month | day |
|---|---|---|---|---|---|---|---|
| 1 | 31 | 2.5 | 1260759144 | 2009-12-14 | 2009 | 12 | 14 |
| 1 | 1029 | 3.0 | 1260759179 | 2009-12-14 | 2009 | 12 | 14 |
| 1 | 1061 | 3.0 | 1260759182 | 2009-12-14 | 2009 | 12 | 14 |
| 1 | 1129 | 2.0 | 1260759185 | 2009-12-14 | 2009 | 12 | 14 |
| 1 | 1172 | 4.0 | 1260759205 | 2009-12-14 | 2009 | 12 | 14 |
| 1 | 1263 | 2.0 | 1260759151 | 2009-12-14 | 2009 | 12 | 14 |
The following table lists the days with more ratings with the movieId.
movies_df %>%
group_by(date, movieId) %>%
summarise(count = n(), .groups='drop') %>%
arrange(-count) %>%
head(10)## # A tibble: 10 x 3
## date movieId count
## <date> <int> <int>
## 1 2005-03-22 1198 6
## 2 2005-03-22 318 5
## 3 2005-03-22 1259 5
## 4 2005-03-22 2028 5
## 5 2005-03-22 2762 5
## 6 2005-03-22 4306 5
## 7 2005-03-22 8360 5
## 8 1996-09-21 480 4
## 9 2000-11-19 2858 4
## 10 2000-11-21 608 4
movies_df %>%
ggplot(aes(x=year)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Rating Distribution Per Year") +
#stat_bin(bins = 30) +
xlab("Year") +
ylab("Number of Ratings") +
scale_y_continuous(labels = comma) +
theme_solarized()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ratings
Count the number of each ratings: Users have the option to choose a rating value from 0.5 to 5.0, totaling 10 possible values.
movies_df %>%
group_by(rating) %>%
summarise(count=n())## # A tibble: 10 x 2
## rating count
## <dbl> <int>
## 1 0.5 1101
## 2 1 3326
## 3 1.5 1687
## 4 2 7271
## 5 2.5 4449
## 6 3 20064
## 7 3.5 10538
## 8 4 28750
## 9 4.5 7723
## 10 5 15095
#arrange()movies_df %>% group_by(rating) %>%
summarise(count = n()) %>%
ggplot(aes(x=rating, y=count)) +
geom_line(color="grey") +
geom_point(shape=21, fill="#69b3a2", size=5) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
ggtitle("Rating Distribution", subtitle = "Higher ratings are prevalent.") +
xlab("Rating") +
ylab("Count") +
theme_solarized_2()Round values receive more ratings than decimals.
Movies
There are 9066 different movies in the movies_df set. We
know from intuition that some of them are rated more than others, since
many movies are watched by few users and blockbusters tend to have more
ratings.
n <- paste(round(100 * movies_df %>% group_by(movieId) %>%
summarise(n=n()) %>%
filter(n<=10) %>%
count() / length(unique(movies_df$movieId)),1),"\\%", sep="")movies_df %>% group_by(movieId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
scale_x_log10() +
ggtitle("Distribution of Movies",
subtitle = "The distribution is not symmetrical") +
xlab("Number of Ratings") +
ylab("Number of Movies") +
theme_solarized_2()77% of movies have less than 10 ratings.
User Distribution
There are 671 different users are in the movies_df
set.
The majority of users rate few movies, while a few users rate more than a thousand movies.
n <- paste(round(100 * movies_df %>% group_by(userId) %>%
summarise(n=n()) %>%
filter(n<20) %>%
count() / length(unique(movies_df$userId)),1),"\\%", sep="")0% users rated less than 20 movies.
movies_df %>% group_by(userId) %>%
summarise(count=n()) %>%
arrange(count) %>%
head()## # A tibble: 6 x 2
## userId count
## <int> <int>
## 1 1 20
## 2 14 20
## 3 35 20
## 4 76 20
## 5 209 20
## 6 221 20
The user distribution is right skewed.
movies_df %>% group_by(userId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
scale_x_log10() +
#stat_bin(bins = 30) +
ggtitle("Distribution of Users",
subtitle="The distribution is right skewed.") +
xlab("Number of Ratings") +
ylab("Number of Users") +
scale_y_continuous(labels = comma) +
theme_solarized_2()0% of users rated less than 20 movies.
Show the heatmap of users x movies
This user-movie matrix is sparse, with the vast majority of empty cells. Notice that four movies have more ratings, and one or two users are more active.
users <- sample(unique(movies_df$userId), 50)
movies_df %>% filter(userId %in% users) %>%
select(userId, movieId, rating) %>%
mutate(rating = 1) %>%
spread(movieId, rating) %>%
select(sample(ncol(.), 50)) %>%
as.matrix() %>% t(.) %>%
image(1:50, 1:50,. , xlab="Movies", ylab="Users") %>%
abline(h=0:50+0.5, v=0:50+0.5, col = "grey") %>%
title("User x Movie Matrix")Data Preparation
The estimated rating uses only movie and user information
We split the dataset in two parts, the training set called
movies and the evaluation set called
validation with 90% and 10% of the original dataset
respectively.
Then, we split the movies set in two parts, the
train set and test set with 90% and 10% of
movies set respectively. The model is created and trained
in the train set and tested in the test set
until the RMSE target is achieved, then finally we train the model again
in the entire movies set and validate in the
validation set. The name of this method is
cross-validation.
# 'Validation' set will be 10% of Movies data
set.seed(1, sample.kind="Rounding")## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
test_index <- createDataPartition(y = movies$rating,
times = 1, p = 0.1, list = FALSE)
move <- movies[-test_index,]
temp <- movies[test_index,]
# Make sure userId and movieId in 'validation' set are also in 'movies' set
validation <- temp %>%
semi_join(movies, by = "movieId") %>%
semi_join(movies, by = "userId")
# Add rows removed from 'validation' set back into 'movies' set
removed <- anti_join(temp, validation)## Joining, by = c("userId", "movieId", "rating", "timestamp")
movies <- rbind(movies, removed)
#(movies, test_index, temp, move, removed)The training set will be 90% of movies data and the test
set will be the remaining 10%.
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = movies$rating, times = 1, p = 0.1, list = FALSE)
train_set <- movies[-test_index,]
temp <- movies[test_index,]
# Make sure userId and movieId in test set are also in train set
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)
#rm(test_index, temp, removed)train_set <- train_set %>% select(userId, movieId, rating)
test_set <- test_set %>% select(userId, movieId, rating)Modeling
This is conducted in three steps:
- Selecting useful data
- Normalizing data
- Binarizing the data
Random Prediction
A very simple model is just randomly predict the rating using the probability distribution observed during the data exploration. For example, if we know the probability of all users giving a movie a rating of 3 is 10%, then we may guess that 10% of the ratings will have a rating of 3.
Such prediction sets the worst error we may get, so any other model should provide better result.
Model Evaluation Functions
Here we define the loss functions.
# Define Mean Absolute Error (MAE)
MAE <- function(true_ratings, predicted_ratings){
mean(abs(true_ratings - predicted_ratings))
}
# Define Mean Squared Error (MAE)
MSE <- function(true_ratings, predicted_ratings){
mean((true_ratings - predicted_ratings)^2)
}
# Define Root Mean Squared Error (RMSE)
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}Random Prediction
The first model randomly predicts the ratings using the observed
probabilities in the training set. First, we calculate the probability
of each rating in the training set, then we predict the rating for the
test set and compare with actual rating. Any
model should be better than this one.
Since the training set is a sample of the entire population and we
don’t know the real distribution of ratings, the
Monte Carlo simulation with replacement provides a good
approximation of the rating distribution.
set.seed(4321, sample.kind = "Rounding")
# Create the probability of each rating
p <- function(x, y) mean(y == x)
rating <- seq(0.5,5,0.5)
# Estimate the probability of each rating with Monte Carlo simulation
B <- 10^3
M <- replicate(B, {
s <- sample(train_set$rating, 100, replace = TRUE)
sapply(rating, p, y= s)
})
prob <- sapply(1:nrow(M), function(x) mean(M[x,]))
# Predict random ratings
y_hat_random <- sample(rating, size = nrow(test_set),
replace = TRUE, prob = prob)
# Create a table with the error results
result <- tibble(Method = "Project Goal", RMSE = 0.8649, MSE = NA, MAE = NA)
result <- bind_rows(result,
tibble(Method = "Random prediction",
RMSE = RMSE(test_set$rating, y_hat_random),
MSE = MSE(test_set$rating, y_hat_random),
MAE = MAE(test_set$rating, y_hat_random)))The RMSE of random prediction is very high.
knitr::kable(result)| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.864900 | NA | NA |
| Random prediction | 1.489564 | 2.218802 | 1.145145 |
Linear Model
The simplest model predicts all users will give the same rating to all movies and assumes the movie to movie variation is the randomly distributed error. Although the predicted rating can be any value, statistics theory says that the average minimizes the RMSE, so the initial prediction is just the average of all observed ratings, as described in this formula:
\[\hat Y_{u,i}=\mu+\epsilon_{i,u}\]
Where \(\hat Y\) is the predicted rating, \(\mu\) is the mean of observed data and \(\epsilon_{i,u}\) is the error distribution. Any value other than the mean increases the RMSE, so this is a good initial estimation.
Part of the movie to movie variability can be explained by the fact that different movies have different rating distribution. This is easy to understand, since some movies are more popular than others and the public preference varies. This is called movie effect or movie bias, and is expressed as \(b_i\) in this formula:
\[\hat Y_{u,i}=\mu + b_i + \epsilon_{i,u}\]
The movie effect can be calculated as the mean of the difference between the observed rating \(y\) and the mean \(\mu\).
\[\hat b_i=\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat \mu)\] Similar to the movie effect, different users have different rating pattern or distribution. For example, some users like most movies and consistently rate 4 or 5, while other users dislike most movies rating 1 or 2. This is called user effect or user bias and is expressed in this formula:
\[\hat b_u=\frac{1}{N}\sum_{i=1}^{N}(y_{u,i}-\hat b_i-\hat \mu)\] The prediction model that includes the user effect becomes:
\[\hat Y_{u,i}=\mu+b_i+b_u+\epsilon_{u,i}\]
Building the linear model based on the formula:
\[\hat y = \mu + b_i + b_u +
\epsilon_{u,i}\]
Initial Prediction
The initial prediction is just the mean of the ratings, \(\mu\).
\[\hat y = \mu + \epsilon_{u,i}\]
# Mean of observed values
mu <- mean(train_set$rating)
# Update the error table
result <- bind_rows(result,
tibble(Method = "Mean",
RMSE = RMSE(test_set$rating, mu),
MSE = MSE(test_set$rating, mu),
MAE = MAE(test_set$rating, mu)))
# Show the RMSE improvement
knitr::kable(result)| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.864900 | NA | NA |
| Random prediction | 1.489564 | 2.218802 | 1.1451446 |
| Mean | 1.054766 | 1.112532 | 0.8472556 |
Include Movie Effect (\(b_i\))
\(b_i\) is the movie effect (bias)
for movie \(i\).
\[\hat y = \mu + b_i +
\epsilon_{u,i}\]
# Movie effects (bi)
bi <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
head(bi)## # A tibble: 6 x 2
## movieId b_i
## <int> <dbl>
## 1 1 0.306
## 2 2 -0.171
## 3 3 -0.392
## 4 4 -1.27
## 5 5 -0.270
## 6 6 0.319
The movie effect is normally right skewed distributed.
bi %>% ggplot(aes(x = b_i)) +
geom_histogram(bins=20, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Movie Effect Distribution") +
xlab("Movie effect") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_solarized_2()Distribution of movie effects.
# Predict the rating with mean + bi
y_hat_bi <- mu + test_set %>%
left_join(bi, by = "movieId") %>%
.$b_i
# Calculate the RMSE
result <- bind_rows(result,
tibble(Method = "Mean + bi",
RMSE = RMSE(test_set$rating, y_hat_bi),
MSE = MSE(test_set$rating, y_hat_bi),
MAE = MAE(test_set$rating, y_hat_bi)))
# Show the RMSE improvement
knitr::kable(result)| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.4895642 | 2.2188017 | 1.1451446 |
| Mean | 1.0547663 | 1.1125319 | 0.8472556 |
| Mean + bi | 0.9860457 | 0.9722861 | 0.7661784 |
Include User Effect (\(b_u\))
\(b_u\) is the user effect (bias)
for user \(u\).
\[\hat y_{u,i} = \mu + b_i + b_u +
\epsilon_{u,i}\]
Predict the rating with mean + bi + bu
# User effect (bu)
bu <- train_set %>%
left_join(bi, by = 'movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
# Prediction
y_hat_bi_bu <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
# Update the results table
result <- bind_rows(result,
tibble(Method = "Mean + bi + bu",
RMSE = RMSE(test_set$rating, y_hat_bi_bu),
MSE = MSE(test_set$rating, y_hat_bi_bu),
MAE = MAE(test_set$rating, y_hat_bi_bu)))
# Show the RMSE improvement
knitr::kable(result)| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.4895642 | 2.2188017 | 1.1451446 |
| Mean | 1.0547663 | 1.1125319 | 0.8472556 |
| Mean + bi | 0.9860457 | 0.9722861 | 0.7661784 |
| Mean + bi + bu | 0.9056371 | 0.8201786 | 0.6961587 |
The user effect is normally distributed.
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(fill="#e2639c", color="#e9ecef", alpha=0.9) +
ggtitle("User Effect Distribution") +
xlab("User Bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_solarized_2()User effect is normally distributed.
Evaluating the model result
The RMSE improved from the initial estimation based on the mean. However, we still need to check if the model makes good ratings predictions.
Check the 10 largest residual differences
train_set %>%
left_join(bi, by='movieId') %>%
mutate(residual = rating - (mu + b_i)) %>%
arrange(desc(abs(residual))) %>%
slice(1:10)## userId movieId rating b_i residual
## 1 174 1252 0.5 0.8159500 -3.858209
## 2 578 50 0.5 0.8028497 -3.845109
## 3 304 527 0.5 0.7404513 -3.782710
## 4 394 527 0.5 0.7404513 -3.782710
## 5 410 527 0.5 0.7404513 -3.782710
## 6 20 296 0.5 0.7303163 -3.772575
## 7 457 1196 0.5 0.7258570 -3.768116
## 8 578 1196 0.5 0.7258570 -3.768116
## 9 310 1276 0.5 0.7196458 -3.761905
## 10 133 111 0.5 0.7054474 -3.747706
Top 10 best movies (ranked by \(b_i\)).
These are unknown movies
bi %>%
#inner_join(by = "movieId") %>%
arrange(-b_i) %>%
select(movieId) %>%
head(10)## # A tibble: 10 x 1
## movieId
## <int>
## 1 53
## 2 183
## 3 301
## 4 309
## 5 559
## 6 702
## 7 759
## 8 764
## 9 820
## 10 845
Top 10 worst movies (ranked by \(b_i\)), also unknown movies:
bi %>%
#inner_join(titles, by = "movieId") %>%
arrange(b_i) %>%
select(movieId) %>%
head(10) ## # A tibble: 10 x 1
## movieId
## <int>
## 1 1311
## 2 2191
## 3 2845
## 4 3883
## 5 3933
## 6 4051
## 7 4445
## 8 4471
## 9 4684
## 10 4753
Number of ratings for 10 best movies:
train_set %>%
left_join(bi, by = "movieId") %>%
arrange(desc(b_i)) %>%
group_by(movieId) %>%
summarise(count = n()) %>%
slice(1:10)## # A tibble: 10 x 2
## movieId count
## <int> <int>
## 1 1 221
## 2 2 97
## 3 3 50
## 4 4 11
## 5 5 55
## 6 6 94
## 7 7 48
## 8 8 5
## 9 9 17
## 10 10 113
train_set %>% count(movieId) %>%
left_join(bi, by="movieId") %>%
arrange(desc(b_i)) %>%
slice(1:10) %>%
pull(n)## [1] 1 1 1 3 1 1 1 1 1 1
Regularization
Now, we regularize the user and movie effects adding a penalty factor
\(\lambda\), which is a tuning
parameter. We define a number of values for \(\lambda\) and use the
regularization function to pick the best value that
minimizes the RMSE.
The linear model provides a good estimation for the ratings, but doesn’t consider that many movies have very few number of ratings, and some users rate very few movies. This means that the sample size is very small for these movies and these users. Statistically, this leads to large estimated error.
The estimated value can be improved adding a factor that penalizes small sample sizes and have have little or no impact otherwise. Thus, estimated movie and user effects can be calculated with these formulas:
\[\hat b_i=\frac{1}{n_i+\lambda}\sum_{u=1}^{n_i}(y_{u,i}-\hat \mu)\]
\[\hat b_u=\frac{1}{n_u+\lambda}\sum_{i=1}^{n_u}( y_{u,i}-\hat b_i-\hat \mu)\]
For values of \(N\) smaller than or similar to \(\lambda\), \(\hat b_i\) and \(\hat b_u\) is smaller than the original values, whereas for values of \(N\) much larger than \(\lambda\), \(\hat b_i\) and \(\hat b_u\) change very little.
An effective method to choose \(\lambda\) that minimizes the RMSE is running simulations with several values of \(\lambda\).
regularization <- function(lambda, trainset, testset){
# Mean
mu <- mean(trainset$rating)
# Movie effect (bi)
b_i <- trainset %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))
# User effect (bu)
b_u <- trainset %>%
left_join(b_i, by="movieId") %>%
filter(!is.na(b_i)) %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))
# Prediction: mu + bi + bu
predicted_ratings <- testset %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, testset$rating))
}
# Define a set of lambdas to tune
lambdas <- seq(0, 10, 0.25)
# Update RMSES table
rmses <- sapply(lambdas,
regularization,
trainset = train_set,
testset = test_set)
# Plot the lambda x RMSE
tibble(Lambda = lambdas, RMSE = rmses) %>%
ggplot(aes(x = Lambda, y = RMSE)) +
geom_point(size=2) +
ggtitle("Regularization",
subtitle = "Pick the penalization that gives the lowest RMSE.") +
theme_solarized_2()Choose the lambda that minimizes RMSE.
Next, we apply the best \(\lambda\) to the linear model.
# We pick the lambda that returns the lowest RMSE.
lambda <- lambdas[which.min(rmses)]
# Then, we calculate the predicted rating using the best parameters
# achieved from regularization.
mu <- mean(train_set$rating)
# Movie effect (bi)
b_i <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda))
# User effect (bu)
b_u <- train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))
# Prediction
y_hat_reg <- test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
# Update the result table
result <- bind_rows(result,
tibble(Method = "Regularized bi and bu",
RMSE = RMSE(test_set$rating, y_hat_reg),
MSE = MSE(test_set$rating, y_hat_reg),
MAE = MAE(test_set$rating, y_hat_reg)))
# Regularization made a small improvement in RMSE.
result## # A tibble: 6 x 4
## Method RMSE MSE MAE
## <chr> <dbl> <dbl> <dbl>
## 1 Project Goal 0.865 NA NA
## 2 Random prediction 1.49 2.22 1.15
## 3 Mean 1.05 1.11 0.847
## 4 Mean + bi 0.986 0.972 0.766
## 5 Mean + bi + bu 0.906 0.820 0.696
## 6 Regularized bi and bu 0.888 0.789 0.684
Matrix Factorization
Matrix factorization approximates a large user-movie matrix into the product of two smaller dimension matrices. Information in the train set is stored in tidy format, with one observation per row, so it needs to be converted to the user-movie matrix before using matrix factorization. This code executes this transformation.
Matrix factorization is widely used machine learning tool for predicting ratings in recommendation systems. This method became widely known during the Netflix Prize challenge1.
The data can be converted into a matrix such that each user is in a row, each movie is in a column and the rating is in the cell, then the algorithm attempts to fill in the missing values. The table below provides a simple example of a \(4\times 5\) matrix.
| movie 1 | movie 2 | movie 3 | movie 4 | movie 5 | |
|---|---|---|---|---|---|
| user 1 | ? | ? | 4 | ? | 3 |
| user 2 | 2 | ? | ? | 4 | ? |
| user 3 | ? | 3 | ? | ? | 5 |
| user 4 | 3 | ? | 2 | ? | ? |
The concept is to approximate a large rating matrix \(R_{m\times n}\) into the product of two lower dimension matrices \(P_{k\times m}\) and \(Q_{k\times n}\), such that
\[R\approx {P}' Q\]
train_data <- train_set %>%
select(userId, movieId, rating) %>%
spread(movieId, rating) %>%
as.matrix()The code above uses more memory than a commodity laptop is able to
process, so I used an alternative method: the recosystem
package, which provides the complete solution for a recommendation
system using matrix factorization.
The package vignette2 describes how to use
recosystem:
Usage of recosystem
The usage of recosystem is quite
simple, mainly consisting of the following steps:
- Create a model object (a Reference Class object in R) by calling
Reco(). - (Optionally) call the
$tune()method to select best tuning parameters along a set of candidate values. - Train the model by calling the
$train()method. A number of parameters can be set inside the function, possibly coming from the result of$tune(). - (Optionally) export the model via
$output(), i.e. write the factorization matrices \(P\) and \(Q\) into files or return them as \(R\) objects. - Use the
$predict()method to compute predicted values.
if(!require(recosystem))
install.packages("recosystem", repos = "http://cran.us.r-project.org")
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm
library("recosystem")
# Convert the train and test sets into recosystem input format
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
# Create the model object
r <- recosystem::Reco()
# Select the best tuning parameters
opts <- r$tune(train_data, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))
# Train the algorithm
r$train(train_data, opts = c(opts$min, nthread = 4, niter = 20))## iter tr_rmse obj
## 0 1.3298 2.3151e+05
## 1 0.9296 1.4654e+05
## 2 0.8975 1.3888e+05
## 3 0.8722 1.3365e+05
## 4 0.8483 1.2949e+05
## 5 0.8215 1.2541e+05
## 6 0.7931 1.2143e+05
## 7 0.7646 1.1771e+05
## 8 0.7393 1.1455e+05
## 9 0.7154 1.1164e+05
## 10 0.6953 1.0942e+05
## 11 0.6777 1.0769e+05
## 12 0.6609 1.0597e+05
## 13 0.6469 1.0458e+05
## 14 0.6337 1.0331e+05
## 15 0.6230 1.0240e+05
## 16 0.6132 1.0148e+05
## 17 0.6032 1.0070e+05
## 18 0.5945 9.9808e+04
## 19 0.5875 9.9332e+04
# Calculate the predicted values
y_hat_reco <- r$predict(test_data, out_memory())
knitr::kable(head(y_hat_reco, 10))| x |
|---|
| 3.560642 |
| 3.778561 |
| 3.333235 |
| 2.905037 |
| 2.613510 |
| 3.758197 |
| 3.089734 |
| 4.276941 |
| 3.981087 |
| 3.219093 |
Matrix factorization improved substantially the RMSE.
result <- bind_rows(result,
tibble(Method = "Matrix Factorization - recosystem",
RMSE = RMSE(test_set$rating, y_hat_reco),
MSE = MSE(test_set$rating, y_hat_reco),
MAE = MAE(test_set$rating, y_hat_reco)))
result## # A tibble: 7 x 4
## Method RMSE MSE MAE
## <chr> <dbl> <dbl> <dbl>
## 1 Project Goal 0.865 NA NA
## 2 Random prediction 1.49 2.22 1.15
## 3 Mean 1.05 1.11 0.847
## 4 Mean + bi 0.986 0.972 0.766
## 5 Mean + bi + bu 0.906 0.820 0.696
## 6 Regularized bi and bu 0.888 0.789 0.684
## 7 Matrix Factorization - recosystem 0.922 0.850 0.717
Final Validation
As we can see from the result table, regularization and
matrix factorization achieved the target RMSE. So, finally
we train the complete movies set with both models and
calculate the RMSE in the validation set. The project goal
is achieved if the RMSE stays below the target.
Linear Model With Regularization
During the training and testing phase, the linear model with
regularization achieved the target RMSE with a small margin. Here we do
the final validation with the validation set.
mu_move <- mean(movies$rating)
# Movie effect (bi)
b_i_mov <- movies %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_move)/(n()+lambda))
# User effect (bu)
b_u_mov <- movies %>%
left_join(b_i_mov, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_move)/(n()+lambda))
# Prediction
y_hat_mov <- validation %>%
left_join(b_i_mov, by = "movieId") %>%
left_join(b_u_mov, by = "userId") %>%
mutate(pred = mu_move + b_i + b_u) %>%
pull(pred)
# Update the results table
result <- bind_rows(result,
tibble(Method = "Final Regularization (edx vs validation)",
RMSE = RMSE(validation$rating, y_hat_mov),
MSE = MSE(validation$rating, y_hat_mov),
MAE = MAE(validation$rating, y_hat_mov)))
# Show the RMSE improvement
knitr::kable(result)| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.4895642 | 2.2188017 | 1.1451446 |
| Mean | 1.0547663 | 1.1125319 | 0.8472556 |
| Mean + bi | 0.9860457 | 0.9722861 | 0.7661784 |
| Mean + bi + bu | 0.9056371 | 0.8201786 | 0.6961587 |
| Regularized bi and bu | 0.8879858 | 0.7885188 | 0.6839101 |
| Matrix Factorization - recosystem | 0.9217162 | 0.8495607 | 0.7169659 |
| Final Regularization (edx vs validation) | 0.8404891 | 0.7064220 | 0.6470983 |
As expected, the RMSE calculated on the validation set
(0.8404891) is lower than the target of 0.8649 and slightly higher than
the RMSE of the test set (0.8879858).
Top 10 best movies
validation %>%
left_join(b_i_mov, by = "movieId") %>%
left_join(b_u_mov, by = "userId") %>%
mutate(pred = mu_move + b_i + b_u) %>%
arrange(-pred) %>%
group_by(movieId) %>%
select(movieId) %>%
head(10)## # A tibble: 10 x 1
## # Groups: movieId [9]
## movieId
## <int>
## 1 58559
## 2 98124
## 3 73
## 4 318
## 5 59018
## 6 80463
## 7 318
## 8 3868
## 9 1203
## 10 608
Top 10 worst movies
validation %>%
left_join(b_i_mov, by = "movieId") %>%
left_join(b_u_mov, by = "userId") %>%
mutate(pred = mu_move + b_i + b_u) %>%
arrange(pred) %>%
group_by(movieId) %>%
select(movieId) %>%
head(10)## # A tibble: 10 x 1
## # Groups: movieId [10]
## movieId
## <int>
## 1 65802
## 2 3593
## 3 1902
## 4 673
## 5 2191
## 6 4753
## 7 32153
## 8 46574
## 9 6503
## 10 546
Matrix Factorization
The initial test shows that matrix factorization gives the best RMSE.
Now it’s time to validate with the entire movies and
validation sets.
set.seed(1234, sample.kind = "Rounding")
# Convert 'movies' and 'validation' sets to recosystem input format
mov_reco <- with(movies, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
validation_reco <- with(validation, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
# Create the model object
r <- recosystem::Reco()
# Tune the parameters
opts <- r$tune(mov_reco, opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l2 = c(0.01, 0.1),
costq_l2 = c(0.01, 0.1),
nthread = 4, niter = 10))
# Train the model
r$train(mov_reco, opts = c(opts$min, nthread = 4, niter = 20))## iter tr_rmse obj
## 0 1.2931 2.3886e+05
## 1 0.9120 1.5103e+05
## 2 0.8869 1.4461e+05
## 3 0.8693 1.4043e+05
## 4 0.8516 1.3649e+05
## 5 0.8326 1.3291e+05
## 6 0.8130 1.2990e+05
## 7 0.7920 1.2652e+05
## 8 0.7715 1.2374e+05
## 9 0.7526 1.2098e+05
## 10 0.7352 1.1878e+05
## 11 0.7205 1.1696e+05
## 12 0.7073 1.1532e+05
## 13 0.6947 1.1384e+05
## 14 0.6834 1.1258e+05
## 15 0.6738 1.1146e+05
## 16 0.6641 1.1031e+05
## 17 0.6560 1.0955e+05
## 18 0.6485 1.0879e+05
## 19 0.6411 1.0799e+05
# Calculate the prediction
y_hat_final_reco <- r$predict(validation_reco, out_memory())
# Update the result table
result <- bind_rows(result,
tibble(Method = "Final Matrix Factorization - recosystem",
RMSE = RMSE(validation$rating, y_hat_final_reco),
MSE = MSE(validation$rating, y_hat_final_reco),
MAE = MAE(validation$rating, y_hat_final_reco)))The final RMSE with matrix factorization is 0.6244672, 25.7% better than the linear model with regularization (0.8404891).
# Show the RMSE improvement
knitr::kable(result )| Method | RMSE | MSE | MAE |
|---|---|---|---|
| Project Goal | 0.8649000 | NA | NA |
| Random prediction | 1.4895642 | 2.2188017 | 1.1451446 |
| Mean | 1.0547663 | 1.1125319 | 0.8472556 |
| Mean + bi | 0.9860457 | 0.9722861 | 0.7661784 |
| Mean + bi + bu | 0.9056371 | 0.8201786 | 0.6961587 |
| Regularized bi and bu | 0.8879858 | 0.7885188 | 0.6839101 |
| Matrix Factorization - recosystem | 0.9217162 | 0.8495607 | 0.7169659 |
| Final Regularization (edx vs validation) | 0.8404891 | 0.7064220 | 0.6470983 |
| Final Matrix Factorization - recosystem | 0.6244672 | 0.3899593 | 0.4655513 |
Now, let’s check the best and worst movies predicted with matrix factorization.
Top 10 best movies:
tibble(movieId = validation$movieId, rating = y_hat_final_reco) %>%
arrange(-rating) %>%
group_by(movieId) %>%
select(movieId) %>%
head(10)## # A tibble: 10 x 1
## # Groups: movieId [9]
## movieId
## <int>
## 1 2064
## 2 8970
## 3 106489
## 4 58559
## 5 1291
## 6 318
## 7 523
## 8 527
## 9 318
## 10 73
Top 10 worst movies:
tibble(movieId = validation$movieId, rating = y_hat_final_reco) %>%
arrange(rating) %>%
group_by(movieId) %>%
select(movieId) %>%
head(10)## # A tibble: 10 x 1
## # Groups: movieId [10]
## movieId
## <int>
## 1 1325
## 2 6252
## 3 130448
## 4 51304
## 5 65802
## 6 4753
## 7 7448
## 8 90870
## 9 95443
## 10 32153
Conclusion
We started collecting and preparing the dataset for analysis, then we explored the information seeking for insights that might help during the model building.
Next, we created a random model that predicts the rating based on the probability distribution of each rating. This model gives the worst result.
We started the linear model with a very simple model which is just the mean of the observed ratings. From there, we added movie and user effects, that models the user behavior and movie distribution. With regularization we added a penalty value for the movies and users with few number of ratings. The linear model achieved the RMSE of 0.8404891, successfully passing the target of 0.8649.
Finally, we evaluated the recosystem package that
implements the LIBMF algorithm, and achieved the RMSE of 0.6244672.
4.1 Limitations
Only two predictors are used, the movie and user information, not considering other features. Modern recommendation system models use many predictors, such as genres, bookmarks, playlists, etc.
:+1: T H A N K Y O U :sparkles: :camel: