Process and Workflow

The following basic data science steps will be followed:

  1. Data Preparation: download, parse, import and prepare the data to be processed and analysed.
  2. Data Exploration and visualization: explore data to understand the features and the relationship between the features and predictors.
  3. Data Cleaning: eventually the dataset contains unnecessary information that needs to be removed.
  4. Data Analysis and modeling: create the model using the insights gained during exploration. Also test and validate the model.
  5. Communicate: create the report and publish the results.

Load the necessary Packages

  • library(dplyr)
  • library(tidyverse)
  • library(data.table)
  • library(lubridate)
  • library(ggplot2)
  • library(ggthemes)
  • library(caret)
  • library(scales)

Data Exploration

Data exploration and visualization: explore data to understand the features and the relationship between the features and predictors.

Read the 100k movies dataset

movies <- read.csv('Project.csv') 
knitr::kable(head(movies))
userId movieId rating timestamp
1 31 2.5 1260759144
1 1029 3.0 1260759179
1 1061 3.0 1260759182
1 1129 2.0 1260759185
1 1172 4.0 1260759205
1 1263 2.0 1260759151

Insight on the dataset

str(movies)
## 'data.frame':    100004 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
##  $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
##  $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...

There are 4 variables and with integer and numeric datatypes | variables | datatype | |———–|———–| | movieId | integer | | userId | integer |
| rating | numeric |
| timestamp | integer |

dim(movies)
## [1] 100004      4

From the first data exploration. There are 100,004 rows and 4 variables (columns)

Further exploration on the dataset.

# number of unique users
cat(n_distinct(movies$movieId), "Number of Movies\n",
    n_distinct(movies$userId), "Number of Users\n",
    n_distinct(movies$rating), "Unique Ratings")
## 9066 Number of Movies
##  671 Number of Users
##  10 Unique Ratings

Count the number of each ratings:

movies %>% 
  group_by(rating) %>% 
  summarise(count=n()) 
## # A tibble: 10 x 2
##    rating count
##     <dbl> <int>
##  1    0.5  1101
##  2    1    3326
##  3    1.5  1687
##  4    2    7271
##  5    2.5  4449
##  6    3   20064
##  7    3.5 10538
##  8    4   28750
##  9    4.5  7723
## 10    5   15095

There are 671 users who gave 10 types of ratings (0.5 - 5.0) on 9066 Movies.

About the Date

tibble(`Start Date` = date(as_datetime(min(movies$timestamp), origin="1970-01-01")),
       `End Date` = date(as_datetime(max(movies$timestamp), origin="1970-01-01"))) %>%
  mutate(Period = duration(max(movies$timestamp)-min(movies$timestamp)))
## # A tibble: 1 x 3
##   `Start Date` `End Date` Period                   
##   <date>       <date>     <Duration>               
## 1 1995-01-09   2016-10-16 686988635s (~21.77 years)

The rating period was collected over almost 21 years and 9 months.

Convert the numeric value of timestamp into date

movies_df = movies %>% 
  mutate(date = date(as_datetime(timestamp))) %>% 
  mutate_at(vars(date), list(year=year, month=month, day=day))
  
knitr::kable(head(movies_df))
userId movieId rating timestamp date year month day
1 31 2.5 1260759144 2009-12-14 2009 12 14
1 1029 3.0 1260759179 2009-12-14 2009 12 14
1 1061 3.0 1260759182 2009-12-14 2009 12 14
1 1129 2.0 1260759185 2009-12-14 2009 12 14
1 1172 4.0 1260759205 2009-12-14 2009 12 14
1 1263 2.0 1260759151 2009-12-14 2009 12 14

The following table lists the days with more ratings with the movieId.

movies_df %>%
  group_by(date, movieId) %>%
  summarise(count = n(), .groups='drop') %>%
  arrange(-count) %>%
  head(10)
## # A tibble: 10 x 3
##    date       movieId count
##    <date>       <int> <int>
##  1 2005-03-22    1198     6
##  2 2005-03-22     318     5
##  3 2005-03-22    1259     5
##  4 2005-03-22    2028     5
##  5 2005-03-22    2762     5
##  6 2005-03-22    4306     5
##  7 2005-03-22    8360     5
##  8 1996-09-21     480     4
##  9 2000-11-19    2858     4
## 10 2000-11-21     608     4
movies_df %>% 
  ggplot(aes(x=year)) + 
  geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Rating Distribution Per Year") + 
  #stat_bin(bins = 30) +
  xlab("Year") +
  ylab("Number of Ratings") +
  scale_y_continuous(labels = comma) + 
  theme_solarized()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ratings

Count the number of each ratings: Users have the option to choose a rating value from 0.5 to 5.0, totaling 10 possible values.

movies_df %>% 
  group_by(rating) %>% 
  summarise(count=n())
## # A tibble: 10 x 2
##    rating count
##     <dbl> <int>
##  1    0.5  1101
##  2    1    3326
##  3    1.5  1687
##  4    2    7271
##  5    2.5  4449
##  6    3   20064
##  7    3.5 10538
##  8    4   28750
##  9    4.5  7723
## 10    5   15095
  #arrange()
movies_df %>% group_by(rating) %>% 
  summarise(count = n()) %>%
  ggplot(aes(x=rating, y=count)) + 
    geom_line(color="grey") +
    geom_point(shape=21, fill="#69b3a2", size=5) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
    ggtitle("Rating Distribution", subtitle = "Higher ratings are prevalent.") + 
    xlab("Rating") +
    ylab("Count") +
    theme_solarized_2()
Round values receive more ratings than decimals.

Round values receive more ratings than decimals.

Movies

There are 9066 different movies in the movies_df set. We know from intuition that some of them are rated more than others, since many movies are watched by few users and blockbusters tend to have more ratings.

n <- paste(round(100 * movies_df %>% group_by(movieId) %>%
  summarise(n=n()) %>%
  filter(n<=10) %>%
  count() / length(unique(movies_df$movieId)),1),"\\%", sep="")
movies_df %>% group_by(movieId) %>%
  summarise(n=n()) %>%
  ggplot(aes(n)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    scale_x_log10() + 
    ggtitle("Distribution of Movies", 
            subtitle = "The distribution is not symmetrical") +
    xlab("Number of Ratings") +
    ylab("Number of Movies") + 
    theme_solarized_2()
77\%  of movies have less than 10 ratings.

77% of movies have less than 10 ratings.

User Distribution

There are 671 different users are in the movies_df set.

The majority of users rate few movies, while a few users rate more than a thousand movies.

n <- paste(round(100 * movies_df %>% group_by(userId) %>%
  summarise(n=n()) %>%
  filter(n<20) %>%
  count() / length(unique(movies_df$userId)),1),"\\%", sep="")

0% users rated less than 20 movies.

movies_df %>% group_by(userId) %>%
  summarise(count=n()) %>%
  arrange(count) %>%
  head()
## # A tibble: 6 x 2
##   userId count
##    <int> <int>
## 1      1    20
## 2     14    20
## 3     35    20
## 4     76    20
## 5    209    20
## 6    221    20

The user distribution is right skewed.

movies_df %>% group_by(userId) %>%
  summarise(n=n()) %>%
  ggplot(aes(n)) +
    geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    scale_x_log10() + 
    #stat_bin(bins = 30) +
    ggtitle("Distribution of Users", 
            subtitle="The distribution is right skewed.") +
    xlab("Number of Ratings") +
    ylab("Number of Users") + 
    scale_y_continuous(labels = comma) + 
    theme_solarized_2()
0\%  of users rated less than 20 movies.

0% of users rated less than 20 movies.

Show the heatmap of users x movies

This user-movie matrix is sparse, with the vast majority of empty cells. Notice that four movies have more ratings, and one or two users are more active.

users <- sample(unique(movies_df$userId), 50)
movies_df %>% filter(userId %in% users) %>%
  select(userId, movieId, rating) %>%
  mutate(rating = 1) %>%
  spread(movieId, rating) %>% 
  select(sample(ncol(.), 50)) %>% 
  as.matrix() %>% t(.) %>%
  image(1:50, 1:50,. , xlab="Movies", ylab="Users") %>% 
  abline(h=0:50+0.5, v=0:50+0.5, col = "grey") %>% 
  title("User x Movie Matrix")

Data Preparation

The estimated rating uses only movie and user information

We split the dataset in two parts, the training set called movies and the evaluation set called validation with 90% and 10% of the original dataset respectively.

Then, we split the movies set in two parts, the train set and test set with 90% and 10% of movies set respectively. The model is created and trained in the train set and tested in the test set until the RMSE target is achieved, then finally we train the model again in the entire movies set and validate in the validation set. The name of this method is cross-validation.

# 'Validation' set will be 10% of Movies data
set.seed(1, sample.kind="Rounding")
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
test_index <- createDataPartition(y = movies$rating, 
                                  times = 1, p = 0.1, list = FALSE)
move <- movies[-test_index,]
temp <- movies[test_index,]

# Make sure userId and movieId in 'validation' set are also in 'movies' set
validation <- temp %>% 
  semi_join(movies, by = "movieId") %>%
  semi_join(movies, by = "userId")

# Add rows removed from 'validation' set back into 'movies' set
removed <- anti_join(temp, validation)
## Joining, by = c("userId", "movieId", "rating", "timestamp")
movies <- rbind(movies, removed)

#(movies, test_index, temp, move, removed)

The training set will be 90% of movies data and the test set will be the remaining 10%.

set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = movies$rating, times = 1, p = 0.1, list = FALSE)
train_set <- movies[-test_index,]
temp <- movies[test_index,]

# Make sure userId and movieId in test set are also in train set
test_set <- temp %>% 
  semi_join(train_set, by = "movieId") %>%
  semi_join(train_set, by = "userId")

# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)

#rm(test_index, temp, removed)
train_set <- train_set %>% select(userId, movieId, rating)
test_set  <- test_set  %>% select(userId, movieId, rating)

Modeling

This is conducted in three steps:

  1. Selecting useful data
  2. Normalizing data
  3. Binarizing the data

Random Prediction

A very simple model is just randomly predict the rating using the probability distribution observed during the data exploration. For example, if we know the probability of all users giving a movie a rating of 3 is 10%, then we may guess that 10% of the ratings will have a rating of 3.

Such prediction sets the worst error we may get, so any other model should provide better result.

Model Evaluation Functions

Here we define the loss functions.

# Define Mean Absolute Error (MAE)
MAE <- function(true_ratings, predicted_ratings){
  mean(abs(true_ratings - predicted_ratings))
}

# Define Mean Squared Error (MAE)
MSE <- function(true_ratings, predicted_ratings){
  mean((true_ratings - predicted_ratings)^2)
}

# Define Root Mean Squared Error (RMSE)
RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

Random Prediction

The first model randomly predicts the ratings using the observed probabilities in the training set. First, we calculate the probability of each rating in the training set, then we predict the rating for the test set and compare with actual rating. Any model should be better than this one.

Since the training set is a sample of the entire population and we don’t know the real distribution of ratings, the Monte Carlo simulation with replacement provides a good approximation of the rating distribution.

set.seed(4321, sample.kind = "Rounding")

# Create the probability of each rating
p <- function(x, y) mean(y == x)
rating <- seq(0.5,5,0.5)

# Estimate the probability of each rating with Monte Carlo simulation
B <- 10^3
M <- replicate(B, {
  s <- sample(train_set$rating, 100, replace = TRUE)
  sapply(rating, p, y= s)
})
prob <- sapply(1:nrow(M), function(x) mean(M[x,]))

# Predict random ratings
y_hat_random <- sample(rating, size = nrow(test_set), 
                       replace = TRUE, prob = prob)

# Create a table with the error results
result <- tibble(Method = "Project Goal", RMSE = 0.8649, MSE = NA, MAE = NA)
result <- bind_rows(result, 
                    tibble(Method = "Random prediction", 
                           RMSE = RMSE(test_set$rating, y_hat_random),
                           MSE  = MSE(test_set$rating, y_hat_random),
                           MAE  = MAE(test_set$rating, y_hat_random)))

The RMSE of random prediction is very high.

knitr::kable(result)
Method RMSE MSE MAE
Project Goal 0.864900 NA NA
Random prediction 1.489564 2.218802 1.145145

Linear Model

The simplest model predicts all users will give the same rating to all movies and assumes the movie to movie variation is the randomly distributed error. Although the predicted rating can be any value, statistics theory says that the average minimizes the RMSE, so the initial prediction is just the average of all observed ratings, as described in this formula:

\[\hat Y_{u,i}=\mu+\epsilon_{i,u}\]

Where \(\hat Y\) is the predicted rating, \(\mu\) is the mean of observed data and \(\epsilon_{i,u}\) is the error distribution. Any value other than the mean increases the RMSE, so this is a good initial estimation.

Part of the movie to movie variability can be explained by the fact that different movies have different rating distribution. This is easy to understand, since some movies are more popular than others and the public preference varies. This is called movie effect or movie bias, and is expressed as \(b_i\) in this formula:

\[\hat Y_{u,i}=\mu + b_i + \epsilon_{i,u}\]

The movie effect can be calculated as the mean of the difference between the observed rating \(y\) and the mean \(\mu\).

\[\hat b_i=\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat \mu)\] Similar to the movie effect, different users have different rating pattern or distribution. For example, some users like most movies and consistently rate 4 or 5, while other users dislike most movies rating 1 or 2. This is called user effect or user bias and is expressed in this formula:

\[\hat b_u=\frac{1}{N}\sum_{i=1}^{N}(y_{u,i}-\hat b_i-\hat \mu)\] The prediction model that includes the user effect becomes:

\[\hat Y_{u,i}=\mu+b_i+b_u+\epsilon_{u,i}\]

Building the linear model based on the formula:
\[\hat y = \mu + b_i + b_u + \epsilon_{u,i}\]

Initial Prediction

The initial prediction is just the mean of the ratings, \(\mu\).
\[\hat y = \mu + \epsilon_{u,i}\]

# Mean of observed values
mu <- mean(train_set$rating)

# Update the error table  
result <- bind_rows(result, 
                    tibble(Method = "Mean", 
                           RMSE = RMSE(test_set$rating, mu),
                           MSE  = MSE(test_set$rating, mu),
                           MAE  = MAE(test_set$rating, mu)))
# Show the RMSE improvement  
knitr::kable(result)
Method RMSE MSE MAE
Project Goal 0.864900 NA NA
Random prediction 1.489564 2.218802 1.1451446
Mean 1.054766 1.112532 0.8472556

Include Movie Effect (\(b_i\))

\(b_i\) is the movie effect (bias) for movie \(i\).
\[\hat y = \mu + b_i + \epsilon_{u,i}\]

# Movie effects (bi)
bi <- train_set %>% 
  group_by(movieId) %>% 
  summarize(b_i = mean(rating - mu))
head(bi)
## # A tibble: 6 x 2
##   movieId    b_i
##     <int>  <dbl>
## 1       1  0.306
## 2       2 -0.171
## 3       3 -0.392
## 4       4 -1.27 
## 5       5 -0.270
## 6       6  0.319

The movie effect is normally right skewed distributed.

bi %>% ggplot(aes(x = b_i)) + 
  geom_histogram(bins=20, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Movie Effect Distribution") +
  xlab("Movie effect") +
  ylab("Count") +
  scale_y_continuous(labels = comma) + 
  theme_solarized_2()
Distribution of movie effects.

Distribution of movie effects.

# Predict the rating with mean + bi  
y_hat_bi <- mu + test_set %>% 
  left_join(bi, by = "movieId") %>% 
  .$b_i

# Calculate the RMSE  
result <- bind_rows(result, 
                    tibble(Method = "Mean + bi", 
                           RMSE = RMSE(test_set$rating, y_hat_bi),
                           MSE  = MSE(test_set$rating, y_hat_bi),
                           MAE  = MAE(test_set$rating, y_hat_bi)))

# Show the RMSE improvement  
knitr::kable(result)
Method RMSE MSE MAE
Project Goal 0.8649000 NA NA
Random prediction 1.4895642 2.2188017 1.1451446
Mean 1.0547663 1.1125319 0.8472556
Mean + bi 0.9860457 0.9722861 0.7661784

Include User Effect (\(b_u\))

\(b_u\) is the user effect (bias) for user \(u\).
\[\hat y_{u,i} = \mu + b_i + b_u + \epsilon_{u,i}\]

Predict the rating with mean + bi + bu

# User effect (bu)
bu <- train_set %>% 
  left_join(bi, by = 'movieId') %>%
  group_by(userId) %>%
  summarize(b_u = mean(rating - mu - b_i))

# Prediction
y_hat_bi_bu <- test_set %>% 
  left_join(bi, by='movieId') %>%
  left_join(bu, by='userId') %>%
  mutate(pred = mu + b_i + b_u) %>%
  .$pred

# Update the results table
result <- bind_rows(result, 
                    tibble(Method = "Mean + bi + bu", 
                           RMSE = RMSE(test_set$rating, y_hat_bi_bu),
                           MSE  = MSE(test_set$rating, y_hat_bi_bu),
                           MAE  = MAE(test_set$rating, y_hat_bi_bu)))

# Show the RMSE improvement  
knitr::kable(result)
Method RMSE MSE MAE
Project Goal 0.8649000 NA NA
Random prediction 1.4895642 2.2188017 1.1451446
Mean 1.0547663 1.1125319 0.8472556
Mean + bi 0.9860457 0.9722861 0.7661784
Mean + bi + bu 0.9056371 0.8201786 0.6961587

The user effect is normally distributed.

train_set %>% 
  group_by(userId) %>% 
  summarize(b_u = mean(rating)) %>% 
  filter(n()>=100) %>%
  ggplot(aes(b_u)) + 
    geom_histogram(fill="#e2639c", color="#e9ecef", alpha=0.9) + 
    ggtitle("User Effect Distribution") +
    xlab("User Bias") +
    ylab("Count") +
    scale_y_continuous(labels = comma) + 
    theme_solarized_2()
User effect is normally distributed.

User effect is normally distributed.

Evaluating the model result

The RMSE improved from the initial estimation based on the mean. However, we still need to check if the model makes good ratings predictions.

Check the 10 largest residual differences

train_set %>% 
  left_join(bi, by='movieId') %>%
  mutate(residual = rating - (mu + b_i)) %>%
  arrange(desc(abs(residual))) %>%  
  slice(1:10)
##    userId movieId rating       b_i  residual
## 1     174    1252    0.5 0.8159500 -3.858209
## 2     578      50    0.5 0.8028497 -3.845109
## 3     304     527    0.5 0.7404513 -3.782710
## 4     394     527    0.5 0.7404513 -3.782710
## 5     410     527    0.5 0.7404513 -3.782710
## 6      20     296    0.5 0.7303163 -3.772575
## 7     457    1196    0.5 0.7258570 -3.768116
## 8     578    1196    0.5 0.7258570 -3.768116
## 9     310    1276    0.5 0.7196458 -3.761905
## 10    133     111    0.5 0.7054474 -3.747706

Top 10 best movies (ranked by \(b_i\)).
These are unknown movies

bi %>% 
  #inner_join(by = "movieId") %>% 
  arrange(-b_i) %>% 
  select(movieId) %>%
  head(10)
## # A tibble: 10 x 1
##    movieId
##      <int>
##  1      53
##  2     183
##  3     301
##  4     309
##  5     559
##  6     702
##  7     759
##  8     764
##  9     820
## 10     845

Top 10 worst movies (ranked by \(b_i\)), also unknown movies:

bi %>% 
  #inner_join(titles, by = "movieId") %>% 
  arrange(b_i) %>% 
  select(movieId) %>%
  head(10) 
## # A tibble: 10 x 1
##    movieId
##      <int>
##  1    1311
##  2    2191
##  3    2845
##  4    3883
##  5    3933
##  6    4051
##  7    4445
##  8    4471
##  9    4684
## 10    4753

Number of ratings for 10 best movies:

train_set %>% 
  left_join(bi, by = "movieId") %>%
  arrange(desc(b_i)) %>% 
  group_by(movieId) %>% 
  summarise(count = n()) %>% 
  slice(1:10)
## # A tibble: 10 x 2
##    movieId count
##      <int> <int>
##  1       1   221
##  2       2    97
##  3       3    50
##  4       4    11
##  5       5    55
##  6       6    94
##  7       7    48
##  8       8     5
##  9       9    17
## 10      10   113
train_set %>% count(movieId) %>% 
  left_join(bi, by="movieId") %>% 
  arrange(desc(b_i)) %>% 
  slice(1:10) %>% 
  pull(n)
##  [1] 1 1 1 3 1 1 1 1 1 1

Regularization

Now, we regularize the user and movie effects adding a penalty factor \(\lambda\), which is a tuning parameter. We define a number of values for \(\lambda\) and use the regularization function to pick the best value that minimizes the RMSE.

The linear model provides a good estimation for the ratings, but doesn’t consider that many movies have very few number of ratings, and some users rate very few movies. This means that the sample size is very small for these movies and these users. Statistically, this leads to large estimated error.

The estimated value can be improved adding a factor that penalizes small sample sizes and have have little or no impact otherwise. Thus, estimated movie and user effects can be calculated with these formulas:

\[\hat b_i=\frac{1}{n_i+\lambda}\sum_{u=1}^{n_i}(y_{u,i}-\hat \mu)\]

\[\hat b_u=\frac{1}{n_u+\lambda}\sum_{i=1}^{n_u}( y_{u,i}-\hat b_i-\hat \mu)\]

For values of \(N\) smaller than or similar to \(\lambda\), \(\hat b_i\) and \(\hat b_u\) is smaller than the original values, whereas for values of \(N\) much larger than \(\lambda\), \(\hat b_i\) and \(\hat b_u\) change very little.

An effective method to choose \(\lambda\) that minimizes the RMSE is running simulations with several values of \(\lambda\).

regularization <- function(lambda, trainset, testset){

  # Mean
  mu <- mean(trainset$rating)

  # Movie effect (bi)
  b_i <- trainset %>% 
    group_by(movieId) %>%
    summarize(b_i = sum(rating - mu)/(n()+lambda))

  # User effect (bu)  
  b_u <- trainset %>% 
    left_join(b_i, by="movieId") %>%
    filter(!is.na(b_i)) %>%
    group_by(userId) %>%
    summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))

  # Prediction: mu + bi + bu  
  predicted_ratings <- testset %>% 
    left_join(b_i, by = "movieId") %>%
    left_join(b_u, by = "userId") %>%
    filter(!is.na(b_i), !is.na(b_u)) %>%
    mutate(pred = mu + b_i + b_u) %>%
    pull(pred)
  
  return(RMSE(predicted_ratings, testset$rating))
}

# Define a set of lambdas to tune
lambdas <- seq(0, 10, 0.25)

# Update RMSES table
rmses <- sapply(lambdas, 
                regularization, 
                trainset = train_set, 
                testset = test_set)

# Plot the lambda x RMSE
tibble(Lambda = lambdas, RMSE = rmses) %>%
  ggplot(aes(x = Lambda, y = RMSE)) +
    geom_point(size=2) +
    ggtitle("Regularization", 
            subtitle = "Pick the penalization that gives the lowest RMSE.") +
    theme_solarized_2()
Choose the lambda that minimizes RMSE.

Choose the lambda that minimizes RMSE.

Next, we apply the best \(\lambda\) to the linear model.

# We pick the lambda that returns the lowest RMSE.
lambda <- lambdas[which.min(rmses)]

# Then, we calculate the predicted rating using the best parameters 
# achieved from regularization.  
mu <- mean(train_set$rating)

# Movie effect (bi)
b_i <- train_set %>% 
  group_by(movieId) %>%
  summarize(b_i = sum(rating - mu)/(n()+lambda))

# User effect (bu)
b_u <- train_set %>% 
  left_join(b_i, by="movieId") %>%
  group_by(userId) %>%
  summarize(b_u = sum(rating - b_i - mu)/(n()+lambda))

# Prediction
y_hat_reg <- test_set %>% 
  left_join(b_i, by = "movieId") %>%
  left_join(b_u, by = "userId") %>%
  mutate(pred = mu + b_i + b_u) %>%
  pull(pred)

# Update the result table
result <- bind_rows(result, 
                    tibble(Method = "Regularized bi and bu", 
                           RMSE = RMSE(test_set$rating, y_hat_reg),
                           MSE  = MSE(test_set$rating, y_hat_reg),
                           MAE  = MAE(test_set$rating, y_hat_reg)))

# Regularization made a small improvement in RMSE.  
result
## # A tibble: 6 x 4
##   Method                 RMSE    MSE    MAE
##   <chr>                 <dbl>  <dbl>  <dbl>
## 1 Project Goal          0.865 NA     NA    
## 2 Random prediction     1.49   2.22   1.15 
## 3 Mean                  1.05   1.11   0.847
## 4 Mean + bi             0.986  0.972  0.766
## 5 Mean + bi + bu        0.906  0.820  0.696
## 6 Regularized bi and bu 0.888  0.789  0.684

Matrix Factorization

Matrix factorization approximates a large user-movie matrix into the product of two smaller dimension matrices. Information in the train set is stored in tidy format, with one observation per row, so it needs to be converted to the user-movie matrix before using matrix factorization. This code executes this transformation.

Matrix factorization is widely used machine learning tool for predicting ratings in recommendation systems. This method became widely known during the Netflix Prize challenge1.

The data can be converted into a matrix such that each user is in a row, each movie is in a column and the rating is in the cell, then the algorithm attempts to fill in the missing values. The table below provides a simple example of a \(4\times 5\) matrix.

movie 1 movie 2 movie 3 movie 4 movie 5
user 1 ? ? 4 ? 3
user 2 2 ? ? 4 ?
user 3 ? 3 ? ? 5
user 4 3 ? 2 ? ?

The concept is to approximate a large rating matrix \(R_{m\times n}\) into the product of two lower dimension matrices \(P_{k\times m}\) and \(Q_{k\times n}\), such that

\[R\approx {P}' Q\]

train_data <- train_set %>% 
  select(userId, movieId, rating) %>% 
  spread(movieId, rating) %>% 
  as.matrix()

The code above uses more memory than a commodity laptop is able to process, so I used an alternative method: the recosystem package, which provides the complete solution for a recommendation system using matrix factorization.

The package vignette2 describes how to use recosystem:

Usage of recosystem

The usage of recosystem is quite simple, mainly consisting of the following steps:

  1. Create a model object (a Reference Class object in R) by calling Reco().
  2. (Optionally) call the $tune() method to select best tuning parameters along a set of candidate values.
  3. Train the model by calling the $train() method. A number of parameters can be set inside the function, possibly coming from the result of $tune().
  4. (Optionally) export the model via $output(), i.e. write the factorization matrices \(P\) and \(Q\) into files or return them as \(R\) objects.
  5. Use the $predict() method to compute predicted values.
if(!require(recosystem)) 
  install.packages("recosystem", repos = "http://cran.us.r-project.org")
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm
library("recosystem")
# Convert the train and test sets into recosystem input format
train_data <-  with(train_set, data_memory(user_index = userId, 
                                           item_index = movieId, 
                                           rating     = rating))
test_data  <-  with(test_set,  data_memory(user_index = userId, 
                                           item_index = movieId, 
                                           rating     = rating))

# Create the model object
r <-  recosystem::Reco()

# Select the best tuning parameters
opts <- r$tune(train_data, opts = list(dim = c(10, 20, 30), 
                                       lrate = c(0.1, 0.2),
                                       costp_l2 = c(0.01, 0.1), 
                                       costq_l2 = c(0.01, 0.1),
                                       nthread  = 4, niter = 10))

 # Train the algorithm  
r$train(train_data, opts = c(opts$min, nthread = 4, niter = 20))
## iter      tr_rmse          obj
##    0       1.3298   2.3151e+05
##    1       0.9296   1.4654e+05
##    2       0.8975   1.3888e+05
##    3       0.8722   1.3365e+05
##    4       0.8483   1.2949e+05
##    5       0.8215   1.2541e+05
##    6       0.7931   1.2143e+05
##    7       0.7646   1.1771e+05
##    8       0.7393   1.1455e+05
##    9       0.7154   1.1164e+05
##   10       0.6953   1.0942e+05
##   11       0.6777   1.0769e+05
##   12       0.6609   1.0597e+05
##   13       0.6469   1.0458e+05
##   14       0.6337   1.0331e+05
##   15       0.6230   1.0240e+05
##   16       0.6132   1.0148e+05
##   17       0.6032   1.0070e+05
##   18       0.5945   9.9808e+04
##   19       0.5875   9.9332e+04
# Calculate the predicted values  
y_hat_reco <-  r$predict(test_data, out_memory())
knitr::kable(head(y_hat_reco, 10))
x
3.560642
3.778561
3.333235
2.905037
2.613510
3.758197
3.089734
4.276941
3.981087
3.219093

Matrix factorization improved substantially the RMSE.

result <- bind_rows(result, 
                    tibble(Method = "Matrix Factorization - recosystem", 
                           RMSE = RMSE(test_set$rating, y_hat_reco),
                           MSE  = MSE(test_set$rating, y_hat_reco),
                           MAE  = MAE(test_set$rating, y_hat_reco)))
result
## # A tibble: 7 x 4
##   Method                             RMSE    MSE    MAE
##   <chr>                             <dbl>  <dbl>  <dbl>
## 1 Project Goal                      0.865 NA     NA    
## 2 Random prediction                 1.49   2.22   1.15 
## 3 Mean                              1.05   1.11   0.847
## 4 Mean + bi                         0.986  0.972  0.766
## 5 Mean + bi + bu                    0.906  0.820  0.696
## 6 Regularized bi and bu             0.888  0.789  0.684
## 7 Matrix Factorization - recosystem 0.922  0.850  0.717

Final Validation

As we can see from the result table, regularization and matrix factorization achieved the target RMSE. So, finally we train the complete movies set with both models and calculate the RMSE in the validation set. The project goal is achieved if the RMSE stays below the target.

Linear Model With Regularization

During the training and testing phase, the linear model with regularization achieved the target RMSE with a small margin. Here we do the final validation with the validation set.

mu_move <- mean(movies$rating)

# Movie effect (bi)
b_i_mov <- movies %>% 
  group_by(movieId) %>%
  summarize(b_i = sum(rating - mu_move)/(n()+lambda))

# User effect (bu)
b_u_mov <- movies %>% 
  left_join(b_i_mov, by="movieId") %>%
  group_by(userId) %>%
  summarize(b_u = sum(rating - b_i - mu_move)/(n()+lambda))

# Prediction
y_hat_mov <- validation %>% 
  left_join(b_i_mov, by = "movieId") %>%
  left_join(b_u_mov, by = "userId") %>%
  mutate(pred = mu_move + b_i + b_u) %>%
  pull(pred)

# Update the results table
result <- bind_rows(result, 
                    tibble(Method = "Final Regularization (edx vs validation)", 
                           RMSE = RMSE(validation$rating, y_hat_mov),
                           MSE  = MSE(validation$rating, y_hat_mov),
                           MAE  = MAE(validation$rating, y_hat_mov)))

# Show the RMSE improvement
knitr::kable(result)
Method RMSE MSE MAE
Project Goal 0.8649000 NA NA
Random prediction 1.4895642 2.2188017 1.1451446
Mean 1.0547663 1.1125319 0.8472556
Mean + bi 0.9860457 0.9722861 0.7661784
Mean + bi + bu 0.9056371 0.8201786 0.6961587
Regularized bi and bu 0.8879858 0.7885188 0.6839101
Matrix Factorization - recosystem 0.9217162 0.8495607 0.7169659
Final Regularization (edx vs validation) 0.8404891 0.7064220 0.6470983

As expected, the RMSE calculated on the validation set (0.8404891) is lower than the target of 0.8649 and slightly higher than the RMSE of the test set (0.8879858).

Top 10 best movies

validation %>% 
  left_join(b_i_mov, by = "movieId") %>%
  left_join(b_u_mov, by = "userId") %>% 
  mutate(pred = mu_move + b_i + b_u) %>% 
  arrange(-pred) %>% 
  group_by(movieId) %>% 
  select(movieId) %>%
  head(10)
## # A tibble: 10 x 1
## # Groups:   movieId [9]
##    movieId
##      <int>
##  1   58559
##  2   98124
##  3      73
##  4     318
##  5   59018
##  6   80463
##  7     318
##  8    3868
##  9    1203
## 10     608

Top 10 worst movies

validation %>% 
  left_join(b_i_mov, by = "movieId") %>%
  left_join(b_u_mov, by = "userId") %>% 
  mutate(pred = mu_move + b_i + b_u) %>% 
  arrange(pred) %>% 
  group_by(movieId) %>% 
  select(movieId) %>%
  head(10)
## # A tibble: 10 x 1
## # Groups:   movieId [10]
##    movieId
##      <int>
##  1   65802
##  2    3593
##  3    1902
##  4     673
##  5    2191
##  6    4753
##  7   32153
##  8   46574
##  9    6503
## 10     546

Matrix Factorization

The initial test shows that matrix factorization gives the best RMSE. Now it’s time to validate with the entire movies and validation sets.

set.seed(1234, sample.kind = "Rounding")

# Convert 'movies' and 'validation' sets to recosystem input format
mov_reco <-  with(movies, data_memory(user_index = userId, 
                                   item_index = movieId, 
                                   rating = rating))
validation_reco  <-  with(validation, data_memory(user_index = userId, 
                                                  item_index = movieId, 
                                                  rating = rating))

# Create the model object
r <-  recosystem::Reco()

# Tune the parameters
opts <-  r$tune(mov_reco, opts = list(dim = c(10, 20, 30), 
                                     lrate = c(0.1, 0.2),
                                     costp_l2 = c(0.01, 0.1), 
                                     costq_l2 = c(0.01, 0.1),
                                     nthread  = 4, niter = 10))

# Train the model
r$train(mov_reco, opts = c(opts$min, nthread = 4, niter = 20))
## iter      tr_rmse          obj
##    0       1.2931   2.3886e+05
##    1       0.9120   1.5103e+05
##    2       0.8869   1.4461e+05
##    3       0.8693   1.4043e+05
##    4       0.8516   1.3649e+05
##    5       0.8326   1.3291e+05
##    6       0.8130   1.2990e+05
##    7       0.7920   1.2652e+05
##    8       0.7715   1.2374e+05
##    9       0.7526   1.2098e+05
##   10       0.7352   1.1878e+05
##   11       0.7205   1.1696e+05
##   12       0.7073   1.1532e+05
##   13       0.6947   1.1384e+05
##   14       0.6834   1.1258e+05
##   15       0.6738   1.1146e+05
##   16       0.6641   1.1031e+05
##   17       0.6560   1.0955e+05
##   18       0.6485   1.0879e+05
##   19       0.6411   1.0799e+05
# Calculate the prediction
y_hat_final_reco <-  r$predict(validation_reco, out_memory())

# Update the result table
result <- bind_rows(result, 
                    tibble(Method = "Final Matrix Factorization - recosystem", 
                           RMSE = RMSE(validation$rating, y_hat_final_reco),
                           MSE  = MSE(validation$rating, y_hat_final_reco),
                           MAE  = MAE(validation$rating, y_hat_final_reco)))

The final RMSE with matrix factorization is 0.6244672, 25.7% better than the linear model with regularization (0.8404891).

# Show the RMSE improvement
knitr::kable(result )
Method RMSE MSE MAE
Project Goal 0.8649000 NA NA
Random prediction 1.4895642 2.2188017 1.1451446
Mean 1.0547663 1.1125319 0.8472556
Mean + bi 0.9860457 0.9722861 0.7661784
Mean + bi + bu 0.9056371 0.8201786 0.6961587
Regularized bi and bu 0.8879858 0.7885188 0.6839101
Matrix Factorization - recosystem 0.9217162 0.8495607 0.7169659
Final Regularization (edx vs validation) 0.8404891 0.7064220 0.6470983
Final Matrix Factorization - recosystem 0.6244672 0.3899593 0.4655513

Now, let’s check the best and worst movies predicted with matrix factorization.

Top 10 best movies:

tibble(movieId = validation$movieId, rating = y_hat_final_reco) %>%
  arrange(-rating) %>% 
  group_by(movieId) %>% 
  select(movieId) %>%
  head(10)
## # A tibble: 10 x 1
## # Groups:   movieId [9]
##    movieId
##      <int>
##  1    2064
##  2    8970
##  3  106489
##  4   58559
##  5    1291
##  6     318
##  7     523
##  8     527
##  9     318
## 10      73

Top 10 worst movies:

tibble(movieId = validation$movieId, rating = y_hat_final_reco) %>%
  arrange(rating) %>% 
  group_by(movieId) %>% 
  select(movieId) %>%
  head(10)
## # A tibble: 10 x 1
## # Groups:   movieId [10]
##    movieId
##      <int>
##  1    1325
##  2    6252
##  3  130448
##  4   51304
##  5   65802
##  6    4753
##  7    7448
##  8   90870
##  9   95443
## 10   32153

Conclusion

We started collecting and preparing the dataset for analysis, then we explored the information seeking for insights that might help during the model building.

Next, we created a random model that predicts the rating based on the probability distribution of each rating. This model gives the worst result.

We started the linear model with a very simple model which is just the mean of the observed ratings. From there, we added movie and user effects, that models the user behavior and movie distribution. With regularization we added a penalty value for the movies and users with few number of ratings. The linear model achieved the RMSE of 0.8404891, successfully passing the target of 0.8649.

Finally, we evaluated the recosystem package that implements the LIBMF algorithm, and achieved the RMSE of 0.6244672.

4.1 Limitations

Only two predictors are used, the movie and user information, not considering other features. Modern recommendation system models use many predictors, such as genres, bookmarks, playlists, etc.

:+1: T H A N K Y O U :sparkles: :camel:


  1. https://www.netflixprize.com/↩︎

  2. https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html↩︎