HarvardX Data Science Capstone – Movielens Project

1. Introduction and Executive Summary

A recommendation system is an information filtering system, which provides suggestions to users on specific courses of action to take. In a similar vein, a movie recommendation system recommends movies to users. These recommendations are given on the assumption that the user is likely to provide high ratings for them.

A movie recommendation system ‘learns’ what to suggest to a specific user by being ‘fed’ with past, actual ratings data. In this capstone, I will build one such recommendation system, based on a data set of 10 million ratings provided by GroupLens Research. (https://grouplens.org/datasets/movielens/). The ratings include additional data such as the movie title and release year, its ID, the anonymized ID of the user who provided the rating, as well as the date when the rating was given.

This system is constructed as follows: (1) Data cleaning will be performed to capture and remove NAs and duplicate rows, (2) Data exploration and Visualization will be conducted to tease out insights on the most useful predictors, (3) A series of recommendation systems (i.e., algorithms) generated and tested to identify the best. I top this off with (4) Conclusions including limitations and possibilities.

Each algorithm is tested by having it predict ratings on a data set (“validation set”), and I target to achieve a penultimate Root Mean Squared Error (RMSE) of <0.86490. The RMSE is a performance metric that gauges a recommendation system’s predictive quality, by (essentially) summing up how far the system’s predictions are, from an actual rating that was provided from a testing set of data.

2. Methods and Analysis

The methods section is split into 3 sections: (1) Data observation, (2) Cleaning, and (3) Exploration. Note that the code used to produce the quiz results is in the submitted R Script, and is not covered here.

2.1 Data Observation

I first observe the data to make preliminary sense of it. This is done through the as_tibble() command.

edx %>% as_tibble()

## # A tibble: 9,000,055 × 6
##    userId movieId rating timestamp title                                  genres
##     <int>   <int>  <dbl>     <int> <chr>                                  <chr> 
##  1      1     122      5 838985046 Boomerang (1992)                       Comed…
##  2      1     185      5 838983525 Net, The (1995)                        Actio…
##  3      1     292      5 838983421 Outbreak (1995)                        Actio…
##  4      1     316      5 838983392 Stargate (1994)                        Actio…
##  5      1     329      5 838983392 Star Trek: Generations (1994)          Actio…
##  6      1     355      5 838984474 Flintstones, The (1994)                Child…
##  7      1     356      5 838983653 Forrest Gump (1994)                    Comed…
##  8      1     362      5 838984885 Jungle Book, The (1994)                Adven…
##  9      1     364      5 838983707 Lion King, The (1994)                  Adven…
## 10      1     370      5 838984596 Naked Gun 33 1/3: The Final Insult (1… Actio…
## # … with 9,000,045 more rows

The data comprises of 6 columns and 9,000,055 rows, including the User ID, Move ID, the rating allocated, name of movie, and the genre(s) which the movie belongs to. With each observation in a different row, each variable measured in 1 column, the data looks tidy. Next, we inspect the characteristics of each column and spot for NAs.

2.2 Cleaning

summary(edx)

##      userId         movieId          rating        timestamp        
##  Min.   :    1   Min.   :    1   Min.   :0.500   Min.   :7.897e+08  
##  1st Qu.:18124   1st Qu.:  648   1st Qu.:3.000   1st Qu.:9.468e+08  
##  Median :35738   Median : 1834   Median :4.000   Median :1.035e+09  
##  Mean   :35870   Mean   : 4122   Mean   :3.512   Mean   :1.033e+09  
##  3rd Qu.:53607   3rd Qu.: 3626   3rd Qu.:4.000   3rd Qu.:1.127e+09  
##  Max.   :71567   Max.   :65133   Max.   :5.000   Max.   :1.231e+09  
##     title              genres         
##  Length:9000055     Length:9000055    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

sum(duplicated(edx))

## [1] 0

The output suggests that there are no NAs. Furthermore, no duplicate rows are spotted. We can assume that the data does not require further cleaning at this point in time.

2.3 Exploration

Per what was stated in our previous machine learning course - think of our task as follows: not every user rated every movie. So how would they have rated the movies they didn’t watch? Each rating y has a different set of predictors. If we are predicting the rating for each movie i by each user u, then we need to consider all ratings for movie i (and those similar with movie i), and all movies rated by user u and similar users. Thus, all the indicators in the training set need to be used as predictors.

With this in mind - we will first look at the distribution of movie ratings and users,

#count number of ratings per movie (x axis = number of movies rated, y axis = number of ratings)
edx %>% 
     dplyr::count(movieId) %>% 
     ggplot(aes(n)) + 
     geom_histogram(bins = 30, color = "black") + 
     scale_x_log10() + 
     ggtitle("Number of ratings per movie")+
  labs(x="# of ratings a movie received", y="number of instances")

For the first plot, we observe that on the left side of x-axis, there are ~100 instances(y) which received only 1 rating. Towards the middle, there are ~700 instances(y) which received ~100 ratings. And on the right side of the x axis, there are ~10 instances (y) which received 50,000 ratings. In other words, most movies receive around 700 ratings or so, with extreme cases which receive 50,000 ratings - e.g. blockbuster movies. Clearly, some movies get more rated than others.

##count number of movies rated per user (x axis = number of movies rated, y axis = number of users)
edx %>%
  dplyr::count(userId) %>% 
  ggplot(aes(n)) + 
  geom_histogram(bins = 30, color = "black") + 
  scale_x_log10() +
  ggtitle("Users")+
  labs(x="# of ratings given by a user", y="number of instances")

For the second plot: on the left side of x-axis there are ~100 instances(y) of 20 ratings(x). In the middle, ~6800 instances of 50 ratings. And at the other extreme of the x axis, there are very few instances of >1,000 ratings. In other words, most users give around 50 ratings or so, with extreme cases where some users give around 1,000 ratings. Clearly some users give more ratings than others.

What else can impact movie ratings? I believe the ‘time-gap’ between a movie’s release date and the date of review can have a part to play. People may watch movies which have been released a long time ago because they were classics/critically-acclaimed. Therefore, it is possible that the longer the gap between release and review, the higher the rating. Let us verify this hypothesis.

##from our training set edx, extract year rated, year released, and find the gap
edx2<-edx %>% mutate(rating_year=as.numeric(year(as_datetime(timestamp))),year_released=as.numeric(str_sub(edx$title,start = -5, end = -2)), gap=rating_year-year_released)

#Does watching older movies mean that we tend to give better ratings?
#we need a plot of y (mean rating that a movie receives) vs x (time gap between movie release and actual watch date)

#Thus, for each x-value, I need to calculate the mean rating.
plottable <- edx2 %>% group_by(gap) %>% summarize(mean=mean(rating))

#Then, plot a trendline to see if there are any relationships.
plottable %>% filter(gap>=0) %>% ggplot(aes(x=gap,y=mean)) + geom_point() + geom_smooth(method=loess) +
  ggtitle("Mean Rating vs gap between release date and rating date")+
  labs(y="Mean Movie Rating", x="# of years between release and rating dates")+
  theme(plot.title=element_text(hjust=0.5))

## `geom_smooth()` using formula = 'y ~ x'

This plot shows that the older the time gap, the higher the rating (a difference of ~0.4 points). While this difference levels off at a gap of 50 years and plummets at ~60+, it still tells us that there can be a ‘time gap effect’, which we can build into our model and see whether it decreases the RMSE.

2.4. Modelling Approach

I will perform an iterative approach to modelling. First, I start with a simple predictive model, and add predictors (movie, user, and gap effects) in a step-wise fashion, to see whether they enhance the predictive capability (i.e., reduce RMSE). Depending on whether I clear the target RMSE, I will further regularize these predictors (concept to be discussed later).

3. Results from Simulating Different Models

We first create a loss function that represents the Root Mean Squared Error. The RMSE equation looks like this: \[RMSE = \sqrt {\frac{1}{N} \sum_{i=1}^{N} (\hat{y_{i}} - y_{i})^2}\]

Its equivalent is coded in R, which is as below. Recall our goal is to build a predictive model that yields an RMSE of below 0.86490.

RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

3.1. Naive Model

With the ‘yardstick’ defined - we now establish a simple, baseline model that predicts the ratings for any movie i by an arbitrary user u. we assume that the model is naive i.e., all ratings are assumed to take on one single value with the differences explained by random variation.

\[Y_{u,i} = \mu + \epsilon_{u,i}\]

Where \(\mu\) is true rating for all movies, and \(\epsilon_{u,i}\) is the error term.

Therefore, the estimate for any rating \(Y_{u,i}\) is \(\hat{\mu}\), the estimate of the rating for all movies. The best estimate (i.e., that minimizes the RMSE) of the rating for all movies is the average rating in the training dataset, which is calculated through a simple ‘mean’ operator in R.

mu <- mean(edx$rating)
naive_rmse <- RMSE(test$rating,mu)
naive_rmse

## [1] 1.061202

We see that the RMSE is 1.06, which is far off from the intended target.

#Tabulate Score
RMSECompiled <- tibble(Model="Naive", RMSE=naive_rmse)
RMSECompiled

## # A tibble: 1 × 2
##   Model  RMSE
##   <chr> <dbl>
## 1 Naive  1.06

3.2. Movie Effects

There is room to create a more accurate predictive model. We know that considering all other ratings for movie i and all other ratings provided by user u in our model can yield a more accurate prediction, since it is likely that (1) movies that are rated to be of a certain caliber are likely to continue to be rated as such, and (2) the pattern behind how a user rates (whether strict or lenient) is likely to remain consistent over movies. These considerations take the form of additional variables which will increase/decrease the predicted rating, and will be referred to as the “movie” and “user” effects.

Adding in the movie effect term \(b_{i}\), the predictive model becomes: \[Y_{u,i} = \mu + b_{i} + \epsilon_{u,i}\] We will now calculate the \(b_{i}\) term using the training set data. This term can be estimated by taking the mean of the training rating, subtracted by the overall average rating calculated earlier.

#3.2. Try our second model i.e. calculate movie effects
b_i <-edx %>% group_by(movieId) %>% summarize(b_i=mean(rating-mu))

#3.2.1. RMSE with movieeffects
movieonlyrmse<-left_join(test, b_i, by = "movieId") |> 
  mutate(pred = mu + b_i) |> 
  summarize(rmse = RMSE(rating, pred))
movieonlyrmse

##        rmse
## 1 0.9439087

#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie Effect",RMSE=movieonlyrmse[1,1])
as.data.frame(RMSECompiled)

##          Model      RMSE
## 1        Naive 1.0612018
## 2 Movie Effect 0.9439087

There is a marked improvement; the RMSE drops to 0.943. Let us factor in the user effect as well.

3.3. Movie and User Effects

With the user effect included on top of the movie effect, the predictive model becomes: \[Y_{u,i} = \mu + b_{i} + b_{u} + \epsilon_{u,i}\] The user effect is estimated by first calculating the movie effect (as per 3.2) and using it to back-calculate said user effect.

#3.3. calculate user effects
b_u <- edx%>%left_join(b_i,by="movieId") %>% group_by(userId) %>% summarize(b_u=mean(rating-mu-b_i))

#3.3.1. RMSE with movie + user effects
movieanduserrmse <- test %>% left_join(b_i, by="movieId") %>% left_join(b_u, by="userId") %>% mutate(pred = mu + b_i + b_u) |> 
  summarize(rmse = RMSE(rating, pred))
movieanduserrmse

##        rmse
## 1 0.8653488

movieanduserrmsereport <- movieanduserrmse[1,1]

#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie + User Effect",RMSE=movieanduserrmse[1,1])
as.data.frame(RMSECompiled)

##                 Model      RMSE
## 1               Naive 1.0612018
## 2        Movie Effect 0.9439087
## 3 Movie + User Effect 0.8653488

The RMSE is 0.86535, which is a marked improvement but still short of the target 0.86490.

3.4. Movie, User and Gap Effects

We have seen earlier that there is basis for the time-gap effect. Let us add this term \(b_{g}\) to the original predictive model. \[Y_{u,i} = \mu + b_{i} + b_{u} + b_{g} + \epsilon_{u,i}\] Calculating the time gap term is done using the below code. Note that I used the ‘transformed’ edx2 set which includes additional columns pertaining to rating year, year released, and the time gap between them. This set was generated earlier in the exploration section.

#3.4. RMSE with movie + user + gap effects
b_g <- edx2%>%left_join(b_i,by="movieId") %>% left_join(b_u,by="userId") %>% 
  group_by(gap) %>% summarize(b_g=mean(rating-mu-b_i-b_u))

#3.4.1. first, mutate test set to include gap data as well
test2 <- test %>% mutate(rating_year=as.numeric(year(as_datetime(timestamp))),year_released=as.numeric(str_sub(title,start = -5, end = -2)), gap=rating_year-year_released)

#then, join b_i, b_u, and b_g and calculate the RMSE
movieuserandgaprmse <- test2 %>% left_join(b_i, by="movieId") %>% left_join(b_u, by="userId") %>%
  left_join(b_g,by="gap") %>%
  mutate(pred = mu + b_i + b_u + b_g) |> 
  summarize(rmse = RMSE(rating, pred))
movieuserandgaprmse

##        rmse
## 1 0.8649038

#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie + User + Gap Effect",RMSE=movieuserandgaprmse[1,1])
as.data.frame(RMSECompiled)

##                       Model      RMSE
## 1                     Naive 1.0612018
## 2              Movie Effect 0.9439087
## 3       Movie + User Effect 0.8653488
## 4 Movie + User + Gap Effect 0.8649038

Factoring in the gap effect, the RMSE has dropped even further, but unfortunately it is still slightly above 0.86490, our yardstick.

3.5. (Regularized) Movie and User Effect

To regularize, in simple terms, means to alter the RMSE equation that is to be minimized, by adding in penalty terms. These penalty terms adjust/compensate for large estimates caused by small sample sizes (e.g., few ratings for a particular movie, few ratings provided by a particular user). Note that the penalty term includes an unknown value \(\lambda\) needed to minimize the RMSE, which we will identify using cross-validation. See http://rafalab.dfci.harvard.edu/dsbook/large-datasets.html#regularization for an elaboration of the mathematical implications.

First, we will build a predictive model with regularization of the movie and user effects. This model tries a series of lambda values, calculates the regularized movie and user effects, then the RMSE, then identifies which lambda minimizes the RMSE.

#3.5. trying regularized (movie + user effect)
lambdas <- seq(0, 10, 0.25)
rmses <- sapply(lambdas, function(l){
  mu <- mean(edx$rating)
  b_i <- edx %>%
    group_by(movieId) %>%
    dplyr::summarize(b_i = sum(rating - mu)/(n()+l))
  b_u <- edx %>% 
    left_join(b_i, by="movieId") %>%
    group_by(userId) %>%
    dplyr::summarize(b_u = sum(rating - b_i - mu)/(n()+l))
  predicted_ratings <- 
    test %>% 
    left_join(b_i, by = "movieId") %>%
    left_join(b_u, by = "userId") %>%
    mutate(pred = mu + b_i + b_u) %>%
    .$pred
  return(RMSE(predicted_ratings, test$rating))
})

qplot(lambdas, rmses)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

lambda <- lambdas[which.min(rmses)]
lambda

## [1] 5.25

The plot shows that the best lambda value is 5.25, and the RMSE drops below 0.86490.

#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Regularized (Movie + User Effect)",RMSE=min(rmses))
as.data.frame(RMSECompiled)

##                               Model      RMSE
## 1                             Naive 1.0612018
## 2                      Movie Effect 0.9439087
## 3               Movie + User Effect 0.8653488
## 4         Movie + User + Gap Effect 0.8649038
## 5 Regularized (Movie + User Effect) 0.8648170

3.6. (Regularized) Movie, User and Gap Effect

While we have achieved our desired RMSE of below 0.86490, let us see if regularizing the gap effect helps to reduce the RMSE even further.

#3.6. trying regularized (movie + user + gap effect)
lambdas <- seq(0, 10, 0.25)
rmses2 <- sapply(lambdas, function(l){
  mu <- mean(edx$rating)
  
  b_i <- edx %>%
    group_by(movieId) %>%
    dplyr::summarize(b_i = sum(rating - mu)/(n()+l))
 
  b_u <- edx %>% 
    left_join(b_i, by="movieId") %>%
    group_by(userId) %>%
    dplyr::summarize(b_u = sum(rating - b_i - mu)/(n()+l))
  
  b_g <- edx2 %>% 
    left_join(b_i, by="movieId") %>% left_join(b_u,by="userId") %>%
    group_by(gap) %>%
    dplyr::summarize(b_g = sum(rating - b_i - b_u - mu)/(n()+l))
  
  predicted_ratings <- 
    test2 %>% 
    left_join(b_i, by = "movieId") %>%
    left_join(b_u, by = "userId") %>%
    left_join(b_g, by = "gap") %>%
    mutate(pred = mu + b_i + b_u + b_g) %>%
    .$pred
  return(RMSE(predicted_ratings, test$rating))
})

qplot(lambdas, rmses2)

lambda2 <- lambdas[which.min(rmses2)]
lambda2

## [1] 5.5

The optimal lambda value has altered slightly - and we will see below that the RMSE drops even further! This is clearly our best model. Let us see all the models we have produced so far, plus the corresponding RMSEs.

#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Regularized (Movie + User + Gap Effect)",RMSE=min(rmses2))
as.data.frame(RMSECompiled)

##                                     Model      RMSE
## 1                                   Naive 1.0612018
## 2                            Movie Effect 0.9439087
## 3                     Movie + User Effect 0.8653488
## 4               Movie + User + Gap Effect 0.8649038
## 5       Regularized (Movie + User Effect) 0.8648170
## 6 Regularized (Movie + User + Gap Effect) 0.8643430

4. Conclusions

The best performing predictive algorithm is the regularized movie + user + time gap model, with an RMSE of 0.86434, which is below the target threshold of 0.86490. While a satisfactory result, there are computational limitations on my PC which prevented me from using more intensive algorithms. One example is the matrix factorization approach, which takes into account the fact that groups of movies can have similar rating patterns, or groups of users can rate in the same fashion. These patterns can be converted into factors that are built into the predictive model. Other possibilities for future work include model based algorithms such as the Random Forest and K-Means classification, as well as item-based filtering approaches such as Cosine Similarity. Source: https://www.jetir.org/download1.php?file=JETIR1907B10.pdf.