A recommendation system is an information filtering system, which provides suggestions to users on specific courses of action to take. In a similar vein, a movie recommendation system recommends movies to users. These recommendations are given on the assumption that the user is likely to provide high ratings for them.
A movie recommendation system ‘learns’ what to suggest to a specific user by being ‘fed’ with past, actual ratings data. In this capstone, I will build one such recommendation system, based on a data set of 10 million ratings provided by GroupLens Research. (https://grouplens.org/datasets/movielens/). The ratings include additional data such as the movie title and release year, its ID, the anonymized ID of the user who provided the rating, as well as the date when the rating was given.
This system is constructed as follows: (1) Data cleaning will be performed to capture and remove NAs and duplicate rows, (2) Data exploration and Visualization will be conducted to tease out insights on the most useful predictors, (3) A series of recommendation systems (i.e., algorithms) generated and tested to identify the best. I top this off with (4) Conclusions including limitations and possibilities.
Each algorithm is tested by having it predict ratings on a data set (“validation set”), and I target to achieve a penultimate Root Mean Squared Error (RMSE) of <0.86490. The RMSE is a performance metric that gauges a recommendation system’s predictive quality, by (essentially) summing up how far the system’s predictions are, from an actual rating that was provided from a testing set of data.
The methods section is split into 3 sections: (1) Data observation, (2) Cleaning, and (3) Exploration. Note that the code used to produce the quiz results is in the submitted R Script, and is not covered here.
I first observe the data to make preliminary sense of it. This is done through the as_tibble() command.
edx %>% as_tibble()
## # A tibble: 9,000,055 × 6
## userId movieId rating timestamp title genres
## <int> <int> <dbl> <int> <chr> <chr>
## 1 1 122 5 838985046 Boomerang (1992) Comed…
## 2 1 185 5 838983525 Net, The (1995) Actio…
## 3 1 292 5 838983421 Outbreak (1995) Actio…
## 4 1 316 5 838983392 Stargate (1994) Actio…
## 5 1 329 5 838983392 Star Trek: Generations (1994) Actio…
## 6 1 355 5 838984474 Flintstones, The (1994) Child…
## 7 1 356 5 838983653 Forrest Gump (1994) Comed…
## 8 1 362 5 838984885 Jungle Book, The (1994) Adven…
## 9 1 364 5 838983707 Lion King, The (1994) Adven…
## 10 1 370 5 838984596 Naked Gun 33 1/3: The Final Insult (1… Actio…
## # … with 9,000,045 more rows
The data comprises of 6 columns and 9,000,055 rows, including the User ID, Move ID, the rating allocated, name of movie, and the genre(s) which the movie belongs to. With each observation in a different row, each variable measured in 1 column, the data looks tidy. Next, we inspect the characteristics of each column and spot for NAs.
summary(edx)
## userId movieId rating timestamp
## Min. : 1 Min. : 1 Min. :0.500 Min. :7.897e+08
## 1st Qu.:18124 1st Qu.: 648 1st Qu.:3.000 1st Qu.:9.468e+08
## Median :35738 Median : 1834 Median :4.000 Median :1.035e+09
## Mean :35870 Mean : 4122 Mean :3.512 Mean :1.033e+09
## 3rd Qu.:53607 3rd Qu.: 3626 3rd Qu.:4.000 3rd Qu.:1.127e+09
## Max. :71567 Max. :65133 Max. :5.000 Max. :1.231e+09
## title genres
## Length:9000055 Length:9000055
## Class :character Class :character
## Mode :character Mode :character
##
##
##
sum(duplicated(edx))
## [1] 0
The output suggests that there are no NAs. Furthermore, no duplicate rows are spotted. We can assume that the data does not require further cleaning at this point in time.
Per what was stated in our previous machine learning course - think of our task as follows: not every user rated every movie. So how would they have rated the movies they didn’t watch? Each rating y has a different set of predictors. If we are predicting the rating for each movie i by each user u, then we need to consider all ratings for movie i (and those similar with movie i), and all movies rated by user u and similar users. Thus, all the indicators in the training set need to be used as predictors.
With this in mind - we will first look at the distribution of movie ratings and users,
#count number of ratings per movie (x axis = number of movies rated, y axis = number of ratings)
edx %>%
dplyr::count(movieId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 30, color = "black") +
scale_x_log10() +
ggtitle("Number of ratings per movie")+
labs(x="# of ratings a movie received", y="number of instances")
For the first plot, we observe that on the left side of x-axis, there
are ~100 instances(y) which received only 1 rating. Towards the middle,
there are ~700 instances(y) which received ~100 ratings. And on the
right side of the x axis, there are ~10 instances (y) which received
50,000 ratings. In other words, most movies receive around 700 ratings
or so, with extreme cases which receive 50,000 ratings -
e.g. blockbuster movies. Clearly, some movies get more rated than
others.
##count number of movies rated per user (x axis = number of movies rated, y axis = number of users)
edx %>%
dplyr::count(userId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 30, color = "black") +
scale_x_log10() +
ggtitle("Users")+
labs(x="# of ratings given by a user", y="number of instances")
For the second plot: on the left side of x-axis there are ~100
instances(y) of 20 ratings(x). In the middle, ~6800 instances of 50
ratings. And at the other extreme of the x axis, there are very few
instances of >1,000 ratings. In other words, most users give around
50 ratings or so, with extreme cases where some users give around 1,000
ratings. Clearly some users give more ratings than others.
What else can impact movie ratings? I believe the ‘time-gap’ between a movie’s release date and the date of review can have a part to play. People may watch movies which have been released a long time ago because they were classics/critically-acclaimed. Therefore, it is possible that the longer the gap between release and review, the higher the rating. Let us verify this hypothesis.
##from our training set edx, extract year rated, year released, and find the gap
edx2<-edx %>% mutate(rating_year=as.numeric(year(as_datetime(timestamp))),year_released=as.numeric(str_sub(edx$title,start = -5, end = -2)), gap=rating_year-year_released)
#Does watching older movies mean that we tend to give better ratings?
#we need a plot of y (mean rating that a movie receives) vs x (time gap between movie release and actual watch date)
#Thus, for each x-value, I need to calculate the mean rating.
plottable <- edx2 %>% group_by(gap) %>% summarize(mean=mean(rating))
#Then, plot a trendline to see if there are any relationships.
plottable %>% filter(gap>=0) %>% ggplot(aes(x=gap,y=mean)) + geom_point() + geom_smooth(method=loess) +
ggtitle("Mean Rating vs gap between release date and rating date")+
labs(y="Mean Movie Rating", x="# of years between release and rating dates")+
theme(plot.title=element_text(hjust=0.5))
## `geom_smooth()` using formula = 'y ~ x'
This plot shows that the older the time gap, the higher the rating (a
difference of ~0.4 points). While this difference levels off at a gap of
50 years and plummets at ~60+, it still tells us that there can be a
‘time gap effect’, which we can build into our model and see whether it
decreases the RMSE.
I will perform an iterative approach to modelling. First, I start with a simple predictive model, and add predictors (movie, user, and gap effects) in a step-wise fashion, to see whether they enhance the predictive capability (i.e., reduce RMSE). Depending on whether I clear the target RMSE, I will further regularize these predictors (concept to be discussed later).
We first create a loss function that represents the Root Mean Squared Error. The RMSE equation looks like this: \[RMSE = \sqrt {\frac{1}{N} \sum_{i=1}^{N} (\hat{y_{i}} - y_{i})^2}\]
Its equivalent is coded in R, which is as below. Recall our goal is to build a predictive model that yields an RMSE of below 0.86490.
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
With the ‘yardstick’ defined - we now establish a simple, baseline model that predicts the ratings for any movie i by an arbitrary user u. we assume that the model is naive i.e., all ratings are assumed to take on one single value with the differences explained by random variation.
\[Y_{u,i} = \mu + \epsilon_{u,i}\]
Where \(\mu\) is true rating for all movies, and \(\epsilon_{u,i}\) is the error term.
Therefore, the estimate for any rating \(Y_{u,i}\) is \(\hat{\mu}\), the estimate of the rating for all movies. The best estimate (i.e., that minimizes the RMSE) of the rating for all movies is the average rating in the training dataset, which is calculated through a simple ‘mean’ operator in R.
mu <- mean(edx$rating)
naive_rmse <- RMSE(test$rating,mu)
naive_rmse
## [1] 1.061202
We see that the RMSE is 1.06, which is far off from the intended
target.
#Tabulate Score
RMSECompiled <- tibble(Model="Naive", RMSE=naive_rmse)
RMSECompiled
## # A tibble: 1 × 2
## Model RMSE
## <chr> <dbl>
## 1 Naive 1.06
There is room to create a more accurate predictive model. We know that considering all other ratings for movie i and all other ratings provided by user u in our model can yield a more accurate prediction, since it is likely that (1) movies that are rated to be of a certain caliber are likely to continue to be rated as such, and (2) the pattern behind how a user rates (whether strict or lenient) is likely to remain consistent over movies. These considerations take the form of additional variables which will increase/decrease the predicted rating, and will be referred to as the “movie” and “user” effects.
Adding in the movie effect term \(b_{i}\), the predictive model becomes: \[Y_{u,i} = \mu + b_{i} + \epsilon_{u,i}\] We will now calculate the \(b_{i}\) term using the training set data. This term can be estimated by taking the mean of the training rating, subtracted by the overall average rating calculated earlier.
#3.2. Try our second model i.e. calculate movie effects
b_i <-edx %>% group_by(movieId) %>% summarize(b_i=mean(rating-mu))
#3.2.1. RMSE with movieeffects
movieonlyrmse<-left_join(test, b_i, by = "movieId") |>
mutate(pred = mu + b_i) |>
summarize(rmse = RMSE(rating, pred))
movieonlyrmse
## rmse
## 1 0.9439087
#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie Effect",RMSE=movieonlyrmse[1,1])
as.data.frame(RMSECompiled)
## Model RMSE
## 1 Naive 1.0612018
## 2 Movie Effect 0.9439087
There is a marked improvement; the RMSE drops to 0.943. Let us factor in the user effect as well.
With the user effect included on top of the movie effect, the predictive model becomes: \[Y_{u,i} = \mu + b_{i} + b_{u} + \epsilon_{u,i}\] The user effect is estimated by first calculating the movie effect (as per 3.2) and using it to back-calculate said user effect.
#3.3. calculate user effects
b_u <- edx%>%left_join(b_i,by="movieId") %>% group_by(userId) %>% summarize(b_u=mean(rating-mu-b_i))
#3.3.1. RMSE with movie + user effects
movieanduserrmse <- test %>% left_join(b_i, by="movieId") %>% left_join(b_u, by="userId") %>% mutate(pred = mu + b_i + b_u) |>
summarize(rmse = RMSE(rating, pred))
movieanduserrmse
## rmse
## 1 0.8653488
movieanduserrmsereport <- movieanduserrmse[1,1]
#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie + User Effect",RMSE=movieanduserrmse[1,1])
as.data.frame(RMSECompiled)
## Model RMSE
## 1 Naive 1.0612018
## 2 Movie Effect 0.9439087
## 3 Movie + User Effect 0.8653488
The RMSE is 0.86535, which is a marked improvement but still short of the target 0.86490.
We have seen earlier that there is basis for the time-gap effect. Let us add this term \(b_{g}\) to the original predictive model. \[Y_{u,i} = \mu + b_{i} + b_{u} + b_{g} + \epsilon_{u,i}\] Calculating the time gap term is done using the below code. Note that I used the ‘transformed’ edx2 set which includes additional columns pertaining to rating year, year released, and the time gap between them. This set was generated earlier in the exploration section.
#3.4. RMSE with movie + user + gap effects
b_g <- edx2%>%left_join(b_i,by="movieId") %>% left_join(b_u,by="userId") %>%
group_by(gap) %>% summarize(b_g=mean(rating-mu-b_i-b_u))
#3.4.1. first, mutate test set to include gap data as well
test2 <- test %>% mutate(rating_year=as.numeric(year(as_datetime(timestamp))),year_released=as.numeric(str_sub(title,start = -5, end = -2)), gap=rating_year-year_released)
#then, join b_i, b_u, and b_g and calculate the RMSE
movieuserandgaprmse <- test2 %>% left_join(b_i, by="movieId") %>% left_join(b_u, by="userId") %>%
left_join(b_g,by="gap") %>%
mutate(pred = mu + b_i + b_u + b_g) |>
summarize(rmse = RMSE(rating, pred))
movieuserandgaprmse
## rmse
## 1 0.8649038
#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Movie + User + Gap Effect",RMSE=movieuserandgaprmse[1,1])
as.data.frame(RMSECompiled)
## Model RMSE
## 1 Naive 1.0612018
## 2 Movie Effect 0.9439087
## 3 Movie + User Effect 0.8653488
## 4 Movie + User + Gap Effect 0.8649038
Factoring in the gap effect, the RMSE has dropped even further, but unfortunately it is still slightly above 0.86490, our yardstick.
To regularize, in simple terms, means to alter the RMSE equation that is to be minimized, by adding in penalty terms. These penalty terms adjust/compensate for large estimates caused by small sample sizes (e.g., few ratings for a particular movie, few ratings provided by a particular user). Note that the penalty term includes an unknown value \(\lambda\) needed to minimize the RMSE, which we will identify using cross-validation. See http://rafalab.dfci.harvard.edu/dsbook/large-datasets.html#regularization for an elaboration of the mathematical implications.
First, we will build a predictive model with regularization of the movie and user effects. This model tries a series of lambda values, calculates the regularized movie and user effects, then the RMSE, then identifies which lambda minimizes the RMSE.
#3.5. trying regularized (movie + user effect)
lambdas <- seq(0, 10, 0.25)
rmses <- sapply(lambdas, function(l){
mu <- mean(edx$rating)
b_i <- edx %>%
group_by(movieId) %>%
dplyr::summarize(b_i = sum(rating - mu)/(n()+l))
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
dplyr::summarize(b_u = sum(rating - b_i - mu)/(n()+l))
predicted_ratings <-
test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
return(RMSE(predicted_ratings, test$rating))
})
qplot(lambdas, rmses)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
lambda <- lambdas[which.min(rmses)]
lambda
## [1] 5.25
The plot shows that the best lambda value is 5.25, and the RMSE drops below 0.86490.
#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Regularized (Movie + User Effect)",RMSE=min(rmses))
as.data.frame(RMSECompiled)
## Model RMSE
## 1 Naive 1.0612018
## 2 Movie Effect 0.9439087
## 3 Movie + User Effect 0.8653488
## 4 Movie + User + Gap Effect 0.8649038
## 5 Regularized (Movie + User Effect) 0.8648170
While we have achieved our desired RMSE of below 0.86490, let us see if regularizing the gap effect helps to reduce the RMSE even further.
#3.6. trying regularized (movie + user + gap effect)
lambdas <- seq(0, 10, 0.25)
rmses2 <- sapply(lambdas, function(l){
mu <- mean(edx$rating)
b_i <- edx %>%
group_by(movieId) %>%
dplyr::summarize(b_i = sum(rating - mu)/(n()+l))
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
dplyr::summarize(b_u = sum(rating - b_i - mu)/(n()+l))
b_g <- edx2 %>%
left_join(b_i, by="movieId") %>% left_join(b_u,by="userId") %>%
group_by(gap) %>%
dplyr::summarize(b_g = sum(rating - b_i - b_u - mu)/(n()+l))
predicted_ratings <-
test2 %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_g, by = "gap") %>%
mutate(pred = mu + b_i + b_u + b_g) %>%
.$pred
return(RMSE(predicted_ratings, test$rating))
})
qplot(lambdas, rmses2)
lambda2 <- lambdas[which.min(rmses2)]
lambda2
## [1] 5.5
The optimal lambda value has altered slightly - and we will see below that the RMSE drops even further! This is clearly our best model. Let us see all the models we have produced so far, plus the corresponding RMSEs.
#Tabulate Score
RMSECompiled <- RMSECompiled %>% add_row(Model="Regularized (Movie + User + Gap Effect)",RMSE=min(rmses2))
as.data.frame(RMSECompiled)
## Model RMSE
## 1 Naive 1.0612018
## 2 Movie Effect 0.9439087
## 3 Movie + User Effect 0.8653488
## 4 Movie + User + Gap Effect 0.8649038
## 5 Regularized (Movie + User Effect) 0.8648170
## 6 Regularized (Movie + User + Gap Effect) 0.8643430
The best performing predictive algorithm is the regularized movie + user + time gap model, with an RMSE of 0.86434, which is below the target threshold of 0.86490. While a satisfactory result, there are computational limitations on my PC which prevented me from using more intensive algorithms. One example is the matrix factorization approach, which takes into account the fact that groups of movies can have similar rating patterns, or groups of users can rate in the same fashion. These patterns can be converted into factors that are built into the predictive model. Other possibilities for future work include model based algorithms such as the Random Forest and K-Means classification, as well as item-based filtering approaches such as Cosine Similarity. Source: https://www.jetir.org/download1.php?file=JETIR1907B10.pdf.