A recommendation system is an area of machine learning that involves building an algorithm from large datasets to predict and make recommendations about the selection of products such as movies and books (Irizarry, 2019; NVIDIA, 2024). Algorithms, for example, can be developed to predict the ratings people give different movies. Using knowledge of predicted ratings, recommendations of similar movies can then be offered to different individuals.
In this project, the goal was to use a portion of the movielens dataset to develop an algorithm that, when tested on a new portion of the movielens dataset, has an RMSE (root mean squared error) < 0.86490. The RMSE function is described in the Modeling approach section below.
Code was provided to generate two datasets. The first dataset is named “edx” and is the dataset that was used to develop the algorithm. The second dataset is named “final_holdout_test” (but was shortened to “fht” for convenience). The second dataset was used to evaluate the algorithm developed with the “edx” dataset by applying the algorithm to the dataset to obtain the RMSE.
The initial dataset that was created was called “movielens”. A brief inspection of these data revealed that there were 10,000,054 rows and 6 columns. The dataset has class dataframe. The 6 column names and their class structures are “userId” (integer), “movieId” (integer), “rating” (numeric), “timestamp” (integer), “title” (character), and “genres” (character).
The edx and final_holdout_test datasets were created by partitioning the movielens dataset so that the final_holdout_test dataset was 10% of the movielens dataset. After generating these datasets, they were inspected, and the results were as expected. Each dataset was a dataframe with 6 columns. The columns had the same names and structure as movielens. The edx dataset had 9,000,055 rows and the final_holdout_test dataset had 999,999 rows.
The steps undertaken to complete this project followed a generic process described for building a recommendation system (Brownlee, 2016; Le, 2019). The general procedure is described below.
These steps will be expanded and described in the Methods section.
The methods section describes the approach taken for initially exploring and understanding the data, and then the modeling approach adopted to derive the algorithm with the lowest RMSE.
The names and data structure of the edx dataset were described in the previous section. Inspecting the first 10 rows of the dataset confirmed that edx was a tidy dataset with one observation per row.
It was considered that the year in which the rating was recorded, as well as the year in which the movie was released, might provide important contributions to an accurate recommendation system. The rating year, therefore, was extracted from the timestamp data and the timestamp column was deleted. Also, the title column was split so that title and year of release were separate columns. The rating year and year of release columns were converted to the class numeric. The columns were reordered to create what seemed to be a more logical organization of the dataset. The same procedures were applied to the fht dataset to prepare that dataset for application of the final algorithm to obtain the desired RMSE.
A summary of the edx and fht datasets and a check for missing values indicated that there were no missing values and all variables were within their expected ranges. Differences between the mean and median for some of the variables indicated that the data were likely to be skewed.
| Raters | Movies | Genres |
|---|---|---|
| 69878 | 10677 | 797 |
A histogram of the frequency of each rating value was considered useful. To determine an appropriate scale for the y-axis, the frequency of counts for each rating value was obtained.
The histogram indicates that most movies were rated at 3.0 or above.
It was also of interest to examine the distribution of the average ratings provided by each rater. The average ratings were calculated and a histogram were produced.
The histogram appears to be approximately normally distributed with the centre of the distribution shifted towards the upper end of the rating scale. This is confirmed with a median = 3.635 and a mean = 3.614.
To inspect the distribution of the number of ratings provided by each rater, the log of the number of ratings was used on the x-axis. The distribution is positively skewed with a small number of raters providing many ratings.| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 69878 | 128.797 | 195.06 | 62 | 86.864 | 54.856 | 10 | 6616 | 6606 | 5.717 | 73.153 | 0.738 |
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10677 | 842.939 | 2238.481 | 122 | 324.588 | 164.569 | 1 | 31362 | 31361 | 5.809 | 45.625 | 21.664 |
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 797 | 11292.42 | 45244.96 | 1459 | 3909.718 | 2120.118 | 2 | 733296 | 733294 | 11.3 | 156.986 | 1602.659 |
The skewness indicated in the summary statistics is
confirmed with the histogram.
The skewness indicated in the summary statistics is
confirmed with the histogram. Examining the number of ratings per year
indicated that ratings were provided between 1995 and 2009 although in
1995 only 2 ratings were provided.
There was a strong relationship between a movie’s year of
release and the average yearly rating. A plot of the data indicates some
nonlinearity.
| Correlation |
|---|
| -0.582716 |
The data were also examined for the average movie ratings of individual movie genre categories. Wrangling these data enabled descriptive statistics, boxplots, and a treemap to be produced.
| Genres | Total Rating Score | Average Rating |
|---|---|---|
| (no genres listed) | 25.5 | 3.64 |
| Action | 8760660.5 | 3.42 |
| Adventure | 6668798.0 | 3.49 |
| Animation | 1682105.5 | 3.60 |
| Children | 2522991.5 | 3.42 |
| Comedy | 12169851.0 | 3.44 |
| Crime | 4867304.0 | 3.67 |
| Documentary | 352114.0 | 3.78 |
| Drama | 14362407.5 | 3.67 |
| Fantasy | 3241531.0 | 3.50 |
| Film-Noir | 475542.0 | 4.01 |
| Horror | 2261028.0 | 3.27 |
| IMAX | 30823.5 | 3.77 |
| Musical | 1543196.0 | 3.56 |
| Mystery | 2089757.5 | 3.68 |
| Romance | 6084484.0 | 3.55 |
| Sci-Fi | 4554313.0 | 3.40 |
| Thriller | 8158499.5 | 3.51 |
| War | 1932551.0 | 3.78 |
| Western | 673469.5 | 3.56 |
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total Rating | 20 | 4121572.62 | 4103759.40 | 2392009.75 | 3491771.56 | 3115090.12 | 25.50 | 14362407.50 | 14362382.00 | 1.04 | 0.03 | 917628.50 |
| Average Rating | 20 | 3.59 | 0.17 | 3.56 | 3.58 | 0.17 | 3.27 | 4.01 | 0.74 | 0.48 | -0.04 | 0.04 |
The boxplots indicate that the medians and distributions
for each of the genres were similar. Generally, all genres were rated
positively with Film-Noir having the highest interquartile range and
Horror having the lowest.
The treemap indicates that genres such as Drama, Comedy,
Action, and Thriller had large rating totals with Film-Noir,
Documentary, and IMAX having much smaller totals even though the average
rating for both these genres were high compared to other genres.
The data exploration phase provided important insights to assist with building a suitable algorithm. The variability apparent in ratings for individual movies and ratings given by individual raters will be useful to leverage when building a predictive model. It appears that, generally, people rate movies that they like and do not tend to rate movies they don’t like. This tendency explains the high average rating for the movies. There also appears to be a tendency for older movies to be rated more highly than newer movies. The variability in ratings for different genres and the year in which the movie was released should also enhance a predictive model.
Recommendation systems can be used for prediction or classification tasks. The current project is concerned with a prediction task so that will inform the approach taken. The modeling approach adopted for this project, follows the approach outlined in Irizary Section 33.7 (2019, pp. 638-44). It began with the simplest model possible and added elements to the model while assessing the impact on the RMSE.
The RMSE function to be used is provided by Irizarry (2019, pp. 640-1) and is defined as:
For the purposes of training and testing, the edx dataset was partitioned into a training dataset and a testing dataset with the testing dataset being 20% of the edx dataset.
To begin with, a naïve benchmark was calculated using the overall mean rating. Using the mean = 3.512364 to predict ratings in the test_set data produced an RMSE = 1.06. An RMSE of this magnitude is unhelpful for prediction purposes.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
Given the large number of movies, their average ratings were added to the model as a first step in improving the RMSE.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
Including the average ratings of movies reduced the RMSE to less than 1.0, however, further improvements were required. Next the average ratings for individual raters were added to the model.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
| A + M + Rater Avg (R) | 0.8659 |
Adding the average ratings for individual raters further improved the RMSE. The RMSE with movies and raters added to the model approached the goal of < 0.86490.
Given the large number of movie genres, their average ratings were added to the model. Further improvement in the RMSE was obtained, however, the improvement was small.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
| A + M + Rater Avg (R) | 0.8659 |
| A + M + R + Genre Avg (G) | 0.8656 |
Since there was a strong relationship between the year of movie release and the average yearly rating, release year was added as a fourth feature of the model in addition to the benchmark average. Further improvement to the RMSE was obtained.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
| A + M + Rater Avg (R) | 0.8659 |
| A + M + R + Genre Avg (G) | 0.8656 |
| A + M + R + G + Release Year (Y) | 0.8654 |
The model now includes the overall average rating along with average ratings for: individual movies; individual raters; genres; and the year of movie release. Since some movies and raters had very few ratings, regularization could help to minimize the RMSE by adding a tuning parameter lambda.
Code from Irizarry (2019, p. 651) was adapted to obtain the ideal lambda value using cross-validation. A plot of the lambda values indicated the ideal lambda was between 4.0 and 5.0. This was confirmed as 4.5.
| Minimum Lambda |
|---|
| 4.5 |
Using regularization with a lambda = 4.5 further improved the RMSE.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
| A + M + Rater Avg (R) | 0.8659 |
| A + M + R + Genre Avg (G) | 0.8656 |
| A + M + R + G + Release Year (Y) | 0.8654 |
| A + R + M + Y + Regularization | 0.8648 |
The goal of an RMSE < 0.86490 had been achieved so this algorithm was applied to the fht dataset.
An algorithm using the overall rating average and adding the average ratings for movies, raters, genres, and movie release year produced an RMSE of 0.8654. With regularization, the RMSE further improved to 0.8648.
This model was applied to the fht (final_holdout_test) dataset. This is the first time the fht dataset has been used. As expected, the RMSE improved further when the algorithm was used with the fht dataset.
| Model | RMSE |
|---|---|
| Overall average (A) | 1.0599 |
| A + Movie Avg (M) | 0.9437 |
| A + M + Rater Avg (R) | 0.8659 |
| A + M + R + Genre Avg (G) | 0.8656 |
| A + M + R + G + Release Year (Y) | 0.8654 |
| A + R + M + Y + Regularization | 0.8648 |
| Final_Holdout_Test | 0.8383 |
Using the movielens dataset, a machine learning algorithm was developed to predict movie ratings. The final algorithm included the overall average rating, as well as average ratings from each movie, each rater, movie genres, and each movie’s year of release. The project goal of an RMSE < 0.86490 was achieved with regularization of the algorithm.
Given the nonlinearity and skewed nature of the data use of regression methods to build the algorithm may be a limitation of this project. It could be that other methods such as K-Nearest-Neighbors and Random Forest may have produced a more precise algorithm, however the size of the dataset and hardware limitations prevented the use of alternative methods.
Even though the goal of this project was achieved, further research could explore the use of other methods such as those already mentioned as well as alternatives such as matrix factorization.
Brownlee, J. (2016). Machine learning mastery with R. v1.12. Accessed on 3 June 2024 from https://machinelearningmastery.com/machine-learning-with-r/.
Irizarry, R. A. (2019). Introduction to data science: data analysis and prediction algorithms with R. Accessed on 14 May 2024 from https://leanpub.com/datasciencebook.
Le, J. (2019). Recommendation system series Part 1: An executive guide to building recommendation system. Accessed on 3 June 2024 from https://towardsdatascience.com/recommendation-system-series-part-1-an-executive-guide-to-building-recommendation-system-608f83e2630a.
NVIDIA. (2024). Recommendation system. Accessed on 14 May 2024 https://www.nvidia.com/en-us/glossary/recommendation-system/.