Introduction

A recommendation system is an area of machine learning that involves building an algorithm from large datasets to predict and make recommendations about the selection of products such as movies and books (Irizarry, 2019; NVIDIA, 2024). Algorithms, for example, can be developed to predict the ratings people give different movies. Using knowledge of predicted ratings, recommendations of similar movies can then be offered to different individuals.

Project goal

In this project, the goal was to use a portion of the movielens dataset to develop an algorithm that, when tested on a new portion of the movielens dataset, has an RMSE (root mean squared error) < 0.86490. The RMSE function is described in the Modeling approach section below.

Description of the datasets

Code was provided to generate two datasets. The first dataset is named “edx” and is the dataset that was used to develop the algorithm. The second dataset is named “final_holdout_test” (but was shortened to “fht” for convenience). The second dataset was used to evaluate the algorithm developed with the “edx” dataset by applying the algorithm to the dataset to obtain the RMSE.

The initial dataset that was created was called “movielens”. A brief inspection of these data revealed that there were 10,000,054 rows and 6 columns. The dataset has class dataframe. The 6 column names and their class structures are “userId” (integer), “movieId” (integer), “rating” (numeric), “timestamp” (integer), “title” (character), and “genres” (character).

The edx and final_holdout_test datasets were created by partitioning the movielens dataset so that the final_holdout_test dataset was 10% of the movielens dataset. After generating these datasets, they were inspected, and the results were as expected. Each dataset was a dataframe with 6 columns. The columns had the same names and structure as movielens. The edx dataset had 9,000,055 rows and the final_holdout_test dataset had 999,999 rows.

Key steps

The steps undertaken to complete this project followed a generic process described for building a recommendation system (Brownlee, 2016; Le, 2019). The general procedure is described below.

Define the problem. In this case, the problem was provided in the Capstone Project course.
Obtain the data. The data were provided and code for some preliminary data wrangling was also available.
Prepare and explore the data. Use descriptive statistics and data visualization to understand the data including relationships between features and between features and the outcome.
Evaluate algorithms. Partition the data into a training set and a test set and evaluate the performance of different algorithms.
Select and test the final algorithm. Use a new dataset to test the best performing algorithm.
Report results and conclusions. Provide a summary of the process and report the final RMSE.

These steps will be expanded and described in the Methods section.

Methods

The methods section describes the approach taken for initially exploring and understanding the data, and then the modeling approach adopted to derive the algorithm with the lowest RMSE.

Data cleaning

The names and data structure of the edx dataset were described in the previous section. Inspecting the first 10 rows of the dataset confirmed that edx was a tidy dataset with one observation per row.

It was considered that the year in which the rating was recorded, as well as the year in which the movie was released, might provide important contributions to an accurate recommendation system. The rating year, therefore, was extracted from the timestamp data and the timestamp column was deleted. Also, the title column was split so that title and year of release were separate columns. The rating year and year of release columns were converted to the class numeric. The columns were reordered to create what seemed to be a more logical organization of the dataset. The same procedures were applied to the fht dataset to prepare that dataset for application of the final algorithm to obtain the desired RMSE.

A summary of the edx and fht datasets and a check for missing values indicated that there were no missing values and all variables were within their expected ranges. Differences between the mean and median for some of the variables indicated that the data were likely to be skewed.

Data exploration and visualization

With the datasets cleaned and modified, it was necessary to conduct exploratory data exploration (EDA) and visualization with the edx dataset to better understand the data. Understanding the data will help with the selection of methods for building an accurate algorithm. Code from Irizarry (2019, p. 639) was used to identify the number of distinct individuals providing ratings, the number of distinct movies, and the number of distinct genres. Results indicated that there were: 69,878 distinct individuals providing ratings; 10,677 distinct movies; and 797 distinct genres.

Numbers of Unique Values for Raters, Movies, and Genres
Raters	Movies	Genres
69878	10677	797

A histogram of the frequency of each rating value was considered useful. To determine an appropriate scale for the y-axis, the frequency of counts for each rating value was obtained.

The histogram indicates that most movies were rated at 3.0 or above.

It was also of interest to examine the distribution of the average ratings provided by each rater. The average ratings were calculated and a histogram were produced.

The histogram appears to be approximately normally distributed with the centre of the distribution shifted towards the upper end of the rating scale. This is confirmed with a median = 3.635 and a mean = 3.614.

To inspect the distribution of the number of ratings provided by each rater, the log of the number of ratings was used on the x-axis. The distribution is positively skewed with a small number of raters providing many ratings.

The descriptive statistics indicate that the maximum number of ratings = 6,616 and the minimum = 10 with a mean = 128.8 and a median = 62.

Summary Statistics for Raters’ Ratings
n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
69878	128.797	195.06	62	86.864	54.856	10	6616	6606	5.717	73.153	0.738

The descriptive statistics and histogram for the number of ratings per movie indicated these data were also somewhat skewed.

Summary Statistics for Ratings per Movie
n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
10677	842.939	2238.481	122	324.588	164.569	1	31362	31361	5.809	45.625	21.664

Once again, a log transformation was used for the x-axis.

The same pattern occurred when examining the number of ratings for the 797 distinct movie genres.

Summary Statistics for Ratings per Genre
n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
797	11292.42	45244.96	1459	3909.718	2120.118	2	733296	733294	11.3	156.986	1602.659

The skewness indicated in the summary statistics is confirmed with the histogram.

The skewness indicated in the summary statistics is confirmed with the histogram. Examining the number of ratings per year indicated that ratings were provided between 1995 and 2009 although in 1995 only 2 ratings were provided.

There was a strong relationship between a movie’s year of release and the average yearly rating. A plot of the data indicates some nonlinearity.

The correlation coefficient (-0.583) suggests that older movies (earlier year of release) had higher average yearly ratings than movies that were released more recently.

Correlation Between the Average Yearly Rating and the Movie’s Year of Release
Correlation
-0.582716

The data were also examined for the average movie ratings of individual movie genre categories. Wrangling these data enabled descriptive statistics, boxplots, and a treemap to be produced.

Total and Average Rating Scores for Genres
Genres	Total Rating Score	Average Rating
(no genres listed)	25.5	3.64
Action	8760660.5	3.42
Adventure	6668798.0	3.49
Animation	1682105.5	3.60
Children	2522991.5	3.42
Comedy	12169851.0	3.44
Crime	4867304.0	3.67
Documentary	352114.0	3.78
Drama	14362407.5	3.67
Fantasy	3241531.0	3.50
Film-Noir	475542.0	4.01
Horror	2261028.0	3.27
IMAX	30823.5	3.77
Musical	1543196.0	3.56
Mystery	2089757.5	3.68
Romance	6084484.0	3.55
Sci-Fi	4554313.0	3.40
Thriller	8158499.5	3.51
War	1932551.0	3.78
Western	673469.5	3.56

For these data there were a larger number of rows than the original edx dataset because many movies were assigned more than one genre.

Summary Statistics for Total and Average Ratings of Genres
	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Total Rating	20	4121572.62	4103759.40	2392009.75	3491771.56	3115090.12	25.50	14362407.50	14362382.00	1.04	0.03	917628.50
Average Rating	20	3.59	0.17	3.56	3.58	0.17	3.27	4.01	0.74	0.48	-0.04	0.04

The boxplots indicate that the medians and distributions for each of the genres were similar. Generally, all genres were rated positively with Film-Noir having the highest interquartile range and Horror having the lowest.

The treemap indicates that genres such as Drama, Comedy, Action, and Thriller had large rating totals with Film-Noir, Documentary, and IMAX having much smaller totals even though the average rating for both these genres were high compared to other genres.

Insights gained

The data exploration phase provided important insights to assist with building a suitable algorithm. The variability apparent in ratings for individual movies and ratings given by individual raters will be useful to leverage when building a predictive model. It appears that, generally, people rate movies that they like and do not tend to rate movies they don’t like. This tendency explains the high average rating for the movies. There also appears to be a tendency for older movies to be rated more highly than newer movies. The variability in ratings for different genres and the year in which the movie was released should also enhance a predictive model.

Modeling approach

Recommendation systems can be used for prediction or classification tasks. The current project is concerned with a prediction task so that will inform the approach taken. The modeling approach adopted for this project, follows the approach outlined in Irizary Section 33.7 (2019, pp. 638-44). It began with the simplest model possible and added elements to the model while assessing the impact on the RMSE.

The RMSE function to be used is provided by Irizarry (2019, pp. 640-1) and is defined as:

RMSE <- function(true_ratings, predicted_ratings){ Sqrt(mean((true_ratings – predicted_ratings)^2)) }

For the purposes of training and testing, the edx dataset was partitioned into a training dataset and a testing dataset with the testing dataset being 20% of the edx dataset.

To begin with, a naïve benchmark was calculated using the overall mean rating. Using the mean = 3.512364 to predict ratings in the test_set data produced an RMSE = 1.06. An RMSE of this magnitude is unhelpful for prediction purposes.

Model Building Results
Model	RMSE
Overall average (A)	1.0599

Given the large number of movies, their average ratings were added to the model as a first step in improving the RMSE.

Model Building Results
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437

Including the average ratings of movies reduced the RMSE to less than 1.0, however, further improvements were required. Next the average ratings for individual raters were added to the model.

Model Building Results
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437
A + M + Rater Avg (R)	0.8659

Adding the average ratings for individual raters further improved the RMSE. The RMSE with movies and raters added to the model approached the goal of < 0.86490.

Given the large number of movie genres, their average ratings were added to the model. Further improvement in the RMSE was obtained, however, the improvement was small.

Model Building Results
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437
A + M + Rater Avg (R)	0.8659
A + M + R + Genre Avg (G)	0.8656

Since there was a strong relationship between the year of movie release and the average yearly rating, release year was added as a fourth feature of the model in addition to the benchmark average. Further improvement to the RMSE was obtained.

Model Building Results
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437
A + M + Rater Avg (R)	0.8659
A + M + R + Genre Avg (G)	0.8656
A + M + R + G + Release Year (Y)	0.8654

The model now includes the overall average rating along with average ratings for: individual movies; individual raters; genres; and the year of movie release. Since some movies and raters had very few ratings, regularization could help to minimize the RMSE by adding a tuning parameter lambda.

Code from Irizarry (2019, p. 651) was adapted to obtain the ideal lambda value using cross-validation. A plot of the lambda values indicated the ideal lambda was between 4.0 and 5.0. This was confirmed as 4.5.

Lambda Value that Minimizes RMSE
Minimum Lambda
4.5

Using regularization with a lambda = 4.5 further improved the RMSE.

Results of Model Building
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437
A + M + Rater Avg (R)	0.8659
A + M + R + Genre Avg (G)	0.8656
A + M + R + G + Release Year (Y)	0.8654
A + R + M + Y + Regularization	0.8648

The goal of an RMSE < 0.86490 had been achieved so this algorithm was applied to the fht dataset.

Results

An algorithm using the overall rating average and adding the average ratings for movies, raters, genres, and movie release year produced an RMSE of 0.8654. With regularization, the RMSE further improved to 0.8648.

This model was applied to the fht (final_holdout_test) dataset. This is the first time the fht dataset has been used. As expected, the RMSE improved further when the algorithm was used with the fht dataset.

Results of Model Building
Model	RMSE
Overall average (A)	1.0599
A + Movie Avg (M)	0.9437
A + M + Rater Avg (R)	0.8659
A + M + R + Genre Avg (G)	0.8656
A + M + R + G + Release Year (Y)	0.8654
A + R + M + Y + Regularization	0.8648
Final_Holdout_Test	0.8383

Conclusion

Using the movielens dataset, a machine learning algorithm was developed to predict movie ratings. The final algorithm included the overall average rating, as well as average ratings from each movie, each rater, movie genres, and each movie’s year of release. The project goal of an RMSE < 0.86490 was achieved with regularization of the algorithm.

Given the nonlinearity and skewed nature of the data use of regression methods to build the algorithm may be a limitation of this project. It could be that other methods such as K-Nearest-Neighbors and Random Forest may have produced a more precise algorithm, however the size of the dataset and hardware limitations prevented the use of alternative methods.

Even though the goal of this project was achieved, further research could explore the use of other methods such as those already mentioned as well as alternatives such as matrix factorization.

References

Brownlee, J. (2016). Machine learning mastery with R. v1.12. Accessed on 3 June 2024 from https://machinelearningmastery.com/machine-learning-with-r/.

Irizarry, R. A. (2019). Introduction to data science: data analysis and prediction algorithms with R. Accessed on 14 May 2024 from https://leanpub.com/datasciencebook.

Le, J. (2019). Recommendation system series Part 1: An executive guide to building recommendation system. Accessed on 3 June 2024 from https://towardsdatascience.com/recommendation-system-series-part-1-an-executive-guide-to-building-recommendation-system-608f83e2630a.

NVIDIA. (2024). Recommendation system. Accessed on 14 May 2024 https://www.nvidia.com/en-us/glossary/recommendation-system/.

MovieLens Capstone Project

Timothy A Carey

2024-06-03