In this Capstone Project for the HarvardX Professional Certificate in Data Science (PH125.9x), we will explore and visually examine the MovieLens data set of GroupLens Research which features in order to to develop a machine-learning model by producing training and test sets that shall predict movie ratings on a validation set with a RMSE of less than 0.86490. Four models will be developed and compared based on their resulting RMSE.
The RMSE measures the average difference between a statistical model’s predicted values and the actual values. Mathematically, it is the standard deviation of the residuals which represent the distance between the regression line and the data points. RMSE quantifies how dispersed these residuals are, revealing how tightly the observed data clusters around the predicted values.
The MovieLens dataset is downloaded from: https://grouplens.org/datasets/movielens/
dl <- "ml-10M100K.zip"
if (!file.exists(dl))
download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings_file <- "ml-10M100K/ratings.dat"
if (!file.exists(ratings_file))
unzip(dl, ratings_file)
movies_file <- "ml-10M100K/movies.dat"
if (!file.exists(movies_file))
unzip(dl, movies_file)
The initial wrangling code is provided by the Course:
ratings <- as.data.frame(str_split(read_lines(ratings_file), fixed("::"), simplify = TRUE), stringsAsFactors = FALSE)
colnames(ratings) <- c("userId", "movieId", "rating", "timestamp")
ratings <- ratings %>%
mutate(userId = as.integer(userId),
movieId = as.integer(movieId),
rating = as.numeric(rating),
timestamp = as.integer(timestamp))
movies <- as.data.frame(str_split(read_lines(movies_file), fixed("::"), simplify = TRUE),
stringsAsFactors = FALSE)
colnames(movies) <- c("movieId", "title", "genres")
movies <- movies %>%
mutate(movieId = as.integer(movieId))
movielens <- left_join(ratings, movies, by = "movieId")
The MovieLens dataset will be divided into two subsets:
“edx,” serving for training, developing, then selecting the best algorithm;
“final_holdout_test” that will be used for evaluating the RMSE of the selected algorithm.
# Final hold-out test set will be 10% of MovieLens data
set.seed(1, sample.kind = "Rounding")
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
# Make sure userId and movieId in final hold-out test set are also in edx set
final_holdout_test <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from final hold-out test set back into edx set
removed <- anti_join(temp, final_holdout_test)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
The objective of this EDA is to:
Identify potential errors, better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables;
Help defining the most relevant strategy to develop our prediction models.
According to the GroupLens website, hosting the data sets we will use for this project, the MovieLens 10M Dataset, released in January 2009 “contains 10,000,054 ratings and 95,580 tags applied to 10,681 movies by 71,567 users of the online movie recommender service MovieLens.”
We will first check the EDX dataset, its 5 first head rows as well as it structure info:
kable(head(edx, 5), booktabs = TRUE, caption = "Head Rows") %>%
kable_styling(position = "center", latex_options = c("striped", "scale_down"))
| userId | movieId | rating | timestamp | title | genres | |
|---|---|---|---|---|---|---|
| 1 | 1 | 122 | 5 | 838985046 | Boomerang (1992) | Comedy|Romance |
| 2 | 1 | 185 | 5 | 838983525 | Net, The (1995) | Action|Crime|Thriller |
| 4 | 1 | 292 | 5 | 838983421 | Outbreak (1995) | Action|Drama|Sci-Fi|Thriller |
| 5 | 1 | 316 | 5 | 838983392 | Stargate (1994) | Action|Adventure|Sci-Fi |
| 6 | 1 | 329 | 5 | 838983392 | Star Trek: Generations (1994) | Action|Adventure|Drama|Sci-Fi |
EDX Structure
str(edx)
## 'data.frame': 9000055 obs. of 6 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 122 185 292 316 329 355 356 362 364 370 ...
## $ rating : num 5 5 5 5 5 5 5 5 5 5 ...
## $ timestamp: int 838985046 838983525 838983421 838983392 838983392 838984474 838983653 838984885 838983707 838984596 ...
## $ title : chr "Boomerang (1992)" "Net, The (1995)" "Outbreak (1995)" "Stargate (1994)" ...
## $ genres : chr "Comedy|Romance" "Action|Crime|Thriller" "Action|Drama|Sci-Fi|Thriller" "Action|Adventure|Sci-Fi" ...
There are 6 Variables in this dataset:
userId: Integer. Movielens users that were selected at random for inclusion. Their ids have been anonymised.
movieId: Integer. MovieID is the real MovieLens id.
rating: Numeric. Rating of one movie by one user. Ratings are made on a 5-star scale, with half-star increments.
timestamp: Integer. When the rating was made, expressed in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
title: Character. Movies’ titles + year of its release.
genres: Character. Genres are a pipe-separated list, and are selected from the following: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, “no genre listed”.
summary_edx <- summary(edx)
kable(summary_edx, booktabs = TRUE, caption = "Summary Statistics") %>%
kable_styling(position = "center", latex_options = c("scale_down", "striped"))
| userId | movieId | rating | timestamp | title | genres | |
|---|---|---|---|---|---|---|
| Min. : 1 | Min. : 1 | Min. :0.500 | Min. :7.897e+08 | Length:9000055 | Length:9000055 | |
| 1st Qu.:18124 | 1st Qu.: 648 | 1st Qu.:3.000 | 1st Qu.:9.468e+08 | Class :character | Class :character | |
| Median :35738 | Median : 1834 | Median :4.000 | Median :1.035e+09 | Mode :character | Mode :character | |
| Mean :35870 | Mean : 4122 | Mean :3.512 | Mean :1.033e+09 | NA | NA | |
| 3rd Qu.:53607 | 3rd Qu.: 3626 | 3rd Qu.:4.000 | 3rd Qu.:1.127e+09 | NA | NA | |
| Max. :71567 | Max. :65133 | Max. :5.000 | Max. :1.231e+09 | NA | NA |
The dataset is clean with no missing/NA values.
We will now check the number of unique values for the “users”, “movies” and “genre” variables.
edx %>%
summarize('Nb Unique Users' = n_distinct(userId),
'Nb Unique Movies' = n_distinct(movieId),
'Nb Unique Genres' = n_distinct(genres)) %>%
kable(caption = "Unique Values") %>%
kable_styling(position = "center", latex_options = c("scale_down", "scale_down"))
| Nb Unique Users | Nb Unique Movies | Nb Unique Genres |
|---|---|---|
| 69878 | 10677 | 797 |
The data contains 9,000,055 ratings applied to 10,677 different movies of 797 different unique or combination of genre from 69,878 single users between 1995 and 2009.
To better perform our analysis and build our models , we will further wrangle the datasets as following:
Convert the “timestamp” column to “datetime” format;
Extract the “Release Year” observation from the “Title” column.
Use the new “Release Year” column to add a “MovieAge” column (using year 2023 to calculate the age).
#Convert the "timestamp" column to "datetime" format
edx <- edx %>%
mutate(timestamp = as_datetime(timestamp),
rating_date = make_date(year(timestamp), month(timestamp))) %>%
select(-timestamp) %>%
relocate(rating_date, .after = rating) %>%
as.data.table()
final_holdout_test <- final_holdout_test %>%
mutate(timestamp = as_datetime(timestamp), rating_date = make_date(year(timestamp), month(timestamp))) %>%
select(-timestamp) %>%
relocate(rating_date, .after = rating) %>%
as.data.table()
#Extract the "Release Year" observation from the "Title" column
rel_year <- "(?<=\\()\\d{4}(?=\\))"
edx <- edx %>%
mutate(release_year = str_extract(title, rel_year) %>%
as.integer()) %>%
relocate(release_year, .after = title)
final_holdout_test <- final_holdout_test %>%
mutate(release_year = str_extract(title, rel_year) %>%
as.integer()) %>%
relocate(release_year, .after = title)
#Use the new “Release Year” column to add a “MovieAge” column
edx <- edx %>%
mutate(movie_age = 2023 - release_year) %>%
relocate(movie_age, .after = release_year)
final_holdout_test <- final_holdout_test %>%
mutate(movie_age = 2023 - release_year) %>%
relocate(movie_age, .after = release_year)
We will now move forward with our EDA, looking deeper into some potentially interesting patterns and/or biases that will help at defining our modelling strategy.
From the “User” perspective, we will explore the number of ratings made by users.
ratings_per_user <- edx %>%
group_by(userId) %>%
summarize(nb_ratings = n()) %>%
ungroup()
Summary Statistics
sumtable(ratings_per_user, out = 'return')
## Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1 userId 69878 35782 20614 1 17943 53620 71567
## 2 nb_ratings 69878 129 195 10 32 141 6616
Contrary to the information provided by GroupLens about the dataset, it is apparent that a number of users actually rated less than 20 movies:
length(which(ratings_per_user$nb_ratings <= 20))
## [1] 4966
It is confirmed that 4,966 users rated \(\leqslant\) 20 movies. This shall be taken in account as a potential bias affecting our prediction models.
We will now visualise the distribution of the number of ratings per user:
ggplot(ratings_per_user, aes(x = nb_ratings)) +
geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
scale_x_log10() +
labs(title = "Log Transformed Distribution of Number of Ratings per User",
x = "Log Number of Ratings",
y = "Number of Users") +
theme_linedraw()
Right skewed log transformed distribution with median at 62 and mean at 129. It appears that the majority of users rated a number of movies that sits between 25 and 100 which seems somehow low: to be taken in account as a “bias” when developing our prediction models with potentially adding a “user penalty term”.
We will now focus on the relation between movies and their related number of ratings.
ratings_movie <- edx %>%
group_by(movieId) %>%
summarize(nb_ratings = n()) %>%
ungroup()
Summary Statistics
sumtable(ratings_movie, out = 'return')
## Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1 movieId 10677 13105 17792 1 2754 8710 65133
## 2 nb_ratings 10677 843 2238 1 30 565 31362
Visualisation of Movies vs. Ratings:
ggplot(ratings_movie, aes(x = nb_ratings)) +
geom_histogram(bins = 40, fill = "navajowhite4", color = "darkgray") +
scale_x_log10() +
labs(title = "Log Transformed Distribution of Number of Ratings vs. Movies",
x = "Log Number of Ratings",
y = "Number of Movies") +
theme_linedraw()
The distribution is not far from being symmetric which tends to show that popular movies are rated more frequently than less popular ones. The fact that there are a number of films with fewer ratings implies potential biases that may affect our recommendation modelling.
Let’s now explore the “Rating” perspective and start with the number of ratings made on a monthly basis;
edx_ratings <- edx %>%
group_by(rating_date) %>%
summarize(nb_rating = n())
Summary Statistics
sumtable(edx_ratings, out = 'return')
## Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1 nb_rating 157 57325 41199 2 38148 64464 264856
Let’s visualise the distribution of ratings made on a monthly basis:
ggplot(edx_ratings, aes(x = rating_date, y = nb_rating)) +
geom_point(color = "navajowhite4") +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
scale_y_continuous(breaks = seq(0, 300000, 50000), labels = scales::comma) +
labs(title = "Distribution of Monthly Ratings",
x = "Rating Date",
y = "Number of Ratings") +
geom_rug(color = "navajowhite4") +
geom_smooth(color = "black") +
theme_linedraw()
We notice an average of approx. 60k movie ratings made on a monthly basis since 1990. This number seems to have stabilized after 2000. We can notice a number of outliers, high and low before 2000, then high around 2000 (up to 264,856 ratings !) and 2005. To be taken in account when developing our prediction models.
Let’s have a look at the rating values and its distribution:
ggplot(edx, aes(rating)) +
geom_histogram(binwidth = 0.25, fill = "navajowhite4", color = "black") +
scale_x_continuous(breaks = seq(0.5, 5, 0.5)) +
scale_y_continuous(breaks = seq(0, 3000000, 500000)) +
labs(title = "Distribution of Ratings Values",
x = "Rating Value",
y = "Number of Ratings") +
theme_linedraw()
There are 10 different ratings users can award a movie: 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 and 5. A rating of “4” is the most popular rating followed by “3”, “5”, “3.5”, etc. while “0.5” is awarded the least frequently.
Finally, we will explore the distribution of mean rating made by users.
avg_ratings_per_user <- edx %>%
group_by(userId) %>%
summarize(average_rating = mean(rating)) %>%
ungroup()
Summary Statistics
sumtable(avg_ratings_per_user, out = 'return')
## Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1 userId 69878 35782 20614 1 17943 53620 71567
## 2 average_rating 69878 3.6 0.43 0.5 3.4 3.9 5
ggplot(avg_ratings_per_user, aes(x = average_rating)) +
geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
labs(title = "Distribution of Average Ratings per User",
x = "Average Rating",
y = "Number of Users") +
theme_linedraw()
The distribution is symmetric and, as seen previously, centred around 3.5 which shows a tendency for users to favour rating movies they appreciated rather than the one they disliked.
ratings_avg_age <- edx %>%
group_by(movie_age) %>%
summarize(average_rating = mean(rating)) %>%
ungroup()
Summary Statistics
sumtable(ratings_avg_age, out = 'return')
## Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1 movie_age 94 62 27 15 38 85 108
## 2 average_rating 94 3.7 0.21 3.3 3.5 3.9 4.1
ggplot(ratings_avg_age, aes(x = movie_age, y = average_rating)) +
geom_point(color = "navajowhite4") +
labs(title = "Distribution of Avg Ratings vs. Movie Age",
x = "Movie Age",
y = "Avg Rating") +
geom_rug(color = "navajowhite4") +
geom_smooth(color = "black") +
theme_linedraw()
The correlation between the two features seems quite clear with users tending to award higher ratings to older movies rather than to newer releases. While probably interesting, this effect / bias will be taken in account if we don’t reach our RMSE targeted value through simpler models
Following the EDA we conducted, we will focus on the the effects and/or biases in relation with the “users” and “movies” sections we went through.
First, we will save the following RMSE/Loss function in our environment: \[ RMSE = \sqrt{\frac{1}{N}\displaystyle\sum_{u,i} (\hat{y}_{u,i}-y_{u,i})^{2}} \]
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
The benchmark model we will use is to naively predict all ratings as the average rating of the training set.
The formula we will use is defined as follow: \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating and \(\epsilon_{u, i}\) = independent errors centred at 0. \[Y_{u, i} = \mu + \epsilon_{u, i}\]
edx_mu <- mean(edx$rating)
RMSE_1 <- RMSE(edx$rating, edx_mu)
result1_table <- tibble(Model = "Benchmark", RMSE = RMSE_1)
result1_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Benchmark | 1.060331 |
The RMSE for this model is 1.0603, which is unsurprisingly high.This will nevertheless be used as a benchmark for comparison with the different models we will now design
To improve our model we will take in account the idea that some movies are subjectively rated higher than others. Higher ratings are mostly linked to popular movies among users and the opposite is true for unpopular movies. We compute the estimated deviation of each movie’s mean rating from the total mean of all movies \(\mu\). The resulting variable is called \(b_{m}\) with “b” for bias and “m” for movie. The formula we will use is defined as: \[Y_{u, i} = \mu +b_{m}+ \epsilon_{u, i}\] With \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating, \(b_{m}\) = “movie bias” variable and \(\epsilon_{u, i}\) = independent errors centred at 0.
Let’s first visualise the \(b_{m}\) distribution:
movie_bias <- edx %>%
group_by(movieId) %>%
summarize(bm = mean(rating - edx_mu))
ggplot(movie_bias, aes(x = bm)) +
geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
labs(title = "Distribution of Movie Bias",
x = "bm",
y = "Number of Movies") +
theme_linedraw()
We will now check the prediction against the edx set to determine the related RMSE:
pred_bm <- edx_mu + edx %>%
left_join(movie_bias, by = "movieId") %>%
.$bm
RMSE_2 <- RMSE(edx$rating, pred_bm)
result2_table <- tibble(Model = "Movie Bias", RMSE = RMSE_2)
result2_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Movie Bias | 0.9423475 |
The RMSE for this model is .9423, which is quite an improvement from our benchmark measure but still far from our initial goal.
We have predicted movie rating based on the fact that movies are rated differently by adding the computed \(b_{m}\) to \(\mu\). If an individual movie is on average rated worse than the average rating of all movies \(\mu\) , we predict that it will rated lower that \(\mu\) by \(b_{m}\), the difference of the individual movie average from the total average.
This model introduces “User Effect / Bias” that reflects the fact that individual users tend to rate films according to their own standards (which vary widely in distribution). The formula can be defined as follows with \(Y_{u,i}\) = predicted rating, \(\mu\) = the average rating, \(b_{m}\) = the movie effects, \(b_{u}\) = the user effects and \(\epsilon_{u,i}\) = independent errors centred at 0.
\[Y_{u,i} = \mu + b_{m} + b_{u} + \epsilon_{u, i}\] where \(b_{u}\) is the “User Bias” variable.
Let’s first visualise the \(b_{u}\) distribution:
user_bias <- edx %>%
left_join(movie_bias, by = "movieId") %>%
group_by(userId) %>%
summarize(bu = mean(rating - edx_mu - bm))
ggplot(user_bias, aes(x = bu)) +
geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
labs(title = "Distribution of User Bias",
x = "bu",
y = "Number of Movies") +
theme_linedraw()
We will now check the prediction of this model against the test set to determine the related RMSE:
pred_bu <- edx %>%
left_join(movie_bias, by = "movieId") %>%
left_join(user_bias, by = "userId") %>%
mutate(pred = edx_mu + bm + bu) %>%
.$pred
RMSE_3 <- RMSE(edx$rating, pred_bu)
result3_table <- tibble(Model = "Movie & User Biases", RMSE = RMSE_3)
result3_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Movie & User Biases | 0.8567039 |
The RMSE for this model is 0.8567. Quite an improvement from our latest model but we are not there yet !
We will now introduce the concept of regularisation: the idea is to add a tuning parameter \(\lambda\) to further reduce the RMSE. The idea is to “penalise” outliers from the Movie Bias and User Bias sets which shall optimise the recommendation system.
Define \(\lambda\)
lambdasReg <- seq(0, 10, 0.25)
RMSEreg <- sapply(lambdasReg, function(l){
edx_mu <- mean(edx$rating)
bm <- edx %>%
group_by(movieId) %>%
summarize(bm = sum(rating - edx_mu)/(n() + l))
bu <- edx %>%
left_join(bm, by = 'movieId') %>%
group_by(userId) %>%
summarize(bu = sum(rating - bm - edx_mu)/(n() + l))
predicted_ratings <- edx %>%
left_join(bm, by = "movieId") %>%
left_join(bu, by = "userId") %>%
mutate(pred = edx_mu + bm + bu) %>%
.$pred
return(RMSE(predicted_ratings, edx$rating))
})
Let’s now determine which \(\lambda\) will be best at reducing the RMSE
lambda <- lambdasReg[which.min(RMSEreg)]
lambda
## [1] 0.5
ggplot(mapping = aes(x = lambdasReg, y = RMSEreg)) +
geom_point(color = "navajowhite4") +
labs(title = "Distribution of Lambdas",
x = "Lambda",
y = "RMSE") +
theme_linedraw()
We will therefore use 0.5 as tuning parameter \(\lambda\) to further reduce our RMSE of the following “Regularised Movie & User Biases” model:
bm <- edx %>%
group_by(movieId) %>%
summarize(bm = sum(rating - edx_mu)/(n() + lambda))
bu <- edx %>%
left_join(bm, by = "movieId") %>%
group_by(userId) %>%
summarize(bu = sum(rating - bm - edx_mu)/(n() + lambda))
pred_reg <- edx %>%
left_join(bm, by = "movieId") %>%
left_join(bu, by = "userId") %>%
mutate(predictions = edx_mu + bm + bu) %>%
.$predictions
RMSE_4 <- RMSE(edx$rating, pred_reg)
result4_table <- tibble(Model = "Regularised Movie & User Biases", RMSE = RMSE_4)
result4_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Regularised Movie & User Biases | 0.8566952 |
This model shows a marginal improvement of from the previous one with a RMSE at 0.85669.
Let’s summarize the RMSEs of the different models we developed along this project:
results_table <- tibble(Model = c("Benchmark", "Movie Bias", "Movie & User Biases", "Regularised Movie & User Biases"), RMSE = c(RMSE_1, RMSE_2, RMSE_3, RMSE_4))
results_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Benchmark | 1.0603313 |
| Movie Bias | 0.9423475 |
| Movie & User Biases | 0.8567039 |
| Regularised Movie & User Biases | 0.8566952 |
The “Regularised Movie and User Biases” Model being the one with the lowest RMSE, we will now apply it to the “final_holdout_test” dataset. The fact that the “final_holdout_test” dataset comprises about 10% of the edx dataset observations shall further reduce its RMSE.
The model: \[Y_{u,i} = \mu + b_{m} + b_{u} + \epsilon_{u, i}\] With \(Y_{u,i}\) = predicted rating, \(\mu\) = the average rating, \(b_{m}\) = the movie effects, \(b_{u}\) = the user effects and \(\epsilon_{u,i}\) = independent errors centred at 0.
#Defining "lambda" for "final_holdout_test" dataset
lambdasReg2 <- seq(0, 10, 0.25)
RMSEreg2 <- sapply(lambdasReg2, function(l){
edx_mu2 <- mean(final_holdout_test$rating)
bm <- final_holdout_test %>%
group_by(movieId) %>%
summarize(bm = sum(rating - edx_mu2)/(n() + l))
bu <- final_holdout_test %>%
left_join(bm, by = 'movieId') %>%
group_by(userId) %>%
summarize(bu = sum(rating - bm - edx_mu2)/(n() + l))
predicted_ratings <- final_holdout_test %>%
left_join(bm, by = "movieId") %>%
left_join(bu, by = "userId") %>%
mutate(pred = edx_mu2 + bm + bu) %>%
.$pred
return(RMSE(predicted_ratings, final_holdout_test$rating))
})
#Applying model to "final_holdout_test
bm <- final_holdout_test %>%
group_by(movieId) %>%
summarize(bm = sum(rating - edx_mu)/(n() + lambda))
bu <- final_holdout_test %>%
left_join(bm, by = "movieId") %>%
group_by(userId) %>%
summarize(bu = sum(rating - bm - edx_mu)/(n() + lambda))
pred_reg <- final_holdout_test %>%
left_join(bm, by = "movieId") %>%
left_join(bu, by = "userId") %>%
mutate(predictions = edx_mu + bm + bu) %>%
.$predictions
# Computing RMSE
RMSE_final2 <- RMSE(final_holdout_test$rating, pred_reg)
resultfinal2_table <- tibble(Model = "Regularised Movie & User Biases", RMSE = RMSE_final2)
resultfinal2_table %>%
knitr::kable()
| Model | RMSE |
|---|---|
| Regularised Movie & User Biases | 0.8258487 |
With a RMSE at 0.8258, the objective of the exercise has been achieved !
While we built a machine learning model to predict movie ratings in the MovieLens Dataset that matches the requirements of this exercise, it is clear that we could have gone further at building a stronger model by taking in account other perspective such as “Movie Age”, “Genre”, etc. We could have as well developped more complex models based on, for example, Distributed Random Forest or many more. We focused on the most basic model development: we achieved our goal.