1 Introduction

In this Capstone Project for the HarvardX Professional Certificate in Data Science (PH125.9x), we will explore and visually examine the MovieLens data set of GroupLens Research which features in order to to develop a machine-learning model by producing training and test sets that shall predict movie ratings on a validation set with a RMSE of less than 0.86490. Four models will be developed and compared based on their resulting RMSE.

The RMSE measures the average difference between a statistical model’s predicted values and the actual values. Mathematically, it is the standard deviation of the residuals which represent the distance between the regression line and the data points. RMSE quantifies how dispersed these residuals are, revealing how tightly the observed data clusters around the predicted values.

1.1 Loading the Data

The MovieLens dataset is downloaded from: https://grouplens.org/datasets/movielens/

dl <- "ml-10M100K.zip"
if (!file.exists(dl))
  download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)

ratings_file <- "ml-10M100K/ratings.dat"
if (!file.exists(ratings_file))
  unzip(dl, ratings_file)

movies_file <- "ml-10M100K/movies.dat"
if (!file.exists(movies_file))
  unzip(dl, movies_file)

1.2 Initial Data Wrangling

The initial wrangling code is provided by the Course:

ratings <- as.data.frame(str_split(read_lines(ratings_file), fixed("::"), simplify = TRUE), stringsAsFactors = FALSE)

colnames(ratings) <- c("userId", "movieId", "rating", "timestamp")
ratings <- ratings %>%
  mutate(userId = as.integer(userId),
         movieId = as.integer(movieId),
         rating = as.numeric(rating),
         timestamp = as.integer(timestamp))

movies <- as.data.frame(str_split(read_lines(movies_file), fixed("::"), simplify = TRUE),
                        stringsAsFactors = FALSE)
colnames(movies) <- c("movieId", "title", "genres")
movies <- movies %>%
  mutate(movieId = as.integer(movieId))

movielens <- left_join(ratings, movies, by = "movieId")

1.3 Creating two Data Subsets

The MovieLens dataset will be divided into two subsets:

“edx,” serving for training, developing, then selecting the best algorithm;
“final_holdout_test” that will be used for evaluating the RMSE of the selected algorithm.

# Final hold-out test set will be 10% of MovieLens data
set.seed(1, sample.kind = "Rounding") 
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]

# Make sure userId and movieId in final hold-out test set are also in edx set
final_holdout_test <- temp %>% 
  semi_join(edx, by = "movieId") %>%
  semi_join(edx, by = "userId")

# Add rows removed from final hold-out test set back into edx set
removed <- anti_join(temp, final_holdout_test)
edx <- rbind(edx, removed)

rm(dl, ratings, movies, test_index, temp, movielens, removed)

2 Exploratory Data Analysis

The objective of this EDA is to:

Identify potential errors, better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables;
Help defining the most relevant strategy to develop our prediction models.

According to the GroupLens website, hosting the data sets we will use for this project, the MovieLens 10M Dataset, released in January 2009 “contains 10,000,054 ratings and 95,580 tags applied to 10,681 movies by 71,567 users of the online movie recommender service MovieLens.”

2.1 DataSet Structure

We will first check the EDX dataset, its 5 first head rows as well as it structure info:

kable(head(edx, 5), booktabs = TRUE, caption = "Head Rows") %>%
  kable_styling(position = "center", latex_options = c("striped", "scale_down"))

Head Rows
	userId	movieId	rating	timestamp	title	genres
1	1	122	5	838985046	Boomerang (1992)	Comedy\|Romance
2	1	185	5	838983525	Net, The (1995)	Action\|Crime\|Thriller
4	1	292	5	838983421	Outbreak (1995)	Action\|Drama\|Sci-Fi\|Thriller
5	1	316	5	838983392	Stargate (1994)	Action\|Adventure\|Sci-Fi
6	1	329	5	838983392	Star Trek: Generations (1994)	Action\|Adventure\|Drama\|Sci-Fi

EDX Structure

str(edx)

## 'data.frame':    9000055 obs. of  6 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  122 185 292 316 329 355 356 362 364 370 ...
##  $ rating   : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ timestamp: int  838985046 838983525 838983421 838983392 838983392 838984474 838983653 838984885 838983707 838984596 ...
##  $ title    : chr  "Boomerang (1992)" "Net, The (1995)" "Outbreak (1995)" "Stargate (1994)" ...
##  $ genres   : chr  "Comedy|Romance" "Action|Crime|Thriller" "Action|Drama|Sci-Fi|Thriller" "Action|Adventure|Sci-Fi" ...

There are 6 Variables in this dataset:

userId: Integer. Movielens users that were selected at random for inclusion. Their ids have been anonymised.
movieId: Integer. MovieID is the real MovieLens id.
rating: Numeric. Rating of one movie by one user. Ratings are made on a 5-star scale, with half-star increments.
timestamp: Integer. When the rating was made, expressed in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
title: Character. Movies’ titles + year of its release.
genres: Character. Genres are a pipe-separated list, and are selected from the following: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, “no genre listed”.

2.2 Summary Statistics

summary_edx <- summary(edx)

kable(summary_edx, booktabs = TRUE, caption = "Summary Statistics") %>%
  kable_styling(position = "center", latex_options = c("scale_down", "striped"))

Summary Statistics
userId	movieId	rating	timestamp	title	genres
Min. : 1	Min. : 1	Min. :0.500	Min. :7.897e+08	Length:9000055	Length:9000055
1st Qu.:18124	1st Qu.: 648	1st Qu.:3.000	1st Qu.:9.468e+08	Class :character	Class :character
Median :35738	Median : 1834	Median :4.000	Median :1.035e+09	Mode :character	Mode :character
Mean :35870	Mean : 4122	Mean :3.512	Mean :1.033e+09	NA	NA
3rd Qu.:53607	3rd Qu.: 3626	3rd Qu.:4.000	3rd Qu.:1.127e+09	NA	NA
Max. :71567	Max. :65133	Max. :5.000	Max. :1.231e+09	NA	NA

The dataset is clean with no missing/NA values.

We will now check the number of unique values for the “users”, “movies” and “genre” variables.

edx %>%
  summarize('Nb Unique Users' = n_distinct(userId), 
            'Nb Unique Movies' = n_distinct(movieId),
            'Nb Unique Genres' = n_distinct(genres)) %>%
  kable(caption = "Unique Values") %>%
  kable_styling(position = "center", latex_options = c("scale_down", "scale_down"))

Unique Values
Nb Unique Users	Nb Unique Movies	Nb Unique Genres
69878	10677	797

The data contains 9,000,055 ratings applied to 10,677 different movies of 797 different unique or combination of genre from 69,878 single users between 1995 and 2009.

To better perform our analysis and build our models , we will further wrangle the datasets as following:

Convert the “timestamp” column to “datetime” format;
Extract the “Release Year” observation from the “Title” column.
Use the new “Release Year” column to add a “MovieAge” column (using year 2023 to calculate the age).

#Convert the "timestamp" column to "datetime" format
edx <- edx %>%
  mutate(timestamp = as_datetime(timestamp), 
         rating_date = make_date(year(timestamp), month(timestamp))) %>% 
  select(-timestamp) %>% 
  relocate(rating_date, .after = rating) %>%
  as.data.table()

final_holdout_test <- final_holdout_test %>% 
  mutate(timestamp = as_datetime(timestamp), rating_date = make_date(year(timestamp), month(timestamp))) %>% 
  select(-timestamp) %>% 
  relocate(rating_date, .after = rating) %>% 
  as.data.table()

#Extract the "Release Year" observation from the "Title" column
rel_year <- "(?<=\\()\\d{4}(?=\\))"

edx <- edx %>%
  mutate(release_year = str_extract(title, rel_year) %>%
  as.integer()) %>%
  relocate(release_year, .after = title)

final_holdout_test <- final_holdout_test %>%
  mutate(release_year = str_extract(title, rel_year) %>%
  as.integer()) %>%
  relocate(release_year, .after = title)

#Use the new “Release Year” column to add a “MovieAge” column
edx <- edx %>% 
  mutate(movie_age = 2023 - release_year) %>% 
  relocate(movie_age, .after = release_year)

final_holdout_test <- final_holdout_test %>%
  mutate(movie_age = 2023 - release_year) %>% 
  relocate(movie_age, .after = release_year)

We will now move forward with our EDA, looking deeper into some potentially interesting patterns and/or biases that will help at defining our modelling strategy.

2.3 Users

2.3.1 Number of Ratings per User

From the “User” perspective, we will explore the number of ratings made by users.

ratings_per_user <- edx %>%
                    group_by(userId) %>%
                    summarize(nb_ratings = n()) %>%
                    ungroup()

Summary Statistics

sumtable(ratings_per_user, out = 'return')

##     Variable     N  Mean Std. Dev. Min Pctl. 25 Pctl. 75   Max
## 1     userId 69878 35782     20614   1    17943    53620 71567
## 2 nb_ratings 69878   129       195  10       32      141  6616

Contrary to the information provided by GroupLens about the dataset, it is apparent that a number of users actually rated less than 20 movies:

length(which(ratings_per_user$nb_ratings <= 20))

## [1] 4966

It is confirmed that 4,966 users rated \(\leqslant\) 20 movies. This shall be taken in account as a potential bias affecting our prediction models.

We will now visualise the distribution of the number of ratings per user:

ggplot(ratings_per_user, aes(x = nb_ratings)) +
  geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
  scale_x_log10() +
  labs(title = "Log Transformed Distribution of Number of Ratings per User",
         x = "Log Number of Ratings",
         y = "Number of Users") +
  theme_linedraw()

Right skewed log transformed distribution with median at 62 and mean at 129. It appears that the majority of users rated a number of movies that sits between 25 and 100 which seems somehow low: to be taken in account as a “bias” when developing our prediction models with potentially adding a “user penalty term”.

2.4 Movies

2.4.1 Nb of Movies vs. Nb of Ratings

We will now focus on the relation between movies and their related number of ratings.

ratings_movie <- edx %>%
                    group_by(movieId) %>%
                    summarize(nb_ratings = n()) %>%
                    ungroup()

Summary Statistics

sumtable(ratings_movie, out = 'return')

##     Variable     N  Mean Std. Dev. Min Pctl. 25 Pctl. 75   Max
## 1    movieId 10677 13105     17792   1     2754     8710 65133
## 2 nb_ratings 10677   843      2238   1       30      565 31362

Visualisation of Movies vs. Ratings:

ggplot(ratings_movie, aes(x = nb_ratings)) +
  geom_histogram(bins = 40, fill = "navajowhite4", color = "darkgray") +
  scale_x_log10() +
  labs(title = "Log Transformed Distribution of Number of Ratings vs. Movies",
         x = "Log Number of Ratings",
         y = "Number of Movies") +
  theme_linedraw()

The distribution is not far from being symmetric which tends to show that popular movies are rated more frequently than less popular ones. The fact that there are a number of films with fewer ratings implies potential biases that may affect our recommendation modelling.

2.5 Ratings

2.5.1 Number of Monthly Ratings

Let’s now explore the “Rating” perspective and start with the number of ratings made on a monthly basis;

edx_ratings <- edx %>%
  group_by(rating_date) %>%
  summarize(nb_rating = n())

Summary Statistics

sumtable(edx_ratings, out = 'return')

##    Variable   N  Mean Std. Dev. Min Pctl. 25 Pctl. 75    Max
## 1 nb_rating 157 57325     41199   2    38148    64464 264856

Let’s visualise the distribution of ratings made on a monthly basis:

ggplot(edx_ratings, aes(x = rating_date, y = nb_rating)) +
  geom_point(color = "navajowhite4") +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  scale_y_continuous(breaks = seq(0, 300000, 50000), labels = scales::comma) +
  labs(title = "Distribution of Monthly Ratings",
         x = "Rating Date",
         y = "Number of Ratings") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(color = "black") +
  theme_linedraw()

We notice an average of approx. 60k movie ratings made on a monthly basis since 1990. This number seems to have stabilized after 2000. We can notice a number of outliers, high and low before 2000, then high around 2000 (up to 264,856 ratings !) and 2005. To be taken in account when developing our prediction models.

2.5.2 Rating Values

Let’s have a look at the rating values and its distribution:

ggplot(edx, aes(rating)) +
  geom_histogram(binwidth = 0.25, fill = "navajowhite4", color = "black") +
  scale_x_continuous(breaks = seq(0.5, 5, 0.5)) +
  scale_y_continuous(breaks = seq(0, 3000000, 500000)) +
  labs(title = "Distribution of Ratings Values",
         x = "Rating Value",
         y = "Number of Ratings") +
  theme_linedraw()

There are 10 different ratings users can award a movie: 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 and 5. A rating of “4” is the most popular rating followed by “3”, “5”, “3.5”, etc. while “0.5” is awarded the least frequently.

2.5.3 Average Rating per User

Finally, we will explore the distribution of mean rating made by users.

avg_ratings_per_user <- edx %>%
                        group_by(userId) %>%
                        summarize(average_rating = mean(rating)) %>%
                        ungroup()

Summary Statistics

sumtable(avg_ratings_per_user, out = 'return')

##         Variable     N  Mean Std. Dev. Min Pctl. 25 Pctl. 75   Max
## 1         userId 69878 35782     20614   1    17943    53620 71567
## 2 average_rating 69878   3.6      0.43 0.5      3.4      3.9     5

ggplot(avg_ratings_per_user, aes(x = average_rating)) +
    geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
    labs(title = "Distribution of Average Ratings per User",
         x = "Average Rating",
         y = "Number of Users") +
  theme_linedraw()

The distribution is symmetric and, as seen previously, centred around 3.5 which shows a tendency for users to favour rating movies they appreciated rather than the one they disliked.

2.6 Movie Age

2.6.1 Mean Rating vs. Movie Age

ratings_avg_age <- edx %>%
                        group_by(movie_age) %>%
                        summarize(average_rating = mean(rating)) %>%
                        ungroup()

Summary Statistics

sumtable(ratings_avg_age, out = 'return')

##         Variable  N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
## 1      movie_age 94   62        27  15       38       85 108
## 2 average_rating 94  3.7      0.21 3.3      3.5      3.9 4.1

ggplot(ratings_avg_age, aes(x = movie_age, y = average_rating)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Avg Ratings vs. Movie Age",
         x = "Movie Age",
         y = "Avg Rating") +
  geom_rug(color = "navajowhite4") +
  geom_smooth(color = "black") +
  theme_linedraw()

The correlation between the two features seems quite clear with users tending to award higher ratings to older movies rather than to newer releases. While probably interesting, this effect / bias will be taken in account if we don’t reach our RMSE targeted value through simpler models

3 Modelling

Following the EDA we conducted, we will focus on the the effects and/or biases in relation with the “users” and “movies” sections we went through.

First, we will save the following RMSE/Loss function in our environment: \[ RMSE = \sqrt{\frac{1}{N}\displaystyle\sum_{u,i} (\hat{y}_{u,i}-y_{u,i})^{2}} \]

RMSE <- function(true_ratings, predicted_ratings){
  sqrt(mean((true_ratings - predicted_ratings)^2))
}

3.1 “Benchmark” Model

The benchmark model we will use is to naively predict all ratings as the average rating of the training set.

The formula we will use is defined as follow: \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating and \(\epsilon_{u, i}\) = independent errors centred at 0. \[Y_{u, i} = \mu + \epsilon_{u, i}\]

edx_mu <- mean(edx$rating)

RMSE_1 <- RMSE(edx$rating, edx_mu)

result1_table <- tibble(Model = "Benchmark", RMSE = RMSE_1)

result1_table %>% 
  knitr::kable()

Model	RMSE
Benchmark	1.060331

The RMSE for this model is 1.0603, which is unsurprisingly high.This will nevertheless be used as a benchmark for comparison with the different models we will now design

3.2 “Movie Bias” Model

To improve our model we will take in account the idea that some movies are subjectively rated higher than others. Higher ratings are mostly linked to popular movies among users and the opposite is true for unpopular movies. We compute the estimated deviation of each movie’s mean rating from the total mean of all movies \(\mu\). The resulting variable is called \(b_{m}\) with “b” for bias and “m” for movie. The formula we will use is defined as: \[Y_{u, i} = \mu +b_{m}+ \epsilon_{u, i}\] With \(Y_{u, i}\) = predicted rating, \(\mu\) = average rating, \(b_{m}\) = “movie bias” variable and \(\epsilon_{u, i}\) = independent errors centred at 0.

Let’s first visualise the \(b_{m}\) distribution:

movie_bias <- edx %>%
  group_by(movieId) %>%
  summarize(bm = mean(rating - edx_mu))

ggplot(movie_bias, aes(x = bm)) +
    geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
    labs(title = "Distribution of Movie Bias",
         x = "bm",
         y = "Number of Movies") +
  theme_linedraw()

We will now check the prediction against the edx set to determine the related RMSE:

pred_bm <- edx_mu + edx %>% 
  left_join(movie_bias, by = "movieId") %>% 
  .$bm

RMSE_2 <- RMSE(edx$rating, pred_bm)

result2_table <- tibble(Model = "Movie Bias", RMSE = RMSE_2)

result2_table %>% 
  knitr::kable()

Model	RMSE
Movie Bias	0.9423475

The RMSE for this model is .9423, which is quite an improvement from our benchmark measure but still far from our initial goal.

We have predicted movie rating based on the fact that movies are rated differently by adding the computed \(b_{m}\) to \(\mu\). If an individual movie is on average rated worse than the average rating of all movies \(\mu\) , we predict that it will rated lower that \(\mu\) by \(b_{m}\), the difference of the individual movie average from the total average.

3.3 “Movie & User Biases” Model

This model introduces “User Effect / Bias” that reflects the fact that individual users tend to rate films according to their own standards (which vary widely in distribution). The formula can be defined as follows with \(Y_{u,i}\) = predicted rating, \(\mu\) = the average rating, \(b_{m}\) = the movie effects, \(b_{u}\) = the user effects and \(\epsilon_{u,i}\) = independent errors centred at 0.

\[Y_{u,i} = \mu + b_{m} + b_{u} + \epsilon_{u, i}\] where \(b_{u}\) is the “User Bias” variable.

Let’s first visualise the \(b_{u}\) distribution:

user_bias <- edx %>%
  left_join(movie_bias, by = "movieId") %>%
  group_by(userId) %>% 
  summarize(bu = mean(rating - edx_mu - bm))

ggplot(user_bias, aes(x = bu)) +
    geom_histogram(bins = 30, fill = "navajowhite4", color = "darkgray") +
    labs(title = "Distribution of User Bias",
         x = "bu",
         y = "Number of Movies") +
  theme_linedraw()

We will now check the prediction of this model against the test set to determine the related RMSE:

pred_bu <- edx %>% 
  left_join(movie_bias, by = "movieId") %>% 
  left_join(user_bias, by = "userId") %>% 
  mutate(pred = edx_mu + bm + bu) %>% 
  .$pred

RMSE_3 <- RMSE(edx$rating, pred_bu)

result3_table <- tibble(Model = "Movie & User Biases", RMSE = RMSE_3)

result3_table %>% 
  knitr::kable()

Model	RMSE
Movie & User Biases	0.8567039

The RMSE for this model is 0.8567. Quite an improvement from our latest model but we are not there yet !

3.4 “Regularised Movie & User Biases” Model

We will now introduce the concept of regularisation: the idea is to add a tuning parameter \(\lambda\) to further reduce the RMSE. The idea is to “penalise” outliers from the Movie Bias and User Bias sets which shall optimise the recommendation system.

Define \(\lambda\)

lambdasReg <- seq(0, 10, 0.25)

RMSEreg <- sapply(lambdasReg, function(l){
  
  edx_mu <- mean(edx$rating)
  
  bm <- edx %>%
    group_by(movieId) %>%
    summarize(bm = sum(rating - edx_mu)/(n() + l))
  
  bu <- edx %>%
    left_join(bm, by = 'movieId') %>% 
    group_by(userId) %>%
    summarize(bu = sum(rating - bm - edx_mu)/(n() + l))
  
  predicted_ratings <- edx %>%
    left_join(bm, by = "movieId") %>%
    left_join(bu, by = "userId") %>%
    mutate(pred = edx_mu + bm + bu) %>%
    .$pred
  
return(RMSE(predicted_ratings, edx$rating))
})

Let’s now determine which \(\lambda\) will be best at reducing the RMSE

lambda <- lambdasReg[which.min(RMSEreg)]

lambda

## [1] 0.5

ggplot(mapping = aes(x = lambdasReg, y = RMSEreg)) +
  geom_point(color = "navajowhite4") +
  labs(title = "Distribution of Lambdas",
         x = "Lambda",
         y = "RMSE") +
  theme_linedraw()

We will therefore use 0.5 as tuning parameter \(\lambda\) to further reduce our RMSE of the following “Regularised Movie & User Biases” model:

bm <- edx %>% 
  group_by(movieId) %>%
  summarize(bm = sum(rating - edx_mu)/(n() + lambda))

bu <- edx %>% 
  left_join(bm, by = "movieId") %>%
  group_by(userId) %>%
  summarize(bu = sum(rating - bm - edx_mu)/(n() + lambda))

pred_reg <- edx %>% 
  left_join(bm, by = "movieId") %>%
  left_join(bu, by = "userId") %>%
  mutate(predictions = edx_mu + bm + bu) %>%
  .$predictions

RMSE_4 <- RMSE(edx$rating, pred_reg)

result4_table <- tibble(Model = "Regularised Movie & User Biases", RMSE = RMSE_4)

result4_table %>% 
  knitr::kable()

Model	RMSE
Regularised Movie & User Biases	0.8566952

This model shows a marginal improvement of from the previous one with a RMSE at 0.85669.

Let’s summarize the RMSEs of the different models we developed along this project:

results_table <- tibble(Model = c("Benchmark", "Movie Bias", "Movie & User Biases", "Regularised Movie & User Biases"), RMSE = c(RMSE_1, RMSE_2, RMSE_3, RMSE_4))

results_table %>% 
  knitr::kable()

Model	RMSE
Benchmark	1.0603313
Movie Bias	0.9423475
Movie & User Biases	0.8567039
Regularised Movie & User Biases	0.8566952

4 Results

The “Regularised Movie and User Biases” Model being the one with the lowest RMSE, we will now apply it to the “final_holdout_test” dataset. The fact that the “final_holdout_test” dataset comprises about 10% of the edx dataset observations shall further reduce its RMSE.

The model: \[Y_{u,i} = \mu + b_{m} + b_{u} + \epsilon_{u, i}\] With \(Y_{u,i}\) = predicted rating, \(\mu\) = the average rating, \(b_{m}\) = the movie effects, \(b_{u}\) = the user effects and \(\epsilon_{u,i}\) = independent errors centred at 0.

#Defining "lambda" for "final_holdout_test" dataset

lambdasReg2 <- seq(0, 10, 0.25)

RMSEreg2 <- sapply(lambdasReg2, function(l){
  
  edx_mu2 <- mean(final_holdout_test$rating)
  
  bm <- final_holdout_test %>%
    group_by(movieId) %>%
    summarize(bm = sum(rating - edx_mu2)/(n() + l))
  
  bu <- final_holdout_test %>%
    left_join(bm, by = 'movieId') %>% 
    group_by(userId) %>%
    summarize(bu = sum(rating - bm - edx_mu2)/(n() + l))
  
  predicted_ratings <- final_holdout_test %>%
    left_join(bm, by = "movieId") %>%
    left_join(bu, by = "userId") %>%
    mutate(pred = edx_mu2 + bm + bu) %>%
    .$pred
  
return(RMSE(predicted_ratings, final_holdout_test$rating))
})

#Applying model to "final_holdout_test

bm <- final_holdout_test %>% 
  group_by(movieId) %>%
  summarize(bm = sum(rating - edx_mu)/(n() + lambda))

bu <- final_holdout_test %>% 
  left_join(bm, by = "movieId") %>%
  group_by(userId) %>%
  summarize(bu = sum(rating - bm - edx_mu)/(n() + lambda))

pred_reg <- final_holdout_test %>% 
  left_join(bm, by = "movieId") %>%
  left_join(bu, by = "userId") %>%
  mutate(predictions = edx_mu + bm + bu) %>%
  .$predictions

# Computing RMSE

RMSE_final2 <- RMSE(final_holdout_test$rating, pred_reg)

resultfinal2_table <- tibble(Model = "Regularised Movie & User Biases", RMSE = RMSE_final2)

resultfinal2_table %>% 
  knitr::kable()

Model	RMSE
Regularised Movie & User Biases	0.8258487

With a RMSE at 0.8258, the objective of the exercise has been achieved !

5 Conclusion

While we built a machine learning model to predict movie ratings in the MovieLens Dataset that matches the requirements of this exercise, it is clear that we could have gone further at building a stronger model by taking in account other perspective such as “Movie Age”, “Genre”, etc. We could have as well developped more complex models based on, for example, Distributed Random Forest or many more. We focused on the most basic model development: we achieved our goal.

HarvardX PH125.9x | Data Science
MovieLens Capstone Project

Thierry Laurent-Badin

February 20204

1 Introduction

1.1 Loading the Data

1.2 Initial Data Wrangling

1.3 Creating two Data Subsets

2 Exploratory Data Analysis

2.1 DataSet Structure

2.2 Summary Statistics

2.3 Users

2.3.1 Number of Ratings per User

2.4 Movies

2.4.1 Nb of Movies vs. Nb of Ratings

2.5 Ratings

2.5.1 Number of Monthly Ratings

2.5.2 Rating Values

2.5.3 Average Rating per User

2.6 Movie Age

2.6.1 Mean Rating vs. Movie Age

3 Modelling

3.1 “Benchmark” Model

3.2 “Movie Bias” Model

3.3 “Movie & User Biases” Model

3.4 “Regularised Movie & User Biases” Model

4 Results

5 Conclusion

HarvardX PH125.9x | Data Science MovieLens Capstone Project

Thierry Laurent-Badin

February 20204

1 Introduction

1.1 Loading the Data

1.2 Initial Data Wrangling

1.3 Creating two Data Subsets

2 Exploratory Data Analysis

2.1 DataSet Structure

2.2 Summary Statistics

2.3 Users

2.3.1 Number of Ratings per User

2.4 Movies

2.4.1 Nb of Movies vs. Nb of Ratings

2.5 Ratings

2.5.1 Number of Monthly Ratings

2.5.2 Rating Values

2.5.3 Average Rating per User

2.6 Movie Age

2.6.1 Mean Rating vs. Movie Age

3 Modelling

3.1 “Benchmark” Model

3.2 “Movie Bias” Model

3.3 “Movie & User Biases” Model

3.4 “Regularised Movie & User Biases” Model

4 Results

5 Conclusion

HarvardX PH125.9x | Data Science
MovieLens Capstone Project