The success of the Netflix invitation to beat its in-house system has created a huge interest with more and more Online sellers/content providers using it to improve their business performance.
In this paper we try to build a similar system using publicly available information to come up with a simple recommendation model . We use RMSE as our key metric to define success and target a number lower than 0.8649. We try to use all the features available in the dataset to reach our goal.
Our final model reaches 0.8646 slightly better than the target. Further we regularise the data and come up with a tuning parameter (lambda) of 5.25 which would reduce it to 0.8643.
In the end we also discuss if there is any further scope for decreasing it.
Recommendation systems deal with recommending a product or assigning a rating to item. They act like information filtering systems and try to remove unnecessary information from the data stream before it reaches a human. They are mostly used to generate playlists for the audience by companies such as YouTube, Spotify, and Netflix. Companies like Amazon also use it to recommend products to its users. Most of the systems study users by allowing them to rate their products. This data thus collected can be used to predict what rating a particular user will give a specific product. Products with a prediction of a high rating are then recommended to that user resulting in higher Sales and theretofore revenue. Netflix ran a huge contest from 2006 to 2009 asking people to design an algorithm that can improve its famous in-house recommendation system ‘Cinematch’ by 10%. Whoever gave the best improvements would be awarded a $1 million. For more details on the context please refer https://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/.
In this report we attempt to create a simple Recommendation model. While the Netflix data is propreitary in nature and not publicly available, this report uses a similar dataset from movielens available here http://files.grouplens.org/datasets/movielens/ml-10m.zip . MovieLens itself is a research site run by GroupLens Research group at the University of Minnesota. https://grouplens.org/
The primary objective would be to keep the predicted values as close to actual values as possible. In order to evaluate how well the model fits the data set we will calculate the difference between the actual and predicted values. In an ideal situation the difference (error) should be ‘nil’ which means the model is able to predict all values correctly (this will never happen). We will use the root mean square error (RMSE), which is a metric that tells us the average distance (standard deviation) between the predicted values from the model and the actual values in the dataset. The lower the RMSE, the better a given model is able to “fit” a dataset. The formula to find the Root Mean Square Error (often abbreviated RMSE) is as follows: RMSE=√∑(Pi-Oi)^2)/n where: Σ means “sum” Pi is the predicted value for the ith observation in the dataset Oi is the observed value for the ith observation in the dataset n is the sample size. We will target an RMSE of 0.8647.
We can define the function as:
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
Most of the packages that were used come from the tidyverse - a collection of packages that share common philosophies of tidy data. The other packages used are as under:-
1) caret :- Short for Classification And Regression Training contains functions to streamline the model training process for complex regression and classification problems. It utilizes a number of R packages and loads them as and when needed.
2) data.table :- This package is required to read regular delimited files faster and more conveniently. We will be using ‘fread’ from this package to convert the files from the data set.
3) lubridate :- This package facilitates parsing and manipulation of dates.
4) knitr :- This package is an engine dynamic reporting and will be used to create the output summary as well as the html report
As mentioned earlier we will be using the movielens data set and can be downloaded as under to a tempfile.
The dataset is a zip file. The contents of the zip file has been displayed in a screenshot below:-
The Readme file provides more details on the data set. Please refer screenshot below.
The database contains 10000054 ratings applied to 10681 movies by 71567 users (Selected at random for inclusion and had rated at least 20 movies).
There are 2 files of particular interest:-
1) The ratings.dat contains all the ratings with each line of representing one rating of one movie by one user, and has the following format: UserID::MovieID::Rating::Timestamp
2) The movies.dat contains all the details on the movies, and has the following format: MovieID::Title::Genres with Genres being a pipe-separated list. Both the files have MovieID and can be used as a common link to combine them.
We will read the files from the zipped file and databases with the right headers and let us call them ratings and movies based on their content.
## movieId title
## [1,] "1" "Toy Story (1995)"
## [2,] "2" "Jumanji (1995)"
## [3,] "3" "Grumpier Old Men (1995)"
## [4,] "4" "Waiting to Exhale (1995)"
## [5,] "5" "Father of the Bride Part II (1995)"
## [6,] "6" "Heat (1995)"
## genres
## [1,] "Adventure|Animation|Children|Comedy|Fantasy"
## [2,] "Adventure|Children|Fantasy"
## [3,] "Comedy|Romance"
## [4,] "Comedy|Drama|Romance"
## [5,] "Comedy"
## [6,] "Action|Crime|Thriller"
The movies file has character separators which need to be removed and converted to a data frame. Further the movies column has the release year embedded within it. This needs to be extracted.
We can now combine both the files into a new file. However we need only details of those movies which have been rated so we will use ratings as the master file and pull data accordingly.
Let us do some basic checks on number of distinct movies , users and overall dimensions.
## [1] "The total number of distinct users are 69878"
## [1] "The total number of distinct movies are 10677"
## [1] "The total numbers of rows and columns in the data set are 10000054"
## [2] "The total numbers of rows and columns in the data set are 7"
In the above process we have lost data on 1689 (71567-69878) users and 4 (10681-10677) movies. However the total number of ratings is still 1000054 so these are movies and users without a valid rating. Since we have 69878 users and 10677 movies we should have around (69878*10677) 746 million ratings. However we have only around 1 million (around 13%) so most users have not rated for all the movies. We can visualize the gaps for a sample of 100 users. Recommendation systems are built to fill in these gaps.
## integer(0)
Instead of a random sample of 100, even if we take the top 100 users and movies rated although lower we still have gaps where users have not rated
## Selecting by n
## Selecting by n
## integer(0)
Let us see whether all users have rated the top 5 movies.
## Selecting by n
| userId | Forrest Gump | Jurassic Park | Pulp Fiction | Shawshank Redemption, The | Silence of the Lambs, The |
|---|---|---|---|---|---|
| 1 | 5 | 5 | NA | NA | NA |
| 4 | NA | 5 | NA | NA | NA |
| 5 | NA | NA | NA | NA | 4 |
| 7 | NA | NA | NA | NA | 3 |
| 8 | NA | 3 | NA | NA | 4 |
| 10 | 3 | NA | 2 | NA | 3 |
We can clearly see that there are gaps in these movies too.
Now that we have the base file ready we can create a training (around 90%) and a testing file (balance 10%) as a validation file for our models.
set.seed(1, sample.kind="Rounding") # if using R 3.5 or earlier, use `set.seed(1)`
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE) # Creating a sample index with 10% test data
test <- movielens[-test_index,] # Creating edx as a training file with balance 90%
temp <- movielens[test_index,] # Creating a temp file for test data as we need to process this further
As we have randomly selected the partitions we need to ensure that the userID and the movieID in the validation set are in the training set else we will have a mismatch.
Let us get rid of the files we don’t need.
rm(movielens,movies,ratings,removed,temp,test_index,dl,users)
Now we should have only 2 files. “test” the training file and "validation’ file to validate. and the RMSE function we defined earlier. Let us explore the contents of edx.
## Rows: 9,000,055
## Columns: 7
## $ userId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2~
## $ movieId <dbl> 122, 185, 292, 316, 329, 355, 356, 362, 364, 370, 377, 420~
## $ rating <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5~
## $ timestamp <int> 838985046, 838983525, 838983421, 838983392, 838983392, 838~
## $ title <chr> "Boomerang ", "Net, The ", "Outbreak ", "Stargate ", "Star~
## $ genres <chr> "Comedy|Romance", "Action|Crime|Thriller", "Action|Drama|S~
## $ releaseyear <dbl> 1992, 1995, 1995, 1994, 1994, 1994, 1994, 1994, 1994, 1994~
Let us validate the file and see if the basic parameters of edx match with the expected results.
## `summarise()` has grouped output by 'movieId'. You can override using the `.groups` argument.
## Adding missing grouping variables: `movieId`
## Selecting by count
| Sr.No | Test | Actual | Expected | Variance |
|---|---|---|---|---|
| 1 | Number of Rows | 9000055 | 9000055 | TRUE |
| 2 | Number of Columns in Test | 7 | 7 | TRUE |
| 3 | Number of Zero Ratings in Test | 0 | 0 | TRUE |
| 4 | Number of Three Ratings in Test | 2121240 | 2121240 | TRUE |
| 5 | Number of Distinct Movies | 10677 | 10677 | TRUE |
| 6 | Number of Distinct Users | 69878 | 69878 | TRUE |
| 7 | Number of Movies in the Drama genre | 3910127 | 3910127 | TRUE |
| 8 | Number of Movies in the Comedy genre | 3540930 | 3540930 | TRUE |
| 9 | Number of Movies in the Thriller genre | 2325899 | 2325899 | TRUE |
| 10 | Number of Movies in the Romance genre | 1712100 | 1712100 | TRUE |
| 11 | Movie with the highest ratings | Pulp Fiction | Pulp Fiction | TRUE |
| 12 | Most Given Rating by Users | 4 | 4 | TRUE |
| 13 | Second Most Rating by Users | 3 | 3 | TRUE |
| 14 | Third Most Rating by Users | 5 | 5 | TRUE |
| 15 | Fourth Most Rating by Users | 3.5 | 3.5 | TRUE |
| 16 | Third Most Rating by Users | 2 | 2 | TRUE |
Please ensure that the variance column has all “TRUE” before we progress. You may have to rerun the code if not so.
In order to reach our target we will be building models from the most simple based on movie averages and try to use all features available and test them as we move along to reach our target RMSE.
Let us start with the simplest of all assumptions i.e. we fill all the gaps and predict the mean as the rating for all movies regardless of user and the differences explained by random errors. The model will look something like this
Y (u , i) = μ + ε (u , i)
with ε (u , i) random errors sampled from the same distribution centered at 0 and μ the “true” rating for all movies. We know that the estimate that minimizes the RMSE is the least squares estimate of μ and, in this case, is the average of all ratings:
mu <- mean(test$rating) #storing the mean to mu a variable
naive_rmse <- RMSE(validation$rating,mu) #using the RMSE function to check
rmse_results <- tibble(method = "Just the average", RMSE = naive_rmse) # adding the results to the RMSE table (to be used as final output)
rmse_results
We get an RMSE of 1.06 which is very high. If we were to use any other number apart from the mean, the RMSE will only go up. Let us now analyse and see if we can reduce it. We saw earlier that the most frequent ratings provided by users were 4, 3, 5, 3.5 & 2 in descending order. Probably the number of 4 and 3 ratings are the main contributors for the mean rating of 3.52. Let us see how this looks like in graph.
We can clearly see from the graph above that we need to align the ratings as per our model in order to reduce the RMSE.
If we had got all our predictions right (not practically possible) the graph would have looked like this.
Let us see how close we can come to this graph as we move along.
In our initial exploration we saw that not all Movies have been rated the same and there seem to be strong preferences towards some movies. Let us try to improve our model by bringing in the specific mean variance of the movies and call it bi. The new model will look something like this:-
Y (u , i) = μ + bi+ε (u , i)
We can get the variance of the movie specific mean with the overall mean like this:-
movie_avgs <- test %>%group_by(movieId) %>% summarize(b_i = mean(rating - mu)) # We take the mean of each movie and reduce mu we calculated earlier
Let us see if there is enough variability which we can use in our model.
Note:- +1.5 in the above graph would mean a rating of 5
We can clearly see that there is a lot of variation and we can use this to improve our model.
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
We already see a considerable improvement and the RMSE is now below 1. Let us see whether we have been able to substantially spread our rating.
We can see the ratings based on our model moving towards the actual rating and this was reflected in the lowering of the RMSE. Let us now see if we can improve this further. Let us analyse some movies with the biggest gap between predicted and actual rating
validation %>%left_join(movie_avgs,by="movieId")%>%mutate(prediction = predicted_ratings,mu=mu,residual = predicted_ratings - rating) %>%arrange(desc(abs(residual)))%>%select(movieId,title,b_i,mu,prediction,rating, residual) %>% slice(1:10)
Let us have a look at what is happening for Pokeman Heroes which seems to be an unheard of.
There are only 21 ratings which means this movie was not a blockbuster and 2 users have given it a perfect score of 5. Seems the users were in a very good mood while watching the movie.
If we look at the next movie. It is one of the more popular movie with more than 28k ratings.
We can clearly see a user bias with around 2500 users giving it a rating below 3.5. Some cranky users may rate a good movie lower or some very generous users just don’t care for assessment. Let us see if we can bring in this in our model.
Let us try to improve our model further by bringing in the specific mean variance of the users and call it bu. The new model will look something like this:-
Y(u , i) = μ + bi+b u+ε (u , i)
We can get the variance of the user specific mean with the overall mean like this:-
user_avgs <- test %>%left_join(movie_avgs, by='movieId') %>% group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
Let us see if there is enough variability which we can use in our model.
There seems to be considerable variability which we can use. The histogram suggests that positives are more than the negatives (the column after zero is almost the same as zero). We can use this variability as below.
predicted_ratings <- validation %>% left_join(movie_avgs, by='movieId')%>%left_join(user_avgs, by='userId')%>%mutate(pred=mu + b_i + b_u)%>%.$pred
model_2_rmse <- RMSE(predicted_ratings, validation$rating)
rmse_results <- bind_rows(rmse_results, data_frame(method="Movie plus User Effect Model", RMSE = model_2_rmse ))
rmse_results %>%kable()
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
| Movie plus User Effect Model | 0.86535 |
Let us now see how successful we have been in aligning our predicted ratings.
## `summarise()` has grouped output by 'Actual'. You can override using the `.groups` argument.
We can see from the above graph that the predicted ratings have come very close to the actual ratings. From the graph above we can see that some of the ratings have moved out of range (0.5-5). Let us see if we can improve the RMSE by including a limit.
predicted_ratings<-replace(predicted_ratings,predicted_ratings<0.5,0.5)%>%replace(predicted_ratings>5,5)
model_2_rmse <- RMSE(predicted_ratings, validation$rating)
rmse_results <- bind_rows(rmse_results, data_frame(method="Movie plus User Effect Model (Cleaned)", RMSE = model_2_rmse ))
rmse_results %>%kable()
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
| Movie plus User Effect Model | 0.86535 |
| Movie plus User Effect Model (Cleaned) | 0.86516 |
Let us now take a look at the movies with the largest difference between actual and prediction.
We have a completely different set of movies. However we have still managed to give a rating of more than a perfect score for the first movie (the sum of b_u,b_i and mu it is more than 5). Let us have a look at what is happening.
## `summarise()` has grouped output by 'rating'. You can override using the `.groups` argument.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## rating 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
## 2.5 NA NA NA NA NA 1 NA NA NA NA
## 3 NA 1 NA 4 NA 4 1.0 3 NA 1
## 3.5 NA NA NA 6 2.0 22 12.0 28 8.0 13
## 4 2.0 4 1.0 9 6.0 46 34.0 115 21.0 96
## 4.5 2.0 4 1.0 5 NA 22 3.0 51 20.0 131
## 5 2.0 NA NA NA NA NA NA 5 1.0 12
Have user preferences changed over time. Let us explore that in the next section. Looks like our predicted ratings are stilll too generous and we need to move them to the left in the above table for eg there are 20 correct predicted ratings of 4.5. However for the same we have predicted 131 ratings as 5. This will be discussed further in the section on regularisation.
We earlier separated the year a movie was released from the title. Let us see if we can improve our prediction using this feature. Let us first see if there is enough variability for us to use.
We see that, on average, movies that came out after 1993 get more ratings. We also see that with newer movies, starting in 1993, the number of ratings decreases with year: the more recent a movie is, the less time users have had to rate it.
Among movies that came out in 1993 or later, let us look at the top 5 with the highest average number of ratings per year (n/year), and calculate the average rating of each of them.
From the above table, we can see that the most frequently rated movies tend to have above average ratings. This is not surprising: more people watch popular movies. To confirm this, let us stratify the post-1993 movies by ratings per year and compute their average ratings.
We can clearly see that the trend is that the more often a movie is rated, the higher its average rating. We will not have similar rates in the validation file so let us stratify this rate into more usable groups and see if we can make an improvement
This looks much better let us see if there is enough variability to use in our model.
We can see from the graph that the variability explained by the year released is not that large.
predicted_ratings <- validation %>% left_join(movie_avgs, by='movieId')%>%left_join(user_avgs, by='userId')%>%left_join(b_t, by='movieId')%>%mutate(pred=mu + b_i + b_u+r)%>%.$pred
predicted_ratings<-replace(predicted_ratings,predicted_ratings<0.5,0.5)
predicted_ratings<-replace(predicted_ratings,predicted_ratings>5,5)
print(paste("RMSE =",RMSE(validation$rating,predicted_ratings)))
## [1] "RMSE = 0.865135980113562"
Not much improvement in the RMSE. However let us see if we have been able to align more ratings
## `summarise()` has grouped output by 'Actual'. You can override using the `.groups` argument.
We see that the graph has deteriorated. So we can drop this model. We also have genre as a feature. Let us see if we have some variability to use.
We also have a time stamp of the rating. Let us see if we can use that as a feature.
We see that the graph has deteriorated. So we can drop this model. We also have timestamp as a feature. Let us see if we have some variability to use. We can break the timestamp into the year it was rated as some movies could become popular in a particular year. Week as more movies would be rated during the holiday period and wee as usually blockbusters get released on days before holidays (Friday) and many users like to go for the first show. Let us see if using these helps reduce the RMSE.
Let us start with the weekday
There is some variability which we can use. Let us see if there is an improvement.
week_avgs<-test_1 %>%left_join(movie_avgs, by='movieId') %>% left_join(user_avgs, by='userId')%>% group_by(weekday) %>% summarize(r = mean(rating-mu-b_i-b_u))
predicted_ratings <- validation %>%mutate(date = as_datetime(timestamp))%>%mutate(weekday = weekdays(date))%>% left_join(movie_avgs, by='movieId')%>%left_join(user_avgs, by='userId')%>%left_join(week_avgs, by='weekday')%>%mutate(pred=mu + b_i + b_u+r)%>%.$pred
predicted_ratings<-replace(predicted_ratings,predicted_ratings<0.5,0.5)
predicted_ratings<-replace(predicted_ratings,predicted_ratings>5,5)
print(paste("RMSE =",RMSE(validation$rating,predicted_ratings)))
## [1] "RMSE = 0.865160186872465"
Not much improvement here. Let us see if we can use the week.
week_avgs<-test_1 %>%left_join(movie_avgs, by='movieId') %>% left_join(user_avgs, by='userId')%>% group_by(week) %>% summarize(r = mean(rating-mu-b_i-b_u))
predicted_ratings <- validation %>%mutate(date = as_datetime(timestamp))%>%mutate(week = week(date))%>% left_join(movie_avgs, by='movieId')%>%left_join(user_avgs, by='userId')%>%left_join(week_avgs, by='week')%>%mutate(pred=mu + b_i + b_u+r)%>%.$pred
predicted_ratings<-replace(predicted_ratings,predicted_ratings<0.5,0.5)
predicted_ratings<-replace(predicted_ratings,predicted_ratings>5,5)
print(paste("RMSE =",RMSE(validation$rating,predicted_ratings)))
## [1] "RMSE = 0.865156570822928"
Nothing here too. Let us see if the Year the movie was rated helps.
year_avgs<-test_1 %>%left_join(movie_avgs, by='movieId') %>% left_join(user_avgs, by='userId')%>% group_by(year) %>% summarize(r = mean(rating-mu-b_i-b_u))
predicted_ratings <- validation %>%mutate(date = as_datetime(timestamp))%>%mutate(year = year(date))%>% left_join(movie_avgs, by='movieId')%>%left_join(user_avgs, by='userId')%>%left_join(year_avgs, by='year')%>%mutate(pred=mu + b_i + b_u+r)%>%.$pred
predicted_ratings<-replace(predicted_ratings,predicted_ratings<0.5,0.5)
predicted_ratings<-replace(predicted_ratings,predicted_ratings>5,5)
print(paste("RMSE =",RMSE(validation$rating,predicted_ratings)))
## [1] "RMSE = 0.865149629170259"
No major improvement here either. We also have the genre let us see if we can use that to improve the RMSE.
We need to separate out the genres before we proceed.
test_1 <-test %>%select(movieId,userId,genres,releaseyear,rating)%>%separate_rows(genres, sep = "\\|")
Let us first inspect the data for genres with a large number of ratings (say 1000).
There seems to be strong evidence of a genre effect. Let us see if we can use it in our model
genre_avgs<-test_1%>%left_join(movie_avgs, by='movieId') %>% left_join(user_avgs, by='userId')%>% group_by(genres) %>% summarize(b_g = mean(rating-mu-b_i-b_u))
movie_avgs<-test_1%>%left_join(genre_avgs,by='genres')%>%left_join(movie_avgs,by="movieId")%>%group_by(movieId)%>%summarise(b_i=mean(b_i),b_g=mean(b_g))
Let us see if there is enough variability which we can use in our model.
We can see from the graph that we can use some variability
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
| Movie plus User Effect Model | 0.86535 |
| Movie plus User Effect Model (Cleaned) | 0.86516 |
| Movie, User plus Genre Effect Model | 0.86504 |
We are almost there we had the maximum improvement with year released. Let us see if we combine year released and genre and see if user preferences have changed over time.
We can clearly see that the means have not remained constant throughout the years. Let us see if we can improve the RMSE by bringing in this variability.
## `summarise()` has grouped output by 'genres'. You can override using the `.groups` argument.
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
| Movie plus User Effect Model | 0.86535 |
| Movie plus User Effect Model (Cleaned) | 0.86516 |
| Movie, User plus Genre Effect Model | 0.86504 |
| Movie, User, Genre + Year Effect Model | 0.86464 |
We already see a considerable improvement and the RMSE is gone below 0.846.
Our final model will look something like this:- Y(u , i) = μ + bi+b u+b gy+Yu,+ε (u , i) with Yu,i limited to a range (0.5,5)and the RMSE would be 0.86466 since we met our target. Let us see whether we have been able to substantially spread our rating.
Let us see how our extreme cases now look like.
These are the same set of movies we saw earlier where we had only included movieId and userId. These are movies where we have predicted a near perfect score whereas the actual rating is the lowest possible. The above table highlights that movie average ratings are very high (b_i for most of them is above 0.5 which itself would take the predicted rating to 4 ). We had earlier seen in our analysis of "Christmas Story,A) is that 2 users had rated the movie as 0.5 and that is what is causing these extremes. Is it because we do not have a large enough sample size and therefore ended up skewing the estimate. Let us explore that in the next section.
We have noticed in our data exploration, some users are more actively participated in movie reviewing. There are also users who have rated very few movies. On the other hand, some movies are rated very few times. These are basically noisy estimates that we should not trust. Additionally, RMSE are sensitive to large errors. Large errors can increase our residual mean squared error. So we must put a penalty term to give less importance to such effect. Regularization permits us to penalize large estimates that are formed using small sample sizes. However choosing the penalty term is an iterative process described below.
Note:- Since using Genres in our analysis is very resource intensive, we will ignore it for now.
## [1] "Lambda = 5.25"
We see that a lamda of 5.25 results in the lowest rmse. Let us use that in our final model and see if the rmse reduces further.
## `summarise()` has grouped output by 'genres'. You can override using the `.groups` argument.
| method | RMSE |
|---|---|
| Just the average | 1.06120 |
| Movie Effect Model | 0.94391 |
| Movie plus User Effect Model | 0.86535 |
| Movie plus User Effect Model (Cleaned) | 0.86516 |
| Movie, User plus Genre Effect Model | 0.86504 |
| Movie, User, Genre + Year Effect Model | 0.86464 |
| Regularised Movie + User Effects Model | 0.86426 |
To see how the estimates shrink, let’s make a plot of the regularized estimates versus the least squares estimates
Looks like we have been able to move quite a few ratings by regularisation.
There were other algorithms based on Recommenderlab, kmeans and Random Forest on a small subset. However due to resource constraints could not extend it to the full set. Parallel processing with the help of “foreach ’ and”doparallel’ also did not help get to the results. There were some algorithms like PCA and KNN got a RMSE which was worse than our final model and therefore did not make sense.
After testing a couple of different methods, the target was achieved using only four features. The biggest challenge turned out to be the size of the data set and not the accuracy of the prediction (which is counter intuitive). We reached a RMSE of 0.86464 with our final model of Yu,i=μ+bi+bu+bgy with Yu,i limited to a range (0.5,5) before regularisation.
We can explore usage of other algorithms like xgboost, tensorflow, Random Forrest and create an ensemble (taking the best matched across the models) to improve th RMSE further. However it would require at least 32gb of RAM.