Irizarry, R. A. (2022, July 7). Introduction to Data Science. HARVARD Data Science. Retrieved June 8, 2022, from Https://rafalab.github.io/dsbook/ This project utilized “Introduction to Data Science Data Analysis and Prediction Algorithms with R” by our course instructor Rafael A. Irizarry published 2022-07-07.
R Packages. R Studio. Retrieved May 18, 2022, (https://www.rstudio.com/products/rpackages/)
Definition and History of Recommender Systems. (n.d.). Computer Science. (https://international.binus.ac.id/computer-science/2020/11/03/definition-and-history-of-recommender-systems/)
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. (DOI=http://dx.doi.org/10.1145/2827872)
PBS [Crash Course]. How YouTube knows what you should watch: Crash Course AI #15. (2019, November 22). [Video]. YouTube. (https://www.youtube.com/watch?v=kiInh5STnyQ&t=3s)
PBS [Crash Course]. (2019, November 29). Let’s make a movie recommendation system: Crash Course AI #16 [Video]. YouTube. (https://www.youtube.com/watch?v=iaIW3CO4rcY&t=661s)
GROUPLENS. (2009, January). MovieLens 10M/100k Data Set README. MovieLens 10M/100K. Retrieved June 1, 2022, from (https://files.grouplens.org/datasets/movielens/ml-10m-README.html)
Introduction to Loss Functions. (n.d.). DataRobot AI Cloud. (https://www.datarobot.com/blog/introduction-to-loss-functions/)
Note: Portions of the code within this project was provided by HARVARDX and the Reference Section Listed above. This process could take a couple of minutes:
if(!require(tidyverse)) install.packages(
"tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages(
"caret", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages(
"data.table", repos = "http://cran.us.r-project.org")
if(!require(RColorBrewer)) install.packages(
"RColorBrewer", repos = "http://cran.us.r-project.org")
if(!require(rmarkdown)) install.packages(
"rmarkdown", repos = "http://cran.us.r-project.org")
if(!require(dslabs)) install.packages(
"dslabs", repos = "http://cran.us.r-project.org")
Note: Due to known loading issues, load the following package(s), if needed
install.packages("hms", dependencies = TRUE)
install.packages("gtable", dependencies = TRUE)
install.packages("hexbin", dependencies = TRUE)
install.packages("readr", dependencies=TRUE, INSTALL_opts = c('--no-lock'))
install.packages("caret", dependencies = TRUE)
install.packages("data.table", dependencies = TRUE)
install.packages("tidyverse", dependencies = TRUE)
install.packages("Rcolorbrewer", dependencies = TRUE)
install.packages("gt", dependencies = TRUE)
install.packages("curl", dependencies = TRUE)
install.packages("ggpmisc", dependencies = TRUE)
install.packages("knitr", dependencies = TRUE)
install.packages("dplyr", dependencies = TRUE)
Note: this process could take a couple of minutes: Load the following Library(ies)
library(tidyverse)
library(caret)
library(data.table)
library(RColorBrewer)
library(rmarkdown)
library(dslabs)
library(gtable)
library(hexbin)
library(gt)
library(curl)
library(dplyr)
library(ggpmisc)
library(gridExtra)
library(knitr)
library(lubridate)
library(stringr)
For Lower Bandwidth/ RAM recommend adjusting the timeout settings
options(timeout = 320)
Depending on your RAM, to free up unused memory, recommend using:
gc()
To complete the next steps, verify which version of R your system is utilizing:
sessionInfo()
MovieLens 10M dataset: (https://grouplens.org/datasets/movielens/10m/) (http://files.grouplens.org/datasets/movielens/ml-10m.zip)
dl <- tempfile()
download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
Note: If you received a Peer certificate cannot be authenticated error. Recommend the following:
dl <- tempfile()
options(download.file.method="curl", download.file.extra="-k -L")
download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(
readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
if using R 3.6 or earlier:
movies <- as.data.frame(
movies) %>% mutate(movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres))
if using R 4.0 or later:
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(movieId),
title = as.character(title),
genres = as.character(genres))
Join the data below
movielens <- left_join(ratings, movies, by = "movieId")
Validation set will be 10% of MovieLens data. if using R 4.0 or later use set.seed(1, sample.kind=“Rounding”).
if using R 3.5 or earlier, use set.seed(1)
set.seed(1)
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
Trial_Set <- movielens[test_index,]
Make sure userId and movieId in validation set are also in edx set
validation <- Trial_Set %>% semi_join(
edx, by = "movieId") %>% semi_join(edx, by = "userId")
Add rows removed from validation set back into edx set
removed <- anti_join(Trial_Set, validation)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, movielens, removed)
Since the late 1970’s corporations have used machine learning recommendation systems to understand user selections, user trends and consumer demands. Machine learning algorithms in 2022 can now predict future interests, engagement, current taste, new product experimentation and more!
I will use The University of Minnesota team lab (Grouplens) MovieLens 10m dataset for this project. Grouplens selected 72,000 users at random to rate at least 20 movies for a combined 10 million ratings (view Reference Section for more information).
I will clean the data, investigate any NAs and examine the outliers that may skew the data needed to achieve the RMSE goal of .86490.
Create a series of visualizations and examine each chart to understand what steps we need to complete to reach our RMSE goal.
Create a trial recommender model to understand the RMSE.
Create a recommendation system based on the code from the reference section and utilize Loss Function (RMSE), User effects, and Regularization.
Finalize the machine learning algorithm to achieve the RMSE goal.
Lets clean the data and view it in a table
head(edx, 10)
## userId movieId rating timestamp title
## <int> <num> <num> <int> <char>
## 1: 1 122 5 838985046 Boomerang (1992)
## 2: 1 185 5 838983525 Net, The (1995)
## 3: 1 231 5 838983392 Dumb & Dumber (1994)
## 4: 1 292 5 838983421 Outbreak (1995)
## 5: 1 316 5 838983392 Stargate (1994)
## 6: 1 329 5 838983392 Star Trek: Generations (1994)
## 7: 1 355 5 838984474 Flintstones, The (1994)
## 8: 1 356 5 838983653 Forrest Gump (1994)
## 9: 1 362 5 838984885 Jungle Book, The (1994)
## 10: 1 364 5 838983707 Lion King, The (1994)
## genres
## <char>
## 1: Comedy|Romance
## 2: Action|Crime|Thriller
## 3: Comedy
## 4: Action|Drama|Sci-Fi|Thriller
## 5: Action|Adventure|Sci-Fi
## 6: Action|Adventure|Drama|Sci-Fi
## 7: Children|Comedy|Fantasy
## 8: Comedy|Drama|Romance|War
## 9: Adventure|Children|Romance
## 10: Adventure|Animation|Children|Drama|Musical
For comparison purposes later, lets compute Average Movie Rating
avg_rating <- mean(edx$rating)
avg_rating
## [1] 3.512464
Verify if there are any inconsistencies or NAs
any(is.na(edx))
## [1] FALSE
sum(is.na(edx))
## [1] 0
which(is.na(edx), arr.ind=TRUE)
## row col
Create a Summary Table
MovieLens Recommendation System Overview | |||||
HardvardX Capstone Project 2022 | |||||
userId | movieId | rating | timestamp | title | genres |
---|---|---|---|---|---|
1 | 122 | 5 | 1996-08-02 11:24:06 | Boomerang (1992) | Comedy|Romance |
1 | 185 | 5 | 1996-08-02 10:58:45 | Net, The (1995) | Action|Crime|Thriller |
1 | 231 | 5 | 1996-08-02 10:56:32 | Dumb & Dumber (1994) | Comedy |
1 | 292 | 5 | 1996-08-02 10:57:01 | Outbreak (1995) | Action|Drama|Sci-Fi|Thriller |
1 | 316 | 5 | 1996-08-02 10:56:32 | Stargate (1994) | Action|Adventure|Sci-Fi |
1 | 329 | 5 | 1996-08-02 10:56:32 | Star Trek: Generations (1994) | Action|Adventure|Drama|Sci-Fi |
1 | 355 | 5 | 1996-08-02 11:14:34 | Flintstones, The (1994) | Children|Comedy|Fantasy |
1 | 356 | 5 | 1996-08-02 11:00:53 | Forrest Gump (1994) | Comedy|Drama|Romance|War |
1 | 362 | 5 | 1996-08-02 11:21:25 | Jungle Book, The (1994) | Adventure|Children|Romance |
1 | 364 | 5 | 1996-08-02 11:01:47 | Lion King, The (1994) | Adventure|Animation|Children|Drama|Musical |
Portions of this data is from the Reference Section. | |||||
MovieLens dataset consists of over 10M movie ratings. |
Lets take a look at each column. Create a summary table for Movies, Titles, Users and Genres
Summary of MovieLens Data Totals | |||
HardvardX Capstone Project 2022 | |||
total_movies | total_users | total_genres | total_titles |
---|---|---|---|
10677 | 69878 | 797 | 10676 |
Portions of this data is from the Reference Section. | |||
Timestamps, Ratings, and Years excluded for summary purposes.. |
The ratings score is 0.5 through 5. Let’s see how are the Ratings Data distributed
RatingPer<- table(edx$rating)
RatingPer
##
## 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
## 85420 345935 106379 710998 332783 2121638 792037 2588021 526309 1390541
After viewing the data, ratings 3, 4, and 5 received the most votes, further validating our average.
Each user, on average, rated 20 movies but did every user rate 20 movies? Are there any users with less than 20 ratings or above 20 ratings?
table(edx$userId <=20)
##
## FALSE TRUE
## 8997881 2180
sum(edx$userId)
## [1] 322822329510
By calling out this data, we noticed that 2180 occurrences of an UserId did not have greater than or equal to 20 ratings. Per the table above, we only have unique 9,000,061 userIds for the training set. So let us use a different code to narrow down this data.
Remember the average movie rating is 3.512464, so if we multiply that times the average amount of movies that were rated by an unique user (20), we should have the minimum average score per userID.
p = 3.512464*20
p
## [1] 70.24928
Userdata_sum1_Lessthan20 <- aggregate(rating ~ userId, data = edx, sum)<=p
table(Userdata_sum1_Lessthan20)
## Userdata_sum1_Lessthan20
## FALSE TRUE
## 135708 4048
R scanned both columns and found 4048 instances of unique users averaged a total rating of less than 70.2498
x = 4048/9000061
x
## [1] 0.0004497747
Luckily, that is equal to less than .04% of users contributed to the MovieLens database. This is statistically insignificant and we should not take more action in removing the 4048 userids.
Average User Ratings via Histogram
Let us plot a box plot to see if we have any outliers that the histogram could not display.
Boxplot of Ratings Distribution with the Average Movie Rating
As we can see, several outliers encompass the least favorable ratings. Let us plot the ten worst Movies based on total ratings and examine the results.
BarPlot of Top 10 Worst Movies by Least Amount of Total Ratings
As we can see, the movies listed only had one rating of .5 or 1 (the lowest eligible ratings). A movie with only one total rating out of thousands of possible ratings from unique users will be considered an outlier in this data set. Remember, the average unique user rated at least 20 movies.
BarPlot of Top 10 Movies by the Highest Total Ratings
Ratio of Genres to Ratings
Note: this process could take a couple of minutes:
GR <- edx %>% group_by(genres) %>% mutate(ratio = mean(rating))
Look at how the majority of the ratio of genres to ratings are between 3 & 4.
After cleaning and visualizing the data, we now understand that outliers and data inconsistencies probably can affect the machine learning algorithm when making a recommender system.
Having an average rating of 3.51 is not enough to depict the accuracy of a recommender system. We will use the Loss Function’s Residual Mean Squared Error to evaluate our algorithm model for our train(edx) and test set(Trial_Set). Per DataRobot (Introduction to Loss Functions, n.d.), for RMSE, we will use the difference between the predictions and the actual data, square it, and average it across the entire dataset. The formula is
\[RMSE = \sqrt{\frac{1}{N}\sum_{u,i}\left(\hat{y}_{u,i}-y_{u,i}\right)^2}\] 1. The following explains the formulas or the associated object:
Train set = edx
Test set = Trial_Set
i = actual user provided rating
u = movie
b = bias
mu = avg_rating
b_i = movie_bias
b_u = User_Effect
N = sum of user/movie combinations
= Sum of u,i
\[\sum_{u,i}\] * = differences of both, squared
\[(\hat{y}_{u,i}-{y}_{u,i})^2\]
Create a loss function
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
Per the course instruction we want a RMSE around .86490. Lets adjust and plug in numbers to see the effects by creating a Trial Run Model.
Execute Trial_Set model
True_NaiveB_RMSE <- RMSE(Trial_Set$rating, avg_rating)
True_NaiveB_RMSE
Trial_One <- rep(1, nrow(Trial_Set))
Trial_One_RMSE<- RMSE(Trial_Set$rating, Trial_One)
Trial_Two <- rep(2, nrow(Trial_Set))
Trial_Two_RMSE<- RMSE(Trial_Set$rating, Trial_Two)
Trial_Three <- rep(3, nrow(Trial_Set))
Trial_Three_RMSE <- RMSE(Trial_Set$rating, Trial_Three)
Trial_Run_Summary <- tibble(
Trial_One_RMSE,
Trial_Two_RMSE,
Trial_Three_RMSE,
avg_rating,
True_NaiveB_RMSE)
Lets put our trial data in a table
Lets Compare it to the RMSE Goal of .86490
Goal_RMSE <- .86490
Goal_RMSE
Trial_Name <- c("Trial_One_RMSE","Trial_Two_RMSE", "Trial_Three_RMSE",
"True_NaiveB_RMSE")
RMSE_Value <- c(2.75547, 1.86982, 1.183146, 1.060651)
Max_RMSE <- c(0, 1, 2, 3)
Trial_Compare<- data.frame(Trials = c("Trial_One_RMSE","Trial_Two_RMSE",
"Trial_Three_RMSE", "True_NaiveB_RMSE"),
RMSE_Value = c(2.75547, 1.86982, 1.183146, 1.060651),
Max_RMSE = c(0, 1, 2, 3))
Trial_Compare_Legend<- data.frame(Trials = c("Trial_One_RMSE","Trial_Two_RMSE",
"Trial_Three_RMSE", "True_NaiveB_RMSE"),
RMSE_Value = c(2.75547, 1.86982, 1.183146, 1.060651))
As you can see my trial data is far from the brown dotted line which represents our RMSE goal. Moving forward we will only rely on our train set (edx). By refining the code, I will get closer to the RMSE needed to ensure that this recommendation system is valuable for our train set (edx) vice our Trial_Set.
NaiveB_RMSE = bind_rows(method = "The Naive RMSE Mean", RMSE = True_NaiveB_RMSE)
NaiveB_RMSE
## # A tibble: 1 × 2
## method RMSE
## <chr> <dbl>
## 1 The Naive RMSE Mean 1.06
Some movies receive more ratings over time than others. Higher-rated movies could stem from users falling in love with the movie, or it could be that certain movies leave a lasting effect on a population/culture, which turns them into a household name. Being a household name, I could infer that users may have never watched the movie. However, since everyone says it is an excellent movie, I believe people would automatically give it a good rating for being a classic. That effect is called Movie Bias \({b_i}\), and we will account for it in the algorithm. \({b}\) will be the the bias while \({i}\) is the actual user provided rating. To calculate movie bias we will use Formula: \[Y_{u,i} ={\mu}+{b_i}+\epsilon_{u,i}\]
Plot the average movie bias
Average_Movie_Bias<- edx %>%
group_by(movieId) %>%
summarise(movie_bias = mean(rating - avg_rating))
MA_df <- data.frame(Average_Movie_Bias)
Now lets see if we have any improvements since updating the code.
Clean_predicted_ratings_true_1 <- avg_rating + edx %>%
left_join(Average_Movie_Bias, by = "movieId") %>%
pull(movie_bias)
Clean_predicted_ratings_true_1
RMSE(Clean_predicted_ratings_true_1, edx$rating)
## [1] 0.942368
As we can see we are getting closer to our goal of .86490.
Movie_Bias_Effects_RMSE <- RMSE(Clean_predicted_ratings_true_1, edx$rating)
Movie_Bias_Effects_RMSE
NaiveB_RMSE = bind_rows(NaiveB_RMSE,
data_frame(method = "Movie Bias Effects", RMSE =
Movie_Bias_Effects_RMSE))
NaiveB_RMSE
## # A tibble: 2 × 2
## method RMSE
## <chr> <dbl>
## 1 The Naive RMSE Mean 1.06
## 2 Movie Bias Effects 0.942
As an avid Marvel movie watcher, I am biased towards the marvel cinematic universe. Additionally, I am not too fond of romance movies. If I was forced to rate a romance movie, I am 100% sure I would rate it a .5 rating without watching one minute of the movie. As the creator of this algorithm set, we must account for the User effect among our user base. To formulate this we will use \({b_u}\) to account for user affects. User Effects = the mean of the ratings per user. We can add this to our model:
\[Y_{u,i} ={\mu}+{b_i}+{b_u}+\epsilon_{u,i}\]
then update it to reflect one’s high ratings vs low ratings
\[\hat{b_u} = mean(Y_\hat{u,i} - \hat{\mu}-\hat{b_i})\]
Average_User_Rating <- edx %>%
left_join(Average_Movie_Bias, by="movieId") %>%
group_by(userId) %>%
summarise(User_Effect = mean(rating - avg_rating - movie_bias))
Clean_predicted_ratings_true_2 <- edx %>%
left_join(Average_Movie_Bias, by='movieId') %>%
left_join(Average_User_Rating, by='userId') %>%
mutate(pred = avg_rating + movie_bias + User_Effect) %>%
pull(pred)
RMSE(Clean_predicted_ratings_true_2, edx$rating)
## [1] 0.8566699
Check the progress
User_Bias_Effects_RMSE <- RMSE(Clean_predicted_ratings_true_2, edx$rating)
User_Bias_Effects_RMSE
NaiveB_RMSE = bind_rows(NaiveB_RMSE,
data_frame(method = "User Bias Effects",
RMSE = User_Bias_Effects_RMSE))
NaiveB_RMSE
## # A tibble: 3 × 2
## method RMSE
## <chr> <dbl>
## 1 The Naive RMSE Mean 1.06
## 2 Movie Bias Effects 0.942
## 3 User Bias Effects 0.857
We are getting closer to our RMSE goal. So far, we have recognized that there are movie bias and user effects that can significantly change our algorithm. To shrink the errors and deviations, we will use regularization. This will help us refine the algorithm over time by eventually adding penalty terms.
Earlier, we determined the ten best and worst movies based on total ratings. In both visualizations, we did not consider user effects and movie bias. Let us implement both to see if the best and worst movies change. Also, we will see how many ratings per user for each film are in our new list of 10 best & 10 worst.
table(Top_10_Best_Movies$`... %>% pull(title)`)
##
## Blue Light, The (Das Blaue Licht) (1932)
## 1
## Constantine's Sword (2007)
## 1
## Fighting Elegy (Kenka erejii) (1966)
## 1
## Hellhounds on My Trail (1999)
## 1
## Human Condition II, The (Ningen no joken II) (1959)
## 2
## Satan's Tango (Sátántangó) (1994)
## 1
## Shadows of Forgotten Ancestors (1964)
## 1
## Sun Alley (Sonnenallee) (1999)
## 1
## Who's Singin' Over There? (a.k.a. Who Sings Over There) (Ko to tamo peva) (1980)
## 2
table(Top_10_Worst_Movies)
## ... %>% pull(title)
## Accused (Anklaget) (2005)
## 1
## Besotted (2001)
## 1
## Disaster Movie (2008)
## 6
## From Justin to Kelly (2003)
## 8
## Grief (1993)
## 1
## Hi-Line, The (1999)
## 1
## Hip Hop Witch, Da (2000)
## 3
## Stacy's Knights (1982)
## 1
## SuperBabies: Baby Geniuses 2 (2004)
## 7
## War of the Worlds 2: The Next Wave (2008)
## 2
As you can see, the number of ratings per movie is not enough to train an algorithm. Inherently, we could guess the user rating selections, but the margin of error would be too high. Instead, we will use penalized regression to optimize the algorithm parameters and control the variability of the movie effects. I will test the regularized estimates using the object lambdas.
Lambda ( \(\lambda\) ) is the Greek alphabet eleventh letter, and its symbol is commonly used for wavelength. Lambda is the common term used in the machine learning community for finding the balance between training the data and simplicity. Multi-purpose retail giant Amazon has a machine learning model named AWS Lambdas that test machine learning algorithms. I will test \(\lambda\) = 1 and \(\lambda\) = 5 to see which object is a better tuning mechanism. We will also visualize each lambda to see which one has a more substantial weight towards zero.
Formula for Lambda One:
\(\lambda\) = 1
\[\hat{b_i}(\lambda) = {\frac{1}{\lambda + n_i}\sum_{u=i}^{n_i}\left({Y}_{u,i}-\hat{\mu}\right)}\]
Formula for Lambda Five:
\(\lambda\) = 5
\[\hat{b_i}(\lambda) = {\frac{1}{\lambda + n_i}\sum_{u=i}^{n_i}\left({Y}_{u,i}-\hat{\mu}\right)}\]
lambda_one <- 1
avg_movies <- mean(edx$rating)
Average_Regularized_Mov <- edx %>%
group_by(movieId) %>%
summarize(movie_bias = sum(rating - avg_movies)/(n()+lambda_one), n_i = n())
Penalty_Plot <- tibble(original = Average_Movie_Bias$movie_bias,
regularlized = Average_Regularized_Mov$movie_bias,
n = Average_Regularized_Mov$n_i)
lambda_five <- 5
avg_movies <- mean(edx$rating)
Average_Regularized_Mov <- edx %>%
group_by(movieId) %>%
summarize(movie_bias = sum(rating - avg_movies)/(n()+lambda_five), n_i = n())
Comp_Penalty <- tibble(original = Average_Movie_Bias$movie_bias,
regularlized = Average_Regularized_Mov$movie_bias,
n = Average_Regularized_Mov$n_i)
Observe the Side-by_side Comparison. Notice in Lambda Five that the estimates shrank considerably, and the weight of the plot is more towards zero. Lambda Five spread also indicates more wiggle room for simplicity. If we pick Lambda one, we will run the risk of training a more complex algorithm making it harder for the algorithm to process more general information. We will stick with Lambda Five. Lets re-evaluate the Top Ten lists utilizing the penalized estimates.
Top 10 Best Movies using Penalized Estimates
## ... %>% pull(title)
## Casablanca (1942)
## 10
## Double Indemnity (1944)
## 10
## Godfather, The (1972)
## 10
## Rear Window (1954)
## 10
## Schindler's List (1993)
## 10
## Seven Samurai (Shichinin no samurai) (1954)
## 10
## Shawshank Redemption, The (1994)
## 10
## Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
## 10
## Usual Suspects, The (1995)
## 10
Top 10 Worst Movies using Penalized Estimates
## ... %>% pull(title)
## Barney's Great Adventure (1998)
## 7
## Carnosaur 3: Primal Species (1996)
## 7
## Disaster Movie (2008)
## 6
## Faces of Death 6 (1996)
## 8
## From Justin to Kelly (2003)
## 8
## Gigli (2003)
## 9
## Glitter (2001)
## 9
## Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie) (2002)
## 9
## Pokémon Heroes (2003)
## 9
## SuperBabies: Baby Geniuses 2 (2004)
## 7
As you can see, the number of ratings per movie changed considerably. Utilizing penalized regression to optimize the algorithm parameters and control the variability of the movie effects has helped us get better data to assist us in reaching our Goal RMSE. Let us finish buttoning up our code and add the results to our RMSE table.
L5RP <-mean(Comp_Penalty$regularlized)
Clean_predicted_ratings_3 <- edx %>%
left_join(Average_Regularized_Mov, by = "movieId") %>%
mutate(pred = avg_movies + movie_bias + L5RP) %>%
pull(pred)
Clean_predicted_ratings_3
Check the progress and add to table
Regularized_Movie_Effect <- RMSE(Clean_predicted_ratings_3, edx$rating)
Regularized_Movie_Effect
NaiveB_RMSE = bind_rows(NaiveB_RMSE,
data_frame(method = "Regularized Movie Effect",
RMSE = Regularized_Movie_Effect))
NaiveB_RMSE
## # A tibble: 4 × 2
## method RMSE
## <chr> <dbl>
## 1 The Naive RMSE Mean 1.06
## 2 Movie Bias Effects 0.942
## 3 User Bias Effects 0.857
## 4 Regularized Movie Effect 0.980
From the Ratings Average to the Regularized Movie Effect Model I noticed a considerable changed in my machine learning model when I added average penalized regularized terms. Here, I will verify my decision to choose Lambda Five to see if I can obtain the Goal RMSE!
lambdas <- seq(0, 10, 0.25)
rmses <- sapply(lambdas, function(l){
avg_movies <- mean(edx$rating)
movie_bias <- edx %>%
group_by(movieId) %>%
summarize(movie_bias = sum(rating - avg_movies)/(n()+l))
User_Effects <- edx %>%
left_join(movie_bias, by="movieId") %>%
group_by(userId) %>%
summarize(User_Effects = sum(rating - movie_bias - avg_movies)/(n()+l))
Clean_predicted_ratings_4 <-
validation %>%
left_join(movie_bias, by = "movieId") %>%
left_join(User_Effects, by = "userId") %>%
mutate(pred = avg_movies + movie_bias + User_Effects) %>%
pull(pred)
return(RMSE(validation$rating, Clean_predicted_ratings_4))
})
lambda <- lambdas[which.min(rmses)]
lambda
## [1] 5.5
Final_Movie_Model <- min(rmses)
Final_Movie_Model
NaiveB_RMSE = bind_rows(NaiveB_RMSE,
data_frame(
method = "Finalized Movie Model",
RMSE = Final_Movie_Model))
NaiveB_RMSE
## # A tibble: 5 × 2
## method RMSE
## <chr> <dbl>
## 1 The Naive RMSE Mean 1.06
## 2 Movie Bias Effects 0.942
## 3 User Bias Effects 0.857
## 4 Regularized Movie Effect 0.980
## 5 Finalized Movie Model 0.865
Final results are posted in the Table
MovieLens Recommendation System Final RMSE Model Results | |
Summary of Each Recommender Prediction Model | |
Model_Name | RMSE |
---|---|
The Naive RMSE Mean | 1.0606537 |
Movie Bias Effects | 0.9423680 |
User Bias Effects | 0.8566699 |
Regularized Movie Effect | 0.9797166 |
Finalized Movie Model | 0.8649857 |
Portions of this data is from the Reference Section. |
We finally met our goals.
We cleaned the data, determined what outliers were statistically significant, and created a series of visualizations to better understand our data.
We created a trial recommender model to understand: