Machine Learning - MovieLens Project
1. Introduction
Recommendation systems rely on the notion of similarity between objects. Nowadays it has been applied to recommend movies, books, hotels, restaurants, doctors, etc. According to Forte and Miller (2017) customers can be considered like each other if they share most of the products that they have purchased, and items can be considered similar to each other if they share a large number of customers who purchased them.
In the real-world, Recommendation systems have been used by e-commerce websites such as Amazon.com, Netflix, iTunes to improve revenue. Customers are guided to the right product and receive recommendations on products with similar characteristics.
This project is a requirement for the HarvardX Professional Certificate Data Science Program which aims to predict movies ratings using “MovieLens 10M Dataset” https://grouplens.org/datasets/movielens/10m/, http://files.grouplens.org/datasets/movielens/ml-10m.zip.
github: https://github.com/silpai/Machine-Learning---MovieLens-
2. Methodology and Analysis
Several data transformations were implemented to create new categories for:
- movie rating decade - transform timestamp into year and after categorizing the ratings in the 1990s and 2000s
- release decade: split the title and retrieve the release year, then categorize into decades from 1910s to 2000s
- rating group: if the star rating was half or whole star
- single genre: remove all the genres combination
- date: timestamp was also transformed into YYYY-MM-DD
- Validations were preformed to verify if the data wrangling was implemented correctly
Exploratory Analysis (EDA) were developed to get more insights for the recommendation system and have a better understanding of the data.
Modeling was used to create the predictions. Models were built from the training data (edx further split into edx_train and edx_test), and then assessed to the test data (validation):
- Random prediction,
- Linear Models: Linear regression using edx, followed by predictions using train_set to create the model for Only mean of the ratings, Adding Movie effects (bi), Adding Movie + User Effect (bu), Adding Movie + User + time Effect (bt) . Both test_set and validation were used to predict and calculate the RMSE
- Matrix factorization (recosystem)
Final metric is the root mean square estimate (RMSE). A lower RMSE means a better, more accurate prediction. According to Perrier (2017) RMSE is defined as the sum of the squares of the difference between the real values and the predicted values. The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction. The aim for successful metrics is to reach RMSE Capstone target < 0.8649.
2.1 Data Transformation
During the process of data transformation, a series of validations occurred to ensure retrieving the correct data. For example, after splitting the title to get the release year, it was found that the generated premier_date retrieve values were between 1000 and 9000. It was clear that the split numbers related to the title were retrieved as release year. Re-coding was necessary to fix this issue.
#1.Transformed edx
edx_transf<-edx %>%
mutate(year_rated = year(as_datetime(timestamp)), # transform timestamp to year
premier_date=as.numeric(stri_extract(title, regex = "(\\d{4})", comments = TRUE))) # extract date from title
#2. Validated transformation (split of the title into premier_date)
start_premier_date=min(edx_transf$premier_date) # movies release start year
end_premier_date=max(edx_transf$premier_date) # movies release end year
#3.Identified release year issues
premier_year_tofix <- edx_transf %>% group_by(movieId) %>%
filter (premier_date<1910|premier_date>2009) %>%
select(movieId, title, premier_date)
unique(premier_year_tofix)## # A tibble: 17 x 3
## # Groups: movieId [17]
## movieId title premier_date
## <dbl> <chr> <dbl>
## 1 6290 House of 1000 Corpses (2003) 1000
## 2 1422 Murder at 1600 (1997) 1600
## 3 671 Mystery Science Theater 3000: The Movie (1996) 3000
## 4 8198 1000 Eyes of Dr. Mabuse, The (Tausend Augen des Dr. Mab~ 1000
## 5 2311 2010: The Year We Make Contact (1984) 2010
## 6 6645 THX 1138 (1971) 1138
## 7 2691 Legend of 1900, The (a.k.a. The Legend of the Pianist o~ 1900
## 8 5310 Transylvania 6-5000 (1985) 5000
## 9 4159 3000 Miles to Graceland (2001) 3000
## 10 8905 1492: Conquest of Paradise (1992) 1492
## 11 27266 2046 (2004) 2046
## 12 8864 Mr. 3000 (2004) 3000
## 13 53953 1408 (2007) 1408
## 14 5472 1776 (1972) 1776
## 15 2308 Detroit 9000 (1973) 9000
## 16 26359 1900 (Novecento) (1976) 1900
## 17 4311 Bloody Angels (1732 Høtten: Marerittet Har et Postnumm~ 1732
#4. Recoded the 17 release years between 1000 and 9000
edx_transf[edx_transf$movieId == "2308", "premier_date"] <- 1973
edx_transf[edx_transf$movieId == "5310", "premier_date"] <- 1985
edx_transf[edx_transf$movieId == "671", "premier_date"] <- 1996
edx_transf[edx_transf$movieId == "4159", "premier_date"] <- 2001
edx_transf[edx_transf$movieId == "27266", "premier_date"] <- 2004
edx_transf[edx_transf$movieId == "8864", "premier_date"] <- 2004
edx_transf[edx_transf$movieId == "2311", "premier_date"] <- 1984
edx_transf[edx_transf$movieId == "5472", "premier_date"] <- 1972
edx_transf[edx_transf$movieId == "4311", "premier_date"] <- 1998
edx_transf[edx_transf$movieId == "1422", "premier_date"] <- 1997
edx_transf[edx_transf$movieId == "8905", "premier_date"] <- 1992
edx_transf[edx_transf$movieId == "53953", "premier_date"] <- 2007
edx_transf[edx_transf$movieId == "6645", "premier_date"] <- 1971
edx_transf[edx_transf$movieId == "6290", "premier_date"] <- 2003
edx_transf[edx_transf$movieId == "8198", "premier_date"] <- 1960
edx_transf[edx_transf$movieId == "2691", "premier_date"] <- 1998
edx_transf[edx_transf$movieId == "26359", "premier_date"] <- 1976
#5. Implemented final transformations
edx <-edx_transf %>%
mutate(date = round_date(as_datetime(timestamp), unit = "month"), # transform timestamp into YYYY-MM-DD
movie_rated_era=ifelse (year_rated<=1999, "Ratings during 1990s","Ratings during 2000s"), # define the decades of the rated movies: 1990s (1990-1999) 2000s (2000-2009)
premier_year_era=ifelse (premier_date >=1910 & premier_date <=1919, "1910s", # define the decades of the premier date
ifelse(premier_date >=1920 & premier_date <=1929, "1920s",
ifelse(premier_date >=1930 & premier_date <=1939, "1930s",
ifelse(premier_date >=1940 & premier_date <=1949, "1940s",
ifelse(premier_date >=1950 & premier_date <=1959, "1950s",
ifelse(premier_date >=1960 & premier_date <=1969, "1960s",
ifelse(premier_date >=1970 & premier_date <=1979, "1970s",
ifelse(premier_date >=1980 & premier_date <=1989, "1980s",
ifelse(premier_date >=1990 & premier_date <=1999, "1990s",
ifelse(premier_date >=2000 & premier_date <=2009, "2000s","2010s")))))))))),
rating_group= ifelse((rating == 1 |rating == 2 | rating == 3 | rating == 4 | rating == 5), "whole star", "half star"), # Define if rating are whole or half starts
Orig_genres=genres) %>% # keep the original genres column
separate(genres,c("single_genres"),sep = "\\|") # split combination genres into single genres| userId | movieId | rating | timestamp | title | single_genres | year_rated | premier_date | date | movie_rated_era | premier_year_era | rating_group | Orig_genres |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 122 | 5 | 838985046 | Boomerang (1992) | Comedy | 1996 | 1992 | 1996-08-01 | Ratings during 1990s | 1990s | whole star | Comedy|Romance |
| 1 | 185 | 5 | 838983525 | Net, The (1995) | Action | 1996 | 1995 | 1996-08-01 | Ratings during 1990s | 1990s | whole star | Action|Crime|Thriller |
| 1 | 292 | 5 | 838983421 | Outbreak (1995) | Action | 1996 | 1995 | 1996-08-01 | Ratings during 1990s | 1990s | whole star | Action|Drama|Sci-Fi|Thriller |
| 1 | 316 | 5 | 838983392 | Stargate (1994) | Action | 1996 | 1994 | 1996-08-01 | Ratings during 1990s | 1990s | whole star | Action|Adventure|Sci-Fi |
| 1 | 329 | 5 | 838983392 | Star Trek: Generations (1994) | Action | 1996 | 1994 | 1996-08-01 | Ratings during 1990s | 1990s | whole star | Action|Adventure|Drama|Sci-Fi |
2.2 Exploratory Data Analysis (EDA) & Visualizations
MovieLens is comprised by 2 datasets: movies.dat containing Movie ID, title and genres and ratings.dat containing User ID, Movie ID, rating, and timestamp.
EDA indicated that edx contains 9,000,055 ratings applied to 106,770 movies from 797 genres combinations. Movies were rated by 69,878 users from 1995 to 2009. Movie’s premier took place between 1915 and 2008.
## ratings_n unique_movies unique_genres users_n start_year_rate end_year_rate
## 1 9000055 10677 797 69878 1995 2009
## start_premier_date end_premier_date
## 1 1915 2008
| ratings_n | unique_movies | unique_genres | users_n | start_year_rate | end_year_rate | start_premier_date | end_premier_date |
|---|---|---|---|---|---|---|---|
| 9000055 | 10677 | 797 | 69878 | 1995 | 2009 | 1915 | 2008 |
Genres
After splitting the genre combinations (such as Action from “Action|Crime|Thriller” or Comedy from “Comedy|Romance”) into single genres, 20 unique single genres were found.
## `summarise()` ungrouping output (override with `.groups` argument)
| single_genres | count |
|---|---|
| Action | 2560545 |
| Comedy | 2437260 |
| Drama | 1741668 |
| Adventure | 753650 |
| Crime | 529521 |
| Horror | 233074 |
It was possible to identify that the top 3 single genres were Action, Comedy and Drama. Having Action and Comedy similar ratings. The lowest rated single genres were Romance, War and IMAX.
single_genres_sum<-edx %>% group_by(single_genres,movieId) %>%
summarize(n_rating_of_movie = n(),
mu_movie = mean(rating),
sd_movie = sd(rating))
single_genres_sum %>%
ggplot( aes(x=single_genres, y=n_rating_of_movie)) +
geom_segment( aes(xend=single_genres, yend=0)) +
geom_point( size=4, color="orange") +
coord_flip() + ggtitle("Movies per single genre") +
labs(fill = "number of ratings",
subtitle="How is each genre distributed?",
caption ="Note: genre combinations were removed") +
mytheme + theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
legend.position = "none")Release Decade
The highest ratings were given to the 1980s and 1990s movies (3 and 4 rating stars). Movies from the 1910s to 1970s also received good ratings, the majority were above 3 stars, having their highest number of ratings with 4 stars.
edx %>% mutate(premier_year_era=premier_year_era)%>% group_by (rating,premier_year_era) %>%
count(premier_year_era) %>%
ggplot(aes(x = premier_year_era, y = n, fill= n)) + # Plot with values on top
geom_bar(stat = "identity") +
ggtitle("Release Decade") +
labs(fill = "number of ratings", subtitle="How are movie release decades distributed over the star ratings?", caption ="Note: release decade category derived from the split of title") +
mytheme +
theme(axis.text.y = element_blank(), axis.text.x =element_text(size = rel(0.75),angle = 90), legend.position = "bottom", legend.spacing.x = unit(1.0, 'cm'),legend.text = element_text(margin = margin(t = 10)))+
guides(fill = guide_colorbar(title = "number of ratings", # edit the legend bar
label.position = "bottom", label.hjust = 1,
title.position = "left", title.vjust = 0.75,
frame.colour = "black", # draw border around the legend
barwidth = 13,
barheight = 0.75)) +
scale_fill_distiller(palette = "Spectral") +
facet_wrap(~ rating,ncol=3)The analysis of the trend: “average ratings vs. date” shows that there is some evidence of a time effect. The movies from the 1970s and earlier received on average higher star rate score than movies which premier took place after the 1980s.
It can be noticed a decrease on the average star rating after year 2000 for the 1900s and 2000s movies (from an average of about 4 to 3.5 stars), whereas the 1980s movies maintained the 3.5 star rating over the period of time.
release_decade<-edx %>%
mutate(premier_year_era_group=ifelse (premier_date >=1910 & premier_date <=1979, "1970s or earlier", # define the decades of the premier date : prior 1980s, 1980s,1990s,2000s
ifelse(premier_date >=1980 & premier_date <=1989, "1980s",
ifelse(premier_date >=1990 & premier_date <=1999, "1990s",
ifelse(premier_date >=2000 & premier_date <=2009, "2000s","2010s"))))) %>%
group_by(date, premier_year_era_group) %>%
summarize(avgrating = mean(rating))
release_decade%>%
ggplot(aes(date, avgrating, colour = premier_year_era_group)) +
geom_point() +
geom_smooth() +
theme_bw() + theme(panel.grid = element_blank(),axis.title = element_blank()) +
labs(colour = "Release decades",
title="Timestamp (unit in month)",
subtitle="Do movies prior 1980s recieve more star ratings than recent movies?")2.3 Modelling
In this step edx was split into train_set (90% of the data) and test_set (10% of the data). Using the original code as template. Dimensions were: edx train_set = 8100048 x 12; edx test_set = 899990 X 12.
2.3.1 - Random Prediction
As the initial insight benchmark, probability distribution of the variables was calculated. This means that any model should be better than the values found. EDA analysis identified that there were more movies being scored with 3- and 4-star ratings (23% and 28% respectively). This represents an overall 51% of the ratings given to just two score stars.
With the random prediction, ratings use the observed probabilities. Initially the probability of each individual rating is calculated on the train_set. Then the test_set is used to predict the rating and compare with the actual rating. The final table results show the probabilities of the edx, train_set, monte carlo using train_set, test_set and RMSE using the test_set.
prop_edx_ratings<- (prop.table(table(edx$rating))) # Proportion of each edx ratings
prop_train_ratings<- (prop.table(table(train_set$rating))) # Proportion of each train_set ratings
prop_test_ratings<- (prop.table(table(test_set$rating))) # Proportion of each test_set ratings
set.seed(1, sample.kind = "Rounding")
# Create the probability of each rating - seq (min,max,interval)
p <- function(x, y) mean(y == x)
rating <- seq(0.5,5,0.5)
# Estimate the probability of each rating with Monte Carlo simulation
B <- 10000
M <- replicate(B, {
s <- sample(train_set$rating, 100, replace = TRUE)
sapply(rating, p, y= s)
})
prob <- sapply(1:nrow(M), function(x) mean(M[x,])) # Proportion of each train_set ratings with Monte Carlo simulation
# Predict random ratings
y_hat_random <- sample(rating, size = nrow(test_set),
replace = TRUE, prob = prob)
RMSE_rating<- RMSE(test_set$rating, y_hat_random)
#calculation of the RMSE for each rating.
#It is the same as sqrt(mean((true_ratings - predicted_ratings)^2))
individual_RMSE_rating <- data.frame (c(RMSE(test_set$rating==0.5, y_hat_random==0.5), # Proportion of each test_set ratings
RMSE(test_set$rating==1, y_hat_random==1),
RMSE(test_set$rating==1.5, y_hat_random==1.5),
RMSE(test_set$rating==2, y_hat_random==2),
RMSE(test_set$rating==2.5, y_hat_random==2.5),
RMSE(test_set$rating==3, y_hat_random==3),
RMSE(test_set$rating==3.5, y_hat_random==3.5),
RMSE(test_set$rating==4, y_hat_random==4),
RMSE(test_set$rating==4.5, y_hat_random==4.5),
RMSE(test_set$rating==5, y_hat_random==5)))
# validated by sqrt(mean(((test_set$rating==5) - (y_hat_random==5))^2))
### create each rating proportion & RMSE final table
rating_prop<-data.frame (rating= c(0.5,1,1.5,2,2.5,3,3.5,4,4.5,5),
prop_edx_ratings, prop_train_ratings, prob, prop_test_ratings, individual_RMSE_rating) %>% # create final table
select (rating, Freq, Freq.1, prob, Freq.2, c.RMSE.test_set.rating....0.5..y_hat_random....0.5...RMSE.test_set.rating.... )
names(rating_prop)<-c("rating","prob_edx", "prob_train_set", "prob_MonteCarlo_train_set", "prob_test_set", "RMSE test_set")| rating | prob_edx | prob_train_set | prob_MonteCarlo_train_set | prob_test_set | RMSE test_set |
|---|---|---|---|---|---|
| 0.5 | 0.0094859 | 0.0094877 | 0.009592 | 0.0094701 | 0.1372839 |
| 1.0 | 0.0384085 | 0.0384012 | 0.038171 | 0.0384749 | 0.2711021 |
| 1.5 | 0.0118250 | 0.0118326 | 0.011777 | 0.0117568 | 0.1523601 |
| 2.0 | 0.0790464 | 0.0790855 | 0.078600 | 0.0786942 | 0.3808550 |
| 2.5 | 0.0370009 | 0.0369727 | 0.036984 | 0.0372549 | 0.2671822 |
| 3.0 | 0.2356919 | 0.2356793 | 0.236294 | 0.2358048 | 0.6009064 |
| 3.5 | 0.0879577 | 0.0879240 | 0.088360 | 0.0882610 | 0.4011090 |
| 4.0 | 0.2876016 | 0.2876351 | 0.287167 | 0.2872999 | 0.6399185 |
| 4.5 | 0.0585259 | 0.0585256 | 0.058362 | 0.0585284 | 0.3319356 |
| 5.0 | 0.1544562 | 0.1544563 | 0.154693 | 0.1544550 | 0.5112290 |
Results of the RMSE test_set for individual star rating using the Random prediction model were below the RMSE target.
### create RMSE for rating
rating_RMSE_1 <- data.frame (Method = c("RMSE Target", " ", "Random prediction"), RMSE = c(0.8649," ",RMSE_rating))
#rating_RMSE_1| Method | RMSE |
|---|---|
| RMSE Target | 0.8649 |
| Random prediction | 1.49979655801755 |
The RMSE of random prediction for the overall rating was very high (1.499), above the RMSE Capstone Target of 0.8649. RMSE of each individual rating met the RMSE Capstone Target.
2.3.2 Linear Models
P-value of the F-statistic is < 2.2e-16, which is highly significant y=mx+b. The only predictor that was significantly related to the rating was single_genresIMAX. All the other genres had Pr(>|t|) > 0.05. This means that changes in single_genresIMAX ratings are significantly associated to changes in ratings, while changes in all remaining individual single genres seem not to be significantly associated with ratings.
single_genresIMAX coefficient suggests that for every 1 rating increase, it can be expected a decrease of -1.3214286 *1 = 1.32 star units, on average. Meaning the maximum score would be 5 stars - 1.32 = 3.68 stars (since the slope is negative, the decrease ratio is 1.32 per unit)
R-squared represents the correlation coefficient. Strong correlation means that R-squared is close to 1, positively or negatively. In the rating ~ single_genres linear regression model, the multiple R-squared was 0.0221, demonstrating that there was no correlation. Only 2.21% of the variance in the measure of rating can be predicted by single genres. Residual standard error (RSE) is the measure of error of prediction. The lower the RSE is, the more accurate is the model.
rating ~ single_genres
set.seed(1, sample.kind = "Rounding")
fit_genres<-lm(rating ~ single_genres, data = edx)
#fit_genres
summary(fit_genres)##
## Call:
## lm(formula = rating ~ single_genres, data = edx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6459 -0.5492 0.1288 0.5786 1.9665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6428572 0.3963158 9.192 < 0.0000000000000002 ***
## single_genresAction -0.2214526 0.3963163 -0.559 0.57631
## single_genresAdventure -0.0782987 0.3963176 -0.198 0.84338
## single_genresAnimation -0.0936693 0.3963221 -0.236 0.81316
## single_genresChildren -0.3953004 0.3963234 -0.997 0.31856
## single_genresComedy -0.1890340 0.3963164 -0.477 0.63338
## single_genresCrime 0.2282971 0.3963184 0.576 0.56458
## single_genresDocumentary 0.1443992 0.3963329 0.364 0.71561
## single_genresDrama 0.0236177 0.3963166 0.060 0.95248
## single_genresFantasy -0.3328879 0.3963690 -0.840 0.40100
## single_genresFilm-Noir 0.5030855 0.3964035 1.269 0.20440
## single_genresHorror -0.6093571 0.3963217 -1.538 0.12416
## single_genresIMAX -1.3214286 0.4853857 -2.722 0.00648 **
## single_genresMusical -0.0008257 0.3964011 -0.002 0.99834
## single_genresMystery 0.0579222 0.3963612 0.146 0.88381
## single_genresRomance -0.3624834 0.3964247 -0.914 0.36052
## single_genresSci-Fi -0.2092698 0.3963434 -0.528 0.59750
## single_genresThriller -0.1128840 0.3963304 -0.285 0.77578
## single_genresWar 0.0295715 0.3969148 0.075 0.94061
## single_genresWestern -0.1069422 0.3964064 -0.270 0.78733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.049 on 9000035 degrees of freedom
## Multiple R-squared: 0.0221, Adjusted R-squared: 0.02209
## F-statistic: 1.07e+04 on 19 and 9000035 DF, p-value: < 0.00000000000000022
## 2.5 % 97.5 %
## (Intercept) 2.8660924 4.4196220
## single_genresAction -0.9982184 0.5553133
## single_genresAdventure -0.8550671 0.6984697
## single_genresAnimation -0.8704466 0.6831079
## single_genresChildren -1.1720802 0.3814794
## single_genresComedy -0.9657999 0.5877318
## single_genresCrime -0.5484728 1.0050670
## single_genresDocumentary -0.6323992 0.9211975
## single_genresDrama -0.7531486 0.8003840
## single_genresFantasy -1.1097569 0.4439811
## single_genresFilm-Noir -0.2738512 1.2800222
## single_genresHorror -1.3861335 0.1674193
## single_genresIMAX -2.2727673 -0.3700900
## single_genresMusical -0.7777576 0.7761062
## single_genresMystery -0.7189316 0.8347760
## single_genresRomance -1.1394616 0.4144949
## single_genresSci-Fi -0.9860887 0.5675490
## single_genresThriller -0.8896775 0.6639095
## single_genresWar -0.7483673 0.8075103
## single_genresWestern -0.8838846 0.6700003
rating ~ movie_rated_era + premier_year_era There is no stronger correlation between Release decades (1980s, 1900s and 2000s) and related average ratings, although p-value was significant. P-value of the F-statistic is < 2.2e-16, which is also highly significant. The only predictors that were not significantly related to the rating were release decades 1980s, 1990s and 2000s. All the other decades prior 1980s had highly significant association with ratings.
The coefficients suggesting a decrease in units were for the release decade 1990s and movies rated during 2000s (with-0.0587 and -0.1184530 star units respectively, on average). As a result, the maximum score would be 4.95 and 4.89 stars respectively. In the rating ~ movie_rated_era + premier_year_era Multiple linear regression (MLR), the multiple R-squared was 0.01827, demonstrating that there was also no correlation. Only 1.8% of the variance in the measurement of rating can be predicted by release decade and rating during 2000.
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
fit_decade<-lm(rating ~ movie_rated_era + premier_year_era, data = edx)
#fit_decade
summary(fit_decade)##
## Call:
## lm(formula = rating ~ movie_rated_era + premier_year_era, data = edx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4194 -0.5098 0.1087 0.6087 1.6087
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 3.5684925 0.0457729 77.961
## movie_rated_eraRatings during 2000s -0.1184530 0.0008601 -137.721
## premier_year_era1920s 0.4212917 0.0467081 9.020
## premier_year_era1930s 0.4019297 0.0459054 8.756
## premier_year_era1940s 0.4694075 0.0458557 10.237
## premier_year_era1950s 0.4092166 0.0458212 8.931
## premier_year_era1960s 0.3425602 0.0458048 7.479
## premier_year_era1970s 0.2983833 0.0457878 6.517
## premier_year_era1980s 0.0601428 0.0457735 1.314
## premier_year_era1990s -0.0587000 0.0457688 -1.283
## premier_year_era2000s 0.0096627 0.0457720 0.211
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## movie_rated_eraRatings during 2000s < 0.0000000000000002 ***
## premier_year_era1920s < 0.0000000000000002 ***
## premier_year_era1930s < 0.0000000000000002 ***
## premier_year_era1940s < 0.0000000000000002 ***
## premier_year_era1950s < 0.0000000000000002 ***
## premier_year_era1960s 0.0000000000000751 ***
## premier_year_era1970s 0.0000000000718931 ***
## premier_year_era1980s 0.189
## premier_year_era1990s 0.200
## premier_year_era2000s 0.833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.051 on 9000044 degrees of freedom
## Multiple R-squared: 0.01827, Adjusted R-squared: 0.01826
## F-statistic: 1.674e+04 on 10 and 9000044 DF, p-value: < 0.00000000000000022
## 2.5 % 97.5 %
## (Intercept) 3.47877931 3.65820575
## movie_rated_eraRatings during 2000s -0.12013876 -0.11676725
## premier_year_era1920s 0.32974555 0.51283782
## premier_year_era1930s 0.31195677 0.49190257
## premier_year_era1940s 0.37953198 0.55928304
## premier_year_era1950s 0.31940880 0.49902446
## premier_year_era1960s 0.25278448 0.43233595
## premier_year_era1970s 0.20864092 0.38812563
## premier_year_era1980s -0.02957157 0.14985711
## premier_year_era1990s -0.14840518 0.03100517
## premier_year_era2000s -0.08004882 0.09937413
- Prediction 1 - Only mean of the ratings The initial prediction is test that all users will give the same rating to all movies and each rating is randomly distributed. formula is y_hat=mu+error
mu <- mean(train_set$rating) # Mean of observed values
RMSE_mu <- RMSE(test_set$rating, mu)
v_RMSE_mu <- RMSE(validation$rating, mu)- Prediction 2 - Adding Movie Effect (bi)
Movie variability may be related to the fact that different movies will have different rating distribution. Some maybe bias by the producer, cast, topic this movie bias is expressed as bi in this formula: y_hat=mu+bi+error (mean of the difference between the observed rating y and the mean mu).
## `summarise()` ungrouping output (override with `.groups` argument)
#bi distribution
bi %>% ggplot(aes(x = b_i)) +
geom_histogram(bins=10, col = I("black")) +
ggtitle("Movie Effect Distribution (bi) ") +
theme_minimal()+ theme(panel.grid = element_blank(),axis.title.y = element_blank())# Predict the rating with mean + bi with test_set
y_hat_bi <- mu + test_set %>%
left_join(bi, by = "movieId") %>%
.$b_i
RMSE_bi<- RMSE(test_set$rating, y_hat_bi)
# Predict the rating with mean + bi with validation
v_y_hat_bi <- mu + validation %>%
left_join(bi, by = "movieId") %>%
.$b_i
v_RMSE_bi<- RMSE(validation$rating, v_y_hat_bi)- Prediction 3 - Adding User Effect (bu)
Different users have different rating distribution, this can also be biased by their own movie preference, physiological effect when they were watching and rating the movies. Users can also be influenced by the movie rating itself. The formula is y_hat=mu+bi+bu+error
# User effect (bu)
bu <- train_set %>%
left_join(bi, by = 'movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))## `summarise()` ungrouping output (override with `.groups` argument)
#User effect distribution shows to be more normally distributed
bu %>% ggplot(aes(x = b_u)) +
geom_histogram(bins=10, col = I("black")) +
ggtitle("User Effect Distribution (bu)") +
theme_minimal()+ theme(panel.grid = element_blank(),axis.title.y = element_blank())# Predict the rating with mean + bi+ bu with test_set
y_hat_bi_bu <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
RMSE_bi_bu <- RMSE(test_set$rating, y_hat_bi_bu)
# Predict the rating with mean + bi+ bu with validation
v_y_hat_bi_bu <- validation %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
.$pred
v_RMSE_bi_bu <- RMSE(validation$rating, v_y_hat_bi_bu)- Prediction 4 - Adding Time Effect (t), the year of the rating
Time can also be biased by the season. For example during valentine’s day and fall season people may watch more romance movies, whereas during Halloween more horror movies. Also people can be influenced by critical events that occurred during a given period of time. The formula is y_hat=mu+bi+bu+error
# Time effect (t)
t <- train_set %>%
mutate(date = round_date(as_datetime(timestamp), unit = "month")) %>% #Timestamp in week
left_join(bi, by = 'movieId') %>%
left_join(bu, by = 'userId') %>%
group_by(date) %>%
summarize(b_t = mean(rating - mu - b_i - b_u))## `summarise()` ungrouping output (override with `.groups` argument)
#User effect distribution shows to be more normally distributed
t %>% ggplot(aes(x = b_t)) +
geom_histogram(bins=10, col = I("black")) +
ggtitle("Time Effect Distribution (bt) based on rating year") +
theme_minimal()+ theme(panel.grid = element_blank(),axis.title.y = element_blank())# Predict the rating with mean + bi+ bu = bt with test_set
y_hat_bi_bu_bt <- test_set %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
left_join(t, by='date') %>%
mutate(pred = mu + b_i + b_u + b_t) %>%
.$pred
RMSE_bi_bu_bt <- RMSE(test_set$rating, y_hat_bi_bu_bt)
# Predict the rating with mean + bi+ bu = bt with validation
v_y_hat_bi_bu_bt <- validation %>%
left_join(bi, by='movieId') %>%
left_join(bu, by='userId') %>%
left_join(t, by='date') %>%
mutate(pred = mu + b_i + b_u +b_t) %>%
.$pred
v_RMSE_bi_bu_bt <- RMSE(validation$rating, v_y_hat_bi_bu_bt)RMSE comparison table
### create RMSE comparison table
rating_RMSE2 <- data.frame (Method = c("RMSE Target"," ", "Random prediction", "Only mean of the ratings", "Adding Movie effects (bi)", "Adding Movie + User Effect (bu)", "Adding Movie + User + time Effect (bt)"), RMSE_test_set= c(0.8649," ", RMSE_rating,RMSE_mu,RMSE_bi, RMSE_bi_bu, RMSE_bi_bu_bt), RMSE_validation= c(0.8649," " , " ", v_RMSE_mu,v_RMSE_bi, v_RMSE_bi_bu, v_RMSE_bi_bu_bt))
#rating_RMSE2| Method | RMSE_test_set | RMSE_validation |
|---|---|---|
| RMSE Target | 0.8649 | 0.8649 |
| Random prediction | 1.49979655801755 | |
| Only mean of the ratings | 1.06005370222409 | 1.0612018064374 |
| Adding Movie effects (bi) | 0.942961498004501 | 0.943972915845845 |
| Adding Movie + User Effect (bu) | 0.86468429490229 | 0.865852823367063 |
| Adding Movie + User + time Effect (bt) | 0.864663712815486 | 0.865818869174341 |
The time effect seems not make much difference at the RMSE validation results. It went from RMSE of 0.86585(adding movie and user effect) to RMSE of 0.86581 (adding movie + user + time effect).
2.3.3 Matrix factorization
Trying to achieve the RMSE Target < 0.8649 and tired to fight the memory limit issues due to the large dataset. I decide to try the recosystem package to create matrix factorization. According to the package documentation https://cran.r-project.org/web/packages/recosystem/recosystem.pdf the default parameters are: - costp_l2 Tuning parameter, the L2 regularization cost for user factors. Can be specified as a numeric vector, with default value c(0.01,0.1). - costq_l2 Tuning parameter, the L2 regularization cost for item factors. Can be specified as a numeric vector, with default value c(0.01,0.1). - lrate Tuning parameter, the learning rate, which can be thought of as the step size in gradient descent. Can be specified as a numeric vector, with default value c(0.01,0.1). - niter Integer, the number of iterations. Default is 20. - nthread Integer, the number of threads for parallel computing. Default is 1.
set.seed(123, sample.kind = "Rounding") # This is a randomized algorithm
# Convert the train and test sets into recosystem input format
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating,
date = date))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating,
date = date))
validation_data <- with(validation, data_memory(user_index = userId,
item_index = movieId,
rating = rating,
date = date))
# Create the model object
r <- recosystem::Reco()
# Select the best tuning parameters. I used the parameters that people has been using wirh Reco() examples such as Qiu(2020)
opts <- r$tune(train_data,
opts = list(dim = c(10, 20, 30), # dim is number of factors
lrate = c(0.1, 0.2), # learning rate
costp_l2 = c(0.01, 0.1), #regularization for P factors
costq_l2 = c(0.01, 0.1), # regularization for Q factors
nthread = 4, niter = 10)) #convergence can be controlled by a number of iterations (niter) and learning rate (lrate)
# Train the algorithm
r$train(train_data, opts = c(opts$min, nthread = 4, niter = 20))## iter tr_rmse obj
## 0 0.9824 1.1054e+07
## 1 0.8746 8.9764e+06
## 2 0.8424 8.3428e+06
## 3 0.8213 7.9661e+06
## 4 0.8048 7.6993e+06
## 5 0.7919 7.5004e+06
## 6 0.7813 7.3546e+06
## 7 0.7725 7.2346e+06
## 8 0.7651 7.1372e+06
## 9 0.7588 7.0624e+06
## 10 0.7532 6.9931e+06
## 11 0.7483 6.9364e+06
## 12 0.7440 6.8892e+06
## 13 0.7399 6.8461e+06
## 14 0.7363 6.8078e+06
## 15 0.7330 6.7742e+06
## 16 0.7298 6.7435e+06
## 17 0.7269 6.7147e+06
## 18 0.7244 6.6916e+06
## 19 0.7220 6.6692e+06
# Calculate the predicted values using Reco test_data
y_hat_reco <- r$predict(test_data, out_memory()) #out_memory(): Result should be returned as R objects
RMSE_reco <- RMSE(test_set$rating, y_hat_reco)
# Calculate the predicted values using Reco validation_data
v_y_hat_reco <- r$predict(validation_data, out_memory()) #out_memory(): Result should be returned as R objects
head(v_y_hat_reco, 10)## [1] 4.248147 4.962215 4.849316 3.052525 3.849292 2.695510 4.077660 4.323699
## [9] 4.435445 3.353303
3. Results
This the final RMSE comparison table. It contains all the models used to predict ratings. As can been seen from the comparison table, the target RMSE of 0.8649 was almost reached after the Regularization qith Movie, user and time effect (RMSE of 0.8658). The best results of the RMSE were achieved when matrix factorization was implemented 0.78581 (around 9% below the RMSE target of 0.8649).
### create RMSE comparison table
rating_RMSE3 <- data.frame (Method = c("RMSE Target"," ", "Random prediction",
"Only mean of the ratings", "Adding Movie effects (bi)",
"Adding Movie + User Effect (bu)",
"Adding Movie + User + time Effect (bt)",
"Matrix Factorization (recosystem)"),
RMSE_test_set= c(0.8649," ", RMSE_rating,RMSE_mu,RMSE_bi,
RMSE_bi_bu, RMSE_bi_bu_bt, RMSE_reco), RMSE_validation= c(0.8649," " , " ", v_RMSE_mu,v_RMSE_bi, v_RMSE_bi_bu, v_RMSE_bi_bu_bt,v_RMSE_reco))
#rating_RMSE3| Method | RMSE_test_set | RMSE_validation |
|---|---|---|
| RMSE Target | 0.8649 | 0.8649 |
| Random prediction | 1.49979655801755 | |
| Only mean of the ratings | 1.06005370222409 | 1.0612018064374 |
| Adding Movie effects (bi) | 0.942961498004501 | 0.943972915845845 |
| Adding Movie + User Effect (bu) | 0.86468429490229 | 0.865852823367063 |
| Adding Movie + User + time Effect (bt) | 0.864663712815486 | 0.865818869174341 |
| Matrix Factorization (recosystem) | 0.785780020455386 | 0.786269710030586 |
4. Conclusion
This project was a great wrap-up of the certification, it gave us freedom to build our own analysis while we all had in mind a common target. It showcased the importance of validate the data transformation as well gave me different perspectives of data insights.
After running my recommender system algorithms using random prediction, Linear model with regularized effects on movies +users + time effect I reached the RMSE of 0.86585 using the validation and 0.8646 using the test_set which was below the Capstone RMSE Target of 0.8649. Since I was not fully satisfied with my RMSE results, I decided to try Matrix Factorization method (recosystem), as some of the discussion highly recommended. I was very impressed to reach RMSE of 0.78581 using the validation (9% less than the Capstone RMSE Target).
Limitations were mostly related with the dataset size and the memory limit issues that I encounter over several chunks of codes. After some research, I used gc to clean unused memory and I was increased the memory limit. Errors disappears, but the time to run the codes were in between 1h40min and 2h30min. This means that I spent probably 60% of my time waiting for the code to run, instead of been developing/improving my codes.
As future work, I would be curious to see how smaller datasets and different combinations of train_set and test_set would impact the RMSE results.
5.Reference
Perrier, A. Effective Amazon Machine Learning . Published by Packt Publishing, 2017.
Forte, R and Miller, J. Mastering Predictive Analytics with R - Second Edition. Published by Packt Publishing, 2017.
Qiu, Y. recosystem: Recommender System Using Parallel Matrix Factorization https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html, 2020.