PROJECT 1
Briefly describe the recommender system that you’re going to build out from a business perspective.
This system recommends movies to movie viewers.
Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity,please include numeric ratings for at least five users, across at least five items, with some missing data.
The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. For this project, I’ll just pick top 7 movie viewers and randomly select the movies that they reviewed.
Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix.
df <- read.csv('https://raw.githubusercontent.com/msdsrep4/DATA-RS/master/ratings_user_item.csv')
df_ratings <- as.data.frame(df)
kable(df_ratings) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| 249 |
5.0 |
5 |
4.0 |
4.5 |
4.5 |
4 |
NA |
| 318 |
3.5 |
3 |
4.5 |
4.0 |
4.5 |
4 |
4.0 |
| 380 |
4.5 |
5 |
5.0 |
3.0 |
5.0 |
5 |
5.0 |
| 414 |
5.0 |
5 |
5.0 |
5.0 |
5.0 |
4 |
4.0 |
| 448 |
2.0 |
5 |
5.0 |
NA |
3.0 |
3 |
5.0 |
| 599 |
5.0 |
5 |
5.0 |
4.0 |
3.5 |
4 |
3.0 |
| 610 |
5.0 |
5 |
5.0 |
3.0 |
3.0 |
5 |
4.5 |
Break your ratings into separate training and test datasets.
## [1] "Train:"
| 3 |
380 |
4.5 |
5 |
5 |
3 |
5 |
5 |
5.0 |
| 5 |
448 |
2.0 |
5 |
5 |
NA |
3 |
3 |
5.0 |
| 7 |
610 |
5.0 |
5 |
5 |
3 |
3 |
5 |
4.5 |
| 4 |
414 |
5.0 |
5 |
5 |
5 |
5 |
4 |
4.0 |
## [1] "Test:"
| 1 |
249 |
5.0 |
5 |
4.0 |
4.5 |
4.5 |
4 |
NA |
| 2 |
318 |
3.5 |
3 |
4.5 |
4.0 |
4.5 |
4 |
4 |
| 6 |
599 |
5.0 |
5 |
5.0 |
4.0 |
3.5 |
4 |
3 |
Using your training data, calculate the raw average (mean) rating for every user-item combination.
user_means <- rowMeans(train,na.rm = TRUE)
user_means_df <- data.frame(as.list(user_means))
user_means_df <- gather(user_means_df,"user")
user_means_df
## user value
## 1 X3 51.56250
## 2 X5 67.28571
## 3 X7 80.06250
## 4 X4 55.87500
## item value
## 1 TheMatrix 4.125000
## 2 StarWars4.ANew.Hope 5.000000
## 3 PulpFiction 5.000000
## 4 ShawshankRedemption 3.666667
## 5 ForrestGump 4.000000
## 6 JurassicPark 4.250000
## 7 SilenceoftheLambs 4.625000
## TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption
## 3 4.5 5 5 3
## 5 2.0 5 5 NA
## 7 5.0 5 5 3
## 4 5.0 5 5 5
## ForrestGump JurassicPark SilenceoftheLambs
## 3 5 5 5.0
## 5 3 3 5.0
## 7 3 5 4.5
## 4 5 4 4.0
## TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption
## 1 5.0 5 4.0 4.5
## 2 3.5 3 4.5 4.0
## 6 5.0 5 5.0 4.0
## ForrestGump JurassicPark SilenceoftheLambs
## 1 4.5 4 NA
## 2 4.5 4 4
## 6 3.5 4 3
## [1] "Training dataset raw average is 4.4074"
## [1] "Test dataset raw average is 4.2"
Calculate the RMSE for raw average for both your training data and your test data.
## [1] "Training RMSE is 0.9031"
## [1] "Test RMSE is 0.6205"
Using your training data, calculate the bias for each user and each item.
| 380 |
0.2354 |
| 448 |
-0.5741 |
| 610 |
-0.05026 |
| 414 |
0.3069 |
| 2 |
TheMatrix |
-0.2824 |
| 3 |
StarWars4.ANew.Hope |
0.5926 |
| 4 |
PulpFiction |
0.5926 |
| 5 |
ShawshankRedemption |
-0.7407 |
| 6 |
ForrestGump |
-0.4074 |
| 7 |
JurassicPark |
-0.1574 |
| 8 |
SilenceoftheLambs |
0.2176 |
From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
| 3 |
380 |
4.4 |
5.2 |
5.2 |
3.9 |
4.2 |
4.5 |
4.9 |
| 5 |
448 |
3.6 |
4.4 |
4.4 |
3.1 |
3.4 |
3.7 |
4.1 |
| 7 |
610 |
4.1 |
4.9 |
4.9 |
3.6 |
3.9 |
4.2 |
4.6 |
| 4 |
414 |
4.4 |
5.3 |
5.3 |
4.0 |
4.3 |
4.6 |
4.9 |
| 1 |
249 |
4.2 |
5.0 |
5.0 |
3.7 |
4.0 |
4.3 |
4.7 |
| 2 |
318 |
3.3 |
4.2 |
4.2 |
2.9 |
3.2 |
3.5 |
3.8 |
| 6 |
599 |
3.9 |
4.7 |
4.7 |
3.4 |
3.7 |
4.0 |
4.4 |
Calculate the RMSE for the baseline predictors for both your training data and your test data.
## [1] "Training Baseline Predictor RMSE is 0.5794"
## [1] "Test Baseline Predictor RMSE is 0.5992"
Summarize your results.
It is said that in a good model, the RMSE should be close for both the test data and the train data. If the RMSE for test data is higher than the train data, there is a high chance that the model overfit. In other words, the model performed worse during testing than training. As we can see, using the raw average, the RMSE for training and testing are 0.9031 and 0.6205 respectively. It improves when we use the baseline predictor (i.e. Training - 0.5794; Testing - 0.5992).
Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python
Citation: F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872