PROJECT 1

Briefly describe the recommender system that you’re going to build out from a business perspective.

This system recommends movies to movie viewers.

Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity,please include numeric ratings for at least five users, across at least five items, with some missing data.

The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. For this project, I’ll just pick top 7 movie viewers and randomly select the movies that they reviewed.

Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix.

df <- read.csv('https://raw.githubusercontent.com/msdsrep4/DATA-RS/master/ratings_user_item.csv')

df_ratings <- as.data.frame(df)

kable(df_ratings) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
UserId TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption ForrestGump JurassicPark SilenceoftheLambs
249 5.0 5 4.0 4.5 4.5 4 NA
318 3.5 3 4.5 4.0 4.5 4 4.0
380 4.5 5 5.0 3.0 5.0 5 5.0
414 5.0 5 5.0 5.0 5.0 4 4.0
448 2.0 5 5.0 NA 3.0 3 5.0
599 5.0 5 5.0 4.0 3.5 4 3.0
610 5.0 5 5.0 3.0 3.0 5 4.5

Break your ratings into separate training and test datasets.

## [1] "Train:"
UserId TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption ForrestGump JurassicPark SilenceoftheLambs
3 380 4.5 5 5 3 5 5 5.0
5 448 2.0 5 5 NA 3 3 5.0
7 610 5.0 5 5 3 3 5 4.5
4 414 5.0 5 5 5 5 4 4.0
## [1] "Test:"
UserId TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption ForrestGump JurassicPark SilenceoftheLambs
1 249 5.0 5 4.0 4.5 4.5 4 NA
2 318 3.5 3 4.5 4.0 4.5 4 4
6 599 5.0 5 5.0 4.0 3.5 4 3

Using your training data, calculate the raw average (mean) rating for every user-item combination.

user_means <- rowMeans(train,na.rm = TRUE)
user_means_df <-  data.frame(as.list(user_means))
user_means_df <- gather(user_means_df,"user")
user_means_df
##   user    value
## 1   X3 51.56250
## 2   X5 67.28571
## 3   X7 80.06250
## 4   X4 55.87500
##                  item    value
## 1           TheMatrix 4.125000
## 2 StarWars4.ANew.Hope 5.000000
## 3         PulpFiction 5.000000
## 4 ShawshankRedemption 3.666667
## 5         ForrestGump 4.000000
## 6        JurassicPark 4.250000
## 7   SilenceoftheLambs 4.625000
##   TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption
## 3       4.5                   5           5                   3
## 5       2.0                   5           5                  NA
## 7       5.0                   5           5                   3
## 4       5.0                   5           5                   5
##   ForrestGump JurassicPark SilenceoftheLambs
## 3           5            5               5.0
## 5           3            3               5.0
## 7           3            5               4.5
## 4           5            4               4.0
##   TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption
## 1       5.0                   5         4.0                 4.5
## 2       3.5                   3         4.5                 4.0
## 6       5.0                   5         5.0                 4.0
##   ForrestGump JurassicPark SilenceoftheLambs
## 1         4.5            4                NA
## 2         4.5            4                 4
## 6         3.5            4                 3
## [1] "Training dataset raw average is 4.4074"
## [1] "Test dataset raw average is 4.2"

Calculate the RMSE for raw average for both your training data and your test data.

## [1] "Training RMSE is  0.9031"
## [1] "Test RMSE is  0.6205"

Using your training data, calculate the bias for each user and each item.

userId bias
380 0.2354
448 -0.5741
610 -0.05026
414 0.3069
  movie bias
2 TheMatrix -0.2824
3 StarWars4.ANew.Hope 0.5926
4 PulpFiction 0.5926
5 ShawshankRedemption -0.7407
6 ForrestGump -0.4074
7 JurassicPark -0.1574
8 SilenceoftheLambs 0.2176

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

UserId TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption ForrestGump JurassicPark SilenceoftheLambs
3 380 4.4 5.2 5.2 3.9 4.2 4.5 4.9
5 448 3.6 4.4 4.4 3.1 3.4 3.7 4.1
7 610 4.1 4.9 4.9 3.6 3.9 4.2 4.6
4 414 4.4 5.3 5.3 4.0 4.3 4.6 4.9
UserId TheMatrix StarWars4.ANew.Hope PulpFiction ShawshankRedemption ForrestGump JurassicPark SilenceoftheLambs
1 249 4.2 5.0 5.0 3.7 4.0 4.3 4.7
2 318 3.3 4.2 4.2 2.9 3.2 3.5 3.8
6 599 3.9 4.7 4.7 3.4 3.7 4.0 4.4

Calculate the RMSE for the baseline predictors for both your training data and your test data.

## [1] "Training Baseline Predictor RMSE is  0.5794"
## [1] "Test Baseline Predictor RMSE is  0.5992"

Summarize your results.

It is said that in a good model, the RMSE should be close for both the test data and the train data. If the RMSE for test data is higher than the train data, there is a high chance that the model overfit. In other words, the model performed worse during testing than training. As we can see, using the raw average, the RMSE for training and testing are 0.9031 and 0.6205 respectively. It improves when we use the baseline predictor (i.e. Training - 0.5794; Testing - 0.5992).

Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python

Citation: F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872