class: center, middle, title-slide background-image: url("data:image/png;base64,#https://github.com/alexandersimon1/Data607/blob/main/Project_Final/background-020.jpg?raw=true") ## DATA607 Final Project ### .gold[Creation and comparison of movie recommender models with the recommenderlab R package] .gold[Alexander Simon] .gold[2024-05-08] --- ## Data source .pull-left[ - MovieLens movie ratings dataset (2018 education & development version) - ratings.csv (user ratings + movie IDs) - movies.csv (movie IDs + titles + genres) - 100,000 users - 10,000 movies ] .pull-right[ <img src = "data:image/png;base64,#https://github.com/alexandersimon1/Data607/blob/main/Project_Final/movielens.png?raw=true" /> ] --- <!-- Note: The code below is the minimal code to reproduce the plots shown in the presentation. Please see project Rmarkdown file for full code. --> ## Tidying the data ``` ## Rows: 9,742 ## Columns: 3 ## $ movieId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… ## $ title <chr> "Toy Story (1995)", "Jumanji (1995)", "Grumpier Old Men (1995)… *## $ genres <chr> "Adventure|Animation|Children|Comedy|Fantasy", "Adventure|Chil… ``` ```r # Add a new column for each genre, initialize with 0 movie_titles[genres_all] <- 0 # For each movie, populate the genre columns # Check whether the column name is found in a movie's genres movie_titles <- movie_titles %>% rowwise() %>% mutate( across( .cols = c(first(genres_all) : last(genres_all)), .fns = ~ sum(grepl(cur_column(), genres)) ) ) %>% select(-genres) ``` --- ## Tidying the data (cont'd) ``` ## # A tibble: 10 × 22 ## # Rowwise: *## movieId title year Adventure Animation Children Comedy Fantasy Romance Drama ## <dbl> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> ## 1 1 Toy … 1995 1 1 1 1 1 0 0 ## 2 2 Juma… 1995 1 0 1 0 1 0 0 ## 3 3 Grum… 1995 0 0 0 1 0 1 0 ## 4 4 Wait… 1995 0 0 0 1 0 1 1 ## 5 5 Fath… 1995 0 0 0 1 0 0 0 ## 6 6 Heat 1995 0 0 0 0 0 0 0 ## 7 7 Sabr… 1995 0 0 0 1 0 1 0 ## 8 8 Tom … 1995 1 0 1 0 0 0 0 ## 9 9 Sudd… 1995 0 0 0 0 0 0 0 ## 10 10 Gold… 1995 1 0 0 0 0 0 0 ## # ℹ 12 more variables: Action <int>, Crime <int>, Thriller <int>, Horror <int>, ## # Mystery <int>, `Sci-Fi` <int>, War <int>, Musical <int>, Documentary <int>, ## # IMAX <int>, Western <int>, `Film-Noir` <int> ``` --- ## Exploratory data analysis .pull-left[ .green[Number of movies rated per user] ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 20.0 35.0 70.5 165.3 168.0 2698.0 ``` <!-- --> ] .pull-right[ .green[Number of users per movie] ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 1.00 3.00 10.37 9.00 329.00 ``` <!-- --> ] --- ## Exploratory data analysis (cont'd) .pull-left[ .center[.blue[Distribution of ratings]] <img src="data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/ratings-barplot-1.png" height="400 px" /> .center[.green[Global mean rating = 3.5]] ] .pull-right[ .blue[Difference of average rating from global mean by genre] <img src="data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/avg-rating-by-genre-barplot-1.png" width="350 px" height="400 px" /> ] --- ## Recommender models in recommenderlab - .term[Item-based collaborative filtering (IBCF)] - uses similarity between items based on user ratings to find items that are similar to items that the active user likes - .term[User-based collaborative filtering (UBCF)] - predicts ratings by aggregating ratings of users who have a similar rating history as the active user - .term[Singular value decomposition (SVD)] - a mathematical method of transforming the rating matrix to infer users with similar ratings - .term[Popular] - non-personalized algorithm that recommends the most popular items that users that not yet rated - .term[Random] - recommends random items, used as a baseline for evaluating model performance --- ## Build recommender model recommenderlab automatically normalizes the data, so the first step is to create a rating matrix, which has user IDs as rows, movie IDs (ie, items) as columns, and movie ratings as values ```r rating_matrix <- as.matrix(ratings_wide) rownames(rating_matrix) <- users_vec *ratings_rrm <- as(rating_matrix, "realRatingMatrix") ratings_rrm ``` ``` ## 305 x 4980 rating matrix of class 'realRatingMatrix' with 1518900 ratings. ``` -- ```r # Define the evaluation scheme eval_scheme <- evaluationScheme(ratings_rrm, method = "split", train = 0.8, given = 5, goodRating = 4) # Define training and test sets *eval_train <- getData(eval_scheme, "train") eval_known <- getData(eval_scheme, "known") eval_unknown <- getData(eval_scheme, "unknown") # Build the model from the training set *IBCF_train <- Recommender(eval_train, "IBCF") # Make predictions using the known set IBCF_predictions <- predict(IBCF_train, eval_known, type = "ratings") ``` --- ## Top recommended movies for a user <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> movie_ID </th> <th style="text-align:left;"> title </th> <th style="text-align:left;"> genres </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 112 </td> <td style="text-align:left;"> Pie in the Sky (1996) </td> <td style="text-align:left;"> Comedy, Romance </td> </tr> <tr> <td style="text-align:right;"> 194 </td> <td style="text-align:left;"> Drop Zone (1994) </td> <td style="text-align:left;"> Action, Thriller </td> </tr> <tr> <td style="text-align:right;"> 220 </td> <td style="text-align:left;"> Jerky Boys, The (1995) </td> <td style="text-align:left;"> Comedy </td> </tr> <tr> <td style="text-align:right;"> 234 </td> <td style="text-align:left;"> Losing Isaiah (1995) </td> <td style="text-align:left;"> Drama </td> </tr> <tr> <td style="text-align:right;"> 250 </td> <td style="text-align:left;"> Natural Born Killers (1994) </td> <td style="text-align:left;"> Action, Crime, Thriller </td> </tr> <tr> <td style="text-align:right;"> 326 </td> <td style="text-align:left;"> Mask, The (1994) </td> <td style="text-align:left;"> Action, Comedy, Crime, Fantasy </td> </tr> <tr> <td style="text-align:right;"> 349 </td> <td style="text-align:left;"> Jason's Lyric (1994) </td> <td style="text-align:left;"> Crime, Drama </td> </tr> <tr> <td style="text-align:right;"> 504 </td> <td style="text-align:left;"> Brady Bunch Movie, The (1995) </td> <td style="text-align:left;"> Comedy </td> </tr> <tr> <td style="text-align:right;"> 566 </td> <td style="text-align:left;"> Operation Dumbo Drop (1995) </td> <td style="text-align:left;"> Action, Adventure, Comedy, War </td> </tr> <tr> <td style="text-align:right;"> 653 </td> <td style="text-align:left;"> House Arrest (1996) </td> <td style="text-align:left;"> Children, Comedy </td> </tr> </tbody> </table> --- ## Comparing models: evaluation of rating predictions .pull-left[ - .term[Root mean square error (RMSE)]: Standard deviation of the difference between actual and predicted ratings. Magnifies outliers. - .term[Mean squared error (MSE)]: RMSE squared - .term[Mean absolute error (MAE)]: Mean of the absolute difference between actual and predicted ratings. Weights all predictions equally. - .green[Smaller values are better, but how small is "good"?] ] .pull-right[ ```r all_models_accuracy <- rbind( IBCF = calcPredictionAccuracy(IBCF_predictions, eval_unknown), UBCF = calcPredictionAccuracy(UBCF_predictions, eval_unknown), SVD = calcPredictionAccuracy(SVD_predictions, eval_unknown), POPULAR = calcPredictionAccuracy(POPULAR_predictions, eval_unknown), RANDOM = calcPredictionAccuracy(RANDOM_predictions, eval_unknown) ) round(all_models_accuracy, digits = 3) %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(font_size = 10) ``` <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> RMSE </th> <th style="text-align:right;"> MSE </th> <th style="text-align:right;"> MAE </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> IBCF </td> <td style="text-align:right;"> 1.101 </td> <td style="text-align:right;"> 1.212 </td> <td style="text-align:right;"> 0.361 </td> </tr> <tr> <td style="text-align:left;"> UBCF </td> <td style="text-align:right;"> 0.925 </td> <td style="text-align:right;"> 0.856 </td> <td style="text-align:right;"> 0.512 </td> </tr> <tr> <td style="text-align:left;"> SVD </td> <td style="text-align:right;"> 0.922 </td> <td style="text-align:right;"> 0.851 </td> <td style="text-align:right;"> 0.514 </td> </tr> <tr> <td style="text-align:left;"> POPULAR </td> <td style="text-align:right;"> 0.923 </td> <td style="text-align:right;"> 0.851 </td> <td style="text-align:right;"> 0.514 </td> </tr> <tr> <td style="text-align:left;"> RANDOM </td> <td style="text-align:right;"> 2.838 </td> <td style="text-align:right;"> 8.053 </td> <td style="text-align:right;"> 2.445 </td> </tr> </tbody> </table> ] --- ## Comparing models: evaluation of top recommendations .pull-left[ .center[ .term[Receiver-operator characteristic] ] <!-- --> ] .pull-right[ .center[ .term[Precision-recall] ] <!-- --> ] - .term[Precision] is the proportion of correctly recommended items among all recommended items - .term[Recall] is the proportion of correctly recommended items among all useful recommendations --- ## Conclusions - These analyses show that the SVD and POPULAR recommendation models had the best overall performance for the movie dataset - I was surprised that the collaborative filtering algorithms didn't perform better than the POPULAR model, but this may be because the MovieLens educational dataset is "idealized" and does not reflect real-world user ratings - Using recommenderlab was a good learning experience, but it feels a little clunky (eg, plots aren't ggplot quality) and some functions are deprecated or did not work --- ## Endnotes .green[This presentation was created using RMarkdown and the 'xaringan' and 'xaringanthemer' packages.] https://cran.r-project.org/web/packages/xaringan/ https://pkg.garrickadenbuie.com/xaringanthemer/ Hill A. Meet xaringan: Making slides in R Markdown. 2019-01-16. https://arm.rbind.io/slides/xaringan.html#1 Xie Y. Presentation Ninja with xaringan. 2016-12-12 (updated 2021-05-12). https://slides.yihui.org/xaringan/#1 .green[The presentation is available at https://rpubs.com/alexandersimon1/final_project_presentation]