DATA607 Final Project

class: center, middle, title-slide
background-image: url("data:image/png;base64,#https://github.com/alexandersimon1/Data607/blob/main/Project_Final/background-020.jpg?raw=true")

## DATA607 Final Project

### .gold[Creation and comparison of movie recommender models with the recommenderlab R package]

.gold[Alexander Simon]

.gold[2024-05-08]

---

## Data source

.pull-left[
- MovieLens movie ratings dataset (2018 education & development version)
  - ratings.csv (user ratings + movie IDs)
  - movies.csv (movie IDs + titles + genres)
  
- 100,000 users

- 10,000 movies
]

.pull-right[
<img src = "data:image/png;base64,#https://github.com/alexandersimon1/Data607/blob/main/Project_Final/movielens.png?raw=true" />

]

---

## Tidying the data

```
## Rows: 9,742
## Columns: 3
## $ movieId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ title   <chr> "Toy Story (1995)", "Jumanji (1995)", "Grumpier Old Men (1995)…
*## $ genres  <chr> "Adventure|Animation|Children|Comedy|Fantasy", "Adventure|Chil…
```

```r
# Add a new column for each genre, initialize with 0
movie_titles[genres_all] <- 0

# For each movie, populate the genre columns
# Check whether the column name is found in a movie's genres
movie_titles <- movie_titles %>%
  rowwise() %>%
  mutate(
    across(
      .cols = c(first(genres_all) : last(genres_all)),
      .fns = ~ sum(grepl(cur_column(), genres))
    )
  ) %>%
  select(-genres)
```

---

## Tidying the data (cont'd)

```
## # A tibble: 10 × 22
## # Rowwise: 
*##    movieId title  year Adventure Animation Children Comedy Fantasy Romance Drama
##      <dbl> <chr> <dbl>     <int>     <int>    <int>  <int>   <int>   <int> <int>
##  1       1 Toy …  1995         1         1        1      1       1       0     0
##  2       2 Juma…  1995         1         0        1      0       1       0     0
##  3       3 Grum…  1995         0         0        0      1       0       1     0
##  4       4 Wait…  1995         0         0        0      1       0       1     1
##  5       5 Fath…  1995         0         0        0      1       0       0     0
##  6       6 Heat   1995         0         0        0      0       0       0     0
##  7       7 Sabr…  1995         0         0        0      1       0       1     0
##  8       8 Tom …  1995         1         0        1      0       0       0     0
##  9       9 Sudd…  1995         0         0        0      0       0       0     0
## 10      10 Gold…  1995         1         0        0      0       0       0     0
## # ℹ 12 more variables: Action <int>, Crime <int>, Thriller <int>, Horror <int>,
## #   Mystery <int>, `Sci-Fi` <int>, War <int>, Musical <int>, Documentary <int>,
## #   IMAX <int>, Western <int>, `Film-Noir` <int>
```

---

## Exploratory data analysis

.pull-left[
.green[Number of movies rated per user]

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    35.0    70.5   165.3   168.0  2698.0
```

![](data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/movies-rated-per-user-histogram-1.png)
]

.pull-right[
.green[Number of users per movie]

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00   10.37    9.00  329.00
```

![](data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/users-per-movie-histogram-1.png)
]

---

## Exploratory data analysis (cont'd)

.pull-left[
.center[.blue[Distribution of ratings]]
<img src="data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/ratings-barplot-1.png" height="400 px" />

.center[.green[Global mean rating = 3.5]]
]

.pull-right[
.blue[Difference of average rating from global mean by genre]
<img src="data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/avg-rating-by-genre-barplot-1.png" width="350 px" height="400 px" />
]

---

## Recommender models in recommenderlab

- .term[Item-based collaborative filtering (IBCF)] - uses similarity between items based on user ratings to find items that are similar to items that the active user likes

- .term[User-based collaborative filtering (UBCF)] - predicts ratings by aggregating ratings of users who have a similar rating history as the active user

- .term[Singular value decomposition (SVD)] - a mathematical method of transforming the rating matrix to infer users with similar ratings

- .term[Popular] - non-personalized algorithm that recommends the most popular items that users that not yet rated

- .term[Random] - recommends random items, used as a baseline for evaluating model performance

---

## Build recommender model

recommenderlab automatically normalizes the data, so the first step is to create a rating matrix, which has user IDs as rows, movie IDs (ie, items) as columns, and movie ratings as values

```r
rating_matrix <- as.matrix(ratings_wide)
rownames(rating_matrix) <- users_vec
*ratings_rrm <- as(rating_matrix, "realRatingMatrix")
ratings_rrm
```

```
## 305 x 4980 rating matrix of class 'realRatingMatrix' with 1518900 ratings.
```

```r
# Define the evaluation scheme
eval_scheme <- evaluationScheme(ratings_rrm, method = "split", train = 0.8, given = 5, goodRating = 4)

# Define training and test sets
*eval_train <- getData(eval_scheme, "train")
eval_known <- getData(eval_scheme, "known")
eval_unknown <- getData(eval_scheme, "unknown")

# Build the model from the training set
*IBCF_train <- Recommender(eval_train, "IBCF")

# Make predictions using the known set
IBCF_predictions <- predict(IBCF_train, eval_known, type = "ratings")
```

---

## Top recommended movies for a user

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> movie_ID </th>
   <th style="text-align:left;"> title </th>
   <th style="text-align:left;"> genres </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 112 </td>
   <td style="text-align:left;"> Pie in the Sky (1996) </td>
   <td style="text-align:left;"> Comedy, Romance </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 194 </td>
   <td style="text-align:left;"> Drop Zone (1994) </td>
   <td style="text-align:left;"> Action, Thriller </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 220 </td>
   <td style="text-align:left;"> Jerky Boys, The (1995) </td>
   <td style="text-align:left;"> Comedy </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 234 </td>
   <td style="text-align:left;"> Losing Isaiah (1995) </td>
   <td style="text-align:left;"> Drama </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 250 </td>
   <td style="text-align:left;"> Natural Born Killers (1994) </td>
   <td style="text-align:left;"> Action, Crime, Thriller </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 326 </td>
   <td style="text-align:left;"> Mask, The (1994) </td>
   <td style="text-align:left;"> Action, Comedy, Crime, Fantasy </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 349 </td>
   <td style="text-align:left;"> Jason's Lyric (1994) </td>
   <td style="text-align:left;"> Crime, Drama </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 504 </td>
   <td style="text-align:left;"> Brady Bunch Movie, The (1995) </td>
   <td style="text-align:left;"> Comedy </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 566 </td>
   <td style="text-align:left;"> Operation Dumbo Drop (1995) </td>
   <td style="text-align:left;"> Action, Adventure, Comedy, War </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 653 </td>
   <td style="text-align:left;"> House Arrest (1996) </td>
   <td style="text-align:left;"> Children, Comedy </td>
  </tr>
</tbody>
</table>

---

## Comparing models: evaluation of rating predictions

.pull-left[
- .term[Root mean square error (RMSE)]: Standard deviation of the difference between actual and predicted ratings. Magnifies outliers.

- .term[Mean squared error (MSE)]: RMSE squared

- .term[Mean absolute error (MAE)]: Mean of the absolute difference between actual and predicted ratings. Weights all predictions equally.

- .green[Smaller values are better, but how small is "good"?]
]

.pull-right[

```r
all_models_accuracy <- rbind(
  IBCF = calcPredictionAccuracy(IBCF_predictions, eval_unknown),
  UBCF = calcPredictionAccuracy(UBCF_predictions, eval_unknown),
  SVD = calcPredictionAccuracy(SVD_predictions, eval_unknown),  
  POPULAR = calcPredictionAccuracy(POPULAR_predictions, eval_unknown),
  RANDOM = calcPredictionAccuracy(RANDOM_predictions, eval_unknown)  
)

round(all_models_accuracy, digits = 3) %>%
  knitr::kable(format = "html") %>%
  kableExtra::kable_styling(font_size = 10)
```

<table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> RMSE </th>
   <th style="text-align:right;"> MSE </th>
   <th style="text-align:right;"> MAE </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> IBCF </td>
   <td style="text-align:right;"> 1.101 </td>
   <td style="text-align:right;"> 1.212 </td>
   <td style="text-align:right;"> 0.361 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> UBCF </td>
   <td style="text-align:right;"> 0.925 </td>
   <td style="text-align:right;"> 0.856 </td>
   <td style="text-align:right;"> 0.512 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SVD </td>
   <td style="text-align:right;"> 0.922 </td>
   <td style="text-align:right;"> 0.851 </td>
   <td style="text-align:right;"> 0.514 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> POPULAR </td>
   <td style="text-align:right;"> 0.923 </td>
   <td style="text-align:right;"> 0.851 </td>
   <td style="text-align:right;"> 0.514 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> RANDOM </td>
   <td style="text-align:right;"> 2.838 </td>
   <td style="text-align:right;"> 8.053 </td>
   <td style="text-align:right;"> 2.445 </td>
  </tr>
</tbody>
</table>
]

---

## Comparing models: evaluation of top recommendations

.pull-left[
.center[
.term[Receiver-operator characteristic]
]
![](data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/plot-roc-curve-1.png)
]

.pull-right[
.center[
.term[Precision-recall]
]
![](data:image/png;base64,#DATA607_Final_Project_Presentation_Alexander_Simon_files/figure-html/plot-precision-recall-1.png)
]

- .term[Precision] is the proportion of correctly recommended items among all recommended items
- .term[Recall] is the proportion of correctly recommended items among all useful recommendations

---

## Conclusions

- These analyses show that the SVD and POPULAR recommendation models had the best overall performance for the movie dataset

- I was surprised that the collaborative filtering algorithms didn't perform better than the POPULAR model, but this may be because the MovieLens educational dataset is "idealized" and does not reflect real-world user ratings

- Using recommenderlab was a good learning experience, but it feels a little clunky (eg, plots aren't ggplot quality) and some functions are deprecated or did not work

---

## Endnotes

.green[This presentation was created using RMarkdown and the 'xaringan' and 'xaringanthemer' packages.]

https://cran.r-project.org/web/packages/xaringan/

https://pkg.garrickadenbuie.com/xaringanthemer/

Hill A. Meet xaringan: Making slides in R Markdown. 2019-01-16. https://arm.rbind.io/slides/xaringan.html#1

Xie Y. Presentation Ninja with xaringan. 2016-12-12 (updated 2021-05-12). https://slides.yihui.org/xaringan/#1

.green[The presentation is available at https://rpubs.com/alexandersimon1/final_project_presentation]