In assignment 3A, I created a simple movie recommender system. I’ve iterated on this system, creating a content-based filtering system that recommends similar movies with high average ratings.
I augmented the data by adding genre information about each movie. I put together my own algorithm, then used recommenderlab. Finally, I compare the recommendations made with an actual model to those using simple math.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(recommenderlab)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loading required package: arules
Attaching package: 'arules'
The following object is masked from 'package:dplyr':
recode
The following objects are masked from 'package:base':
abbreviate, write
Loading required package: proxy
Attaching package: 'proxy'
The following object is masked from 'package:Matrix':
as.matrix
The following objects are masked from 'package:stats':
as.dist, dist
The following object is masked from 'package:base':
as.matrix
Reading the original data set:
sql_url <-"https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv"#read and convert blanks and NULL to NA movie_data <-read.csv(sql_url, na =c("", "NA", "null", "NULL"))
includes_genres <-left_join(new_ratings, movie_list, by ="movie") genre_average <- includes_genres |>group_by(viewer, genre) |>mutate(avg_genre =na_if(round(mean(rating, na.rm =TRUE), 1), NaN)) |>ungroup()#creating an index of favorite genresgenre_average_only <- includes_genres |>group_by(viewer, genre) |>summarise(gen_avg = (round(mean(rating, na.rm =TRUE), 1))) |>arrange(viewer, desc(gen_avg))
`summarise()` has grouped output by 'viewer'. You can override using the
`.groups` argument.
genre_average_only
# A tibble: 18 × 3
# Groups: viewer [6]
viewer genre gen_avg
<chr> <chr> <dbl>
1 anne dc 4.5
2 anne marvel 2
3 anne holiday 1.2
4 cat dc 4.5
5 cat holiday 3.3
6 cat marvel NaN
7 greta marvel 3.5
8 greta dc NaN
9 greta holiday NaN
10 james marvel 3.5
11 james holiday 3.3
12 james dc 3
13 jeff holiday 5
14 jeff marvel 4
15 jeff dc 1
16 nico dc 3.7
17 nico marvel 3.5
18 nico holiday NaN
Sorting by predicted rating and preference
Find the users’ genre order of preference and recommend the next movie in that genre they should watch based on genre and predicted rating.
# A tibble: 6 × 6
# Groups: viewer [6]
viewer movie rating predicted_rating genre avg_genre
<chr> <chr> <int> <dbl> <chr> <dbl>
1 anne super_man NA 1.5 dc 4.5
2 cat the_suicide_squad NA 4.6 dc 4.5
3 greta iron_man NA 3 marvel 3.5
4 james avengers NA 3.1 marvel 3.5
5 jeff the_holiday NA 2.9 holiday 5
6 nico captain_america NA 4.1 marvel 3.5
The result is a simple content-based rating system. It’s similar to project 3A, but it predicts the highest-rated movie of the genre the user is most interested in. I could have gone by average movie rating for that movie instead of predicted rating (the result would have been the same) to make the system purely item-based.
Using the recommenderlab package
After several turns with Claude, here is a version of the recommenderlab package that actually functions (I hesitate to use the word “works”). I will attempt to explain/understand what’s happening in the code below
Step 1: This turns the rating into a double (not necessary) and fills in one value for “Best rating.”
Step 2: Creates a matrix of full ratings for each user and pivots wider. The distinct function also isn’t necessary, but may be for a larger or messier data set.
Step 3: Cosine is a common way to measure similarity with recommender systems
# ── 3. Genre similarity matrix ─────────────────────────────────────────────────cosine_sim <-function(mat) { sim <- mat %*%t(mat) norms <-sqrt(rowSums(mat^2)) sim /outer(norms, norms)}genre_matrix <- predictions_list |>distinct(movie, genre) |>mutate(value =1) |>pivot_wider(names_from = genre, values_from = value, values_fill =0) |>column_to_rownames("movie") |>as.matrix()#A wide DF where 1 denotes whether something belongs to a genre or notgenre_sim <-cosine_sim(genre_matrix)
Step 4 creates a rating matrix that shows how correlated different ratings are. For example, superman and birds of prey have a -.9364 correlation, based on users who rated both movies. Someone who rated Superman well might rate Birds of Prey poorly.
Step 5 finds movies that exist in both genre and rating matrix (in theory, there could be unrated movies that have a genre listed, though that’s not true for this data)
Blended_sim weights how important rating and genre are in defining items for the rating system. The original split was .7/.3, but I changed it to .4/.6, with a heavier weight toward genre.
First two functions: pulls all the viewers’ ratings (actual and predicted); then pulls only the actual ratings
Scores - scores movies the user hasn’t actually seen.
Sims - Finds similarities between target movie and all other movies.
Rated mask is a boolean function that tells whether the user has rated a single movie (otherwise, predictions are not possible)
Denom is the weighted avarage formula - it weight similarity to other movies, and attempts to figure out how a user would related a movie based on how they’ve rated other movies
# ── 6. IBCF scoring function — bypasses recommenderlab's broken list export ────predict_ibcf <-function(viewer_name, rating_mat, sim_mat, n =10) { viewer_ratings <- rating_mat[viewer_name, shared_movies]# Original (non-coalesced) ratings — NA = truly unrated original_ratings <- predictions_list |>filter(viewer == viewer_name) |>select(movie, rating) |>deframe()# Score every movie for this viewer scores <-map_dbl(shared_movies, function(target_movie) {# Only recommend movies the viewer actually hasn't ratedif (!is.na(original_ratings[target_movie])) return(NA_real_)# Similarities between target movie and all other movies sims <- sim_mat[target_movie, shared_movies]# Only use movies this viewer HAS rated as neighbors rated_mask <-!is.na(viewer_ratings)if (sum(rated_mask) ==0) return(NA_real_) neighbor_sims <- sims[rated_mask] neighbor_ratings <- viewer_ratings[rated_mask]# Weighted average: high-similarity movies pull harder on the prediction denom <-sum(abs(neighbor_sims))if (denom ==0) return(NA_real_)sum(neighbor_sims * neighbor_ratings) / denom })names(scores) <- shared_movies# Return top N unrated movies sorted by predicted score scores |>na.omit() |>sort(decreasing =TRUE) |>head(n) |>enframe(name ="movie", value ="ibcf_score") |>mutate(viewer = viewer_name)}
all_viewers - list of all viewers taken from the row names of the rating matrix
recs_df is the predicted ratings of unseen movies based on the algorithm
eval_df tests the model - it is predicting against the original set of predicted ratings, which seems incorrect: it should test predictions against the viewer’s actual scores.
# ── 7. Generate recommendations for all viewers ────────────────────────────────all_viewers <-rownames(rating_matrix)recs_df <-map_dfr(all_viewers, ~predict_ibcf(viewer_name = .x,rating_mat = rating_matrix,sim_mat = blended_sim,n =10)) |>select(viewer, movie, ibcf_score)cat("=== Raw recommendations ===\n")
=== Raw recommendations ===
print(recs_df)
# A tibble: 28 × 3
viewer movie ibcf_score
<chr> <chr> <dbl>
1 anne super_man 2.47
2 cat the_suicide_squad 4.34
3 cat iron_man 3.76
4 cat avengers 3.75
5 cat love_actually 3.66
6 cat captain_america 3.63
7 greta the_suicide_squad 3.69
8 greta super_man 3.46
9 greta iron_man 3.31
10 greta rudolph 3.30
# ℹ 18 more rows
# ── 8. Evaluate: IBCF score vs predicted_rating on originally-unrated cells ────eval_df <- recs_df |>left_join( predictions_list |>select(viewer, movie, predicted_rating),by =c("viewer", "movie") ) |>filter(!is.na(predicted_rating)) # only rows where we have ground truthrmse <-sqrt(mean((eval_df$ibcf_score - eval_df$predicted_rating)^2))mae <-mean( abs(eval_df$ibcf_score - eval_df$predicted_rating))cat("\nRMSE:", round(rmse, 3), "\n")
RMSE: 0.965
cat("MAE: ", round(mae, 3), "\n")
MAE: 0.766
print(eval_df)
# A tibble: 28 × 4
viewer movie ibcf_score predicted_rating
<chr> <chr> <dbl> <dbl>
1 anne super_man 2.47 1.5
2 cat the_suicide_squad 4.34 4.6
3 cat iron_man 3.76 3.3
4 cat avengers 3.75 3.6
5 cat love_actually 3.66 4.6
6 cat captain_america 3.63 4.3
7 greta the_suicide_squad 3.69 4.3
8 greta super_man 3.46 2.8
9 greta iron_man 3.31 3
10 greta rudolph 3.30 3.6
# ℹ 18 more rows
# ── 9. Join full context back in ───────────────────────────────────────────────final_recommendations <- recs_df |>left_join( predictions_list |>distinct(viewer, movie, .keep_all =TRUE) |>select(viewer, movie, predicted_rating, genre, avg_genre),by =c("viewer", "movie") ) |>arrange(viewer, desc(avg_genre), desc(ibcf_score))print(final_recommendations)
# A tibble: 28 × 6
viewer movie ibcf_score predicted_rating genre avg_genre
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 anne super_man 2.47 1.5 dc 4.5
2 cat the_suicide_squad 4.34 4.6 dc 4.5
3 cat love_actually 3.66 4.6 holiday 3.3
4 cat iron_man 3.76 3.3 marvel NA
5 cat avengers 3.75 3.6 marvel NA
6 cat captain_america 3.63 4.3 marvel NA
7 greta iron_man 3.31 3 marvel 3.5
8 greta the_suicide_squad 3.69 4.3 dc NA
9 greta super_man 3.46 2.8 dc NA
10 greta rudolph 3.30 3.6 holiday NA
# ℹ 18 more rows
rmse
[1] 0.965303
mae
[1] 0.7662678
The RMSE is really high for a 5-point rating system. This is expected – it’s a really small data set. Let’s check against the actual ratings.
#my simple data framegenre_predictions |>select(viewer, movie, predicted_rating) |>as_tibble() |>gt(rowname_col =NULL) |>cols_label( viewer ="Viewer",movie ="Movie",predicted_rating ="Predicted Rating") |>tab_header(title ="Predictions Using Math")
Predictions Using Math
Viewer
Movie
Predicted Rating
anne
super_man
1.5
cat
the_suicide_squad
4.6
greta
iron_man
3.0
james
avengers
3.1
jeff
the_holiday
2.9
nico
captain_america
4.1
Overall, the predicted ratings are somewhat different, probably due to the weighted averages. All the recommendations are the same except for Jeff’s - the algorithm recommends The Holiday instead of The Grinch, which is kind of puzzling because they are both holiday movies.
Claude Sonnet 4.6. (2026). [Large language model]. https://gemini.google.com. Accessed April 23 & 24, 2026.