Project 3 A

Author

Samantha Barbaro

Approach

I will use my movie data set from the last project to create a recommendation system. To do this, I will:

  • find the average rating for all movies

  • Find out how the average of every individual movie compares to the overall average

  • Find out how a user’s average rating compares to the overall average

  • Using these numbers, predict ratings for movies users haven’t seen

  • Provide a recommendation by recommending the movie they are likely to rate the highest

Data set:

https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv

Loading the data set

I loaded the data set and kept some cleaning methods from last week.

library(tidyverse)  
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sql_url <- "https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv"  
#read and convert blanks and NULL to NA  
movie_data <- read.csv(sql_url, na = c("", "NA", "null", "NULL")) 

Converting to a tibble

#converting to tibble  
movie_data <- as_tibble(movie_data)   

print(movie_data) 
# A tibble: 6 × 11
  viewer iron_man captain_america avengers super_man birds_of_prey
  <chr>     <int>           <int>    <int>     <int>         <int>
1 anne          1               2        3        NA             5
2 cat          NA              NA       NA         5             4
3 greta        NA               4        3        NA            NA
4 james         2               5       NA         2            NA
5 jeff         NA              NA        4         1            NA
6 nico          5              NA        2         2             5
# ℹ 5 more variables: the_suicide_squad <int>, the_grinch <int>,
#   the_holiday <int>, love_actually <int>, rudolph <int>

Pivot longer

long_movie_data <- movie_data |> pivot_longer(   -viewer,   names_to = "movie",   values_to = "rating" )  

#descriptive statistics 
summary(long_movie_data) 
    viewer             movie               rating     
 Length:60          Length:60          Min.   :1.000  
 Class :character   Class :character   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :3.500  
                                       Mean   :3.188  
                                       3rd Qu.:4.250  
                                       Max.   :5.000  
                                       NA's   :28     

Adding averages and other calculations

#calculate the average movie score for the entire dataset, and give that its own column

long_movie_data |> summarize(round(mean(rating, na.rm = TRUE), 1))
# A tibble: 1 × 1
  `round(mean(rating, na.rm = TRUE), 1)`
                                   <dbl>
1                                    3.2
long_movie_data |> mutate(avg_overall = 3.2)
# A tibble: 60 × 4
   viewer movie             rating avg_overall
   <chr>  <chr>              <int>       <dbl>
 1 anne   iron_man               1         3.2
 2 anne   captain_america        2         3.2
 3 anne   avengers               3         3.2
 4 anne   super_man             NA         3.2
 5 anne   birds_of_prey          5         3.2
 6 anne   the_suicide_squad      4         3.2
 7 anne   the_grinch             1         3.2
 8 anne   the_holiday            1         3.2
 9 anne   love_actually          2         3.2
10 anne   rudolph                1         3.2
# ℹ 50 more rows
#I will add the average movie rating manually
#one way to do this -- get every movie's average rating and index with a case satement
avg_movie_rating <- long_movie_data |>  group_by(movie) |> 
    summarize(avg_rating = round(mean(rating, na.rm = TRUE), 1))

print(avg_movie_rating)
# A tibble: 10 × 2
   movie             avg_rating
   <chr>                  <dbl>
 1 avengers                 3  
 2 birds_of_prey            4.7
 3 captain_america          3.7
 4 iron_man                 2.7
 5 love_actually            4  
 6 rudolph                  3.3
 7 super_man                2.5
 8 the_grinch               2  
 9 the_holiday              2.3
10 the_suicide_squad        4  
movie_averages <- long_movie_data |> mutate(avg_overall = 3.2) |>
    mutate(movie_average = case_when(
        movie == "iron_man" ~ 2.7,
        movie == "avengers" ~ 3.0,
        movie == "birds_of_prey" ~ 4.7,
        movie == "captain_america" ~ 3.7,
        movie == "love_actually" ~ 4.0,
        movie == "rudolph" ~ 3.3,
        movie == "super_man" ~ 2.5,
        movie == "the_grinch" ~ 2.0,
        movie == "the_holiday" ~ 2.3,
        movie == "the_suicide_squad" ~ 4.0,
        TRUE ~ NA_real_  
    ))

Trying something different

At this point, I’ve introduced a lot of typing and the possibility of human error with my methods. The ratings in the code above are also static and don’t update if new ratings are added.

Before I keep going, is there a way to add a column like index/match or hlookup in excel to join these two tables? Or is it possible to just calculate the averages for each movie and the general average for all the movies using mutate and skip the step of creating a new table and indexing?

#I added the average in a single column, which worked

new_ratings <- long_movie_data |> mutate(avg_rating = round(mean(rating, na.rm = TRUE), 1))

#And the average for each movie (apparently one can simply ungroup after grouping by ungroup() or .groups = "drop")

new_ratings <- long_movie_data |> 
    mutate(avg_rating_global = round(mean(rating, na.rm = TRUE), 1)) |>
    group_by(movie) |> 
    mutate(avg_rating_movie = round(mean(rating, na.rm = TRUE), 1)) |> 
    ungroup()


#here it is again with two new columns (user average and the difference between the user's average and the overall average). 
#At this point, I am running out of easily understanable column names.

new_ratings <- long_movie_data |> 
    mutate(avg_rating_global = round(mean(rating, na.rm = TRUE), 1)) |>
    group_by(movie) |>
    mutate(avg_rating_movie = round(mean(rating, na.rm = TRUE), 1)) |> 
    ungroup() |> 
    group_by(viewer) |>
    mutate(viewer_avg = round(mean(rating, na.rm = TRUE), 1)) |> 
    ungroup() |>
    mutate(user_diff = viewer_avg - avg_rating_global)


print(new_ratings)
# A tibble: 60 × 7
   viewer movie   rating avg_rating_global avg_rating_movie viewer_avg user_diff
   <chr>  <chr>    <int>             <dbl>            <dbl>      <dbl>     <dbl>
 1 anne   iron_m…      1               3.2              2.7        2.2        -1
 2 anne   captai…      2               3.2              3.7        2.2        -1
 3 anne   avenge…      3               3.2              3          2.2        -1
 4 anne   super_…     NA               3.2              2.5        2.2        -1
 5 anne   birds_…      5               3.2              4.7        2.2        -1
 6 anne   the_su…      4               3.2              4          2.2        -1
 7 anne   the_gr…      1               3.2              2          2.2        -1
 8 anne   the_ho…      1               3.2              2.3        2.2        -1
 9 anne   love_a…      2               3.2              4          2.2        -1
10 anne   rudolph      1               3.2              3.3        2.2        -1
# ℹ 50 more rows

Creating a new predicted rating table

This table adds the user_diff column (difference from average rating) to the avg_rating_movie (average of each movie) column if the original rating column shows NA. If the viewer has already seen the movie, the predicted rating is NA.

predicted_ratings <- new_ratings |> mutate(predicted_rating = (ifelse(is.na(rating), user_diff + avg_rating_movie, NA)))

glimpse(predicted_ratings)
Rows: 60
Columns: 8
$ viewer            <chr> "anne", "anne", "anne", "anne", "anne", "anne", "ann…
$ movie             <chr> "iron_man", "captain_america", "avengers", "super_ma…
$ rating            <int> 1, 2, 3, NA, 5, 4, 1, 1, 2, 1, NA, NA, NA, 5, 4, NA,…
$ avg_rating_global <dbl> 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.…
$ avg_rating_movie  <dbl> 2.7, 3.7, 3.0, 2.5, 4.7, 4.0, 2.0, 2.3, 4.0, 3.3, 2.…
$ viewer_avg        <dbl> 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 2.2, 3.…
$ user_diff         <dbl> -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0…
$ predicted_rating  <dbl> NA, NA, NA, 1.5, NA, NA, NA, NA, NA, NA, 3.3, 4.3, 3…
print(predicted_ratings |> select(viewer, movie, rating, avg_rating_movie, user_diff, predicted_rating))
# A tibble: 60 × 6
   viewer movie             rating avg_rating_movie user_diff predicted_rating
   <chr>  <chr>              <int>            <dbl>     <dbl>            <dbl>
 1 anne   iron_man               1              2.7        -1             NA  
 2 anne   captain_america        2              3.7        -1             NA  
 3 anne   avengers               3              3          -1             NA  
 4 anne   super_man             NA              2.5        -1              1.5
 5 anne   birds_of_prey          5              4.7        -1             NA  
 6 anne   the_suicide_squad      4              4          -1             NA  
 7 anne   the_grinch             1              2          -1             NA  
 8 anne   the_holiday            1              2.3        -1             NA  
 9 anne   love_actually          2              4          -1             NA  
10 anne   rudolph                1              3.3        -1             NA  
# ℹ 50 more rows

Finally, the predictions

#predicting everyone's next movie by slicing the max for each viewer and keeping the movie column

next_movie <- predicted_ratings |> group_by(viewer) |> slice_max(predicted_rating, n = 1, with_ties = FALSE) |> ungroup() |> select(viewer, movie, predicted_rating)


print(next_movie)
# A tibble: 6 × 3
  viewer movie             predicted_rating
  <chr>  <chr>                        <dbl>
1 anne   super_man                      1.5
2 cat    the_suicide_squad              4.6
3 greta  birds_of_prey                  5  
4 james  birds_of_prey                  4.8
5 jeff   birds_of_prey                  5.3
6 nico   love_actually                  4.4
#According to this table, someone will give Birds of Prey more than 5 stars

#After seeing this, I thought about whether setting a max for the predicted rating column makes sense and concluded that it does not.
#Someone might be a high rater, and you want recommend the movie they'd likely give the absolute highest rating to, even if that rating is out of bounds.

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed Feb 13, 2026.