Loading libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(knitr)

Loading movie ratings dataset

ratings_df <- read.csv("https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/refs/heads/main/DATA607/LAB11/movie_reviews%20-%20Form%20Responses%201.csv")

This dataset contains movie ratings by users of the survey

kable(ratings_df)
Timestamp Name Gladiator2 Wolfs The_Substance Bad_Boys4 The_Beekeeper Rebel_Ridge
5:34:20 PM Aiuna 1 NA 5 2 2 1
6:12:59 PM Ilya 3 3 3 3 NA 3
6:52:31 PM Vadim 1 NA 3 1 3 3
10:16:18 PM Vladimir Storchevoy 5 5 5 5 4 4
1:44:22 AM Timur 5 5 4 5 3 3
1:56:52 AM Gosha 4 5 3 5 4 4
10:30:44 AM Bob 3 5 4 NA 3 NA
9:40:22 PM Kirill 4 4 NA 5 3 3
9:41:38 PM James 3 4 4 5 3 3

Tidying dataset into long format

ratings_df_long <- ratings_df |>
  select(-Timestamp) |>
  pivot_longer(
    cols = -Name,
    names_to = "movie",
    values_to = "rating"
  )
kable(head(ratings_df_long, 10))
Name movie rating
Aiuna Gladiator2 1
Aiuna Wolfs NA
Aiuna The_Substance 5
Aiuna Bad_Boys4 2
Aiuna The_Beekeeper 2
Aiuna Rebel_Ridge 1
Ilya Gladiator2 3
Ilya Wolfs 3
Ilya The_Substance 3
Ilya Bad_Boys4 3

Global baseline estimate recommender formula

The global baseline estimate recommender uses following formula:

\[ \hat{r}_{ui} = \mu + b_u + b_m \]

Where:

This code cell calculates \(\mu\) (global average rating)

mu <- mean(ratings_df_long$rating, na.rm = TRUE)
kable(mu)
x
3.5625

This code cell creates user_biasdataframe, which includes \(b_u\) bias for each user:

user_bias <- ratings_df_long |>
  group_by(Name) |>
  summarise(bias_user = mean(rating - mu, na.rm = TRUE))
kable(user_bias)
Name bias_user
Aiuna -1.3625000
Bob 0.1875000
Gosha 0.6041667
Ilya -0.5625000
James 0.1041667
Kirill 0.2375000
Timur 0.6041667
Vadim -1.3625000
Vladimir Storchevoy 1.1041667

This code cell creates movie_bias dataset, which include \(b_m\) bias for each movie:

movie_bias <- ratings_df_long |>
  left_join(user_bias, by =  "Name") |>
  group_by(movie) |>
  summarise(bias_movie = mean(rating - mu - bias_user, na.rm = TRUE))
kable(movie_bias)
movie bias_movie
Bad_Boys4 0.3916667
Gladiator2 -0.2907407
Rebel_Ridge -0.4833333
The_Beekeeper -0.4520833
The_Substance 0.3979167
Wolfs 0.5404762

This code cell creates predicted_ratingsdataframe which calculates predicted ratings for not rated movies (NAs) by using global baseline estimate recommender formula

predicted_ratings <- ratings_df_long |>
  filter(is.na(rating)) |>
  left_join(user_bias, by ="Name") |>
  left_join(movie_bias, by = "movie") |>
  mutate(predicted = round(mu + bias_user + bias_movie))

kable(predicted_ratings)
Name movie rating bias_user bias_movie predicted
Aiuna Wolfs NA -1.3625 0.5404762 3
Ilya The_Beekeeper NA -0.5625 -0.4520833 3
Vadim Wolfs NA -1.3625 0.5404762 3
Bob Bad_Boys4 NA 0.1875 0.3916667 4
Bob Rebel_Ridge NA 0.1875 -0.4833333 3
Kirill The_Substance NA 0.2375 0.3979167 4