DATA607_WEEK11

Loading libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(knitr)

Loading movie ratings dataset

ratings_df <- read.csv("https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/refs/heads/main/DATA607/LAB11/movie_reviews%20-%20Form%20Responses%201.csv")

This dataset contains movie ratings by users of the survey

kable(ratings_df)

Timestamp	Name	Gladiator2	Wolfs	The_Substance	Bad_Boys4	The_Beekeeper	Rebel_Ridge
5:34:20 PM	Aiuna	1	NA	5	2	2	1
6:12:59 PM	Ilya	3	3	3	3	NA	3
6:52:31 PM	Vadim	1	NA	3	1	3	3
10:16:18 PM	Vladimir Storchevoy	5	5	5	5	4	4
1:44:22 AM	Timur	5	5	4	5	3	3
1:56:52 AM	Gosha	4	5	3	5	4	4
10:30:44 AM	Bob	3	5	4	NA	3	NA
9:40:22 PM	Kirill	4	4	NA	5	3	3
9:41:38 PM	James	3	4	4	5	3	3

Tidying dataset into long format

ratings_df_long <- ratings_df |>
  select(-Timestamp) |>
  pivot_longer(
    cols = -Name,
    names_to = "movie",
    values_to = "rating"
  )
kable(head(ratings_df_long, 10))

Name	movie	rating
Aiuna	Gladiator2	1
Aiuna	Wolfs	NA
Aiuna	The_Substance	5
Aiuna	Bad_Boys4	2
Aiuna	The_Beekeeper	2
Aiuna	Rebel_Ridge	1
Ilya	Gladiator2	3
Ilya	Wolfs	3
Ilya	The_Substance	3
Ilya	Bad_Boys4	3

Global baseline estimate recommender formula

The global baseline estimate recommender uses following formula:

\[ \hat{r}_{ui} = \mu + b_u + b_m \]

Where:

\(\mu\) - global average rating
\(b_u\) - user bias (how a user tends to rate compared to the global average).
This calculated by formula:

\[ b_u = \frac{1}{N_u} (\sum(r_u - \mu)) \]

Where:
\(r_u\) is the movie’s rating by user
\(N_u\) is the number of ratings by the user
\(b_m\) - movies bias (how a movie is rated compared to the global average)
This calculated by formula:

\[ b_m = \frac{1}{N_m}(\sum(r_u - \mu - b_u)) \] Where:
\(N_m\) is is the number of ratings for that movie

This code cell calculates \(\mu\) (global average rating)

mu <- mean(ratings_df_long$rating, na.rm = TRUE)
kable(mu)

x
3.5625

This code cell creates user_biasdataframe, which includes \(b_u\) bias for each user:

user_bias <- ratings_df_long |>
  group_by(Name) |>
  summarise(bias_user = mean(rating - mu, na.rm = TRUE))
kable(user_bias)

Name	bias_user
Aiuna	-1.3625000
Bob	0.1875000
Gosha	0.6041667
Ilya	-0.5625000
James	0.1041667
Kirill	0.2375000
Timur	0.6041667
Vadim	-1.3625000
Vladimir Storchevoy	1.1041667

This code cell creates movie_bias dataset, which include \(b_m\) bias for each movie:

movie_bias <- ratings_df_long |>
  left_join(user_bias, by =  "Name") |>
  group_by(movie) |>
  summarise(bias_movie = mean(rating - mu - bias_user, na.rm = TRUE))
kable(movie_bias)

movie	bias_movie
Bad_Boys4	0.3916667
Gladiator2	-0.2907407
Rebel_Ridge	-0.4833333
The_Beekeeper	-0.4520833
The_Substance	0.3979167
Wolfs	0.5404762

This code cell creates predicted_ratingsdataframe which calculates predicted ratings for not rated movies (NAs) by using global baseline estimate recommender formula

predicted_ratings <- ratings_df_long |>
  filter(is.na(rating)) |>
  left_join(user_bias, by ="Name") |>
  left_join(movie_bias, by = "movie") |>
  mutate(predicted = round(mu + bias_user + bias_movie))

kable(predicted_ratings)

Name	movie	rating	bias_user	bias_movie	predicted
Aiuna	Wolfs	NA	-1.3625	0.5404762	3
Ilya	The_Beekeeper	NA	-0.5625	-0.4520833	3
Vadim	Wolfs	NA	-1.3625	0.5404762	3
Bob	Bad_Boys4	NA	0.1875	0.3916667	4
Bob	Rebel_Ridge	NA	0.1875	-0.4833333	3
Kirill	The_Substance	NA	0.2375	0.3979167	4

DATA607_WEEK11

Farhod Ibragimov

2025-04-12

Loading libraries

Loading movie ratings dataset

Tidying dataset into long format

Global baseline estimate recommender formula