assignment_3a

Global Baseline Estimates

Using the information you collected on movie ratings, implement a Global Baseline Estimate recommendation system in R. The attached spreadsheet provides the implementation algorithm.

Most recommender systems use personalized algorithms like “content management” and “item-item collaborative filtering.” Sometimes non-personalized recommenders are also useful or necessary. One of the best non-personalized recommender system algorithms is the “Global Baseline Estimate.

The job here is to use the survey data collected and write the R code that makes a movie recommendation using the Global Baseline Estimate algorithm. Please see the attached spreadsheet for implementation details.

Movie Ratings XLSX”

Approach

Review

I’ll start by reviewing the excel.

There’s 4 sheets:

MovieRatings

Survey of the list of movies

Problem Statement

Just seems like survey
No movie title keys
Keys for names of people taking the survey

MeanCenteredMovieRatings

First table takes the mean rating per person based on the movies they rated
small subset of all the critics
Second table takes the deviation from the mean per rating per person

Global Baseline

user average
- average(movie rating) per row
- last row is the average of all movies
user average - mean
- user average - total movie average
total movie average
- Takes the average per row, ignores NA
Movie avg
- average rating per movie
movie avg - mean movie
- movie average - total average
How would Param rate Pitch Perfect 2?
- Global Baseline Estimate =
  - Mean movie rating +
  - Pitch Perfect 2’s rating relative to average +
  - Param’s rating relative to average

So, outside of the movie references, the Global Baseline Estimates are:

Expected value = Grand Mean + Row Effect + Column Effect
- Expected value = F10
  - Value we are trying to predict.
- Grand Mean = H18
  - Overall effect
- Row effect = I10
  - Group A effect
- Column effect = F19
  - Group B effect

It’s pretty interesting, apparently it’s just variance decomposition. It’s pretty intuitive, you are predicting a cell, so you take the variance from the row, column, and the entire table to inform that prediction. The model is applied all over the place, because it’s a general pattern structure that separates systematic structure (predictable patterns) from randomness.

ANOVA \[ SS_{Total} = SS_{Rows} + SS_{Columns} + SS_{Residual} \] —

Implementation

Import rating data from PGSQL as df

library(DBI)
library(RPostgres)
library(tidyverse)
library(dotenv)

load_dot_env()

con <- dbConnect(
  RPostgres::Postgres(),
  dbname = Sys.getenv("DB_NAME"),
  host = Sys.getenv("DB_HOST"),
  port = Sys.getenv("DB_PORT"),
  user = Sys.getenv("DB_USER"),
  password = Sys.getenv("DB_PASSWORD")
)

Connection Test

dbGetQuery(con, "SELECT version();")

                                                                  version
1 PostgreSQL 17.6 on x86_64-windows, compiled by msvc-19.44.35213, 64-bit

Creating csv from df

query <- "SELECT * FROM popular_movies.v_ratings_raw"
df <- dbGetQuery(con, query) |>
  as_tibble()

df |> select(name, title, rating)

# A tibble: 30 × 3
   name  title                    rating
   <chr> <chr>                     <int>
 1 Alex  One Battle After Another      5
 2 Alex  Begonia                       4
 3 Alex  Wicked for Good               4
 4 Alex  The Materialist               3
 5 Alex  Sinners                      NA
 6 Bri   One Battle After Another      4
 7 Bri   Begonia                       3
 8 Bri   Wicked for Good               5
 9 Bri   The Materialist              NA
10 Bri   Sinners                       4
# ℹ 20 more rows

write.csv(df, "movie_ratings.csv", row.names = FALSE)

So the data now lives in the folder. I’ll just clean up the df.

df <- read.csv("movie_ratings.csv")
df <- df |> select(name, title, rating)
df

   name                    title rating
1  Alex One Battle After Another      5
2  Alex                  Begonia      4
3  Alex          Wicked for Good      4
4  Alex          The Materialist      3
5  Alex                  Sinners     NA
6   Bri One Battle After Another      4
7   Bri                  Begonia      3
8   Bri          Wicked for Good      5
9   Bri          The Materialist     NA
10  Bri                  Sinners      4
11 Chen One Battle After Another     NA
12 Chen                  Begonia      5
13 Chen          Wicked for Good      4
14 Chen          The Materialist      3
15 Chen                  Sinners      4
16 Devi One Battle After Another      3
17 Devi                  Begonia     NA
18 Devi          Wicked for Good      3
19 Devi          The Materialist      4
20 Devi                  Sinners      5
21  Eli One Battle After Another      4
22  Eli                  Begonia      3
23  Eli          Wicked for Good     NA
24  Eli          The Materialist      5
25  Eli                  Sinners      3
26 Fran One Battle After Another      5
27 Fran                  Begonia      3
28 Fran          Wicked for Good      4
29 Fran          The Materialist      4
30 Fran                  Sinners     NA

So I have name, title, and rating. I want to create a function called global_baseline_estimate

#global_baseline_estimate() <- function(df){}

So, it would need to do the following:

create summarization by name (df_name), get mean rating (na.rm = TRUE) per name (n_mean)
create summarization by title (df_title), get mean rating (na.rm = TRUE) per title (t_mean)
create variable for the mean rating of all titles (x)
mutate name summarization (df_name) to calculate effect (n_effect) = (n_mean - x)
mutate title summarization (df_title) to calculate effect (t_effect) = (t_mean - x)
join df_name$n_effect by name
join df_title$t_effect by title
mutate df (gbe) by rating: if na then x + n_effect + t_effect else rating

That should get me a completed dataset where na values are filled with ratings from a global baseline estimate. Pretty neat.

Codebase

df <- read.csv("movie_ratings.csv")
df <- df |> select(name, title, rating)
df

   name                    title rating
1  Alex One Battle After Another      5
2  Alex                  Begonia      4
3  Alex          Wicked for Good      4
4  Alex          The Materialist      3
5  Alex                  Sinners     NA
6   Bri One Battle After Another      4
7   Bri                  Begonia      3
8   Bri          Wicked for Good      5
9   Bri          The Materialist     NA
10  Bri                  Sinners      4
11 Chen One Battle After Another     NA
12 Chen                  Begonia      5
13 Chen          Wicked for Good      4
14 Chen          The Materialist      3
15 Chen                  Sinners      4
16 Devi One Battle After Another      3
17 Devi                  Begonia     NA
18 Devi          Wicked for Good      3
19 Devi          The Materialist      4
20 Devi                  Sinners      5
21  Eli One Battle After Another      4
22  Eli                  Begonia      3
23  Eli          Wicked for Good     NA
24  Eli          The Materialist      5
25  Eli                  Sinners      3
26 Fran One Battle After Another      5
27 Fran                  Begonia      3
28 Fran          Wicked for Good      4
29 Fran          The Materialist      4
30 Fran                  Sinners     NA

# rater mean

s_name <- df |>
  summarize(rater_mean = mean(rating, na.rm = TRUE), .by = name)

# item mean

s_title <- df |>
  summarize(item_mean = mean(rating, na.rm = TRUE), .by = title)

# global mean

s_global <- df |>
  summarize(mean = mean(rating, na.rm = TRUE))

global_mean <- s_global$mean
global_mean

[1] 3.916667

# rater effect

s_name <- s_name |> mutate(rater_effect = rater_mean - global_mean)
s_title <- s_title |> mutate(item_effect = item_mean - global_mean)

df2 <- df
df2 <- df2 |> left_join(s_name, join_by(name))|>
  left_join(s_title, join_by(title)) |>
  mutate(rating = if_else(is.na(rating), global_mean + rater_effect + item_effect, rating),
         rating = round(rating))

head(df2)

  name                    title rating rater_mean rater_effect item_mean
1 Alex One Battle After Another      5          4   0.08333333       4.2
2 Alex                  Begonia      4          4   0.08333333       3.6
3 Alex          Wicked for Good      4          4   0.08333333       4.0
4 Alex          The Materialist      3          4   0.08333333       3.8
5 Alex                  Sinners      4          4   0.08333333       4.0
6  Bri One Battle After Another      4          4   0.08333333       4.2
  item_effect
1  0.28333333
2 -0.31666667
3  0.08333333
4 -0.11666667
5  0.08333333
6  0.28333333

We calculated the rating for individuals with NA ratings. Let’s say we were to add 1 person to the dataframe with ratings of NA per movie?

df3 <- df2[1:5,] |> mutate(rating = NA, name = "Shawn", rater_mean = NA, rater_effect = 0)
df3

   name                    title rating rater_mean rater_effect item_mean
1 Shawn One Battle After Another     NA         NA            0       4.2
2 Shawn                  Begonia     NA         NA            0       3.6
3 Shawn          Wicked for Good     NA         NA            0       4.0
4 Shawn          The Materialist     NA         NA            0       3.8
5 Shawn                  Sinners     NA         NA            0       4.0
  item_effect
1  0.28333333
2 -0.31666667
3  0.08333333
4 -0.11666667
5  0.08333333

# Oh, look. Shawn didn't watch any movies!
# When rater is NA, it's assumed that rater_effect is zero.
# Meaning the deviation from the global mean is zero.

# We calculate the ratings using the global baseline estimate!
df3 <- df3 |> mutate(rating = (global_mean + item_effect))
df3|> select(name, title, rating) |> as_tibble()

# A tibble: 5 × 3
  name  title                    rating
  <chr> <chr>                     <dbl>
1 Shawn One Battle After Another    4.2
2 Shawn Begonia                     3.6
3 Shawn Wicked for Good             4  
4 Shawn The Materialist             3.8
5 Shawn Sinners                     4