The Goal of this assignment is to use a Global Baseline Estimate (GBE) recommendation system using R and see how personalized changes result in predictions. With the movie rating data collected in the previous assignment (Assignment_2A_Movie ratings), I aim to predict rating values for movies that reviewers have not yet seen. This is possible with the GBE formula. The data set has missing values (NA/NULL) and these will be the target for our prediction algorithm. This model will account for reviewers movie bias preferences and popularity towards the item bias.
Data set contains ratings of 6 reviewers (Liz, Jed, Brenda, Jamie, Justice, Theresa) and 6 horror/thriller movies(The Substance, Nosferatu, Frankenstein, 28 years later, Sinners, The Conjuring 4). I am going to calculate Jamies predicted rating for 28 Years Later as that was a movie she have not seen.
Calculation will be done for all variables listed in this formula:
predicted_ratings = global average + item bias + user bias
\[\hat{r}_{ui} = \mu + b_i +
b_u\]
Source: Data set from local PostgreSQL populated via Assignment2A_movie_rating.sql, Gemini Pro for information and definition of each part of Global Baseline Estimate formula, and R for data Science book 2e. for tidyverse arguments/functions
Connecting/Loading Data
PostgreSQL database connection with DBI and RPostgres, import raw ratings table into R data frame, and convert NULL values in SQL into NA values in R for calculation
Bias Calculation and apply GBE formula With dyplyr, use Group_by function and calculate:
Comparison and Data Visualization
Raw average vs GBE average predictions and seeing the user bias distribution. Find the 28 Years Later predicted score for a reviewer Jamie
Challenges: Reproducibility of database, finding correct function to create formula and finding out what Jamie would rate 28 Years Later (NA)
# Load essential packages to use for data manipulation
library(DBI)
library(RPostgres)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Previously, database creation and password was prompted for users opening file. I decided to make reproducibility easier this time by converting the SQL query table into a csv file
df <- read.csv("movie_ratings.csv", )
# Confirm if table pops up
print(df)
## title name rating
## 1 The Substance Liz 5
## 2 Nosferatu Liz 5
## 3 28 Years Later Liz 5
## 4 Sinners Jed 5
## 5 The Substance Jed 3
## 6 28 Years Later Brenda 2
## 7 The Conjuring: Last Rites Brenda 1
## 8 Nosferatu Brenda 4
## 9 The Substance Jamie 4
## 10 Frankenstein Jamie 5
## 11 The Substance Justice 5
## 12 Nosferatu Justice 5
## 13 Frankenstein Justice 5
## 14 28 Years Later Justice 5
## 15 Sinners Justice 5
## 16 The Conjuring: Last Rites Justice 5
global_mu <- mean(df$rating)
print(paste("Global_Average (Mu):", round(global_mu, 4)))
## [1] "Global_Average (Mu): 4.3125"
movie_avg <- df %>%
filter(title == "28 Years Later") %>% #single out just that particular movie
summarize(mean_rating = mean(rating, na.rm = TRUE)) %>% #calculate mean
pull(mean_rating) ##change to vector from column
b_i <- movie_avg - global_mu
user_avg <- df %>%
filter(name == "Jamie") %>% #single out Jamie from list
summarize(mean_rating = mean(rating, na.rm = TRUE)) %>% # mean calculation
pull(mean_rating) #change to vector from column
b_u <- user_avg - global_mu
prediction <- global_mu + b_i + b_u
print(paste("Global Average:", round(global_mu, 4))) #round and keeps at 4th decimal place
## [1] "Global Average: 4.3125"
print(paste("Movie Bias (28 Years Later):", round(b_i, 4)))
## [1] "Movie Bias (28 Years Later): -0.3125"
print(paste("User Bias (Jamie):", round(b_u, 4)))
## [1] "User Bias (Jamie): 0.1875"
print(paste("FINAL JAMIE'S PREDICTION:", round(prediction, 4)))
## [1] "FINAL JAMIE'S PREDICTION: 4.1875"
Global Average was 4.31, taken for all 16 reviewers ratings. The reviewers in this group are generous with their ratings or the horror/thriller movies selected were very interesting. Due to the baseline being so high, movies under the rating of 4.31 would be under average even if 4.0 is a good score taken into account the ratings received from this data set.
Item bias: 28 Years later = 4.0 Average and has a negative bias of - 0.31. This movie is weaker compare to others in this specific data set
User bias: Jamie’s average rating = 4.5, she has a positive bias of +0.19, this shows the Jamie is little more generous reviewer compare to the average in this group
Global Estimate Baseline formula system predicts that Jamie would rate 28 Years later a 4.19. This predicted score suggests that Jamie would likely enjoy 28 Years Later and the system would recommend this movie to her. Based on the formula, I see that this provides a personalized predictions based on corrections taken from the data set ratings into account instead of just relying on the raw movie averages.