In this lab, we are asked to use the Global Baseline Estimate (GBE) algorithm to calculate how a viewer would rate a movie based on a) how that specific viewer has rated other movies and b) how all movies in a dataset have been rated by all users. The formula to calculate the GBE predicted value is:
GBE = “Mean movie rating overall” + (“Specific movie rating relative to average”) + (“User rating relative to average”)
First we will load in our movie survey data as follows:
# Read data from db
source("login_credentials.R")
mydb <- dbConnect(MySQL(), user = db_user, password = db_password,
dbname = db_name, host = db_host, port = db_port)
query <- "SELECT r.response_id, p.FirstName, m.title, r.rating FROM survey_movie_ratings AS r LEFT JOIN survey_movies AS m ON m.movie_id = r.movie_id LEFT JOIN survey_participants AS p ON p.participant_id = r.participant_id"
rs <- dbSendQuery(mydb, query)
df <- fetch(rs, n = -1) |>
rename(viewer = FirstName )
dbDisconnect(mydb)
## Warning: Closing open result sets
## [1] TRUE
head(df, 10)
## response_id viewer title rating
## 1 1 Nadia Up 5
## 2 2 Nadia Moana NA
## 3 3 Nadia Inside Out 5
## 4 4 Nadia Nightmare Before Christmas 4
## 5 5 Nadia Beetlejuice 3
## 6 6 Nadia Home Alone 2
## 7 7 Luna Up 5
## 8 8 Luna Moana 5
## 9 9 Luna Inside Out 5
## 10 10 Luna Nightmare Before Christmas NA
To calculate the GBE, we first need to calculate the mean rating for all movies. Next, we need to calculate the Specific movie rating relative to average. We group the data by the title column, then use mutate to create a new field called movie_avg_adj which equals the average rating for each movie subtracted by the mean rating for all movies. This tells us how different the movie rates in comparison to other movies. We use the same approach to calculate the User rating relative to average by grouping by the viewer field and creating a new column called viewer_avg_adj to store the average rating for a given viewer - the mean rating for all movies. This tells us how a differnt a viewer’s typical ratings are from the average rating.
Finally, we rounded the values to match our initial integer ratings from 1-5.
# calculate mean rating for all movies
df <- df |>
mutate(
mean_movie_rating = mean(rating, na.rm=T)
)
movie_calculated_rating <- df |>
group_by(title) |>
# calculate movie avg - mean movie rating
mutate(
movie_avg_adj = mean(rating, na.rm=T) - mean_movie_rating
) |>
ungroup() |>
group_by(viewer) |>
# calculate viewer avg - mean movie rating
mutate(
viewer_avg_adj = mean(rating, na.rm=T) - mean_movie_rating
) |>
ungroup() |>
filter(is.na(rating)) |>
# for all null ratings, replace rating with global baseline estimate
mutate(gbe_rating = mean_movie_rating + movie_avg_adj + viewer_avg_adj)
glimpse(movie_calculated_rating)
## Rows: 9
## Columns: 8
## $ response_id <int> 2, 10, 11, 16, 17, 20, 25, 26, 27
## $ viewer <chr> "Nadia", "Luna", "Luna", "Natalia", "Natalia", "Ian"…
## $ title <chr> "Moana", "Nightmare Before Christmas", "Beetlejuice"…
## $ rating <int> NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ mean_movie_rating <dbl> 3.777778, 3.777778, 3.777778, 3.777778, 3.777778, 3.…
## $ movie_avg_adj <dbl> 0.5555556, 0.4722222, -0.2777778, 0.4722222, -0.2777…
## $ viewer_avg_adj <dbl> 0.02222222, 0.97222222, 0.97222222, 0.47222222, 0.47…
## $ gbe_rating <dbl> 4.355556, 5.222222, 4.472222, 4.722222, 3.972222, 3.…
# round ratings
movie_gbe_rating <- movie_calculated_rating |>
subset(select = c(viewer, title, gbe_rating)) |>
mutate(rounded_gbe = round(gbe_rating))
movie_gbe_rating
## # A tibble: 9 × 4
## viewer title gbe_rating rounded_gbe
## <chr> <chr> <dbl> <dbl>
## 1 Nadia Moana 4.36 4
## 2 Luna Nightmare Before Christmas 5.22 5
## 3 Luna Beetlejuice 4.47 4
## 4 Natalia Nightmare Before Christmas 4.72 5
## 5 Natalia Beetlejuice 3.97 4
## 6 Ian Moana 3.76 4
## 7 Kris Up 3.49 3
## 8 Kris Moana 3.22 3
## 9 Kris Inside Out 3.29 3
The Global Baseline Estimate allows us to predict how a user would rate a particular movie or item based on their past ratings and the average rating for a specific movie or item. This simple method will give us a good baseline to work with but may need fine tuning to improve results. Even in this sample dataset, we can see the GBE has predicted values above our rating ceiling (5). While I adjusted this by rounding our predicted value, it may not be the most accurate way to resolve this issuse; values above 5.5 would be rounded out of range (6) and flooring may loose some nuance as in the case of the 3.97 rating which would result in a rating 3 instead of 4. Beyond these simple observations, we may want to experiment with other algorithms that compare each viewer whose ratings we want to predict with the ratings of viewers that have more the preferences in common as a more robust methodology.