The Global Baseline Estimate (GBE) is a non-personalized prediction algorithm that uses a global average across all users and items as a baseline estimate. Bias is then mitigated by calculating the differences between those averages and specific users and items.
Recommendations are then determined as follows: the estimate or prediction of a rating by user u of item i = the global average rating + the user bias + the item bias.
Below, the GBE is applied to sample movie ratings to make movie recommendations. First, the excel survey data is imported.
library(tidyverse)
library(readxl)
ratings <- read_excel("MovieRatings.xlsx")
Next, the average ratings for each Critic
row
(user_biases
) and movie column (item_biases
)
are calculated.
user_biases <- data.frame(
Critic = ratings$Critic,
user_avg = round(rowMeans(ratings[2:7], na.rm = TRUE), digits = 2)
)
user_global_avg <- round(mean(user_biases$user_avg), digits = 2)
The global average for users is 4.03, which is used to calculate the user bias.
user_biases <- user_biases |>
mutate(user_bias = user_avg - user_global_avg)
user_biases
## Critic user_avg user_bias
## 1 Burton 4.00 -0.03
## 2 Charley 3.50 -0.53
## 3 Dan 5.00 0.97
## 4 Dieudonne 4.67 0.64
## 5 Matt 3.25 -0.78
## 6 Mauricio 3.50 -0.53
## 7 Max 3.33 -0.70
## 8 Nathan 4.00 -0.03
## 9 Param 3.50 -0.53
## 10 Parshu 3.67 -0.36
## 11 Prashanth 4.80 0.77
## 12 Shipra 4.00 -0.03
## 13 Sreejaya 4.67 0.64
## 14 Steve 4.00 -0.03
## 15 Vuthy 3.60 -0.43
## 16 Xingjia 5.00 0.97
The process is similar for calculating the item bias, except the ratings table is pivoted longer to tidy the data first and the calculated global average for items is 3.87.
ratings_tidy <- ratings |>
pivot_longer(!Critic, names_to = "movies", values_to = "rating")
item_biases <- ratings_tidy |>
group_by(movies) |>
summarize(item_avg = round(mean(rating, na.rm = TRUE), digits = 2))
item_global_avg <- round(mean(item_biases$item_avg), digits = 2)
item_biases <- item_biases |>
mutate(item_bias = item_avg - item_global_avg)
item_biases
## # A tibble: 6 × 3
## movies item_avg item_bias
## <chr> <dbl> <dbl>
## 1 CaptainAmerica 4.27 0.400
## 2 Deadpool 4.44 0.57
## 3 Frozen 3.73 -0.140
## 4 JungleBook 3.9 0.0300
## 5 PitchPerfect2 2.71 -1.16
## 6 StarWarsForce 4.15 0.280
These biases are joined onto the tidy dataset for easier calculations.
gbe_ratings <- ratings_tidy |>
filter(is.na(rating)) |>
left_join(user_biases, by = "Critic") |>
left_join(item_biases, by = "movies") |>
mutate(predicted_rating = user_global_avg + user_bias + item_bias)
gbe_ratings
## # A tibble: 35 × 8
## Critic movies rating user_avg user_bias item_avg item_bias predicted_rating
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Burton Capta… NA 4 -0.0300 4.27 0.400 4.4
## 2 Burton Deadp… NA 4 -0.0300 4.44 0.57 4.57
## 3 Burton Frozen NA 4 -0.0300 3.73 -0.140 3.86
## 4 Burton Pitch… NA 4 -0.0300 2.71 -1.16 2.84
## 5 Dan Capta… NA 5 0.97 4.27 0.400 5.4
## 6 Dan Frozen NA 5 0.97 3.73 -0.140 4.86
## 7 Dan Jungl… NA 5 0.97 3.9 0.0300 5.03
## 8 Dan Pitch… NA 5 0.97 2.71 -1.16 3.84
## 9 Dieudon… Frozen NA 4.67 0.64 3.73 -0.140 4.53
## 10 Dieudon… Jungl… NA 4.67 0.64 3.9 0.0300 4.7
## # ℹ 25 more rows
The predicted_rating
column of the new
gbe_ratings
dataframe should give reasonable predictions
for each user-item pair. For example, each Critic
could be
given a list of movies they have not rated, filtered by any prediction
over 4.