After reading the assignment requirement, I’ve learned that Global Baseline Estimate is one of the best non-personalized recommender system algorithms. My plan is to first set up connection from my PostgreSQL database, load movie rating data in Rstudio, and implement the Global Baseline Estimate algorithm, generate recommendations, get result such as top recommendations for each user.
Load and Explore Movie Ratings Data
Establish connection with ProstgresSQL database and load data. For movie recommender, we don’t typically need to impute the missing data since we will be predict and provide recommended score for unseen movies. From the result showing below, looks like data are successfully loaded from database in R with columns: u.user_id, u.name, m.movie_id, m.title, m.genre, r.rating. We are ready for next step - calculating Global Baseline estimates.
library(DBI)library(RPostgres)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)library(ggplot2)# create connection without revealing passworddb_password <-Sys.getenv("DB_PASSWORD")db_user <-"postgres"db_name <-"xmdb"db_host <-"localhost"db_port <-5432con <-dbConnect(Postgres(),dbname = db_name,host = db_host,port = db_port,user = db_user,password = db_password)movie_ratings <-dbGetQuery(con, " SELECT u.user_id, u.name, m.movie_id, m.title, m.genre, r.rating FROM ratings r JOIN users u ON r.user_id = u.user_id JOIN movies m ON r.movie_id = m.movie_id")# test connection - View structure and first few rowsstr(movie_ratings)
For global baseline estimate, we will need 3 parts: user bias, movie bias, and movie average ratings. with mathematical formula: b_ui = μ + b_u + b_i. which μ is the overall average rating, b_u is the user bias (user’s average deviation from global mean), and b_i is the movie bias (movie’s average deviation from global mean).
First we get a glabale mean of the movie without NA values, which will be our base line for calculation. then each user’s personal average rating were calculated. We subtracts the global mean from user’s mean to get user bias, could be positive or negative. Similarly we get movie_bais which could be a positive or negative number. With above information, we will be ready to make prediction using the formula.
global_mean <-mean(movie_ratings$rating, na.rm =TRUE)cat("Global Mean Rating (μ):", global_mean, "\n")
To use Global Baseline Estimate algorithm, we pass in 5 variable to the function: user_ID, movie_id, global_mean, user_biases and movie_biases. It will search and pull for user_id and movie_bias from user_bias and movie_bias info we prepared from above code.
Recommendation function takes a target user and generate top recommendations by predicting rating for all movies the user hasn’t seen yet, and sort result by predicted ratings.
----------------------------------------
USER 1 RECOMMENDATIONS
----------------------------------------
1. Movie 4 (predicted rating: 4.70)
2. Movie 5 (predicted rating: 4.20)
----------------------------------------
USER 2 RECOMMENDATIONS
----------------------------------------
1. Movie 3 (predicted rating: 3.90)
2. Movie 5 (predicted rating: 3.90)
----------------------------------------
USER 3 RECOMMENDATIONS
----------------------------------------
1. Movie 4 (predicted rating: 4.70)
2. Movie 1 (predicted rating: 4.20)
----------------------------------------
USER 4 RECOMMENDATIONS
----------------------------------------
1. Movie 5 (predicted rating: 4.20)
2. Movie 2 (predicted rating: 3.90)
----------------------------------------
USER 5 RECOMMENDATIONS
----------------------------------------
1. Movie 4 (predicted rating: 4.00)
2. Movie 1 (predicted rating: 3.50)
dbDisconnect(con)
Conclusion
In simmary, we used global baseline estimate to predict user rating and provide movie recommendation for each user, implementation includes user_bias, movie_bias, and recommendation will be made to each user_id if that user didn’t provide rating for the movie.
Some after thoughts is we could evaluate prediction performance by withholding some already rated movies to do prediction and compare result with the actual rating give by user. I didn’t do this part yet for this project due to relatively small size of the data set which probably are insufficiant to reflect the real performance of the model.