3A Global Baseline Estimate

Planned Approach

I will complete this project in four main steps. First, I will review and clean the dataset to make sure the user names, movie titles, and ratings are formatted correctly. I will check for missing values and confirm that ratings are between 1 and 5. Second, I will calculate the overall average rating across all users and movies. This value serves as the starting point for all predictions. Third, I will calculate user bias and movie bias. User bias measures whether a person tends to rate movies higher or lower than average. Movie bias measures whether a movie generally receives higher or lower ratings than average. Finally, I will use these values to predict ratings for movies that a user has not yet rated. The movies with the highest predicted ratings will be recommended. To improve the model, I could test different regularization values to stabilize bias estimates. I could also compare the results to a simple average-rating recommendation method. In the future, this model could be expanded into a more advanced collaborative filtering system with more users and more ratings.

Anticipated Data Challenges

One challenge is that not every user rated every movie, which creates missing data. This can make predictions less stable, especially if a movie has very few ratings. Another challenge is the small size of the dataset. With limited ratings, one extreme score can strongly influence the results. There may also be differences in how users interpret the rating scale. Some users may rate generously, while others are stricter. The user bias calculation helps adjust for this.

Introduction

For this assignment, I used the movie ratings dataset collected in last week’s SQL and R project. The dataset includes user ratings for multiple movies and was originally stored in a relational database before being analyzed in R. The data can be accessed here:

Link

https://raw.githubusercontent.com/bb2955/607-week2/main/Week_2_SQL_and_R.csv

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data_url <- "https://raw.githubusercontent.com/bb2955/607-week2/main/Week_2_SQL_and_R.csv"

ratings <- readr::read_csv(data_url, show_col_types = FALSE)

mu <- mean(ratings$Rating, na.rm = TRUE)
mu

## [1] 4.307692

user_effects <- ratings %>%
  group_by(Name) %>%
  summarise(
    user_avg = mean(Rating, na.rm = TRUE),
    user_effect = user_avg - mu,
    .groups = "drop"
  )

movie_effects <- ratings %>%
  group_by(Title) %>%
  summarise(
    movie_avg = mean(Rating, na.rm = TRUE),
    movie_effect = movie_avg - mu,
    .groups = "drop"
  )

all_users  <- unique(ratings$Name)
all_movies <- unique(ratings$Title)

full_grid <- expand_grid(
  Name = all_users,
  Title = all_movies
)

predictions_full <- full_grid %>%
  left_join(user_effects %>% select(Name, user_effect), by = "Name") %>%
  left_join(movie_effects %>% select(Title, movie_effect), by = "Title") %>%
  mutate(
    gbe_prediction = mu + user_effect + movie_effect
  )

predictions_full <- predictions_full %>%
  left_join(
    ratings %>% select(Name, Title, Rating),
    by = c("Name", "Title")
  )

recommend_for_user <- function(user_name, n = 3) {
  
  predictions_full %>%
    filter(Name == user_name) %>%
    filter(is.na(Rating)) %>%      # only unseen movies
    arrange(desc(gbe_prediction)) %>%
    select(Name, Title, gbe_prediction) %>%
    slice_head(n = n)
}

all_users <- unique(ratings$Name)

all_recommendations <- map_dfr(all_users, ~recommend_for_user(.x, 3))

all_recommendations

## # A tibble: 12 × 3
##    Name   Title                       gbe_prediction
##    <chr>  <chr>                                <dbl>
##  1 Alex   Taylor Swift: The Eras Tour           5.03
##  2 Alex   Iron Lung                             3.36
##  3 Jordan Barbie                                4.36
##  4 Jordan The Muppets Movie                     4.03
##  5 Sam    Taylor Swift: The Eras Tour           4.86
##  6 Sam    Zootopia 2                            4.19
##  7 Sam    Iron Lung                             3.19
##  8 Taylor Barbie                                3.86
##  9 Taylor The Muppets Movie                     3.53
## 10 Taylor Iron Lung                             2.19
## 11 Chris  Zootopia 2                            4.36
## 12 Chris  Iron Lung                             3.36

Conclusion

This project implemented a Global Baseline Estimate recommender system using overall average ratings, user tendencies, and movie popularity effects. The model produced reasonable recommendations by adjusting predictions based on systematic rating patterns. Future improvements could include adding regularization, testing prediction accuracy with RMSE, or expanding the dataset to improve stability and reliability of recommendations.