Approach

For this assignment, I will use the MovieLens 100k dataset, which is straightforward to manipulate and provides an excellent opportunity to practice data handling and analysis. The dataset will be downloaded and stored in my GitHub repository, allowing me to access the raw data directly.

The workflow for this analysis is as follows:

  1. Data Loading and Preparation:
    I will load the dataset into R using the dplyr and tidyr libraries. After loading, I will reformat the data to match the requirements of this assignment.

  2. Global Baseline Computation:

    • Calculate the global mean rating.
    • Compute user bias and movie bias.
  3. Prediction of Ratings:
    Using the global baseline, I will predict missing ratings for recommendation purposes.

library(dplyr)
library(tidyr)
library(readr)

# Load the dataset from GitHub
url <- "https://raw.githubusercontent.com/japhet125/global-baseline-assign/refs/heads/main/u.data"

ratings <- read_delim(
  url,
  delim = "\t",
  col_names = c("userId", "movieId", "rating", "timestamp")
)

# Keep only the relevant columns
ratings <- ratings[, c("userId", "movieId", "rating")]

# Preview the dataset
head(ratings)

#development The workflow for this analysis consists of the following steps:

  1. Data Loading and Preparation

The MovieLens 100K dataset is loaded directly from GitHub using the readr package. The dataset includes user IDs, movie IDs, ratings, and timestamps. Since the timestamp is not required for the Global Baseline model, it is removed during preprocessing.

  1. Global Baseline Computation
gm <- mean(ratings$rating, na.rm = TRUE)
gm

users_bias <- ratings %>%
  group_by(userId) %>%
summarise(
  user_avg = mean(rating),
  u_b = user_avg - gm
  
)
users_bias

movies_bias <- ratings %>%
  group_by(movieId) %>%
  summarise(
    movie_avg = mean(rating),
    m_b = movie_avg - gm
  )
movies_bias

ratings_pred <- ratings %>%
  left_join(users_bias, by = "userId") %>%
  left_join(movies_bias, by = "movieId") %>%
  
  mutate(
    predicted_rating = gm + u_b + m_b
  )

ratings_pred

all_movies <- unique(ratings$movieId)

user_id <- 1

rated_movies <- ratings %>%
  filter(userId == user_id) %>%
  pull(movieId)

unrated_movies <- setdiff(all_movies, rated_movies)

unrated_movies

recommendation <- movies_bias %>%
  filter(movieId %in% unrated_movies) %>%
  left_join(users_bias %>% filter(userId == user_id), by = character()) %>%
  mutate(
    predicted_rating = gm + u_b + m_b
  ) %>%
  arrange(desc(predicted_rating))

head(recommendation, 5)

The Global Baseline Estimate is computed using: r = gm + u_b + m_b

where: gm is the glabal mean rating u_b is the users bias m_b is the movies bias

Calculating the global mean rating
Computing user-specific rating deviations (user bias)

Computing movie-specific rating deviations (movie bias)

  1. Rating Prediction

Predicted ratings are generated by combining the global mean with user and movie biases. These predictions can be used to recommend unseen movies to users.

Conclusion

This assignment provided valuable insight into the foundations of recommender systems. The Global Baseline Estimate demonstrates how predictive power can be achieved using relatively simple statistical adjustments.

Key takeaways include:

Understanding the importance of the global mean rating

Recognizing how user bias reflects individual rating tendencies

Observing how movie bias captures overall popularity or quality perception

Applying these components to generate rating predictions

Although the model is non-personalized and relatively simple compared to collaborative filtering or matrix factorization methods, it serves as a strong baseline and foundational step in recommendation system development.