For this assignment, I will use the MovieLens 100k dataset, which is straightforward to manipulate and provides an excellent opportunity to practice data handling and analysis. The dataset will be downloaded and stored in my GitHub repository, allowing me to access the raw data directly.
The workflow for this analysis is as follows:
Data Loading and Preparation:
I will load the dataset into R using the dplyr and
tidyr libraries. After loading, I will reformat the data to
match the requirements of this assignment.
Global Baseline Computation:
Prediction of Ratings:
Using the global baseline, I will predict missing ratings for
recommendation purposes.
library(dplyr)
library(tidyr)
library(readr)
# Load the dataset from GitHub
url <- "https://raw.githubusercontent.com/japhet125/global-baseline-assign/refs/heads/main/u.data"
ratings <- read_delim(
url,
delim = "\t",
col_names = c("userId", "movieId", "rating", "timestamp")
)
# Keep only the relevant columns
ratings <- ratings[, c("userId", "movieId", "rating")]
# Preview the dataset
head(ratings)
#development The workflow for this analysis consists of the following steps:
The MovieLens 100K dataset is loaded directly from GitHub using the readr package. The dataset includes user IDs, movie IDs, ratings, and timestamps. Since the timestamp is not required for the Global Baseline model, it is removed during preprocessing.
gm <- mean(ratings$rating, na.rm = TRUE)
gm
users_bias <- ratings %>%
group_by(userId) %>%
summarise(
user_avg = mean(rating),
u_b = user_avg - gm
)
users_bias
movies_bias <- ratings %>%
group_by(movieId) %>%
summarise(
movie_avg = mean(rating),
m_b = movie_avg - gm
)
movies_bias
ratings_pred <- ratings %>%
left_join(users_bias, by = "userId") %>%
left_join(movies_bias, by = "movieId") %>%
mutate(
predicted_rating = gm + u_b + m_b
)
ratings_pred
all_movies <- unique(ratings$movieId)
user_id <- 1
rated_movies <- ratings %>%
filter(userId == user_id) %>%
pull(movieId)
unrated_movies <- setdiff(all_movies, rated_movies)
unrated_movies
recommendation <- movies_bias %>%
filter(movieId %in% unrated_movies) %>%
left_join(users_bias %>% filter(userId == user_id), by = character()) %>%
mutate(
predicted_rating = gm + u_b + m_b
) %>%
arrange(desc(predicted_rating))
head(recommendation, 5)
The Global Baseline Estimate is computed using: r = gm + u_b + m_b
where: gm is the glabal mean rating u_b is the users bias m_b is the movies bias
Calculating the global mean rating
Computing user-specific rating deviations (user bias)
Computing movie-specific rating deviations (movie bias)
Predicted ratings are generated by combining the global mean with user and movie biases. These predictions can be used to recommend unseen movies to users.
This assignment provided valuable insight into the foundations of recommender systems. The Global Baseline Estimate demonstrates how predictive power can be achieved using relatively simple statistical adjustments.
Key takeaways include:
Understanding the importance of the global mean rating
Recognizing how user bias reflects individual rating tendencies
Observing how movie bias captures overall popularity or quality perception
Applying these components to generate rating predictions
Although the model is non-personalized and relatively simple compared to collaborative filtering or matrix factorization methods, it serves as a strong baseline and foundational step in recommendation system development.