Creating A Dataset
The dataset for this project will be a toy dataset representing
product reviews of 5 different products labeled “A” through “E” by
different users. The users and products have varying numbers of missing
reviews, and the review values themselves are designed to reflect the
tendency of people to avoid the bottom range of values in rating scales.
The scale ranges from 0-5 and allows for half steps in the ratings. We
will be building a simple recommender that will generate product
recommendations for users.
# Defining a tibble of sample data
df <- tribble(
~Name, ~A, ~B, ~C, ~D, ~E,
"Alice", 4.5, 5.0, NA, 4.5, 4.5,
"Bob", 4.5, 4.5, NA, 3.0, 4.5,
"Carol", 3.5, NA, 3.5, 2.0, 3.5,
"David", NA, 4.0, 3.5, 2.5, 3.5,
"Emma", 4.5, 5.0, NA, 3.0, 4.0,
"Frank", 3.5, 3.5, 3.0, NA, 3.0,
"Grace", 4.5, 4.0, 3.5, NA, 4.5,
"Henry", NA, 3.5, 2.5, 2.5, 3.5,
"Iris", 4.0, NA, 4.0, 3.5, 4.5,
"Jack", 3.5, 4.0, 3.5, 3.0, 3.5
)
df |>
kbl(caption = "User Product Ratings") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
User Product Ratings
|
Name
|
A
|
B
|
C
|
D
|
E
|
|
Alice
|
4.5
|
5.0
|
NA
|
4.5
|
4.5
|
|
Bob
|
4.5
|
4.5
|
NA
|
3.0
|
4.5
|
|
Carol
|
3.5
|
NA
|
3.5
|
2.0
|
3.5
|
|
David
|
NA
|
4.0
|
3.5
|
2.5
|
3.5
|
|
Emma
|
4.5
|
5.0
|
NA
|
3.0
|
4.0
|
|
Frank
|
3.5
|
3.5
|
3.0
|
NA
|
3.0
|
|
Grace
|
4.5
|
4.0
|
3.5
|
NA
|
4.5
|
|
Henry
|
NA
|
3.5
|
2.5
|
2.5
|
3.5
|
|
Iris
|
4.0
|
NA
|
4.0
|
3.5
|
4.5
|
|
Jack
|
3.5
|
4.0
|
3.5
|
3.0
|
3.5
|
When breaking the data into training and testing sets we want to
treat each user-item pair as its own row. This ensures that we create a
train-test split that has data for each user and item. We exclude the
missing values, as they do not make sense for the test set and provide
no predictive value in the training set.
set.seed(8675309)
# Pivoting dataframe longer into person-item pairings to allow for granular sampling
long_df <- df |>
pivot_longer(cols = A:E,
names_to = "Item",
values_to = "Rating") |>
filter(!is.na(Rating)) # No NAs, no imputation needed for project
# Breaking data into train-test split
train_index <- createDataPartition(y = 1:nrow(long_df),
p = 0.8,
list = FALSE)
train_pairs <- long_df[train_index,]
test_pairs <- long_df[-train_index,]
Calculating Averages
We can see the averages of each user’s ratings as well as each item’s
average rating in the below tables. The averages are from the data in
the training set only.
# Calculating product averages
avg_product <- train_pairs |>
group_by(Item) |>
summarise(average = mean(Rating))
# Calculating user averages
avg_user <- train_pairs |>
group_by(Name) |>
summarise(average = mean(Rating))
avg_user |>
kbl(caption = "Average Rating by User") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Average Rating by User
|
Name
|
average
|
|
Alice
|
4.500000
|
|
Bob
|
4.125000
|
|
Carol
|
3.500000
|
|
David
|
3.333333
|
|
Emma
|
4.166667
|
|
Frank
|
3.166667
|
|
Grace
|
4.125000
|
|
Henry
|
2.833333
|
|
Iris
|
3.833333
|
|
Jack
|
3.500000
|
avg_product |>
kbl(caption = "Average Rating by Product") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Average Rating by Product
|
Item
|
average
|
|
A
|
4.062500
|
|
B
|
4.300000
|
|
C
|
3.333333
|
|
D
|
3.142857
|
|
E
|
3.857143
|
Using raw averages as a blanket prediction we can calculate the RMSE
on both the training and test sets to set a reference point for model
accuracy. Any model we develop should have a smaller RMSE than this
value in order to be useful. The RMSE on both the training and test sets
are not particularly good, being off by between a half and full step in
the rating scale. The test set has worse performance, as would be
expected. Given that the dataset was intentionally constructed to have
an uneven distribution of ratings across the possible scale, with no
scores below a 2.0, the performance is even worse than one would
initially assume.
# Calculating RMSE of average score estimates
baseline_rmse_test <- sqrt(mean((test_pairs$Rating - mean(train_pairs$Rating))^2))
baseline_rmse_train <- sqrt(mean((train_pairs$Rating - mean(train_pairs$Rating))^2))
# Creating a dataframe of the RMSE for later comparison
rmse_df <- tribble(
~Prediction, ~RMSE,
"Raw Average Training", baseline_rmse_train,
"Raw Average Test", baseline_rmse_test
)
rmse_df |>
kbl(caption = "RMSE of Predicted Ratings") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
RMSE of Predicted Ratings
|
Prediction
|
RMSE
|
|
Raw Average Training
|
0.6752206
|
|
Raw Average Test
|
0.8277534
|
Incorporating User and Item Bias
To improve performance we can calculate the bias of each user and
item to construct baseline predictors for the different user-item pairs.
The bias is how much the average rating of each item or user differs
from the average of the entire dataset, which we can use to adjust our
estimates for each user-item pair.
# Calculating user bias scores
user_bias <- avg_user |>
mutate(user_bias = average - mean(train_pairs$Rating)) |>
select(Name, user_bias)
user_bias |>
kbl(caption = "User Bias Values") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
User Bias Values
|
Name
|
user_bias
|
|
Alice
|
0.7727273
|
|
Bob
|
0.3977273
|
|
Carol
|
-0.2272727
|
|
David
|
-0.3939394
|
|
Emma
|
0.4393939
|
|
Frank
|
-0.5606061
|
|
Grace
|
0.3977273
|
|
Henry
|
-0.8939394
|
|
Iris
|
0.1060606
|
|
Jack
|
-0.2272727
|
# Calculating item bias scores
item_bias <- avg_product |>
mutate(item_bias = average - mean(train_pairs$Rating)) |>
select(Item, item_bias)
item_bias |>
kbl(caption = "Item Bias Values") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Item Bias Values
|
Item
|
item_bias
|
|
A
|
0.3352273
|
|
B
|
0.5727273
|
|
C
|
-0.3939394
|
|
D
|
-0.5844156
|
|
E
|
0.1298701
|
We can now generate baseline predictions by adding the item and user
bias values to the set-wide average for each item-user pair. These
baseline estimators are a much better starting point for a recommender,
as they more accurately capture the variance in user and item ratings.
By including the User and Item bias values to produce baseline
predictors we can improve the RMSE of our predictions significantly. The
RMSE of the predictions on both the training and test sets is
approximately halved compared to the raw average predictions.
# Predicting ratings in the training set using baseline estimators
train_predictions <- train_pairs |>
left_join(user_bias, by = "Name") |>
left_join(item_bias, by = "Item") |>
mutate(predicted_rating = mean(train_pairs$Rating) + user_bias + item_bias)
# Predicting ratings in the testing set using baseline estimators
test_predictions <- test_pairs |>
left_join(user_bias, by = "Name") |>
left_join(item_bias, by = "Item") |>
mutate(predicted_rating = mean(train_pairs$Rating) + user_bias + item_bias)
# Calculating RMSE of the baseline estimator predictions
bias_rmse_train <- sqrt(mean((train_predictions$Rating - train_predictions$predicted_rating)^2))
bias_rmse_test <- sqrt(mean((test_predictions$Rating - test_predictions$predicted_rating)^2))
rmse_df |>
add_row(Prediction = "Training Baseline Predictor", RMSE = bias_rmse_train) |>
add_row(Prediction = "Testing Baseline Predictor", RMSE = bias_rmse_test) |>
kbl(caption = "RMSE of Predicted Ratings") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
RMSE of Predicted Ratings
|
Prediction
|
RMSE
|
|
Raw Average Training
|
0.6752206
|
|
Raw Average Test
|
0.8277534
|
|
Training Baseline Predictor
|
0.3295519
|
|
Testing Baseline Predictor
|
0.4246094
|
Incorporating Regularization
One of the methods we can use to improve the performance of the
baseline predictors is regularization. By adding a penalty term when
calculating the user and item bias parameters we can prevent overfitting
and better stabilize the bias values. It is important to allow the user
and bias regularization values to vary independently of each other when
testing, as one set may require more regularization than the other.
train_regularized_bias <- function(train_data, lambda_user = 0.1, lambda_item = 0.1,
max_iter = 50, tol = 1e-6) {
global_mean <- mean(train_data$Rating)
# Initialize biases to zero
user_biases <- train_data |>
distinct(Name) |>
mutate(user_bias = 0)
item_biases <- train_data |>
distinct(Item) |>
mutate(item_bias = 0)
# Iterative updates
for (iter in 1:max_iter) {
old_user_biases <- user_biases$user_bias
old_item_biases <- item_biases$item_bias
# Update user biases
user_biases <- train_data |>
left_join(item_biases, by = "Item") |>
mutate(residual = Rating - global_mean - item_bias) |>
group_by(Name) |>
summarise(
n_ratings = n(),
sum_residuals = sum(residual),
.groups = "drop"
) |>
mutate(user_bias = sum_residuals / (n_ratings + lambda_user))
# Update item biases
item_biases <- train_data |>
left_join(user_biases, by = "Name") |>
mutate(residual = Rating - global_mean - user_bias) |>
group_by(Item) |>
summarise(
n_ratings = n(),
sum_residuals = sum(residual),
.groups = "drop"
) |>
mutate(item_bias = sum_residuals / (n_ratings + lambda_item))
# Check convergence
user_change <- sqrt(mean((user_biases$user_bias - old_user_biases)^2, na.rm = TRUE))
item_change <- sqrt(mean((item_biases$item_bias - old_item_biases)^2, na.rm = TRUE))
if (user_change < tol && item_change < tol) {
break
}
}
list(
global_mean = global_mean,
user_biases = user_biases,
item_biases = item_biases,
lambda_user = lambda_user,
lambda_item = lambda_item,
iterations = iter
)
}
# Function to make predictions
predict_ratings <- function(model, data) {
data |>
left_join(model$user_biases, by = "Name") |>
left_join(model$item_biases, by = "Item") |>
mutate(
user_bias = coalesce(user_bias, 0),
item_bias = coalesce(item_bias, 0),
predicted_rating = model$global_mean + user_bias + item_bias,
predicted_rating = pmax(1, pmin(5, predicted_rating)) # Adding floor and ceiling to ratings
)
}
# For training
calculate_rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2, na.rm = TRUE))
}
# A grid of regularization parameters to try
lambda_grid <- expand_grid(
lambda_user = c(0, 0.01, 0.1, 0.5, 1.0),
lambda_item = c(0, 0.01, 0.1, 0.5, 1.0)
)
results <- lambda_grid |>
mutate(
model = map2(lambda_user, lambda_item, ~{
train_regularized_bias(train_pairs, .x, .y)
})
) |>
mutate(
# Get predictions for train and test
train_pred = map(model, ~predict_ratings(.x, train_pairs)),
test_pred = map(model, ~predict_ratings(.x, test_pairs)),
# Calculate RMSE
train_rmse = map_dbl(train_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating)),
test_rmse = map_dbl(test_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating))
)
Looking at the results of the regularization process, we can see that
our predictors actually perform best without any regularization. This is
not surprising, as the toy dataset constructed for this project is not
very sparse and has intentional patterns built in. If a larger or
sparser dataset was used we would expect the addition of regularization
to have a more meaningful impact on model performance.
results |>
select(lambda_user, lambda_item, train_rmse, test_rmse) |>
arrange(test_rmse) |>
head(10) |>
kbl(caption = "Top Performing Regularization Values") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Top Performing Regularization Values
|
lambda_user
|
lambda_item
|
train_rmse
|
test_rmse
|
|
0.00
|
0.00
|
0.2715822
|
0.3166312
|
|
0.00
|
0.01
|
0.2715830
|
0.3168834
|
|
0.01
|
0.00
|
0.2715868
|
0.3171637
|
|
0.01
|
0.01
|
0.2715875
|
0.3174135
|
|
0.00
|
0.10
|
0.2716607
|
0.3192452
|
|
0.01
|
0.10
|
0.2716645
|
0.3197502
|
|
0.10
|
0.00
|
0.2720140
|
0.3221532
|
|
0.10
|
0.01
|
0.2720140
|
0.3223800
|
|
0.10
|
0.10
|
0.2720837
|
0.3244982
|
|
0.00
|
0.50
|
0.2732845
|
0.3306648
|
To demonstrate this, we can construct just such a large and sparse
dataset. Looking at the results in the table below we can see meaningful
improvement even in the top 10 best performing regularization parameter
pairings. The proper regularization parameters prevent the model from
overfitting, but as we can see from the large jump in RMSE from the
training and testing sets the baseline predictors still fall victim to
their tendency to perform poorly for sparse datasets. We could continue
to attempt to address this issue by adding cross validation and training
across multiple folds of the data, however, at this point with the
addition of regularization and cross validation we will have moved into
the territory of linear regression models and would no longer be using
global baseline predictors.
# Creates a large and sparse dataset
set.seed(8675309)
create_overfitting_data <- function() {
set.seed(123)
n_users <- 80
n_items <- 60
# Create users with extreme biases
user_biases <- rnorm(n_users, 0, 1.2) # Wide range of user biases
item_biases <- rnorm(n_items, 0, 0.8) # Moderate item biases
ratings_list <- list()
for (i in 1:n_users) {
n_user_ratings <- sample(2:4, 1) # Very few ratings per user
user_items <- sample(1:n_items, n_user_ratings)
for (j in user_items) {
true_rating <- 3.5 + user_biases[i] + item_biases[j] + rnorm(1, 0, 0.4)
observed_rating <- pmax(1, pmin(5, round(true_rating * 2) / 2))
ratings_list <- append(ratings_list, list(tibble(
Name = paste0("User_", i),
Item = paste0("Item_", j),
Rating = observed_rating
)))
}
}
bind_rows(ratings_list)
}
# Call creation function
overfit_df <- create_overfitting_data()
# Make a train-test split
overfit_train_index <- createDataPartition(y = 1:nrow(overfit_df),
p = 0.8,
list = FALSE)
overfit_train_pairs <- overfit_df[overfit_train_index,]
overfit_test_pairs <- overfit_df[-overfit_train_index,]
# Train on the overfitting-prone dataset
overfit_results <- lambda_grid |>
mutate(
model = map2(lambda_user, lambda_item, ~{
train_regularized_bias(overfit_train_pairs, .x, .y)
})
) |>
mutate(
# Get predictions for train and test
train_pred = map(model, ~predict_ratings(.x, overfit_train_pairs)),
test_pred = map(model, ~predict_ratings(.x, overfit_test_pairs)),
# Calculate RMSE
train_rmse = map_dbl(train_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating)),
test_rmse = map_dbl(test_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating))
)
overfit_results |>
select(lambda_user, lambda_item, train_rmse, test_rmse) |>
arrange(test_rmse) |>
head(10) |>
kbl(caption = "Top Performing Regularization Values on the Overfitting-Prone Dataset") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Top Performing Regularization Values on the Overfitting-Prone Dataset
|
lambda_user
|
lambda_item
|
train_rmse
|
test_rmse
|
|
0.10
|
0.50
|
0.2442995
|
0.6653811
|
|
0.50
|
0.50
|
0.2969028
|
0.6802588
|
|
0.10
|
0.10
|
0.2080561
|
0.6808387
|
|
0.01
|
0.50
|
0.2378117
|
0.6875802
|
|
0.00
|
0.50
|
0.2374046
|
0.6913835
|
|
0.10
|
1.00
|
0.2826375
|
0.6980052
|
|
0.50
|
0.10
|
0.2753319
|
0.7007383
|
|
0.50
|
1.00
|
0.3251385
|
0.7011162
|
|
0.01
|
1.00
|
0.2783954
|
0.7165237
|
|
0.10
|
0.01
|
0.2022665
|
0.7172480
|