Creating A Dataset

The dataset for this project will be a toy dataset representing product reviews of 5 different products labeled “A” through “E” by different users. The users and products have varying numbers of missing reviews, and the review values themselves are designed to reflect the tendency of people to avoid the bottom range of values in rating scales. The scale ranges from 0-5 and allows for half steps in the ratings. We will be building a simple recommender that will generate product recommendations for users.

# Defining a tibble of sample data
df <- tribble(
  ~Name, ~A, ~B, ~C, ~D, ~E,
  "Alice", 4.5, 5.0, NA, 4.5, 4.5,
  "Bob", 4.5, 4.5, NA, 3.0, 4.5,
  "Carol", 3.5, NA, 3.5, 2.0, 3.5,
  "David", NA, 4.0, 3.5, 2.5, 3.5,
  "Emma", 4.5, 5.0, NA, 3.0, 4.0,
  "Frank", 3.5, 3.5, 3.0, NA, 3.0,
  "Grace", 4.5, 4.0, 3.5, NA, 4.5,
  "Henry", NA, 3.5, 2.5, 2.5, 3.5,
  "Iris", 4.0, NA, 4.0, 3.5, 4.5,
  "Jack", 3.5, 4.0,  3.5,  3.0, 3.5
)

df |>
  kbl(caption = "User Product Ratings") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

User Product Ratings
Name	A	B	C	D	E
Alice	4.5	5.0	NA	4.5	4.5
Bob	4.5	4.5	NA	3.0	4.5
Carol	3.5	NA	3.5	2.0	3.5
David	NA	4.0	3.5	2.5	3.5
Emma	4.5	5.0	NA	3.0	4.0
Frank	3.5	3.5	3.0	NA	3.0
Grace	4.5	4.0	3.5	NA	4.5
Henry	NA	3.5	2.5	2.5	3.5
Iris	4.0	NA	4.0	3.5	4.5
Jack	3.5	4.0	3.5	3.0	3.5

When breaking the data into training and testing sets we want to treat each user-item pair as its own row. This ensures that we create a train-test split that has data for each user and item. We exclude the missing values, as they do not make sense for the test set and provide no predictive value in the training set.

set.seed(8675309)

# Pivoting dataframe longer into person-item pairings to allow for granular sampling
long_df <- df |>
  pivot_longer(cols = A:E,
               names_to = "Item",
               values_to = "Rating") |>
  filter(!is.na(Rating)) # No NAs, no imputation needed for project

# Breaking data into train-test split
train_index <- createDataPartition(y = 1:nrow(long_df),
                                   p = 0.8,
                                   list = FALSE)

train_pairs <- long_df[train_index,]
test_pairs <- long_df[-train_index,]

Calculating Averages

We can see the averages of each user’s ratings as well as each item’s average rating in the below tables. The averages are from the data in the training set only.

# Calculating product averages
avg_product <- train_pairs |>
  group_by(Item) |>
  summarise(average = mean(Rating))

# Calculating user averages
avg_user <- train_pairs |>
  group_by(Name) |>
  summarise(average = mean(Rating))

avg_user |>
  kbl(caption = "Average Rating by User") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Average Rating by User
Name	average
Alice	4.500000
Bob	4.125000
Carol	3.500000
David	3.333333
Emma	4.166667
Frank	3.166667
Grace	4.125000
Henry	2.833333
Iris	3.833333
Jack	3.500000

avg_product |>
  kbl(caption = "Average Rating by Product") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Average Rating by Product
Item	average
A	4.062500
B	4.300000
C	3.333333
D	3.142857
E	3.857143

Using raw averages as a blanket prediction we can calculate the RMSE on both the training and test sets to set a reference point for model accuracy. Any model we develop should have a smaller RMSE than this value in order to be useful. The RMSE on both the training and test sets are not particularly good, being off by between a half and full step in the rating scale. The test set has worse performance, as would be expected. Given that the dataset was intentionally constructed to have an uneven distribution of ratings across the possible scale, with no scores below a 2.0, the performance is even worse than one would initially assume.

# Calculating RMSE of average score estimates
baseline_rmse_test <- sqrt(mean((test_pairs$Rating - mean(train_pairs$Rating))^2))
baseline_rmse_train <- sqrt(mean((train_pairs$Rating - mean(train_pairs$Rating))^2))

# Creating a dataframe of the RMSE for later comparison
rmse_df <- tribble(
  ~Prediction, ~RMSE,
  "Raw Average Training", baseline_rmse_train,
  "Raw Average Test", baseline_rmse_test
)

rmse_df |>
  kbl(caption = "RMSE of Predicted Ratings") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

RMSE of Predicted Ratings
Prediction	RMSE
Raw Average Training	0.6752206
Raw Average Test	0.8277534

Incorporating User and Item Bias

To improve performance we can calculate the bias of each user and item to construct baseline predictors for the different user-item pairs. The bias is how much the average rating of each item or user differs from the average of the entire dataset, which we can use to adjust our estimates for each user-item pair.

# Calculating user bias scores
user_bias <- avg_user |>
  mutate(user_bias = average - mean(train_pairs$Rating)) |>
  select(Name, user_bias)

user_bias |>
  kbl(caption = "User Bias Values") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

User Bias Values
Name	user_bias
Alice	0.7727273
Bob	0.3977273
Carol	-0.2272727
David	-0.3939394
Emma	0.4393939
Frank	-0.5606061
Grace	0.3977273
Henry	-0.8939394
Iris	0.1060606
Jack	-0.2272727

# Calculating item bias scores
item_bias <- avg_product |>
  mutate(item_bias = average - mean(train_pairs$Rating)) |>
  select(Item, item_bias)

item_bias |>
  kbl(caption = "Item Bias Values") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Item Bias Values
Item	item_bias
A	0.3352273
B	0.5727273
C	-0.3939394
D	-0.5844156
E	0.1298701

We can now generate baseline predictions by adding the item and user bias values to the set-wide average for each item-user pair. These baseline estimators are a much better starting point for a recommender, as they more accurately capture the variance in user and item ratings. By including the User and Item bias values to produce baseline predictors we can improve the RMSE of our predictions significantly. The RMSE of the predictions on both the training and test sets is approximately halved compared to the raw average predictions.

# Predicting ratings in the training set using baseline estimators
train_predictions <- train_pairs |>
  left_join(user_bias, by = "Name") |>
  left_join(item_bias, by = "Item") |>
  mutate(predicted_rating = mean(train_pairs$Rating) + user_bias + item_bias)

# Predicting ratings in the testing set using baseline estimators
test_predictions <- test_pairs |>
  left_join(user_bias, by = "Name") |>
  left_join(item_bias, by = "Item") |>
  mutate(predicted_rating = mean(train_pairs$Rating) + user_bias + item_bias)

# Calculating RMSE of the baseline estimator predictions
bias_rmse_train <-  sqrt(mean((train_predictions$Rating - train_predictions$predicted_rating)^2))
bias_rmse_test <- sqrt(mean((test_predictions$Rating - test_predictions$predicted_rating)^2))

rmse_df |>
  add_row(Prediction = "Training Baseline Predictor", RMSE = bias_rmse_train) |>
  add_row(Prediction = "Testing Baseline Predictor", RMSE = bias_rmse_test) |>
  kbl(caption = "RMSE of Predicted Ratings") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

RMSE of Predicted Ratings
Prediction	RMSE
Raw Average Training	0.6752206
Raw Average Test	0.8277534
Training Baseline Predictor	0.3295519
Testing Baseline Predictor	0.4246094

Analyzing Baseline Predictor Performance

While the baseline predictors see a significant improvement in accuracy as measured by RMSE, the performance on the testing set is still not stellar at almost a half step in the rating scale. While baseline predictors are an improvement on simple averages, they have several shortcomings. They are vulnerable to missingness, and as a result different samples of the data can have very different bias values. The tables below demonstrate this phenomena, comparing the bias values of the original sample to a new sample of the data and as we can see from the bias difference of Carol and Item E the swings due to sampling and missingness can be extreme. Additionally, simple bias adjustments do not capture more nuanced relationships in the data such as similarities between users or items which could improve estimates in situations where data is scarce.

set.seed(25637)

# Breaking data into train-test split for a second time
train_index_repeat <- createDataPartition(y = 1:nrow(long_df),
                                   p = 0.8,
                                   list = FALSE)

train_pairs_repeat <- long_df[train_index_repeat,]
test_pairs_repeat <- long_df[-train_index_repeat,]

avg_user_repeat <- train_pairs_repeat |>
  group_by(Name) |>
  summarise(average = mean(Rating))

avg_product_repeat <- train_pairs_repeat |>
  group_by(Item) |>
  summarise(average = mean(Rating))

user_bias_repeat <- avg_user_repeat |>
  mutate(user_bias_repeat = average - mean(train_pairs_repeat$Rating)) |>
  select(Name, user_bias_repeat)

item_bias_repeat <- avg_product_repeat |>
  mutate(item_bias_repeat = average - mean(train_pairs_repeat$Rating)) |>
  select(Item, item_bias_repeat)

user_bias |>
  left_join(user_bias_repeat, join_by(Name)) |>
  mutate(`Absolute Difference` = abs(user_bias_repeat - user_bias),
         `Percent Difference` = abs(((user_bias_repeat - user_bias) / user_bias)) * 100) |>
  kbl(caption = "Comparison of Different Sampling on User Bias") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Comparison of Different Sampling on User Bias
Name	user_bias	user_bias_repeat	Absolute Difference	Percent Difference
Alice	0.7727273	1.0000000	0.2272727	29.41176
Bob	0.3977273	0.3333333	0.0643939	16.19048
Carol	-0.2272727	-0.6666667	0.4393939	193.33333
David	-0.3939394	-0.2916667	0.1022727	25.96154
Emma	0.4393939	0.3333333	0.1060606	24.13793
Frank	-0.5606061	-0.4166667	0.1439394	25.67568
Grace	0.3977273	0.5000000	0.1022727	25.71429
Henry	-0.8939394	-0.6666667	0.2272727	25.42373
Iris	0.1060606	0.1666667	0.0606061	57.14286
Jack	-0.2272727	-0.1666667	0.0606061	26.66667

item_bias |>
  left_join(item_bias_repeat, join_by(Item)) |>
  mutate(`Absolute Difference` = abs(item_bias_repeat - item_bias),
         `Percent Difference` = abs(((item_bias_repeat - item_bias) / item_bias)) * 100) |>
  kbl(caption = "Comparison of Different Sampling on Item Bias") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Comparison of Different Sampling on Item Bias
Item	item_bias	item_bias_repeat	Absolute Difference	Percent Difference
A	0.3352273	0.2500000	0.0852273	25.423729
B	0.5727273	0.5476190	0.0251082	4.383976
C	-0.3939394	-0.3095238	0.0844156	21.428571
D	-0.5844156	-0.8333333	0.2489177	42.592593
E	0.1298701	0.2619048	0.1320346	101.666667

Incorporating Regularization

One of the methods we can use to improve the performance of the baseline predictors is regularization. By adding a penalty term when calculating the user and item bias parameters we can prevent overfitting and better stabilize the bias values. It is important to allow the user and bias regularization values to vary independently of each other when testing, as one set may require more regularization than the other.

train_regularized_bias <- function(train_data, lambda_user = 0.1, lambda_item = 0.1, 
                                       max_iter = 50, tol = 1e-6) {
  global_mean <- mean(train_data$Rating)
  
  # Initialize biases to zero
  user_biases <- train_data |>
    distinct(Name) |>
    mutate(user_bias = 0)
  
  item_biases <- train_data |>
    distinct(Item) |>
    mutate(item_bias = 0)
  
  # Iterative updates
  for (iter in 1:max_iter) {
    old_user_biases <- user_biases$user_bias
    old_item_biases <- item_biases$item_bias
    
    # Update user biases
    user_biases <- train_data |>
      left_join(item_biases, by = "Item") |>
      mutate(residual = Rating - global_mean - item_bias) |>
      group_by(Name) |>
      summarise(
        n_ratings = n(),
        sum_residuals = sum(residual),
        .groups = "drop"
      ) |>
      mutate(user_bias = sum_residuals / (n_ratings + lambda_user))
    
    # Update item biases  
    item_biases <- train_data |>
      left_join(user_biases, by = "Name") |>
      mutate(residual = Rating - global_mean - user_bias) |>
      group_by(Item) |>
      summarise(
        n_ratings = n(),
        sum_residuals = sum(residual),
        .groups = "drop"
      ) |>
      mutate(item_bias = sum_residuals / (n_ratings + lambda_item))
    
    # Check convergence
    user_change <- sqrt(mean((user_biases$user_bias - old_user_biases)^2, na.rm = TRUE))
    item_change <- sqrt(mean((item_biases$item_bias - old_item_biases)^2, na.rm = TRUE))
    
    if (user_change < tol && item_change < tol) {
      break
    }
  }
  
  list(
    global_mean = global_mean,
    user_biases = user_biases,
    item_biases = item_biases,
    lambda_user = lambda_user,
    lambda_item = lambda_item,
    iterations = iter
  )
}

# Function to make predictions
predict_ratings <- function(model, data) {
  data |>
    left_join(model$user_biases, by = "Name") |>
    left_join(model$item_biases, by = "Item") |>
    mutate(
      user_bias = coalesce(user_bias, 0),
      item_bias = coalesce(item_bias, 0),
      predicted_rating = model$global_mean + user_bias + item_bias,
      predicted_rating = pmax(1, pmin(5, predicted_rating)) # Adding floor and ceiling to ratings
    )
}

# For training
calculate_rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2, na.rm = TRUE))
}

# A grid of regularization parameters to try
lambda_grid <- expand_grid(
  lambda_user = c(0, 0.01, 0.1, 0.5, 1.0),
  lambda_item = c(0, 0.01, 0.1, 0.5, 1.0)
)

results <- lambda_grid |>
  mutate(
    model = map2(lambda_user, lambda_item, ~{
      train_regularized_bias(train_pairs, .x, .y)
    })
  ) |>
  mutate(
    # Get predictions for train and test
    train_pred = map(model, ~predict_ratings(.x, train_pairs)),
    test_pred = map(model, ~predict_ratings(.x, test_pairs)),
    
    # Calculate RMSE
    train_rmse = map_dbl(train_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating)),
    test_rmse = map_dbl(test_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating))
  )

Looking at the results of the regularization process, we can see that our predictors actually perform best without any regularization. This is not surprising, as the toy dataset constructed for this project is not very sparse and has intentional patterns built in. If a larger or sparser dataset was used we would expect the addition of regularization to have a more meaningful impact on model performance.

results |>
  select(lambda_user, lambda_item, train_rmse, test_rmse) |>
  arrange(test_rmse) |>
  head(10) |>
  kbl(caption = "Top Performing Regularization Values") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Top Performing Regularization Values
lambda_user	lambda_item	train_rmse	test_rmse
0.00	0.00	0.2715822	0.3166312
0.00	0.01	0.2715830	0.3168834
0.01	0.00	0.2715868	0.3171637
0.01	0.01	0.2715875	0.3174135
0.00	0.10	0.2716607	0.3192452
0.01	0.10	0.2716645	0.3197502
0.10	0.00	0.2720140	0.3221532
0.10	0.01	0.2720140	0.3223800
0.10	0.10	0.2720837	0.3244982
0.00	0.50	0.2732845	0.3306648

To demonstrate this, we can construct just such a large and sparse dataset. Looking at the results in the table below we can see meaningful improvement even in the top 10 best performing regularization parameter pairings. The proper regularization parameters prevent the model from overfitting, but as we can see from the large jump in RMSE from the training and testing sets the baseline predictors still fall victim to their tendency to perform poorly for sparse datasets. We could continue to attempt to address this issue by adding cross validation and training across multiple folds of the data, however, at this point with the addition of regularization and cross validation we will have moved into the territory of linear regression models and would no longer be using global baseline predictors.

# Creates a large and sparse dataset
set.seed(8675309)
create_overfitting_data <- function() {
  set.seed(123)
  
  n_users <- 80
  n_items <- 60
  
  # Create users with extreme biases
  user_biases <- rnorm(n_users, 0, 1.2)  # Wide range of user biases
  item_biases <- rnorm(n_items, 0, 0.8)  # Moderate item biases
  
  ratings_list <- list()
  
  for (i in 1:n_users) {
    n_user_ratings <- sample(2:4, 1)  # Very few ratings per user
    user_items <- sample(1:n_items, n_user_ratings)
    
    for (j in user_items) {
      true_rating <- 3.5 + user_biases[i] + item_biases[j] + rnorm(1, 0, 0.4)
      observed_rating <- pmax(1, pmin(5, round(true_rating * 2) / 2))
      
      ratings_list <- append(ratings_list, list(tibble(
        Name = paste0("User_", i),
        Item = paste0("Item_", j),
        Rating = observed_rating
      )))
    }
  }
  
  bind_rows(ratings_list)
}

# Call creation function
overfit_df <- create_overfitting_data()

# Make a train-test split
overfit_train_index <- createDataPartition(y = 1:nrow(overfit_df),
                                           p = 0.8,
                                           list = FALSE)

overfit_train_pairs <- overfit_df[overfit_train_index,]
overfit_test_pairs <- overfit_df[-overfit_train_index,]

# Train on the overfitting-prone dataset
overfit_results <- lambda_grid |>
  mutate(
    model = map2(lambda_user, lambda_item, ~{
      train_regularized_bias(overfit_train_pairs, .x, .y)
    })
  ) |>
  mutate(
    # Get predictions for train and test
    train_pred = map(model, ~predict_ratings(.x, overfit_train_pairs)),
    test_pred = map(model, ~predict_ratings(.x, overfit_test_pairs)),
    
    # Calculate RMSE
    train_rmse = map_dbl(train_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating)),
    test_rmse = map_dbl(test_pred, ~calculate_rmse(.x$Rating, .x$predicted_rating))
  )

overfit_results |>
  select(lambda_user, lambda_item, train_rmse, test_rmse) |>
  arrange(test_rmse) |>
  head(10) |>
  kbl(caption = "Top Performing Regularization Values on the Overfitting-Prone Dataset") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Top Performing Regularization Values on the Overfitting-Prone Dataset
lambda_user	lambda_item	train_rmse	test_rmse
0.10	0.50	0.2442995	0.6653811
0.50	0.50	0.2969028	0.6802588
0.10	0.10	0.2080561	0.6808387
0.01	0.50	0.2378117	0.6875802
0.00	0.50	0.2374046	0.6913835
0.10	1.00	0.2826375	0.6980052
0.50	0.10	0.2753319	0.7007383
0.50	1.00	0.3251385	0.7011162
0.01	1.00	0.2783954	0.7165237
0.10	0.01	0.2022665	0.7172480

Data 612 Project 1

Matthew Tillmawitz

2025-06-04

Creating A Dataset

Calculating Averages

Incorporating User and Item Bias

Analyzing Baseline Predictor Performance

Incorporating Regularization