607_3A

Author

Desiree Thomas

Approach

For this assignment, I am utilizing the previous synthetic results that I used for assignment 2. All code aligns with the Tidyverse. There was LLM usage to help generate the top recommended movie for users and also to address a NaN that appeared. The NaN was due to a “Cold Start” user profile.

Introduction This codebase contains a Global Baseline Estimate Recommendation System. The ratings for this were generated during a previous assignment and in my specific case, I chose to use synthetic data generated by the LLM Gemini Pro. The goal here was to use the Predicted Rating formula to predict the movie that a user is most likely to watch. The formula is Predicted Rating (aka rui) = Overall Mean + Bi (Item Bias) + Bu (User Bias). Bi and Bu are expected to sometimes need to be calculated depending on the user.

Analysis

I started off by calculating the Overall mean, movie averages and user averages in an Excel PivotTable. This was to ensure that I could “fact-check”, so-to-speak, the numbers that my code would generate. I did this PivotTable before I wrote even a line of code and it is in the .xlsx file in Sheet1.

In R, I used the Tidyverse format/syntax and started off by calculating the overall mean. Next, I then calculated the Movie and User biases. I then used the formula to calculate the Predicted Rating. As a quick validity/math check, I checked the calculations for one user, Carol Johnson against my PivotTable (manually calculating it) and confirmed that it was correct before proceeding.

Next, I asked Gemini to guide me and explain how to generate the Top Movie recommendation for every user. I have cited this code in Citations. While working through the code and then viewing the output, I noticed that one of the users had Nan listed as the output. Gemini suggested that I check if this user had ever rated a movie before; if he had not then this would be an example of a “Cold Start” user. It suggested that I convert the NaN’s in this dataset to 0, so that Cold Start users could have an average of 3.04 and receive a movie recommendation. It wrote the code along with a detailed explanation. For this assignment, I did not use the account_id column and instead focused on the profile_name to identify users. Additionally, since the GBE is not meant for personalization, I did not go further into creating profile_ids, etc. as planned in the previous assignment. I now have a better understanding of the strengths and limitations of this algorithm.

Conclusion

The Global Baseline Estimate Recommendation was a success. I was able to correctly calculate the output, averages, and even encountered an example of a Cold Start which is one of the Strengths of a Global Baseline Predictor.

#Load Libraries and Dataset

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)
library(readr) 

raw_data <- read_csv("https://raw.githubusercontent.com/desithomas/DATA607_3A/refs/heads/main/excel_version_survey_results2.csv")

Rows: 450 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): profile_name, movie_title
dbl (2): account_id, rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Calculate the overall mean

overall_mean <- raw_data %>% summarize(mean_value = mean(rating, na.rm = TRUE)) %>%
  pull(mean_value)

#output shows as 3.04 which was the same as the manual calculation in the .xlsx version of the file
overall_mean

[1] 3.04

#Calculate the item bias 

movie_bias <- raw_data %>% 
  group_by(movie_title) %>%
  summarize(bi = mean(rating, na.rm = TRUE) - overall_mean)
  
movie_bias

# A tibble: 6 × 2
  movie_title                              bi
  <chr>                                 <dbl>
1 Barbie                               0.266 
2 Deadpool & Wolverine                -0.128 
3 Dune: Part Two                      -0.0855
4 Inside Out 2                        -0.0400
5 Oppenheimer                         -0.104 
6 Spider-Man: Across the Spider-Verse  0.164

#Calculate the user bias 

user_bias <- raw_data %>% 
    group_by(profile_name) %>%
    summarize(bu = mean(rating, na.rm = TRUE) - overall_mean)
#Adam Stevens output matches manual calculation found in the .xlsx file Sheet 1
user_bias

# A tibble: 75 × 2
   profile_name           bu
   <chr>               <dbl>
 1 Aaron Powell       0.46  
 2 Adam Gutierrez     0.56  
 3 Adam Howard        0.96  
 4 Adam Stevens      -0.0400
 5 Alan Chavez        0.96  
 6 Alan Reyes        -0.79  
 7 Albert Richardson -0.29  
 8 Amanda Wright      1.63  
 9 Andrew Adams      -0.373 
10 Anthony Smith      1.63  
# ℹ 65 more rows

# calculate the predicted rating aka rui

final_predictions <- raw_data %>% 
  filter(is.na(rating)) %>%
  left_join(movie_bias, by ="movie_title") %>%
  left_join(user_bias, by = "profile_name") %>%
  #this runs the formula provided in the slides for the Predicted Rating 
  mutate(predicted_rating = overall_mean + bi + bu)

final_predictions

# A tibble: 175 × 7
   account_id profile_name   movie_title rating      bi      bu predicted_rating
        <dbl> <chr>          <chr>        <dbl>   <dbl>   <dbl>            <dbl>
 1       6000 Carol Johnson  Deadpool &…     NA -0.128  -0.707              2.21
 2       6000 Carol Johnson  Inside Out…     NA -0.0400 -0.707              2.29
 3       6000 Carol Johnson  Oppenheimer     NA -0.104  -0.707              2.23
 4       6001 Edward Perez   Oppenheimer     NA -0.104  -0.240              2.70
 5       6002 Amanda Wright  Inside Out…     NA -0.0400  1.63               4.63
 6       6002 Amanda Wright  Barbie          NA  0.266   1.63               4.93
 7       6002 Amanda Wright  Spider-Man…     NA  0.164   1.63               4.83
 8       6003 Mark Taylor    Barbie          NA  0.266   0.76               4.07
 9       6004 Donna Harris   Barbie          NA  0.266  -0.240              3.07
10       6005 Richard Willi… Dune: Part…     NA -0.0855 -0.0400             2.91
# ℹ 165 more rows

#check a specific user and movie

example_prediction <- final_predictions %>% 
  filter(profile_name == "Carol Johnson", movie_title == "Inside Out 2") %>%
  select(profile_name, movie_title, bi, bu, predicted_rating)

example_prediction

# A tibble: 1 × 5
  profile_name  movie_title       bi     bu predicted_rating
  <chr>         <chr>          <dbl>  <dbl>            <dbl>
1 Carol Johnson Inside Out 2 -0.0400 -0.707             2.29

# Generate the Top 1 Recommendation for EVERY User
top_recommendations <- final_predictions %>%
  # 1. Group the data by user
  group_by(profile_name) %>%
  
  # 2. Within each group, grab the row with the maximum score
  slice_max(order_by = predicted_rating, n = 1, with_ties = FALSE) %>%
  
  # 3. Clean up the grouping metadata
  ungroup() %>%
  
  # 4. Display the most important columns
  select(profile_name, movie_title, predicted_rating)

# View the result
top_recommendations

# A tibble: 68 × 3
   profile_name      movie_title    predicted_rating
   <chr>             <chr>                     <dbl>
 1 Adam Gutierrez    Dune: Part Two             3.51
 2 Adam Howard       Barbie                     4.27
 3 Adam Stevens      Barbie                     3.27
 4 Alan Chavez       Barbie                     4.27
 5 Alan Reyes        Barbie                     2.52
 6 Albert Richardson Dune: Part Two             2.66
 7 Amanda Wright     Barbie                     4.93
 8 Andrew Adams      Barbie                     2.93
 9 Anthony Smith     Dune: Part Two             4.58
10 Ashley Allen      Barbie                     2.60
# ℹ 58 more rows

#Confirm whether Joshua Marshall is a cold start

# Check for Joshua's active ratings
raw_data %>% 
  filter(profile_name == "Joshua Marshall") %>% 
  summarize(active_ratings = sum(!is.na(rating)))

# A tibble: 1 × 1
  active_ratings
           <int>
1              0

# Check for the movie's active ratings
raw_data %>% 
  filter(movie_title == "Deadpool & Wolverine") %>% 
  summarize(active_ratings = sum(!is.na(rating)))

# A tibble: 1 × 1
  active_ratings
           <int>
1             57

#also visually checked and confirmed that he has not rated any movies

# Add replace_na to ensure NaN becomes 0
movie_bias <- raw_data %>% 
  group_by(movie_title) %>% 
  summarize(bi = mean(rating, na.rm = TRUE) - overall_mean) %>% 
  mutate(bi = replace_na(bi, 0)) # If bi is NaN, make it 0

user_bias <- raw_data %>% 
  group_by(profile_name) %>% 
  summarize(bu = mean(rating, na.rm = TRUE) - overall_mean) %>% 
  mutate(bu = replace_na(bu, 0)) # If bu is NaN, make it 0

LLM transcript

User: Walk me through showing the top recommendation for every user. include explanations for syntax, reasoning and the why’s

LLM: To identify the top recommendation for every user, you move from individual predictions to a ranked “Leaderboard” logic. This is the goal of any recommendation engine: providing an actionable “Next Watch” for every person in the database.

The Recommendation Pipeline

To achieve this in R, you use a Grouped Slice strategy. This tells R to look at the data not as one big list, but as individual “buckets”—one for each person—and then pull the winner from each bucket.

# Generate the Top 1 Recommendation for EVERY User
top_recommendations <- final_predictions %>%
  # 1. Group the data by user
  group_by(profile_name) %>%
  
  # 2. Within each group, grab the row with the maximum score
  slice_max(order_by = predicted_rating, n = 1, with_ties = FALSE) %>%
  
  # 3. Clean up the grouping metadata
  ungroup() %>%
  
  # 4. Display the most important columns
  select(profile_name, movie_title, predicted_rating)

# View the result
top_recommendations

Why This Logic Works

In your current .qmd file, final_predictions contains thousands of rows. Without grouping, a function like slice_max would simply find the single highest score in the entire dataset.

1. group_by(profile_name): This is the “Split” phase of the Split-Apply-Combine strategy. It creates virtual partitions in the data. Any function that follows (like slice_max) is now applied to each person’s unique list of movies rather than the whole table. 2. slice_max(..., n = 1) : This is the modern replacement for the older top_n() function, which has been superseded because its naming was often confusing. It is the most “interview-ready” way to find extremes within groups. 3. with_ties = FALSE: This is a critical logical parameter. If a user has two movies with the exact same predicted score (e.g., both 4.8), slice_max will return both by default. Setting this to FALSE ensures your output has exactly** one row per user, keeping the final report clean. 4. ungroup(): This is a professional best practice. If you forget this, any future code you run on top_recommendations (like a summary or a plot) will still be performed “by user,” which can cause subtle, difficult-to-track bugs in your analysis.

Professional Insight

In a production system, this “Top 1” list is what gets fed into the user interface (the “What to Watch Next” banner). Because you are now using predicted_rating as the order_by variable, the ranking is strictly based on your calculated global baseline formula.

Next Step: Would you like to see how to format this final table into a professional-looking HTML table using the kable() function for your final report?

Data Manipulation with dplyr This video provides a practical look at using dplyr verbs like slice_max to subset data effectively.

User: interestingly enough this user ended up with NaN : Joshua Marshall Deadpool & Wolverine NaN

LLM: The appearance of NaN (Not a Number) in a recommendation system is the technical signature of the “Cold Start” problem. It occurs when the algorithm encounters a user or an item it has never seen “active” before.

In R, this happens because of how the mean() function handles empty sets.

The Technical Cause: 0 ÷ 0

When you calculate movie_bias or user_bias, you use mean(rating, na.rm = TRUE).

If Joshua Marshall is in your dataset but has zero actual ratings (every single one of his rows is NA), the na.rm = TRUE argument removes every value.
R is then left with an empty set. The mean of an empty set is mathematically undefined ( divided by a count of ), so R returns NaN.

Once a single NaN enters your addition formula (), it acts like a virus. Any number added to NaN results in NaN, which is why the final prediction for that specific movie/user combination fails.

How to Verify the “Cold Start”

You can confirm this by checking how many ratings actually exist for that user or movie:

# Check for Joshua's active ratings
raw_data %>% 
  filter(profile_name == "Joshua Marshall") %>% 
  summarize(active_ratings = sum(!is.na(rating)))

# Check for the movie's active ratings
raw_data %>% 
  filter(movie_title == "Deadpool & Wolverine") %>% 
  summarize(active_ratings = sum(!is.na(rating)))

If either result is 0, you have found the source of the NaN.

The “Interview-Ready” Fix: Imputation

In a professional recommendation engine, we handle this by assuming that if we don’t know a user’s bias, we assume they are “average” (a bias of 0). You can fix your bias tables by using the replace_na() function from the tidyr package.

Update your bias chunks like this:

# Add replace_na to ensure NaN becomes 0
movie_bias <- raw_data %>% 
  group_by(movie_title) %>% 
  summarize(bi = mean(rating, na.rm = TRUE) - overall_mean) %>% 
  mutate(bi = replace_na(bi, 0)) # If bi is NaN, make it 0

user_bias <- raw_data %>% 
  group_by(profile_name) %>% 
  summarize(bu = mean(rating, na.rm = TRUE) - overall_mean) %>% 
  mutate(bu = replace_na(bu, 0)) # If bu is NaN, make it 0

Why this is the correct solution:

By replacing NaN with 0, your formula for Joshua Marshall becomes:

Instead of failing, the system will simply predict the Global Average (3.04) for him. This is the safest fallback when you have no data to go on.

Does your raw_data show that Joshua Marshall had any non-NA ratings, or was he a completely “new” user in your list?**

Citation

Google DeepMind. (2026). Gemini 3 Thinking [Large Language Model]. https://gemini.google.com. Accessed February 14th, 2026.