For this assignment, I am utilizing the previous synthetic results that I used for assignment 2. All code aligns with the Tidyverse. There was LLM usage to help generate the top recommended movie for users and also to address a NaN that appeared. The NaN was due to a “Cold Start” user profile.
Introduction This codebase contains a Global Baseline Estimate Recommendation System. The ratings for this were generated during a previous assignment and in my specific case, I chose to use synthetic data generated by the LLM Gemini Pro. The goal here was to use the Predicted Rating formula to predict the movie that a user is most likely to watch. The formula is Predicted Rating (aka rui) = Overall Mean + Bi (Item Bias) + Bu (User Bias). Bi and Bu are expected to sometimes need to be calculated depending on the user.
Analysis
I started off by calculating the Overall mean, movie averages and user averages in an Excel PivotTable. This was to ensure that I could “fact-check”, so-to-speak, the numbers that my code would generate. I did this PivotTable before I wrote even a line of code and it is in the .xlsx file in Sheet1.
In R, I used the Tidyverse format/syntax and started off by calculating the overall mean. Next, I then calculated the Movie and User biases. I then used the formula to calculate the Predicted Rating. As a quick validity/math check, I checked the calculations for one user, Carol Johnson against my PivotTable (manually calculating it) and confirmed that it was correct before proceeding.
Next, I asked Gemini to guide me and explain how to generate the Top Movie recommendation for every user. I have cited this code in Citations. While working through the code and then viewing the output, I noticed that one of the users had Nan listed as the output. Gemini suggested that I check if this user had ever rated a movie before; if he had not then this would be an example of a “Cold Start” user. It suggested that I convert the NaN’s in this dataset to 0, so that Cold Start users could have an average of 3.04 and receive a movie recommendation. It wrote the code along with a detailed explanation. For this assignment, I did not use the account_id column and instead focused on the profile_name to identify users. Additionally, since the GBE is not meant for personalization, I did not go further into creating profile_ids, etc. as planned in the previous assignment. I now have a better understanding of the strengths and limitations of this algorithm.
Conclusion
The Global Baseline Estimate Recommendation was a success. I was able to correctly calculate the output, averages, and even encountered an example of a Cold Start which is one of the Strengths of a Global Baseline Predictor.
#Load Libraries and Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 450 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): profile_name, movie_title
dbl (2): account_id, rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Calculate the overall meanoverall_mean <- raw_data %>%summarize(mean_value =mean(rating, na.rm =TRUE)) %>%pull(mean_value)#output shows as 3.04 which was the same as the manual calculation in the .xlsx version of the fileoverall_mean
# A tibble: 6 × 2
movie_title bi
<chr> <dbl>
1 Barbie 0.266
2 Deadpool & Wolverine -0.128
3 Dune: Part Two -0.0855
4 Inside Out 2 -0.0400
5 Oppenheimer -0.104
6 Spider-Man: Across the Spider-Verse 0.164
#Calculate the user bias user_bias <- raw_data %>%group_by(profile_name) %>%summarize(bu =mean(rating, na.rm =TRUE) - overall_mean)#Adam Stevens output matches manual calculation found in the .xlsx file Sheet 1user_bias
# A tibble: 75 × 2
profile_name bu
<chr> <dbl>
1 Aaron Powell 0.46
2 Adam Gutierrez 0.56
3 Adam Howard 0.96
4 Adam Stevens -0.0400
5 Alan Chavez 0.96
6 Alan Reyes -0.79
7 Albert Richardson -0.29
8 Amanda Wright 1.63
9 Andrew Adams -0.373
10 Anthony Smith 1.63
# ℹ 65 more rows
# calculate the predicted rating aka ruifinal_predictions <- raw_data %>%filter(is.na(rating)) %>%left_join(movie_bias, by ="movie_title") %>%left_join(user_bias, by ="profile_name") %>%#this runs the formula provided in the slides for the Predicted Rating mutate(predicted_rating = overall_mean + bi + bu)final_predictions
# A tibble: 175 × 7
account_id profile_name movie_title rating bi bu predicted_rating
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 6000 Carol Johnson Deadpool &… NA -0.128 -0.707 2.21
2 6000 Carol Johnson Inside Out… NA -0.0400 -0.707 2.29
3 6000 Carol Johnson Oppenheimer NA -0.104 -0.707 2.23
4 6001 Edward Perez Oppenheimer NA -0.104 -0.240 2.70
5 6002 Amanda Wright Inside Out… NA -0.0400 1.63 4.63
6 6002 Amanda Wright Barbie NA 0.266 1.63 4.93
7 6002 Amanda Wright Spider-Man… NA 0.164 1.63 4.83
8 6003 Mark Taylor Barbie NA 0.266 0.76 4.07
9 6004 Donna Harris Barbie NA 0.266 -0.240 3.07
10 6005 Richard Willi… Dune: Part… NA -0.0855 -0.0400 2.91
# ℹ 165 more rows
#check a specific user and movieexample_prediction <- final_predictions %>%filter(profile_name =="Carol Johnson", movie_title =="Inside Out 2") %>%select(profile_name, movie_title, bi, bu, predicted_rating)example_prediction
# A tibble: 1 × 5
profile_name movie_title bi bu predicted_rating
<chr> <chr> <dbl> <dbl> <dbl>
1 Carol Johnson Inside Out 2 -0.0400 -0.707 2.29
# Generate the Top 1 Recommendation for EVERY Usertop_recommendations <- final_predictions %>%# 1. Group the data by usergroup_by(profile_name) %>%# 2. Within each group, grab the row with the maximum scoreslice_max(order_by = predicted_rating, n =1, with_ties =FALSE) %>%# 3. Clean up the grouping metadataungroup() %>%# 4. Display the most important columnsselect(profile_name, movie_title, predicted_rating)# View the resulttop_recommendations
# A tibble: 68 × 3
profile_name movie_title predicted_rating
<chr> <chr> <dbl>
1 Adam Gutierrez Dune: Part Two 3.51
2 Adam Howard Barbie 4.27
3 Adam Stevens Barbie 3.27
4 Alan Chavez Barbie 4.27
5 Alan Reyes Barbie 2.52
6 Albert Richardson Dune: Part Two 2.66
7 Amanda Wright Barbie 4.93
8 Andrew Adams Barbie 2.93
9 Anthony Smith Dune: Part Two 4.58
10 Ashley Allen Barbie 2.60
# ℹ 58 more rows
#Confirm whether Joshua Marshall is a cold start# Check for Joshua's active ratingsraw_data %>%filter(profile_name =="Joshua Marshall") %>%summarize(active_ratings =sum(!is.na(rating)))
# A tibble: 1 × 1
active_ratings
<int>
1 0
# Check for the movie's active ratingsraw_data %>%filter(movie_title =="Deadpool & Wolverine") %>%summarize(active_ratings =sum(!is.na(rating)))
# A tibble: 1 × 1
active_ratings
<int>
1 57
#also visually checked and confirmed that he has not rated any movies
# Add replace_na to ensure NaN becomes 0movie_bias <- raw_data %>%group_by(movie_title) %>%summarize(bi =mean(rating, na.rm =TRUE) - overall_mean) %>%mutate(bi =replace_na(bi, 0)) # If bi is NaN, make it 0user_bias <- raw_data %>%group_by(profile_name) %>%summarize(bu =mean(rating, na.rm =TRUE) - overall_mean) %>%mutate(bu =replace_na(bu, 0)) # If bu is NaN, make it 0
LLM transcript
User: Walk me through showing the top recommendation for every user. include explanations for syntax, reasoning and the why’s
LLM: To identify the top recommendation for every user, you move from individual predictions to a ranked “Leaderboard” logic. This is the goal of any recommendation engine: providing an actionable “Next Watch” for every person in the database.
The Recommendation Pipeline
To achieve this in R, you use a Grouped Slice strategy. This tells R to look at the data not as one big list, but as individual “buckets”—one for each person—and then pull the winner from each bucket.
# Generate the Top 1 Recommendation for EVERY Usertop_recommendations <- final_predictions %>%# 1. Group the data by usergroup_by(profile_name) %>%# 2. Within each group, grab the row with the maximum scoreslice_max(order_by = predicted_rating, n =1, with_ties =FALSE) %>%# 3. Clean up the grouping metadataungroup() %>%# 4. Display the most important columnsselect(profile_name, movie_title, predicted_rating)# View the resulttop_recommendations
Why This Logic Works
In your current .qmd file, final_predictions contains thousands of rows. Without grouping, a function like slice_max would simply find the single highest score in the entire dataset.
1. group_by(profile_name): This is the “Split” phase of the Split-Apply-Combine strategy. It creates virtual partitions in the data. Any function that follows (like slice_max) is now applied to each person’s unique list of movies rather than the whole table. 2. slice_max(..., n = 1) : This is the modern replacement for the older top_n() function, which has been superseded because its naming was often confusing. It is the most “interview-ready” way to find extremes within groups. 3. with_ties = FALSE: This is a critical logical parameter. If a user has two movies with the exact same predicted score (e.g., both 4.8), slice_max will return both by default. Setting this to FALSE ensures your output has exactly** one row per user, keeping the final report clean. 4. ungroup(): This is a professional best practice. If you forget this, any future code you run on top_recommendations (like a summary or a plot) will still be performed “by user,” which can cause subtle, difficult-to-track bugs in your analysis.
Professional Insight
In a production system, this “Top 1” list is what gets fed into the user interface (the “What to Watch Next” banner). Because you are now using predicted_rating as the order_by variable, the ranking is strictly based on your calculated global baseline formula.
Next Step: Would you like to see how to format this final table into a professional-looking HTML table using the kable() function for your final report?
Data Manipulation with dplyr This video provides a practical look at using dplyr verbs like slice_max to subset data effectively.
User: interestingly enough this user ended up with NaN : Joshua Marshall Deadpool & Wolverine NaN
LLM: The appearance of NaN (Not a Number) in a recommendation system is the technical signature of the “Cold Start” problem. It occurs when the algorithm encounters a user or an item it has never seen “active” before.
In R, this happens because of how the mean() function handles empty sets.
The Technical Cause: 0 ÷ 0
When you calculate movie_bias or user_bias, you use mean(rating, na.rm = TRUE).
If Joshua Marshall is in your dataset but has zero actual ratings (every single one of his rows is NA), the na.rm = TRUE argument removes every value.
R is then left with an empty set. The mean of an empty set is mathematically undefined ( divided by a count of ), so R returns NaN.
Once a single NaN enters your addition formula (), it acts like a virus. Any number added to NaN results in NaN, which is why the final prediction for that specific movie/user combination fails.
How to Verify the “Cold Start”
You can confirm this by checking how many ratings actually exist for that user or movie:
# Check for Joshua's active ratingsraw_data %>%filter(profile_name =="Joshua Marshall") %>%summarize(active_ratings =sum(!is.na(rating)))# Check for the movie's active ratingsraw_data %>%filter(movie_title =="Deadpool & Wolverine") %>%summarize(active_ratings =sum(!is.na(rating)))
If either result is 0, you have found the source of the NaN.
The “Interview-Ready” Fix: Imputation
In a professional recommendation engine, we handle this by assuming that if we don’t know a user’s bias, we assume they are “average” (a bias of 0). You can fix your bias tables by using the replace_na() function from the tidyr package.
Update your bias chunks like this:
# Add replace_na to ensure NaN becomes 0movie_bias <- raw_data %>%group_by(movie_title) %>%summarize(bi =mean(rating, na.rm =TRUE) - overall_mean) %>%mutate(bi =replace_na(bi, 0)) # If bi is NaN, make it 0user_bias <- raw_data %>%group_by(profile_name) %>%summarize(bu =mean(rating, na.rm =TRUE) - overall_mean) %>%mutate(bu =replace_na(bu, 0)) # If bu is NaN, make it 0
Why this is the correct solution:
By replacing NaN with 0, your formula for Joshua Marshall becomes:
Instead of failing, the system will simply predict the Global Average (3.04) for him. This is the safest fallback when you have no data to go on.
Does your raw_data show that Joshua Marshall had any non-NA ratings, or was he a completely “new” user in your list?**
Citation
Google DeepMind. (2026). Gemini 3 Thinking [Large Language Model]. https://gemini.google.com. Accessed February 14th, 2026.