week3a

Author

Zihao Yu

1.How will I tackle the problem.

I will follow the instructions, use the formula to fill in the missing ratings, and select the highest predicted score from each user’s unrated movies as the recommendation.

2.What data challenges do I may anticipate?

Since most users have only rated two movies, the bias is unstable. To improve accuracy, I might include ratings for films they haven’t watched but have some knowledge of, thereby reducing the bias.

3.Read the data

I create my xlsx and download it as a csv.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
movies_rating <- read.csv("https://raw.githubusercontent.com/XxY-coder/data607-week3a/refs/heads/main/movies_ratings.csv") |>
  clean_names()

glimpse(movies_rating)
Rows: 7
Columns: 7
$ critics                                      <chr> "M", "J", "W", "HP", "ZH"…
$ zootopia_2                                   <int> 4, NA, 5, NA, NA, 4, NA
$ stranger_things_5                            <int> 5, 5, 3, 1, NA, NA, 4
$ captain_america_brave_new_world              <int> 3, NA, NA, 1, NA, NA, NA
$ now_you_see_me_now_you_don_t                 <int> 4, NA, NA, NA, 4, NA, 2
$ disney_s_snow_white                          <dbl> 1.0, NA, NA, NA, 3.5, NA,…
$ the_sponge_bob_movie_search_for_square_pants <int> 4, 4, NA, NA, NA, 3, NA
rating_long <- 
  movies_rating |>
  pivot_longer(
    cols = -critics,
    names_to = "movies",
    values_to = "rating"
)

glimpse(rating_long)
Rows: 42
Columns: 3
$ critics <chr> "M", "M", "M", "M", "M", "M", "J", "J", "J", "J", "J", "J", "W…
$ movies  <chr> "zootopia_2", "stranger_things_5", "captain_america_brave_new_…
$ rating  <dbl> 4, 5, 3, 4, 1, 4, NA, 5, NA, NA, NA, 4, 5, 3, NA, NA, NA, NA, …

4.Calculations

Finding those values.

mu <- mean(rating_long$rating, na.rm = TRUE)
mu
[1] 3.361111
user_bias <-
  rating_long |>
  group_by(critics) |>
  summarise(
    user_avg = mean(rating, na.rm = TRUE),
    .groups = "drop"
) |>
  mutate(b_u = user_avg -  mu)

(user_bias)
# A tibble: 7 × 3
  critics user_avg    b_u
  <chr>      <dbl>  <dbl>
1 CO          3.5   0.139
2 HP          1    -2.36 
3 J           4.5   1.14 
4 M           3.5   0.139
5 W           4     0.639
6 ZH          3.75  0.389
7 ZY          3    -0.361
movies_bias <-
  rating_long |>
  group_by(movies) |>
  summarise(
    movies_avg = mean(rating, na.rm = TRUE),
    .groups = "drop"
) |>
  mutate(b_i = movies_avg - mu)

(movies_bias)
# A tibble: 6 × 3
  movies                                       movies_avg     b_i
  <chr>                                             <dbl>   <dbl>
1 captain_america_brave_new_world                    2    -1.36  
2 disney_s_snow_white                                2.25 -1.11  
3 now_you_see_me_now_you_don_t                       3.33 -0.0278
4 stranger_things_5                                  3.6   0.239 
5 the_sponge_bob_movie_search_for_square_pants       3.67  0.306 
6 zootopia_2                                         4.33  0.972 

5.Calculate the missing ratings.

pred <- 
  rating_long |>
  left_join(user_bias, by = "critics") |>
  left_join(movies_bias, by = "movies") |>
  mutate(r_hat = mu + b_u + b_i)

view(pred)

6. Conclusion

Since M was rating all the movies, so there is there are no recommendations here. —For J’s top 1 is ‘zootopia2’(it is about 5.47 but consider as 5). —For W’s top 1 is ‘The SpongBob Movie’(it is about 4.31). —For HP’s top 1 is ‘zootopia2’(it is about 1.97) —For ZH’s top 1 is ‘zootopia2’(it is about 4.72) —For CO’s top 1 is ‘stranger_things_5’(it is about 3.74) —For ZY’s top 1 is ‘zootopia2’(it is about 3.97)

OpenAI. (2025). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com.