week3a

Author

Zihao Yu

1.How will I tackle the problem.

I will follow the instructions, use the formula to fill in the missing ratings, and select the highest predicted score from each user’s unrated movies as the recommendation.

2.What data challenges do I may anticipate?

Since most users have only rated two movies, the bias is unstable. To improve accuracy, I might include ratings for films they haven’t watched but have some knowledge of, thereby reducing the bias.

3.Read the data

I create my xlsx and download it as a csv.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

movies_rating <- read.csv("https://raw.githubusercontent.com/XxY-coder/data607-week3a/refs/heads/main/movies_ratings.csv") |>
  clean_names()

glimpse(movies_rating)

Rows: 7
Columns: 7
$ critics                                      <chr> "M", "J", "W", "HP", "ZH"…
$ zootopia_2                                   <int> 4, NA, 5, NA, NA, 4, NA
$ stranger_things_5                            <int> 5, 5, 3, 1, NA, NA, 4
$ captain_america_brave_new_world              <int> 3, NA, NA, 1, NA, NA, NA
$ now_you_see_me_now_you_don_t                 <int> 4, NA, NA, NA, 4, NA, 2
$ disney_s_snow_white                          <dbl> 1.0, NA, NA, NA, 3.5, NA,…
$ the_sponge_bob_movie_search_for_square_pants <int> 4, 4, NA, NA, NA, 3, NA

rating_long <- 
  movies_rating |>
  pivot_longer(
    cols = -critics,
    names_to = "movies",
    values_to = "rating"
)

glimpse(rating_long)

Rows: 42
Columns: 3
$ critics <chr> "M", "M", "M", "M", "M", "M", "J", "J", "J", "J", "J", "J", "W…
$ movies  <chr> "zootopia_2", "stranger_things_5", "captain_america_brave_new_…
$ rating  <dbl> 4, 5, 3, 4, 1, 4, NA, 5, NA, NA, NA, 4, 5, 3, NA, NA, NA, NA, …

4.Calculations

Finding those values.

mu <- mean(rating_long$rating, na.rm = TRUE)
mu

[1] 3.361111

user_bias <-
  rating_long |>
  group_by(critics) |>
  summarise(
    user_avg = mean(rating, na.rm = TRUE),
    .groups = "drop"
) |>
  mutate(b_u = user_avg -  mu)

(user_bias)

# A tibble: 7 × 3
  critics user_avg    b_u
  <chr>      <dbl>  <dbl>
1 CO          3.5   0.139
2 HP          1    -2.36 
3 J           4.5   1.14 
4 M           3.5   0.139
5 W           4     0.639
6 ZH          3.75  0.389
7 ZY          3    -0.361

movies_bias <-
  rating_long |>
  group_by(movies) |>
  summarise(
    movies_avg = mean(rating, na.rm = TRUE),
    .groups = "drop"
) |>
  mutate(b_i = movies_avg - mu)

(movies_bias)

# A tibble: 6 × 3
  movies                                       movies_avg     b_i
  <chr>                                             <dbl>   <dbl>
1 captain_america_brave_new_world                    2    -1.36  
2 disney_s_snow_white                                2.25 -1.11  
3 now_you_see_me_now_you_don_t                       3.33 -0.0278
4 stranger_things_5                                  3.6   0.239 
5 the_sponge_bob_movie_search_for_square_pants       3.67  0.306 
6 zootopia_2                                         4.33  0.972

5.Calculate the missing ratings.

pred <- 
  rating_long |>
  left_join(user_bias, by = "critics") |>
  left_join(movies_bias, by = "movies") |>
  mutate(r_hat = mu + b_u + b_i)

view(pred)

6. Conclusion

Since M was rating all the movies, so there is there are no recommendations here. —For J’s top 1 is ‘zootopia2’(it is about 5.47 but consider as 5). —For W’s top 1 is ‘The SpongBob Movie’(it is about 4.31). —For HP’s top 1 is ‘zootopia2’(it is about 1.97) —For ZH’s top 1 is ‘zootopia2’(it is about 4.72) —For CO’s top 1 is ‘stranger_things_5’(it is about 3.74) —For ZY’s top 1 is ‘zootopia2’(it is about 3.97)

OpenAI. (2025). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com.