week2a

Author

Zihao Yu

1. How will I tackle the problem?

I will select 6 movies (TV shows, books, songs, etc.), then find at least 5 people to rate them on a scale of 1–5; items not viewed will be left blank and treated as missing values. Next, I will create a data table in an SQL database to record the data, then import it into R for analysis using dplyr, similar to the first assignment.

2. What data challenges do I anticipate?

Not everyone will have seen them all, resulting in numerous missing ratings. In extreme cases, someone might have seen only one of them, so I’ll need to find new participants or switch to more popular items. Additionally, this is my first time working with SQL, which may extend the time needed to complete the assignment.

3. Connect to SQL

library(DBI)
library(RPostgres)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "data_607",
  host   = "localhost",
  port   = 5432,
  user   = "postgres"
)

ratings_df <- dbGetQuery(con, "
  SELECT u.user_id, u.name, m.movie_id, m.title, r.rating
  FROM ratings r
  JOIN users u  ON r.user_id = u.user_id
  JOIN movies m ON r.movie_id = m.movie_id
  ORDER BY u.name, m.movie_id;
")

users_df  <- dbGetQuery(con, "SELECT user_id, name  FROM users  ORDER BY user_id;")
movies_df <- dbGetQuery(con, "SELECT movie_id, title FROM movies ORDER BY movie_id;")

dbDisconnect(con)

4. Clean the data

glimpse(ratings_df)

Rows: 18
Columns: 5
$ user_id  <int> 6, 6, 4, 4, 2, 2, 1, 1, 1, 1, 1, 1, 3, 3, 5, 5, 7, 7
$ name     <chr> "CO", "CO", "HP", "HP", "J", "J", "M", "M", "M", "M", "M", "M…
$ movie_id <int> 1, 6, 2, 3, 2, 6, 1, 2, 3, 4, 5, 6, 1, 2, 4, 5, 2, 4
$ title    <chr> "Zootopia 2", "The SpongeBob Movie: Search for SquarePants", …
$ rating   <dbl> 4.0, 3.0, 1.0, 1.0, 5.0, 4.0, 4.0, 5.0, 3.0, 4.0, 1.0, 4.0, 5…

full_grid <- users_df |>
  crossing(movies_df) |>
  left_join(ratings_df, by = c("user_id","name","movie_id","title"))

full_grid |>
  summarise(missing_ratings = sum(is.na(rating)))

# A tibble: 1 × 1
  missing_ratings
            <int>
1              24

5. Summary the rating from users

count(ratings_df, name)

  name n
1   CO 2
2   HP 2
3    J 2
4    M 6
5    W 2
6   ZH 2
7   ZY 2

count(ratings_df, title)

                                        title n
1            Captain America: Brave New World 2
2                        Disney's Snow White  2
3               Now You See Me: Now You Don’t 3
4                           Stranger Things 5 5
5 The SpongeBob Movie: Search for SquarePants 3
6                                  Zootopia 2 3

summary(ratings_df$rating)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.000   3.361   4.000   5.000

6. Finding the average.

user_summary <- full_grid |>
  group_by(title) |>
  summarise(
    n_ratings  = sum(!is.na(rating)),
    avg_rating = mean(rating, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_rating))

user_summary

# A tibble: 6 × 3
  title                                         n_ratings avg_rating
  <chr>                                             <int>      <dbl>
1 "Zootopia 2"                                          3       4.33
2 "The SpongeBob Movie: Search for SquarePants"         3       3.67
3 "Stranger Things 5"                                   5       3.6 
4 "Now You See Me: Now You Don’t"                       3       3.33
5 "Disney's Snow White "                                2       2.25
6 "Captain America: Brave New World"                    2       2

7. Conclusion

This project collected ratings from 1 to 5 for 6 titles and 7 users. Most people only watched about two objects, leaving 24 blank entries. Based on the code: —Zootopia 2 received 3 ratings with an average of 4.3
—The SpongeBob Movie: Search for SquarePants received 3 ratings with an average of 3.67
—Stranger Things Season 5 received 5 ratings with an average of 3.6
—Now You See Me: Now You Don’t received 3 ratings with an average of 3.3
—Disney’s Snow White received 2 ratings with an average of 2.25
—Captain America: Brave New World received 2 ratings with an average of 2.0