Week 2 CodeBase: SQL and R – Movie Ratings

Author

Muhammad Suffyan Khan

Published

February 8, 2026

Objective

The goal of this assignment is to collect a small user–item ratings dataset (people × movies), store it in a SQL database (PostgreSQL), and analyze it in R. This project demonstrates an end-to-end workflow across data collection, relational storage, querying, loading into R, while also showing a reasonable strategy for missing ratings.

Data Collection Plan

  • Items (Movies): I will select six recent popular movies (or similar media items) to increase the chance that participants have seen at least some of them.
  • Users (Participants): I will survey at least five people.
  • Rating Scale: Participants will rate any movie they have seen on a 1–5 integer scale.
  • Missing Ratings: If a participant has not seen a movie, the rating will be recorded as missing (NULL) instead of forcing a guess.

Responses will be collected in a simple format (e.g., a small table) and then inserted into PostgreSQL.

Tools / Environment

  • Database: PostgreSQL (local installation)
  • GUI: pgAdmin 4 (for running SQL scripts and verifying tables)
  • R Packages (planned): DBI, RPostgres, tidyverse

Security Note: Database passwords will not be included in the code. Connection credentials will be stored using environment variables or masked placeholders.

Database Design (Normalized Schema)

To keep the solution professional and interview-ready, I will use a normalized relational schema to represent the many-to-many relationship between users and movies.

Planned tables:

  • users(user_id, name)
  • movies(movie_id, title, release_year)
  • ratings(rating_id, user_id, movie_id, rating)

Keys and Constraints (Planned)

  • users.user_id and movies.movie_id will be primary keys
  • ratings.user_id and ratings.movie_id will be foreign keys
  • Ratings will be constrained to the range 1–5
  • A unique constraint on (user_id, movie_id) will prevent duplicate ratings

This structure mirrors how real recommendation datasets are stored: users and movies are separate entities, and ratings are stored in a junction table.

Missing Data Strategy

Missing ratings are expected because participants may not have seen every movie.

I will handle missing data in two ways:

  1. In SQL: Missing ratings will be represented as NULL (or omitted rows if a participant did not rate an item).
  2. In R: When analyzing results, missing ratings will appear as NA. Summary statistics (means, medians) will be computed with na.rm = TRUE to avoid bias from missingness.

To document missingness, I will report:

  • ratings count per movie (coverage)
  • ratings count per user
  • how many missing ratings exist overall

Analysis Plan in R

After populating the database, I will load the data from PostgreSQL into R using DBI + RPostgres and:

  • Join users, movies, and ratings into one tidy dataframe
  • Compute simple summaries:
    • average rating per movie (with number of ratings)
    • average rating per user
    • overall rating distribution (1–5)
  • Optionally reshape into a user–item matrix (wide format) to illustrate how the dataset resembles input for collaborative filtering (no advanced recommender model required).

Reproducibility Plan

Even if I use pgAdmin to run queries, I will include the full SQL scripts required to:

  • create the tables (CREATE TABLE)
  • populate them (INSERT INTO)
  • query them (SELECT ... JOIN ...)

All code (SQL + R + Quarto) will be stored in a GitHub repository for submission, with sensitive credentials removed or masked.

Loading Libraries

library(DBI)
library(RPostgres)
library(dplyr)
library(tidyr)

Connect to PosgreSQL

con <- dbConnect(
  RPostgres::Postgres(),
  host = "localhost",
  port = 5432,
  dbname = "Movie _Ratings",
  user = "postgres",
  password = Sys.getenv("PGPASSWORD")
)

dbListTables(con) 
[1] "movies"  "ratings" "users"  

Load table in R

users_df  <- dbGetQuery(con, "SELECT * FROM users;")
movies_df <- dbGetQuery(con, "SELECT * FROM movies;")
ratings_df <- dbGetQuery(con, "SELECT * FROM ratings;")

Join tables in R

ratings_joined <- ratings_df %>%
  left_join(users_df, by = "user_id") %>%
  left_join(movies_df, by = "movie_id") %>%
  select(name, title, release_year, rating) %>%
  arrange(name, title)

ratings_joined
       name                               title release_year rating
1     Abdul                      Dune: Part Two         2024      4
2     Abdul Spider-Man: Across the Spider-Verse         2023      4
3     Abdul                          The Batman         2022      3
4     Abdul                   Top Gun: Maverick         2022      4
5       Ali                              Barbie         2023      3
6       Ali                         Oppenheimer         2023      4
7       Ali                   Top Gun: Maverick         2022      5
8   Daniyal                      Dune: Part Two         2024      3
9   Daniyal                         Oppenheimer         2023      5
10  Daniyal                          The Batman         2022      4
11  Daniyal                   Top Gun: Maverick         2022      5
12 Muhammad                      Dune: Part Two         2024      5
13 Muhammad                         Oppenheimer         2023      5
14 Muhammad Spider-Man: Across the Spider-Verse         2023      4
15    Samir                              Barbie         2023      4
16    Samir                         Oppenheimer         2023      5
17    Samir Spider-Man: Across the Spider-Verse         2023      5

Demonstrate missing ratings handling (Part 1)

ratings_complete <- users_df %>%
  crossing(movies_df) %>%
  left_join(ratings_df, by = c("user_id", "movie_id"))

ratings_complete
# A tibble: 30 × 7
   user_id name     movie_id title                 release_year rating_id rating
     <int> <chr>       <int> <chr>                        <int>     <int>  <int>
 1       1 Muhammad        1 Dune: Part Two                2024         1      5
 2       1 Muhammad        2 Oppenheimer                   2023         2      5
 3       1 Muhammad        3 Barbie                        2023        NA     NA
 4       1 Muhammad        4 The Batman                    2022        NA     NA
 5       1 Muhammad        5 Spider-Man: Across t…         2023         3      4
 6       1 Muhammad        6 Top Gun: Maverick             2022        NA     NA
 7       2 Ali             1 Dune: Part Two                2024        NA     NA
 8       2 Ali             2 Oppenheimer                   2023         4      4
 9       2 Ali             3 Barbie                        2023         5      3
10       2 Ali             4 The Batman                    2022        NA     NA
# ℹ 20 more rows

Demonstrate missing ratings handling (Part 2)

ratings_complete %>%
  summarise(
    total_possible = n(),
    missing_ratings = sum(is.na(rating)),
    observed_ratings = sum(!is.na(rating))
  )
# A tibble: 1 × 3
  total_possible missing_ratings observed_ratings
           <int>           <int>            <int>
1             30              13               17

Simple summaries

ratings_complete %>%
  group_by(title) %>%
  summarise(avg_rating = mean(rating, na.rm = TRUE),
            n_ratings = sum(!is.na(rating)))
# A tibble: 6 × 3
  title                               avg_rating n_ratings
  <chr>                                    <dbl>     <int>
1 Barbie                                    3.5          2
2 Dune: Part Two                            4            3
3 Oppenheimer                               4.75         4
4 Spider-Man: Across the Spider-Verse       4.33         3
5 The Batman                                3.5          2
6 Top Gun: Maverick                         4.67         3

Disconnecting Database

dbDisconnect(con)