library(DBI)
library(RPostgres)
library(dplyr)
library(tidyr)Week 2 CodeBase: SQL and R – Movie Ratings
Objective
The goal of this assignment is to collect a small user–item ratings dataset (people × movies), store it in a SQL database (PostgreSQL), and analyze it in R. This project demonstrates an end-to-end workflow across data collection, relational storage, querying, loading into R, while also showing a reasonable strategy for missing ratings.
Data Collection Plan
- Items (Movies): I will select six recent popular movies (or similar media items) to increase the chance that participants have seen at least some of them.
- Users (Participants): I will survey at least five people.
- Rating Scale: Participants will rate any movie they have seen on a 1–5 integer scale.
- Missing Ratings: If a participant has not seen a movie, the rating will be recorded as missing (NULL) instead of forcing a guess.
Responses will be collected in a simple format (e.g., a small table) and then inserted into PostgreSQL.
Tools / Environment
- Database: PostgreSQL (local installation)
- GUI: pgAdmin 4 (for running SQL scripts and verifying tables)
- R Packages (planned):
DBI,RPostgres,tidyverse
Security Note: Database passwords will not be included in the code. Connection credentials will be stored using environment variables or masked placeholders.
Database Design (Normalized Schema)
To keep the solution professional and interview-ready, I will use a normalized relational schema to represent the many-to-many relationship between users and movies.
Planned tables:
users(user_id, name)movies(movie_id, title, release_year)ratings(rating_id, user_id, movie_id, rating)
Keys and Constraints (Planned)
users.user_idandmovies.movie_idwill be primary keysratings.user_idandratings.movie_idwill be foreign keys- Ratings will be constrained to the range 1–5
- A unique constraint on
(user_id, movie_id)will prevent duplicate ratings
This structure mirrors how real recommendation datasets are stored: users and movies are separate entities, and ratings are stored in a junction table.
Missing Data Strategy
Missing ratings are expected because participants may not have seen every movie.
I will handle missing data in two ways:
- In SQL: Missing ratings will be represented as
NULL(or omitted rows if a participant did not rate an item). - In R: When analyzing results, missing ratings will appear as
NA. Summary statistics (means, medians) will be computed withna.rm = TRUEto avoid bias from missingness.
To document missingness, I will report:
- ratings count per movie (coverage)
- ratings count per user
- how many missing ratings exist overall
Analysis Plan in R
After populating the database, I will load the data from PostgreSQL into R using DBI + RPostgres and:
- Join users, movies, and ratings into one tidy dataframe
- Compute simple summaries:
- average rating per movie (with number of ratings)
- average rating per user
- overall rating distribution (1–5)
- Optionally reshape into a user–item matrix (wide format) to illustrate how the dataset resembles input for collaborative filtering (no advanced recommender model required).
Reproducibility Plan
Even if I use pgAdmin to run queries, I will include the full SQL scripts required to:
- create the tables (
CREATE TABLE) - populate them (
INSERT INTO) - query them (
SELECT ... JOIN ...)
All code (SQL + R + Quarto) will be stored in a GitHub repository for submission, with sensitive credentials removed or masked.
Loading Libraries
Connect to PosgreSQL
con <- dbConnect(
RPostgres::Postgres(),
host = "localhost",
port = 5432,
dbname = "Movie _Ratings",
user = "postgres",
password = Sys.getenv("PGPASSWORD")
)
dbListTables(con) [1] "movies" "ratings" "users"
Load table in R
users_df <- dbGetQuery(con, "SELECT * FROM users;")
movies_df <- dbGetQuery(con, "SELECT * FROM movies;")
ratings_df <- dbGetQuery(con, "SELECT * FROM ratings;")Join tables in R
ratings_joined <- ratings_df %>%
left_join(users_df, by = "user_id") %>%
left_join(movies_df, by = "movie_id") %>%
select(name, title, release_year, rating) %>%
arrange(name, title)
ratings_joined name title release_year rating
1 Abdul Dune: Part Two 2024 4
2 Abdul Spider-Man: Across the Spider-Verse 2023 4
3 Abdul The Batman 2022 3
4 Abdul Top Gun: Maverick 2022 4
5 Ali Barbie 2023 3
6 Ali Oppenheimer 2023 4
7 Ali Top Gun: Maverick 2022 5
8 Daniyal Dune: Part Two 2024 3
9 Daniyal Oppenheimer 2023 5
10 Daniyal The Batman 2022 4
11 Daniyal Top Gun: Maverick 2022 5
12 Muhammad Dune: Part Two 2024 5
13 Muhammad Oppenheimer 2023 5
14 Muhammad Spider-Man: Across the Spider-Verse 2023 4
15 Samir Barbie 2023 4
16 Samir Oppenheimer 2023 5
17 Samir Spider-Man: Across the Spider-Verse 2023 5
Demonstrate missing ratings handling (Part 1)
ratings_complete <- users_df %>%
crossing(movies_df) %>%
left_join(ratings_df, by = c("user_id", "movie_id"))
ratings_complete# A tibble: 30 × 7
user_id name movie_id title release_year rating_id rating
<int> <chr> <int> <chr> <int> <int> <int>
1 1 Muhammad 1 Dune: Part Two 2024 1 5
2 1 Muhammad 2 Oppenheimer 2023 2 5
3 1 Muhammad 3 Barbie 2023 NA NA
4 1 Muhammad 4 The Batman 2022 NA NA
5 1 Muhammad 5 Spider-Man: Across t… 2023 3 4
6 1 Muhammad 6 Top Gun: Maverick 2022 NA NA
7 2 Ali 1 Dune: Part Two 2024 NA NA
8 2 Ali 2 Oppenheimer 2023 4 4
9 2 Ali 3 Barbie 2023 5 3
10 2 Ali 4 The Batman 2022 NA NA
# ℹ 20 more rows
Demonstrate missing ratings handling (Part 2)
ratings_complete %>%
summarise(
total_possible = n(),
missing_ratings = sum(is.na(rating)),
observed_ratings = sum(!is.na(rating))
)# A tibble: 1 × 3
total_possible missing_ratings observed_ratings
<int> <int> <int>
1 30 13 17
Simple summaries
ratings_complete %>%
group_by(title) %>%
summarise(avg_rating = mean(rating, na.rm = TRUE),
n_ratings = sum(!is.na(rating)))# A tibble: 6 × 3
title avg_rating n_ratings
<chr> <dbl> <int>
1 Barbie 3.5 2
2 Dune: Part Two 4 3
3 Oppenheimer 4.75 4
4 Spider-Man: Across the Spider-Verse 4.33 3
5 The Batman 3.5 2
6 Top Gun: Maverick 4.67 3
Disconnecting Database
dbDisconnect(con)