2A SQL & R Deliverable

Author

Desiree Thomas

Published

February 8, 2026

Data is a highly value asset and used by virtually all companies. This data is often used by streaming services such as Netflix, YouTube, HBO Max, Paramount+, and many more. Compare your experiences with these streaming services vs. your experience with cable TV. I am going to develop my business approach/question further: https://arxiv.org/pdf/1609.02451.

We are being asked to create the dataset that will later be used for a recommender system. I chose these 6 recent popular movies: ‘Deadpool & Wolverine’(2024), ‘Dune: Part 2’ (2024), ‘Barbie’ (2023), ‘Oppenheimer’ (2023), ’Spider-Man: Across the Spider-Verse (2023). For this assignment, I chose to generate synthetic data and in a higher number than required. The guidelines for this required only 5 users, however, since this setup will be used for a future recommender system, I realized I would need more data if I were to find relevant patterns.

I also generated an Account and Profile ID’s due to personal experience of using YouTube and Netflix, etc. There one Account may have multiple users. I know that they need ways to track their users in order to suggest recommendations and tailor their users experience. This turned out to be related to Normalization, which was an optional feature of the homework assignment.

I ensured that movies whose ratings were blank, were not deleted. Instead they were changed to NULLS/NA. Theoretically, there would be nothing to recommend if the data only contained movies that they had already seen. Perhaps, it would default to the movies that users rated the highest to encourage them to ‘Watch Again’?

I connected RStudio/R to the database using DBI and RPostgres to ensure that the database is secure and password protected. I used LEFT JOINS in SQL within the queries so I could keep the NULL values. I also used the tidyverse library, to calculate the user and movie metrics.

##Package Installation Install these packages if you do not already have them: “DBI”, “RPostgres”, “tidyverse”, “rstudioapi”

##Load Packages

library(DBI)
library(RPostgres)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rstudioapi)

##Set Up Connection

con <- dbConnect(RPostgres::Postgres(),
                 dbname = "movie_ratings",
                 host = "localhost", 
                 port = 5432,
                 user = "postgres",
                 password = "pijza3-havkud-dewTef") #This is a false password

#Getting the Data

SQL_query <- "
  SELECT  p.profile_name, m.title,r.rating
  FROM ratings r
  LEFT JOIN profiles p ON r.profile_id = p.profile_id
  LEFT JOIN movies m ON r.movie_id = m.movie_id;
"
df <- dbGetQuery(con, SQL_query)

print(head(df))
    profile_name                title rating
1   Nancy Nelson Deadpool & Wolverine      4
2   Harry Cooper               Barbie     NA
3   Nancy Nelson               Barbie      3
4   Thomas Allen          Oppenheimer     NA
5 Nicholas Olson          Oppenheimer      3
6 Adam Gutierrez         Inside Out 2      4

##Analyzing the Data

#Homework Analysis chunk


movie_watchers_analysis <- df %>% group_by(profile_name) %>% 
  summarise(
    movies_rated = sum(!is.na(rating)), 
    movies_not_watched = sum(is.na(rating)),
    total_watched_and_not_watched = n()
) %>% 
  arrange(desc(movies_rated))

print("Ranking The Users")
[1] "Ranking The Users"
print(movie_watchers_analysis)
# A tibble: 75 × 4
   profile_name    movies_rated movies_not_watched total_watched_and_not_watched
   <chr>                  <int>              <int>                         <int>
 1 Aaron Powell               6                  0                             6
 2 Bruce Stewart              6                  0                             6
 3 Jack Phillips              6                  0                             6
 4 Joshua Castillo            6                  0                             6
 5 Mary Jackson               6                  0                             6
 6 Phillip Snyder             6                  0                             6
 7 Walter Marshall            6                  0                             6
 8 Adam Gutierrez             5                  1                             6
 9 Alan Chavez                5                  1                             6
10 Bruce Harrison             5                  1                             6
# ℹ 65 more rows
movie_stats <- df %>% 
  group_by(title) %>%
  summarise(
    average_score = mean(rating, na.rm = TRUE), 
    count = sum(!is.na(rating))
  ) %>% 
  arrange(desc(average_score))

print("Ranking The Movies")
[1] "Ranking The Movies"
print(movie_stats)
# A tibble: 6 × 3
  title                               average_score count
  <chr>                                       <dbl> <int>
1 Barbie                                       3.31    36
2 Spider-Man: Across the Spider-Verse          3.20    49
3 Inside Out 2                                 3       42
4 Dune: Part Two                               2.95    44
5 Oppenheimer                                  2.94    47
6 Deadpool & Wolverine                         2.91    57
#Count the average ratings per user and movie AND computer average ratings

#Users 
count_users <- df %>% 
  group_by(profile_name)%>%
  summarise(
    movies_rated = sum(!is.na(rating))
  ) %>% 
  arrange(desc(movies_rated))

print("Ratings Per User")
[1] "Ratings Per User"
print(count_users)
# A tibble: 75 × 2
   profile_name    movies_rated
   <chr>                  <int>
 1 Aaron Powell               6
 2 Bruce Stewart              6
 3 Jack Phillips              6
 4 Joshua Castillo            6
 5 Mary Jackson               6
 6 Phillip Snyder             6
 7 Walter Marshall            6
 8 Adam Gutierrez             5
 9 Alan Chavez                5
10 Bruce Harrison             5
# ℹ 65 more rows
dbDisconnect(con)

#LLM Transcript

Results

The data contained a total of 450 rows.

Barbie had the highest average score but had the lowest count of ratings with only 36 in comparison to Deadpool & Wolverine which had the highest number of ratings while holding the lowest average score.

#LLM Prompt

User: I need you to set up the PostGres connection.

LLM: Copy this into your script. This loads your tools and opens the secure tunnel to Postgres.

Why this matters: rstudioapi keeps your password out of the code. This is the “Portfolio Standard.”

R 1. SETUP library(DBI) library(RPostgres) library(tidyverse) library(rstudioapi)

CONNECT When you run this, look for the popup box to type your password. con <- dbConnect(RPostgres::Postgres(), dbname = “movie_ratings”, host = “localhost”, port = 5432, user = “postgres”, password = rstudioapi::askForPassword(“Database Password”)) Run Block 1. (Ctrl+Enter).

Checkpoint: Did the popup appear? Did it run without error?

LLM’s explanation of code: Think of RStudio and Postgres as two separate buildings. Right now, the doors are locked. We need to build a secure tunnel between them.

We use the DBI library (Database Interface) as the blueprint for the tunnel, and RPostgres as the specific materials for a Postgres tunnel.

The Syntax: To open the tunnel, we use a function called dbConnect(). It needs 5 pieces of ID:

Driver: What kind of database is it? (RPostgres::Postgres())

DB Name: dbname

Host: Where is it? (Usually localhost)

User: Who are you? (postgres)

Password: The key.

Your Challenge: I want you to write the connection code.

Security Tip: Instead of typing your password in quotes (which is unsafe), use this function: rstudioapi::askForPassword(“Enter Password”). This forces R to pop up a secure window.

Try writing the con <- dbConnect(…) block in your script. Paste what you have here so I can check it.

LLM Citation Google DeepMind. (2025). Gemini 3 Pro. [Large Language model]. https://gemini.google.com. Acessed February 7, 2026.