week2a_approach

Author

Sinem K Moschos

TV Show Ratings Assignment

Introduction

This assignment is for collecting rating data, store it in a database, and look at it using R. I want to understand the full process from start to finish. This includes collecting the data, saving it in a database, and doing some basic analysis. This assignment is not about building complex systems, but about learning how the tools work together.

Data Collection

For this project, I will create a small ratings dataset like the ones used on streaming apps. The people I ask the users, the television series will be the items, and the ratings will be numbers from one to five. Each person will rate the series they have watched. If someone has not seen a series, they will not give a rating. This creates missing data, which is common in real life. I will collect ratings from five people and include five series so that not everyone rates everything.

Database and Storage

I will use PostgreSQL to store the data. The database will have three tables. One table will store the people, one table will store the series, and one table will store the ratings. The ratings table will connect people and series and store the score each person gives. This setup allows one person to rate many series and one series to be rated by many people. Each table will have an id to keep the data organized.

Missing Data Handle

Missing ratings will happen when a person has not watched a series. My plan is to not save a rating when it does not exist. Later, when the data is loaded into R, these missing ratings will appear as NA. This makes it easy to see which ratings are missing.

Data Loading and R Analysis

After the data is saved in PostgreSQL, I will connect to the database from R. In R, I will load the data into a dataframe and check that it looks correct. I will look at the first few rows and the structure of the data. I will do simple checks, like finding the average rating for each series and counting how many ratings exist.

Tools and Workflow

PostgreSQL will be used to store the data. SQL will be used to create tables and add data. PgAdmin 4 will be used to help manage the database. R will be used to load the data and do the analysis. This Quarto file will be the report for the assignment.

Expected Outcome

At the end of this assignment, I will have a small ratings dataset stored in a database and analyzed in R. This shows that I understand the basic ideas behind databases, SQL, and simple data analysis.

SQL Database Setup

People: Kostas, Summer, Adrian, Jack, Oscar TV Series: Escobar, Breaking Bad, Suits, Peaky Blinders, House of Cards

Create tables

First, I created a users table to store the people.

CREATE TABLE users ( user_id SERIAL PRIMARY KEY, name TEXT NOT NULL UNIQUE );

Second of all, I created series table for the series titles.

CREATE TABLE series ( series_id SERIAL PRIMARY KEY, title TEXT NOT NULL UNIQUE );

Lastly, I created ratings table for the rating value and connection betweeb users & series.

CREATE TABLE ratings ( user_id INT NOT NULL REFERENCES users(user_id), series_id INT NOT NULL REFERENCES series(series_id), rating INT CHECK (rating BETWEEN 1 AND 5), PRIMARY KEY (user_id, series_id) );

-inserted the five people who provided ratings.

INSERT INTO users (name) VALUES ('Kostas'), ('Summer'), ('Adrian'), ('Jack'), ('Oscar');

-inserted the series titles that were rated.

INSERT INTO series (title) VALUES ('Escobar'), ('Breaking Bad'), ('Suits'), ('Peaky Blinders'), ('House of Cards');

-Inserted the ratings into the ratings table. Missing ratings are handled by leaving those user and series pairs out. This part is kind of copy paste and adjusting ratings and names.

INSERT INTO ratings (user_id, series_id, rating) SELECT u.user_id, s.series_id, v.rating FROM (VALUES ('Kostas', 'Escobar', 4), ('Kostas', 'Breaking Bad', 5), ('Kostas', 'Peaky Blinders', 5), ('Kostas', 'House of Cards', 4),

('Summer', 'Breaking Bad', 4), ('Summer', 'Suits', 5), ('Summer', 'Peaky Blinders', 4),

('Adrian', 'Escobar', 3), ('Adrian', 'Breaking Bad', 5), ('Adrian', 'Suits', 4), ('Adrian', 'House of Cards', 4),

('Jack', 'Escobar', 4), ('Jack', 'Suits', 3), ('Jack', 'Peaky Blinders', 5), ('Jack', 'House of Cards', 5),

('Oscar', 'Breaking Bad', 4), ('Oscar', 'Suits', 4), ('Oscar', 'Peaky Blinders', 4), ('Oscar', 'House of Cards', 3) ) AS v(user_name, series_title, rating) JOIN users u ON u.name = v.user_name JOIN series s ON s.title = v.series_title;

Loading Data to into R

connected to PostgreSQL from R and load the ratings data into a dataframe. I stored my database credentials in a local .Renviron file so my password is not written in my code. ** I learned this part (Password hiding) via AI Compiler.

library(DBI)
library(RPostgres)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)

con <- dbConnect(
  RPostgres::Postgres(),
  host = Sys.getenv("PGHOST"),
  port = as.integer(Sys.getenv("PGPORT")),
  dbname = Sys.getenv("PGDATABASE"),
  user = Sys.getenv("PGUSER"),
  password = Sys.getenv("PGPASSWORD")
)

dbListTables(con)
[1] "ratings" "series"  "users"  
ratings_df <- dbGetQuery(con, "
  SELECT u.name AS user,
         s.title AS series,
         r.rating
  FROM ratings r
  JOIN users u ON r.user_id = u.user_id
  JOIN series s ON r.series_id = s.series_id
  ORDER BY u.name, s.title;
")

ratings_df
     user         series rating
1  Adrian   Breaking Bad      5
2  Adrian        Escobar      3
3  Adrian House of Cards      4
4  Adrian          Suits      4
5    Jack        Escobar      4
6    Jack House of Cards      5
7    Jack Peaky Blinders      5
8    Jack          Suits      3
9  Kostas   Breaking Bad      5
10 Kostas        Escobar      4
11 Kostas House of Cards      4
12 Kostas Peaky Blinders      5
13  Oscar   Breaking Bad      4
14  Oscar House of Cards      3
15  Oscar Peaky Blinders      4
16  Oscar          Suits      4
17 Summer   Breaking Bad      4
18 Summer Peaky Blinders      4
19 Summer          Suits      5
full_grid <- expand_grid(
  user = unique(ratings_df$user),
  series = unique(ratings_df$series)
)

ratings_full <- full_grid %>%
  left_join(ratings_df, by = c("user", "series")) %>%
  arrange(user, series)

ratings_full
# A tibble: 25 × 3
   user   series         rating
   <chr>  <chr>           <int>
 1 Adrian Breaking Bad        5
 2 Adrian Escobar             3
 3 Adrian House of Cards      4
 4 Adrian Peaky Blinders     NA
 5 Adrian Suits               4
 6 Jack   Breaking Bad       NA
 7 Jack   Escobar             4
 8 Jack   House of Cards      5
 9 Jack   Peaky Blinders      5
10 Jack   Suits               3
# ℹ 15 more rows
avg_by_series <- ratings_full %>%
  group_by(series) %>%
  summarise(
    avg_rating = mean(rating, na.rm = TRUE),
    num_ratings = sum(!is.na(rating))
  ) %>%
  arrange(desc(avg_rating))

avg_by_series
# A tibble: 5 × 3
  series         avg_rating num_ratings
  <chr>               <dbl>       <int>
1 Breaking Bad         4.5            4
2 Peaky Blinders       4.5            4
3 House of Cards       4              4
4 Suits                4              4
5 Escobar              3.67           3
dbDisconnect(con)

Result

In R, I successfully pulled the ratings data from PostgreSQL into dataframe. Then created a full user and series grid so missing ratings showed up as NA. After that, I calculated the average rating and the number of ratings for each series.