Assignment Two Overview

For this assignment, I need to:

PGS Connection & Data Import

Table Creation in PGS

My survey of 5 friends about 6 recent movies yielded a table with seven rows and eight columns, for 27 cells total. An extremely small dataset in the grand scheme of the universe. I setup pgAdmin 4 (PGA-4) on my Macbook Pro and just used the GUI to create a database called “data_607” and then a schema and table.

This is the SQL to create the table:

-- Table: movie_survey_raw_results.raw_results

-- DROP TABLE IF EXISTS movie_survey_raw_results.raw_results;

CREATE TABLE IF NOT EXISTS movie_survey_raw_results.raw_results
(
    name text COLLATE pg_catalog."default" NOT NULL,
    deadpool_wolverine integer,
    dune_part_two integer,
    twisters integer,
    godzilla_kong_the_new_empire integer,
    quiet_place_day_one integer,
    inside_out_two integer,
    CONSTRAINT raw_results_pkey PRIMARY KEY (name)
)

TABLESPACE pg_default;

ALTER TABLE IF EXISTS movie_survey_raw_results.raw_results
    OWNER to postgres;

This is the SQL to import a CSV from a local computer. I removed sensitive components of the file path name. You would need to ensure this points to where you store it.

COPY movie_survey_raw_results.raw_results (deadpool_wolverine,  dune_part_two,  twisters,   
godzilla_kong_the_new_empire,   quiet_place_day_one,    inside_out_two) 
FROM '[your_computer_folder]/607_homeworktwo_moviesurvey.csv' 
DELIMITER ',' 
CSV HEADER;

I thought about making a connection to GCP to professionalize the setup a bit more but I decided against it since those charge by the hour, even when idle. Unless you’re dilligent, the costs can escalate quickly.

Getting Data Ready

While I’m new to data science and to many facets of wrangling unstructured data, I’m not new to relational databases and how data should be stored in them. For instance, I saved myself some future pain by simply inputting the movie titles into my file as, for example, “deadpool_wolverine” instead of “Deadpool & Wolverine” become. This was a small survey where changing values manually was easy enough. Column names in a relational database should be lowercase and free of special characters, otherwise you’ll have a bad time later on.

As has been my experience working with programmatic connections to Amazon Web Services or Google Cloud Platforms, the most difficult part of the connection and import was configuring the settings properly within PGS to ensure RStudio could access it. I also wanted to ensure the connection used RStudio global variables instead of the raw values.

This is the SQL I ran in PGA-4 to set the schema where the results live as the default. I did this after successfully establishing connection with PGS but nothing being return.

ALTER ROLE postgres SET search_path TO movie_survey_raw_results;

Connecting to PGS

I had to install the RPostgres and DBI packages as the foundation of the connection. I created global variable for the PGS username and password I need to pass in the connection request. After the connection, I tested it by looking at the value returned by dbListTables(), which should be “raw_results”.

Hidden from your view (you, the reader) is the chunk I used to make a connection to PGS because it has my username and password. Here’s the code I ran, with the user and password set to variables I set locally.

pgs_db <- dbConnect(
  RPostgres::Postgres(),
  host = "localhost",           
  port = 5432,                  
  dbname = "data_607", 
  user = pgs_username,       
  password = pgs_pw     
)
library(RPostgres)
library(DBI)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(purrr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
dbListTables(pgs_db)
## [1] "raw_results"

Turning SQL Table Into Dataframe

This is a clean line of code that imports all the data and assigns it to a variable. Normally running SELECT * from a table without some sort of limiter is very bad form. However, this dataset is very small and I need all the data anyway.

movie_results_raw_df <- dbGetQuery(pgs_db, "SELECT * FROM movie_survey_raw_results.raw_results")

kable(movie_results_raw_df) %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped")
name deadpool_wolverine dune_part_two twisters godzilla_kong_the_new_empire quiet_place_day_one inside_out_two
MLS NA 2 3 NA NA 5
Zach 4 2 NA 3 NA NA
Ken 1 NA 1 NA 1 NA
Chris NA NA NA 5 2 NA
Kellie NA NA NA NA NA NA
Samuel 3 NA 4 NA NA 4

We are now ready to cleanup the nulls and discuss next steps.

Lies, Damn Lies, and Nulls

There are a lot of options for dealing with null or missing values and it’s highly dependent on the data you have. Here are a few with some commentary on application here:

movie_results_mn <- movie_results_raw_df %>%
  mutate_all(~ ifelse(is.na(.), round(mean(., na.rm = TRUE)), .))

kable(movie_results_mn) %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped")
name deadpool_wolverine dune_part_two twisters godzilla_kong_the_new_empire quiet_place_day_one inside_out_two
MLS 3 2 3 4 2 5
Zach 4 2 3 3 2 4
Ken 1 2 1 4 1 4
Chris 3 2 3 5 2 4
Kellie 3 2 3 4 2 4
Samuel 3 2 4 4 2 4

We now have a dataset that is full of values and could be used in further analysis.

Closing Thoughts

I had to cut for time my desire to try and “create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated.” It’s very rare that all the data you need is in one table and, when it is, there are usually major performance issues. A company like Meta or Google is generating petabytes of data a day and that has to placed in mutliple data models with multiple different components within it.