Academy Awards

Project: Academy Awards

Data Sources and The Process of Data Normalization

Presented by Shazia Khan

Data Design Process

Determine the purpose of the database
Find and organize the information required
Apply the normalization rules
Divide the information into tables
Turn information items into columns
Specify primary keys
Set up the table relationships
Upload the data

The Purpose of the database

… is to determine whether Best Film Editing is the best predictor of Best Picture.

Award

Find and organize the information required

The Source of the data:

https://www.aggdata.com/awards/oscar

The file :

academy_awards.csv

The Fields:

Year	Category	Nominee	Additional Info	won?
2010 (83rd)	Actor – Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO

Step 1

R Script Code: Step_1_R_Code_To_Cleanup.txt

Input :

Academy_awards.csv

Output:

Category.csv

Movie.csv

Academy_Awards_Temp.csv

What does this Step 1 R Code do?

Uses : library(dplyr) and library(tidyr)

Removes white spaces

Isolates Year number from String format in parenthesis and adds the century for four digit year

Extracts Movie name from Additional Info field

Exchanges Category and Nominee data where it is interchanged (the categories are identified)

Removes specical characters like {}, ()

What does this Step 1 R Code do? (cont'd)

Changes data for Won? from yes/ no to binary 0/ 1

Removes categories that are not associated with movies for ex. Honorary Award, Scientific and Technical, Foreign Language

Deletes information that is not required ex. empty columns, Additional Information after extracting relevant information.

Trims and changes data to uppercase

Writes to CSV file Movie.csv, Category.csv, and Academy_Awards_Temp.csv

What does this Step 1 R Code do? (cont'd)

Movie.csv

Movie_Name	year
BIUTIFUL	2010

Category.csv

category_name
ACTOR – LEADING ROLE

Nominations.csv

movie_id	category_id	nomination_number	won
1	1	1	0

Step 2 Postgresql Code

Postgresql Code: Step_2_Postgresql_Code.txt

Input :

Category.csv

Movie.csv

Academy_Awards_Temp.csv (Nominations.csv)

Output:

Awards_File.csv

What does this Step 2 Postgresql Code do?

Reads the 3 files Output from Step 1 R Code

Creates Award Database

Creates 3 tables (Movie, Category and Nominations)

Joins 3 tables in one select statement to output Awards.csv file for analysis

Entity Relationship Diagram

ERD

Table 1 - Movie Table

MOVIE_ID SERIAL CONSTRAINT MOVIE_PK1 PRIMARY KEY

MOVIE_NAME VARCHAR(300) NOT NULL

YEAR INTEGER NOT NULL

Table 2 - Category Table

CATEGORY_ID SERIAL CONSTRAINT CATEGORY_PK1 PRIMARY KEY

CATEGORY_NAME VARCHAR(300) NOT NULL

Table 3 - Nomination Table

MOVIE_ID INTEGER NOT NULL

CATEGORY_ID INTEGER NOT NULL

NOMINATION_NUMBER INTEGER NOT NULL

WON INTEGER NOT NULL CONSTRAINT NOMINATIONS_CHK1 CHECK (WON IN (0,1))

CONSTRAINT NOMINATIONS_PK PRIMARY KEY (MOVIE_ID, CATEGORY_ID, NOMINATION_NUMBER)

CONSTRAINT NOMINATIONS_FK1 FOREIGN KEY (MOVIE_ID) REFERENCES MOVIE(MOVIE_ID)

CONSTRAINT NOMINATIONS_FK2 FOREIGN KEY (CATEGORY_ID) REFERENCES CATEGORY(CATEGORY_ID)

Start to Finish

Start - Academy_awards.csv

Year	Category	Nominee	Additional Info	won?
2010 (83rd)	Actor – Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO

Finish - Awards.csv

movie_id	movie_name	year	category_id	category_name	won
1	BIUTIFUL	2010	1	ACTOR – LEADING ROLE	0