Academy Awards

Project: Academy Awards

Data Sources and The Process of Data Normalization

Presented by Shazia Khan

Data Design Process

  1. Determine the purpose of the database
  2. Find and organize the information required
  3. Apply the normalization rules
  4. Divide the information into tables
  5. Turn information items into columns
  6. Specify primary keys
  7. Set up the table relationships
  8. Upload the data

The Purpose of the database

is to determine whether Best Film Editing is the best predictor of Best Picture.

Award

Find and organize the information required

The Source of the data:

The file :

  • academy_awards.csv

The Fields:

Year Category Nominee Additional Info won?
2010 (83rd) Actor – Leading Role Javier Bardem Biutiful {'Uxbal'} NO

Step 1

  • R Script Code: Step_1_R_Code_To_Cleanup.txt

  • Input :

    1. Academy_awards.csv
  • Output:

    1. Category.csv
    2. Movie.csv
    3. Academy_Awards_Temp.csv

What does this Step 1 R Code do?

  • Uses : library(dplyr) and library(tidyr)
  • Removes white spaces
  • Isolates Year number from String format in parenthesis and adds the century for four digit year
  • Extracts Movie name from Additional Info field
  • Exchanges Category and Nominee data where it is interchanged (the categories are identified)
  • Removes specical characters like {}, ()

What does this Step 1 R Code do? (cont'd)

  • Changes data for Won? from yes/ no to binary 0/ 1
  • Removes categories that are not associated with movies for ex. Honorary Award, Scientific and Technical, Foreign Language
  • Deletes information that is not required ex. empty columns, Additional Information after extracting relevant information.
  • Trims and changes data to uppercase
  • Writes to CSV file Movie.csv, Category.csv, and Academy_Awards_Temp.csv

What does this Step 1 R Code do? (cont'd)

Movie.csv

Movie_Name year
BIUTIFUL 2010

Category.csv

category_name
ACTOR – LEADING ROLE

Nominations.csv

movie_id category_id nomination_number won
1 1 1 0

Step 2 Postgresql Code

  • Postgresql Code: Step_2_Postgresql_Code.txt

  • Input :

    1. Category.csv
    2. Movie.csv
    3. Academy_Awards_Temp.csv (Nominations.csv)
  • Output:

    1. Awards_File.csv

What does this Step 2 Postgresql Code do?

  • Reads the 3 files Output from Step 1 R Code
  • Creates Award Database
  • Creates 3 tables (Movie, Category and Nominations)
  • Joins 3 tables in one select statement to output Awards.csv file for analysis

Entity Relationship Diagram

ERD

Table 1 - Movie Table

  • MOVIE_ID SERIAL CONSTRAINT MOVIE_PK1 PRIMARY KEY
  • MOVIE_NAME VARCHAR(300) NOT NULL
  • YEAR INTEGER NOT NULL

Table 2 - Category Table

  • CATEGORY_ID SERIAL CONSTRAINT CATEGORY_PK1 PRIMARY KEY
  • CATEGORY_NAME VARCHAR(300) NOT NULL

Table 3 - Nomination Table

  • MOVIE_ID INTEGER NOT NULL
  • CATEGORY_ID INTEGER NOT NULL
  • NOMINATION_NUMBER INTEGER NOT NULL
  • WON INTEGER NOT NULL CONSTRAINT NOMINATIONS_CHK1 CHECK (WON IN (0,1))
  • CONSTRAINT NOMINATIONS_PK PRIMARY KEY (MOVIE_ID, CATEGORY_ID, NOMINATION_NUMBER)
  • CONSTRAINT NOMINATIONS_FK1 FOREIGN KEY (MOVIE_ID) REFERENCES MOVIE(MOVIE_ID)
  • CONSTRAINT NOMINATIONS_FK2 FOREIGN KEY (CATEGORY_ID) REFERENCES CATEGORY(CATEGORY_ID)

Start to Finish

Start - Academy_awards.csv

Year Category Nominee Additional Info won?
2010 (83rd) Actor – Leading Role Javier Bardem Biutiful {'Uxbal'} NO

Finish - Awards.csv

movie_id movie_name year category_id category_name won
1 BIUTIFUL 2010 1 ACTOR – LEADING ROLE 0