Data Sources and The Process of Data Normalization
Presented by Shazia Khan
The Source of the data:
The file :
- academy_awards.csv
The Fields:
| Year | Category | Nominee | Additional Info | won? |
|---|---|---|---|---|
| 2010 (83rd) | Actor – Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO |
R Script Code: Step_1_R_Code_To_Cleanup.txt
Input :
- Academy_awards.csv
Output:
- Category.csv
- Movie.csv
- Academy_Awards_Temp.csv
- Uses : library(dplyr) and library(tidyr)
- Removes white spaces
- Isolates Year number from String format in parenthesis and adds the century for four digit year
- Extracts Movie name from Additional Info field
- Exchanges Category and Nominee data where it is interchanged (the categories are identified)
- Removes specical characters like {}, ()
- Changes data for Won? from yes/ no to binary 0/ 1
- Removes categories that are not associated with movies for ex. Honorary Award, Scientific and Technical, Foreign Language
- Deletes information that is not required ex. empty columns, Additional Information after extracting relevant information.
- Trims and changes data to uppercase
- Writes to CSV file Movie.csv, Category.csv, and Academy_Awards_Temp.csv
Movie.csv
| Movie_Name | year |
|---|---|
| BIUTIFUL | 2010 |
Category.csv
| category_name |
|---|
| ACTOR – LEADING ROLE |
Nominations.csv
| movie_id | category_id | nomination_number | won |
|---|---|---|---|
| 1 | 1 | 1 | 0 |
Postgresql Code: Step_2_Postgresql_Code.txt
Input :
- Category.csv
- Movie.csv
- Academy_Awards_Temp.csv (Nominations.csv)
Output:
- Awards_File.csv
- Reads the 3 files Output from Step 1 R Code
- Creates Award Database
- Creates 3 tables (Movie, Category and Nominations)
- Joins 3 tables in one select statement to output Awards.csv file for analysis
Table 1 - Movie Table
- MOVIE_ID SERIAL CONSTRAINT MOVIE_PK1 PRIMARY KEY
- MOVIE_NAME VARCHAR(300) NOT NULL
- YEAR INTEGER NOT NULL
Table 2 - Category Table
- CATEGORY_ID SERIAL CONSTRAINT CATEGORY_PK1 PRIMARY KEY
- CATEGORY_NAME VARCHAR(300) NOT NULL
Table 3 - Nomination Table
- MOVIE_ID INTEGER NOT NULL
- CATEGORY_ID INTEGER NOT NULL
- NOMINATION_NUMBER INTEGER NOT NULL
- WON INTEGER NOT NULL CONSTRAINT NOMINATIONS_CHK1 CHECK (WON IN (0,1))
- CONSTRAINT NOMINATIONS_PK PRIMARY KEY (MOVIE_ID, CATEGORY_ID, NOMINATION_NUMBER)
- CONSTRAINT NOMINATIONS_FK1 FOREIGN KEY (MOVIE_ID) REFERENCES MOVIE(MOVIE_ID)
- CONSTRAINT NOMINATIONS_FK2 FOREIGN KEY (CATEGORY_ID) REFERENCES CATEGORY(CATEGORY_ID)
Start - Academy_awards.csv
| Year | Category | Nominee | Additional Info | won? |
|---|---|---|---|---|
| 2010 (83rd) | Actor – Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO |
Finish - Awards.csv
| movie_id | movie_name | year | category_id | category_name | won |
|---|---|---|---|---|---|
| 1 | BIUTIFUL | 2010 | 1 | ACTOR – LEADING ROLE | 0 |