You are a data engineer in ABC, and are responsible for designing, developing, constructing and maintaining the data architecture for ABC’s employment related data, to support policy analysis.
Your colleagues from the policy department have approached you for assistance, as they would like to know:
For both (i) and (ii), they have indicated that they would be interested in the answers to these questions for subsequent time periods too, when data for later time periods subsequently come in.
At your disposal are 5 raw data tables, covering the following:
How will you clean the available data, declare data types to every data field, and design a database schema that allows you to address your policy colleagues’ questions?
Additionally, knowing that the raw data for later time periods will periodically be transferred to you in the future, how will you schedule regular batch job processing for the 5 data tables, in a way that will allow you to continue answering variants of your policy colleagues’ questions for later time periods?
Please detail your methodology and ETL/ELT process.
The most common model for database management is the relational model invented by Edgar F. Codd, an English computer scientist, while working for IBM in 1970. Currently, it is the leading database model for commercial applications. The most common Relational Database Management Systems (RDBMSs) include Oracle DB, IBM DB2, Microsoft SQL Server, MySQL, and PostgreSQL. An RDBMS is chosen based on cost, operating systems, and other features. They are typically written in Structured Query Language (SQL).
A relational database organises data into tables, also known as relations. A database table is similar to a spreadsheet, each consisting of columns and rows. Each column, also known as a field or attribute, lists a specific domain of the entity in question. A particular attribute or combination of attributes is chosen as a primary key and referred to as a foreign key in other tables. Each row, also known as a record or tuple, includes data about a specific instance of the entity in question.
Different types of relationships can be created among tables, including one-to-one, one-to-many, and many-to-many relationships. This enables a relational database to effectively store and retrieve large amounts of data. Tables can be normalised to make the database flexible, adaptable, and scalable. When normalised, each piece of data becomes atomic and broken down into the smallest useful size.
A Well-Designed Database:
Phases of the Database Design Process:
We need to gather information to understand the purpose of our database. This will help us to choose the types of data and entities we want to store. This information becomes our data dictionary, which outlines the tables used in our database. We can consider different perspectives for our Database by:
For our Database, its purpose is to determine:
Once we have decided on the purpose of our database, we need to gather, clean, and organise our data into subject-based tables, each consisting of columns and rows, like a spreadsheet. Start by creating a table for each type of entity in our data. Each row, also known as a record or tuple, includes data about something or someone. Each column, also known as a field or attribute, contains a single type of information that appears in each row.
The appropriate data type is assigned to each column to keep the data consistent from one row to the next. Common Data Types include:
To create a visual overview of our database, known as an Entity-Relationship (ER) Diagram, each table will be represented by a box in the diagram. The title of each box will indicate the data it describes and its columns will be listed in it.
For our Database, we need to gather additional information in order to answer our questions.
For our Database, we will use some of the most common data cleaning techniques in SQL. The most common data types we have are numbers (numeric) and text (string).
For our Database, we will assign appropriate data types for each column to ensure data consistency.
In our relational model, a table cannot contain duplicate rows, as this will create ambiguities during retrieval. To ensure a unique identity, we need to decide which columns will serve as the primary key for each table. A primary key is a unique identifier for a given entity.
A Primary Key should have the following Characteristics:
Therefore, order numbers and user names make good primary keys. A primary key is known as a simple key if it consists of a single column. If it consists of multiple columns, it is known as a composite key.
For our Database, the main reference table is “Workforce Employment Records” as it contains both the Identification Number of FDW Employers and the Unique Entity Number (UEN) of Companies. This allows it to have many-to-many relationships with other tables. Essentially, the Primary Keys for each table are as follow:
The power of relational databases lies in the relationships among tables. Each entity can potentially have a relationship with another entity and they are usually 1 of 3 types:
When there is only one instance of Entity A for every instance of Entity B, they are said to have a one-to-one (1:1) relationship. We can represent this in an ER Diagram with a line and a dash on both ends. It is usually better to combine data from 2 tables in a 1:1 relationship. However, we might want to create tables with a 1:1 relationship under certain circumstances. A 1:1 relationship is useful for storing sensitive data in a secure table, with the non-sensitive data being stored in the main table.
A one-to-many (1:M) relationship occurs when a row in a table is associated with multiple entries in another table. These relationships are represented by a “Crow’s Foot Notation” in an ER Diagram. To create a 1:M relationship, add the primary key from the table on one side of the relationship as a column in the other table. When a primary key is listed in another table, it is known as a foreign key.
When multiple entities from a table can be associated with multiple entities in another table, they are said to have a many-to-many (M:N) relationship. These relationships are represented by 2 “Crow’s Foot Notation” in an ER Diagram. In reality, it is not possible to implement this relationship in a database. We have to break it up into 2 1:M relationships and introduce a go-between entity known as a link table or junction table. We can create a link table where each row will match together 2 of the entities in neighbouring tables, including supplemental information.
For our Database, we would like to create multiple M:N relationships among our tables. As mentioned above, our main reference table is “Workforce Employment Records”.
We need to apply normalisation rules which are the industry standards to make sure our database is structured correctly and optimally. Online Transaction Processing (OLTP) Databases should be normalised as users are concerned with creating, reading, updating, and deleting records. Online Analytical Processing (OLAP) Databases which favour analysis and reporting might fare better with a degree of de-normalisation as the emphasis is on calculation speed. Each level of normalisation includes rules associated with the lower levels.
First Normal Form (1NF): A table is First Normal Form (1NF) if it specifies that each cell contains a single value and not a list of values. A table with groups of repeated or closely related columns does not meet the 1NF. First, we split our data into multiple tables until each cell holds only one value and there are no extra columns. At this point, our data becomes atomic and broken down into the smallest useful size.
Second Normal Form (2NF): A table is Second Normal Form (2NF) if it is 1NF and also mandates that every non-key column be fully dependent on the primary key. If a primary key is made up of multiple columns, every non-key column has to be fully dependent on the entire set of columns, and not part of it. Otherwise, the table violates the 2NF.
Third Normal Form (3NF): A table is Third Normal Form (3NF) if it is 2NF and also adds the requirement that every non-key column be independent of one another. Essentially, the non-key columns are dependent only on the primary key and nothing else. Otherwise, the table violates the 3NF.
Higher Normal Forms: Higher Normal Forms include the Boyce-Codd Normal Form, the Fourth through Sixth Normal Forms, and the Domain-Key Normal Form. However, the first 3 Normal Forms are the most common. While these 3 forms outlay the best practices to follow, the degree of normalisation largely depends on the database.
We need to apply Data Integrity Rules to validate our data.
Entity Integrity Rule: The Entity Integrity Rule states that the primary key cannot be NULL. If a primary key is made up of multiple columns, none of them can be NULL. Otherwise, it cannot identify a unique row. Most RDBMSs check and enforce this rule.
Referential Integrity Rule: The Referential Integrity Rule requires each foreign key to be matched with a primary key in the parent table. We can insert a row with a foreign key in the child table only if the value exists in the parent table. If the primary key changes in the parent table, these changes will apply wherever this key is referenced in the database. Most RDBMSs check and enforce this rule.
Business Logic Integrity Rule: The Business Logic Integrity Rule ensures that our data fits prevalent logical parameters. This is carried out by a validation rule or programming logic. Most RDBMSs check and enforce this rule.
We can create an index for selected columns to facilitate our data search and retrieval. An index is a sorted copy of one or more columns, with the values either in ascending or descending order. Instead of re-sorting for each query, data can be accessed in the order specified by the index, allowing for a quicker search.
Even though an index can speed up data retrieval, it can also slow down insertions, deletions, and updates. This is because the index has to be rebuilt whenever a record is changed. Most RDBMSs automatically build an index based on the primary key.
The Extract, Transform, Load (ETL) Process is a data analytics pipeline that extracts information from raw data, transforms it into a format for analysis, and loads it into a target system such as a data warehouse. The ETL Process summarises data to reduce its size and improve performance for different types of analysis.
The Extract, Load, Transform (ELT) Process is an alternative to the ETL Process where data is extracted and loaded into a target system immediately. Today’s cloud-based data warehouse and data lake infrastructure supports large storage and scalable computation. The ELT Process is a data analytics pipeline that processes and maintains massive data pools in the cloud at low cost.
Here, we present two ways to build an ETL Process and one way to build an ELT Process.
A traditional ETL Process involves transferring and processing of data in batches from source databases to a data warehouse. It can be challenging to build an Enterprise ETL Process from scratch. We will need tools such as Stitch or Blendo to simplify and automate this process. In general, building an ETL Process with Batch Processing involves:
Reference Data: We need to create a dataset that defines permissible values for our data.
Extract from Data Sources: Subsequent ETL steps can only be successful if we extract our data correctly. Most ETL Processes combine data from multiple sources, each with their own data format and organisation. These include relational and non-relational databases. Successful extraction converts data into a single format for standardised processing.
Data Validation: An automated process to confirm whether data from multiple sources gave expected values. The validation engine rejects data if it fails validation rules. Rejected records need to be analysed in order to identify errors for corrections and modifications. This will help to resolve potential problems in future batches.
Transform Data: We need to program and test a series of rules and functions in order to achieve appropriate transformations for our data. We can then run this program to clean our data, apply business rules, check for data integrity, and create aggregates as necessary.
Stage: Data that has been transformed will be loaded into a staging database. This makes it easier to revert if something goes wrong. We can also generate audit reports for regulatory compliance, or diagnose and repair problems with our data.
Publish to Data Warehouse: Here, we load our data to target tables. Some data warehouses overwrite existing information every time an ETL Process loads a new batch of data. This may happen on a daily, weekly or monthly basis. In other cases, an ETL Process can add new data without overwriting. It adds a timestamp to indicate new data. This has to be done carefully to prevent the data warehouse from overloading due to limited disk space and performance issues.
Modern data analytics involve real time data. In these cases, we cannot extract and transform data in large batches. Therefore, we need to perform the ETL Process on data streams, where data is transformed and saved immediately to a target system. We will need tools such as Apache Samza, Apache Storm, and Apache Kafka. Building an ETL Process with Stream Processing based on Apache Kafka involves:
Extract Data into Kafka Topics: For the Confluent Platform, the Confluent JDBC COnnector writes each key and value pair into a Kafka topic. ETL applications can read from this topic and add rows to the source table. Kafka automatically writes them as new messages to the topic, allowing for a real time stream.
Retrieve Data from Kafka Topics: ETL applications can extract messages from a Kafka topic as Avro records. This creates an Avro schema file and de-serialises them, creating KStream objects from these messages.
Transform Data into a KStream Object: The Kafka Streams API, the Stream Processor receives one record at a time, processes it, and produces output for downstream processors. These processors transform messages one at a time and perform data operations on them.
Load Data to Other Systems: ETL applications can stream data to target systems such as a data warehouse or data lake. For the Confluent Platform, the S3 Sink Connector can stream data to Amazon S3. This can be integrated with other systems such as a Redshift data warehouse using Amazon Kinesis.
An Automated Cloud-Based Data Warehouse has end-to-end data management built in. It uses a self-optimising architecture with machine learning and natural language processing to automatically extract and transform data. It is integrated with data sources such as analytics systems, Business Intelligence (BI) tools, databases, social and advertising platforms. In general, building an ELT Process with an Automated Cloud-Based Data Warehouse involves:
Select Data Sources and Import Data: Select data sources from a list, enter credentials, and define destination tables. The Cloud-Based Data Warehouse will automatically import the data, clean it, and select schemas.
Run Transformation Queries: Select a table and run a SQL query against the raw data. We can save the query as a transformation or export the resulting table to a target system. We can keep running transformation queries until we achieve a format that suits our data analysis.
Data Analysis with Business Intelligence (BI) Tools: We can connect our Automated Cloud-Based Data Warehouse with any Business Intelligence (BI) tool such as Tableau or Looker to explore our transformed data.
Essentially, we have shown two ways to build an ETL Process and one way to build an ELT Process.
ETL Process with Batch Processing: A traditional way of preparing and transforming data in batches, from source databases to a data warehouse.
ETL Process with Stream Processing: Using a modern stream processing framework like Kafka, we can retrieve data in real time, manipulate it, and load it to a target system such as Amazon Redshift.
ELT Process with an Automated Cloud-Based Data Warehouse: A next-generation data pipeline that can retrieve data from multiple sources and automatically prepare data for analysis with BI tools.