Background

You have been tasked to develop a data model and architecture to support an operational data store of individual data for policy analysis and service delivery. Table 1 describes the data items which are in scope for the data store.

The scope of data coverage will be for all ABC citizens and PRs and will need to be refreshed on a periodic basis. The operational data store is expected to provide at least five years’ worth of historical data. The information highlighted in Table 1 is considered highly sensitive and should be dealt with accordingly.

There will be several different types of users for the operational data store. Some users will only look at policy analysis and will not need to modify the data. Other users will need to modify the data as part of their service delivery.

Data Modelling

  1. Develop a data model that describes how the data should be organised and stored, taking into account the sensitivity of certain data items.

  2. Provide a visual representation of this data model. Your representation should be understandable by a non-technical person.

  3. Set out a high-level set of data quality standards for agencies to comply with. These standards should cover both syntactic and semantic standards.

  4. Ask three key questions which you believe are necessary to refine the data model.

Data Architecture

  1. Develop a draft logical architecture to describe how you will obtain data from the agencies, clean the data to meet the data quality standards, store the data and distribute data to agencies. NB: This architecture should define what types of systems are necessary but should not describe the specific technical details of what kinds of solutions are necessary (i.e. you should not specify an Oracle Database with an Informatica ETL Solution).

  2. Describe what types of access controls should be in place at each system for the users.

  3. Ask three key questions which you believe are necessary to develop the architecture.

  4. In less than 500 words, discuss whether a centralised or decentralised data architecture best support the public service of ABC.

My Responses to the Questions

Data Modelling

1) Develop a data model that describes how the data should be organised and stored, taking into account the sensitivity of certain data items.

The most common model for database management is the relational model invented by Edgar F. Codd, an English computer scientist, while working for IBM in 1970. Currently, it is the leading database model for commercial applications. The most common Relational Database Management Systems (RDBMSs) include Oracle DB, IBM DB2, Microsoft SQL Server, MySQL, and PostgreSQL. An RDBMS is chosen based on cost, operating systems, and other features. They are typically written in Structured Query Language (SQL).

A relational database organises data into tables, also known as relations. A database table is similar to a spreadsheet, each consisting of columns and rows. Each column, also known as a field or attribute, lists a specific domain of the entity in question. A particular attribute or combination of attributes is chosen as a primary key and referred to as a foreign key in other tables. Each row, also known as a record or tuple, includes data about a specific instance of the entity in question.

Different types of relationships can be created among tables, including one-to-one, one-to-many, and many-to-many relationships. This enables a relational database to effectively store and retrieve large amounts of data. Tables can be normalised to make the database flexible, adaptable, and scalable. When normalised, each piece of data becomes atomic and broken down into the smallest useful size.

For this assignment, we can employ a RDBMS to organise and store our data based on “Data Source” primarily. There are certain items unique to each of these Data Sources that are highly sensitive. Examples include the information highlighted in Table 1 which will continue to be owned by their respective agencies. Therefore, there is a need to implement and maintain a set of high-level data quality standards for various agencies to comply with.

2) Provide a visual representation of this data model. Your representation should be understandable by a non-technical person.

The Entity Relationship (ER) Diagram for our Relational Database is illustrated separately.

3) Set out a high-level set of data quality standards for agencies to comply with. These standards should cover both syntactic and semantic standards.

In general, data is considered of high quality if it correctly represents the real-world construct to which it refers. It should be fit for its intended purposes in operations, decision making, and planning. As the number of data sources increase, internal data consistency becomes significant. Data Governance can be used to form definitions and standards that are agreed upon for data quality.

Essentially, a well-designed database aims to eliminate data redundancy, ensure data accuracy and integrity, and provide useful ways to access data. There are a few phases in the database design process that can help to ensure this:

Step 1: Define the Purpose of our Database (Requirement Analysis)

We need to gather information to understand the purpose of our database. This will help us to choose the types of data and entities we want to store. This information becomes our data dictionary, which outlines the tables used in our database. We can consider different perspectives for our database by interviewing the end users, analysing business forms such as invoices, timesheets, and surveys, and analysing existing data systems containing physical and digital files.

Step 2: Gather, Clean, Organise Data into Tables and Specify Primary Keys

Once we have decided on the purpose of our database, we need to gather, clean, and organise our data into subject-based tables, each consisting of columns and rows, like a spreadsheet. Start by creating a table for each type of entity in our data. Each row, also known as a record or tuple, includes data about something or someone. Each column, also known as a field or attribute, contains a single type of information that appears in each row.

For our database, we will use some of the most common data cleaning techniques in SQL. The most common data types we have are numbers (numeric) and text (string). Common messy numeric values include type mismatch and NULL values. We can solve these problems by data aggregation and table joins. Common messy string values can be different but mean the same thing. We can solve these problems by using string functions and operators to examine and manipulate these string values.

In our relational model, duplicate rows can create ambiguities during retrieval. To ensure a unique identity, we need to decide which columns will serve as the primary key for each table. A primary key is a unique identifier for a given entity. From our Entity Relationship (ER) Diagram shown in Question 2, I have set the primary key as “NRIC” for all the tables except for the table containing “Sensitive Data Field” which should be accessed only by their respective agencies.

Step 3: Create and Analyse Relationships among Tables

The power of relational databases lies in the relationships among tables. Each entity can potentially have a relationship with another entity and they are usually one of three types: One-to-One (1:1), One-to-Many (1:M), and Many-to-Many (M:N). When multiple entities from a table can be associated with multiple entities in another table, they are said to have a M:N relationship. These relationships are represented by two “Crow’s Foot Notation” as seen in our ER Diagram.

In reality, it is not possible to implement this relationship in a database. We have to break it up into two 1:M relationships and introduce a go-between entity known as a link table or junction table. We can create a link table where each row will match together two of the entities in neighbouring tables, including supplemental information. For our database, we created multiple M:N relationships among our tables.

Step 4: Normalise, Refine, and Standardise Tables

We need to apply normalisation rules which are the industry standards to make sure our database is structured correctly and optimally. Online Transaction Processing (OLTP) Databases should be normalised as users are concerned with creating, reading, updating, and deleting records. Online Analytical Processing (OLAP) Databases which favour analysis and reporting might fare better with a degree of de-normalisation as the emphasis is on calculation speed. Each level of normalisation includes rules associated with the lower levels.

We also need to apply data integrity rules such as the entity integrity rule, the referential integrity rule, and the business logic integrity rule, in order to validate our data.

Furthermore, we can create an index for selected columns to facilitate our data search and retrieval. An index is a sorted copy of one or more columns, with the values either in ascending or descending order. Instead of re-sorting for each query, data can be accessed in the order specified by the index, allowing for a quicker search. Even though an index can speed up data retrieval, it can also slow down insertions, deletions, and updates. This is because the index has to be rebuilt whenever a record is changed. Most RDBMSs automatically build an index based on the primary key.

4) Ask three key questions which you believe are necessary to refine the data model.

  1. How will one schedule regular batch job processing for the data tables in our data model?

  2. How will one utilise our data model to answer our colleagues’ questions for different time periods?

  3. What are some of the potential issues our data model may face, and how to overcome them?

Data Architecture

1) Develop a draft logical architecture to describe how you will obtain data from the agencies, clean the data to meet the data quality standards, store the data and distribute data to agencies. NB: This architecture should define what types of systems are necessary but should not describe the specific technical details of what kinds of solutions are necessary (i.e. you should not specify an Oracle Database with an Informatica ETL Solution).

A data architecture is an overarching strategy made up of models, policies, rules and standards that govern data collection, storage, integration, and utilisation in data systems. It is one of several architecture domains that form the pillars of an enterprise architecture. It is imperative to streamline an enterprise’s data architecture logically.

Essentially, a data warehouse is a central repository for collecting and processing data. It integrates current and historical data from one or more disparate sources, and stores it for reporting and analysis. It is a key component of Business Intelligence (BI). Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are the two main approaches used to build a data warehouse.

The ETL Process is a data analytics pipeline that extracts information from raw data, transforms it into a format for analysis, and loads it into a target system such as a data warehouse. The ETL Process summarises data to reduce its size and improve performance for analysis.

The ELT Process is an alternative to the ETL Process where data is extracted and loaded into a target system immediately. Today’s cloud-based data warehouses and data lakes support large storage and scalable computation. The ELT Process is a data analytics pipeline that processes and maintains massive data pools in the cloud at low cost.

Here, we present two ways to build an ETL Process and one way to build an ELT Process.

1. Building an ETL Process with Batch Processing

A traditional ETL Process involves transferring and processing of data in batches from source databases to a data warehouse. It can be challenging to build an Enterprise ETL Process from scratch. In general, building an ETL Process with Batch Processing involves:

Reference Data: We need to create a dataset that defines permissible values for our data.

Extract from Data Sources: Subsequent ETL steps can only be successful if we extract our data correctly. Most ETL Processes combine data from multiple sources, each with their own data format and organisation. These include relational and non-relational databases. Successful extraction converts data into a single format for standardised processing.

Validate Data: An automated process to confirm whether data from multiple sources gave expected values. The validation engine rejects data if it fails validation rules. Rejected records need to be analysed in order to identify errors for corrections and modifications. This will help to resolve potential problems in future batches.

Transform Data: We need to program and test a series of rules and functions in order to achieve appropriate transformations for our data. We can then run this program to clean our data, apply business rules, check for data integrity, and create aggregates as necessary.

Stage: Data that has been transformed will be loaded into a staging database. This makes it easier to revert if something goes wrong. We can also generate audit reports for regulatory compliance, or diagnose and repair problems with our data.

Publish to Data Warehouse: Here, we load our data to target tables. Some data warehouses overwrite existing information every time an ETL Process loads a new batch of data. This may happen on a daily, weekly or monthly basis. In other cases, an ETL Process can add new data without overwriting. It adds a timestamp to indicate new data. This has to be done carefully to prevent the data warehouse from overloading due to limited disk space and performance issues.

2. Building an ETL Process with Stream Processing

Modern data analytics involve real time data. In these cases, we cannot extract and transform data in large batches. Therefore, we need to perform the ETL Process on data streams, where data is transformed and saved immediately to a target system. We will need ETL applications for this. In general, building an ETL Process with Stream Processing involves:

Extract Data into Topics: ETL applications write each key and value pair into a topic. They read from this topic and add rows to the source table. They automatically write them as new messages to the topic, allowing for a real time stream.

Retrieve Data from Topics: ETL applications can extract messages from a topic as records. This creates a schema file and de-serialises them, creating objects from these messages.

Transform Data into Objects: ETL applications receive one record at a time, processes it, and produces output for downstream processors. These processors transform messages one at a time and perform data operations on them.

Load Data to Other Systems: ETL applications can stream data to target systems such as a data warehouse or data lake.

3. Building an ELT Process with an Automated Cloud-Based Data Warehouse

An automated cloud-based data warehouse has end-to-end data management built in. It uses a self-optimising data architecture with machine learning and natural language processing to automatically extract and transform data. It is integrated with data sources such as analytics systems, BI tools, other databases, social and advertising platforms. In general, building an ELT Process with an Automated Cloud-Based Data Warehouse involves:

Select Data Sources and Import Data: Select data sources from a list, enter credentials, and define destination tables. The cloud-based data warehouse will automatically import the data, clean it, and select database schemas.

Run Transformation Queries: Select a table and run a SQL query against the raw data. We can save the query as a transformation or export the resulting table to a target system. We can keep running transformation queries until we achieve a format that suits our data analysis.

Data Analysis with BI Tools: We can connect our automated cloud-based data warehouse with any BI tools such as Tableau to explore our transformed data.

2) Describe what types of access controls should be in place at each system for the users.

Access control is a security technique used to control who or what can view or use resources in a computing environment. It is a fundamental component of security compliance in any organisation. There are two types of access control: physical and logical. Physical access control limits access to buildings, rooms, and physical IT assets. Logical access control limits access to computer databases, networks, and data itself.

To secure a data centre, we can assign employees, executives, freelancers, and vendors to different groups and access levels. We can use access control systems that rely on user credentials, access card readers, and auditing reports to track employee access to various rooms and buildings. Different users will have access to different rooms and buildings. We can restrict entry to certain areas with alarms and lockdown capabilities to prevent unauthorised access and operations.

Access control systems perform identification authentication and authorisation of users by evaluating login credentials such as passwords, Personal Identification Numbers (PINs), biometric scans, and security tokens. Multi-Factor Authentication (MFA) which requires two or more authentication factors can provide a layered defence in our access control systems.

Access control systems aim to identify an individual or entity, verify the person or application, and authorise the access level and set of actions allowed for the user or Internet Protocol (IP) address. Examples of directory services and protocols include Lightweight Directory Access Protocol (LDAP) and Security Assertion Mark-up Language (SAML).

Essentially, the goal is to minimise the security risk of unauthorised access to both physical and logical systems. Access control systems provide the security infrastructure and procedures needed to protect our data architecture. They safeguard privileged data, confidential information, and sensitive data such as Personally Identifiable Information (PII) and intellectual property.

Organisations need to comply with data rules and regulations, and cybersecurity standards. They use different access control systems to meet their compliance requirements and IT security levels. The main security models of access control include:

1. Mandatory Access Control (MAC)

This security model grants access rights through a central authority based on multiple levels of security clearance for users and devices. A hardware or software component is used to grant access rights. This can be in the form of a password or keypad. It is often used in government agencies and military organisations and is the opposite of DAC.

2. Discretionary Access Control (DAC)

This security model allows system administrators to have direct control over a target system, resource or data. They can set policies that define who or what is authorised to access it. This enables system administrators to limit the propagation of access rights. However, it lacks centralised control and one method of access can open all doors to the target system.

3. Role Based Access Control (RBAC)

This security model is commonly used to grant access rights based on individuals or groups with defined business functions rather than their actual identities. It relies on a complex structure of role assignments, role authorisations, and role permissions developed using role engineering to control employee access to systems. RBAC is user-friendly as system administrators can manage roles centrally. It is often used to enforce MAC and DAC frameworks.

4. Attribute Based Access Control (ABAC)

This security model grants access rights by evaluating a set of rules, policies, and relationships derived from the attributes of users, systems, and environmental conditions.

5. Rule Based Access Control

This security model allows system administrators to define the rules that grant access rights to users and devices. These rules are based on conditions such as the time of day or location. It is commonly used to enforce RBAC frameworks.

3) Ask three key questions which you believe are necessary to develop the architecture.

  1. What are some of the ways we can use to improve our existing data architecture?

  2. What are some of the potential issues our data architecture may face, and how to overcome them?

  3. What are some of the challenges of access control, and how to overcome them?

4) In less than 500 words, discuss whether a centralised or decentralised data architecture best support the public service of ABC.

First, we need to understand and compare the different aspects of these data architectures.

1. Centralised Data Architecture

It is a client-server architecture where one or more client nodes are connected to a central server. This is the most common system used in most organisations.

Advantages include security as it is safe to service server and client nodes due to close proximity. Dedicated resources (memory, CPU cores) facilitate a dedicated system to suit custom needs. Quick updates are possible while client nodes can be easily detached from the system.

Disadvantages include high dependency on network connectivity. There can be sudden system failure if the nodes lose connectivity. Less opportunities for data backup as server node may fail. Difficulties in server maintenance can cause inconveniences to users.

2. Decentralised Data Architecture

It is a peer-to-peer or master-slave architecture where every node makes its own decisions. The final output is the aggregate of decisions made by individual nodes. For peer-to-peer architecture, all nodes are peers of one another and no one node has supremacy over others. For master-slave architecture, one node becomes the master by voting and coordinates a part of the system.

Advantages include lower incidence of performance bottlenecks as the entire load is balanced over all nodes. There is high availability as some nodes (computers, mobile phones, servers) are always online. There is more autonomy over resources as each node controls its own behaviour.

Disadvantages include difficulties in big global tasks as there is no chain of command to perform certain tasks. There is no regulatory oversight and troubleshooting can be difficult when a node fails. Each node has to be pinged for availability checking and partitioning of work has to be done.

3. Distributed Data Architecture

It is a form of decentralised architecture where there can be peer-to-peer, client-server or n-tier architecture. For peer-to-peer architecture, all nodes are peers of one another working towards a common goal. For client-server architecture, some nodes become server nodes acting as coordinators. For n-tier architecture, different parts of an application are distributed in different nodes but they work together to function as an application.

Advantages include low latency due to greater geographical spread and faster response time.

Disadvantages include difficulties in achieving consensus and it is not possible to use conventional methodologies of logging events by absolute time.

Conclusion

Essentially, I feel that a decentralised or distributed data architecture can best support ABC’s plans and its key principles. This is because I feel that the advantages of a decentralised or distributed data architecture can best help us to achieve our objectives based on the abovementioned comparisons.