Introduction to Data engineering

Understanding DS Teams

NOTE I: Over the last few years data science has evolved into a multidisciplinary field with specific specialist roles becoming more important.

With this shift, companies are more and more looking to hire “T”-shaped individuals to join their analytics team.

Cross-discipline looks to ensure that theteam speaks the same language whenattempting something new or solving aproblem..
Specialization helps to resolve diffi cultproblems that cause bottlenecks in thepipeline as fast as possible.

Analytical process

As a data science team, task will most likely be divided up into two main components:

Immediate need of client/manager.
The things we would like to be doing.

Deciding how to divide up the work is known as demand-leveling and the traditional approach is to balance resourceand demand based on availability of the team.

As we can see by the graph, a new task canonly be assigned to a team member, oncethat task has been completed.

As a data science team, tasks often fall into one of the following two categories:

Tasks that serve the immediate needs of clients/managers.
The things we would like to be doing.

Deciding how to divide up the work is known as demand-leveling and the traditional approach is to balance resources and demand based on availability of the team.

In I-shaped teams, specialization results ina silo-ing of tasks required to solveproblems.
- Can lead to task backlogs.
T-shaped people effectively increase the number of available resources.
- Affords a more adaptable responseunderpinned by the prioritization oftasks.

Advantages of T-shaped Teams

NOTE I: Use experts to solve bottlenecks and non-experts to provide additional support.

By having a data science team of individuals that are T-shaped, experts can offl oad menial tasks to non-experts thathave the required basic knowledge.

This frees up the expert to solve the bottleneck much quicker.
Even partial knowledge on how to solve a bottleneck is more valuable than having an expert work on a non-bottleneck problem.
The combined effort of all individuals working on tasks together accomplishes more than a silo-ed approach totask management.

So, besides effi ciency what other advantages are there in T-Shaped teams?

Expnand to see additional advantages

Communication among team members is more effi cient

In the process up picking up a skill, you will also start to build domain-specifi c language. This ensures that theteam understands different perspectives.

Team members stay interested in what they are doing.

What’s boring for one person on the team might be an exciting challenge for another member.
A multidisciplinary approach enables learning and a growth mentality among the team.
- Essential to avoid the paradox of expertise

You’re more attractive to employers.

We are building careers in a very dynamic sector where new roles and responsibilities emerge all the time. Cross-competencies make for an adaptable resource, a chracteristic that is very valuable to employers.

Moving to a T-shaped DS Team

Note II Unfortunately, there isn’t a universal cheatcheat…

Assess what skills and knowledge that each individual already has.

Rate their ability within these different areas as well as the skills that the team member wants to improve on.

Ensure that work is broken down into incremental outcomes across the work streams.

This approach ensures that tasks usually require multiple skills.

Example: creation of a new indicator. This would entail writing SQLin a programmatic language (Data scientist), compiling the software bundle (Developer) and putting the new version into production (DevOps).

This approach also encourages pair-coding which greatly increases knowledge sharing..

What are these expert roles?

Data science hieracrhy of needs.

What are these expert roles?

The journey might look a bit different for each individual, but the general career path looks like this:

Analyst (Business Analyst)

Note III. The typical data analyst role is consulting-centric….

A data analyst typically works as part of an interdisciplinary team to determine and execute the organization’s goals. The analyst mostly gathers data to identify trends that help business leaders make strategic decisions.

Analysts usually have a strong economic or business background that is supported by his/her knowledge of statistics and mathematics. As opposed to Data Scientist, analysts are more generalist and tend to bemore flexible in the job market.

Knowledge of

Excel

SQL

Power BI

Responsibilities

Pull data from databases.

Analyse and forecast trends.

Create dashboards using Power BI or Tableau.

Descriptive, diagnostic,predictive or prescriptive analytics.

Data Scientist

Note IV. A data scientist is someone who is better at statistics than any software engineer, and better at softwareengineering than any statistician.

Job Description

Although it is important for Data Scientists to have a good understanding of business processes, the majority of their work will involve solving complex problems by developing new tools, methods or procedures. While the analyst is closely involved in answering an organization’s business questions on a day to day basis, the DS focuses on a macro level to develop ways to meet those business needs. It is important to develop analyses in a structured way so that they can be automated and scaled if the business requires them on a regular basis.

This is a much more specialized role and organisations that fully utilize the data scientist skillset can be hard to find.

Knowledge of

Python

Java

Scala

Responsibilities

Data cleaning and wrangling(80%).

Building APIs and ETL pipelines

Statistical analysis using ML.

Automate processes.

Database Administrator

Note V. Contingency is the name of the game..

Job Description

A database administrator, commonly abbreviated as DBA, maintains the integrity and functioning of a database. This position entails running regular diagnostic tests to ensure data is not corrupt and combing for bugs or glitches within the system. Safely storing and backing-up datain case of system failure or memory loss and creating plans for addressing large-scale errors are also important responsibilities of a DBA.

It is important that DBAs works closely with SysAdmins to ensure high availability of servers supporting clusters.

Knowledge of

Linux

SQL

Java

Python

Responsibilities

Manage backups

Capacity planning

Disaster recovery processesand procedures

Security

Index maintenance

Data Engineer

Note VI. Overseeing the technical part of data…

Job Description

Data Engineers are usually senior individuals in the organisation with extensive knowledge of data models, databases, IT infrastructure and software engineering.

Your data engineers are responsible for building optimized data flows that can be relied on in every day decision making and operations. To accomplish this, data engineers need experience in building database architecture by allocating data storage, establishing rules for data flow and most importantly, choosing the correct technology stack to run the data pipelines.

Knowledge of

Linux

Distributed database systems

Python,R

Java

C++

Scala

Airflow

SysAdmin

Note VII. When its good, its good, but when its bad, hell hath no fury like a SysAdmin scorned…

Job Description

System administrators (SysAdmin) are benevolent creatures with endless power who make sure your computer and network remain in good working order no matter what the silly data engineers/scientist/devops do.

By far the most important role if a organisation is to succeed, as the SysAdmin is responsible for keeping the system running in a secure manner no matter what the workload is. SysAdmins are also responsible for configuration management tools so that a system can be restored procedurely if it goes down.

Knowledge of

Linux CLI tools (awk, sed, jq)

Perl/Python

Ansible/ Jenkins etc

Responsibilities

Be ready for 2am calls. ;-)

DevOps

Note VIII. DevOps transforms the delivery capability of development and software teams….

Job Description

Development & Operations (DevOps) is a series of practices and processes that are intended to speed up and automate the developing,testing, and releasing of software to allow for the continuous delivery of software and software updates. DevOps are the team that is responsible for putting models, applications, dashboards and APIs intoproduction and orchestrating how each of the pieces of software interact with one another and the public.

Knowledge of

Linux

Kubernetes

Docker

Python

CI/CD tools

Ansible/ Jenkins etc

Data engineering and its role in data lifecyle

The data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others

We divide the data engineering lifecycle into five stages

Generation

We cannot engineer data we do not have, and that reason why data generation is the first step in the data lifecycle, data generated in different sources like data collection tools,web scraping the electronic device like PC, smartphones ,sensors, and so on …

Ingestion

Moves data from multiple sources SQL and NoSQL databases, IoT devices, websites, streaming services, etc.to a target system to be transformed for further analysis. Data comes in various forms and can be both structured and unstructured.

Data transformation

Adjusts disparate data to the needs of end users. It involves removing errors and duplicates from data, normalizing it, and converting it into the needed format.

Data serving

Delivers transformed data to end users — a BI platform, dashboard, or data science team

Note I. we have many types of data pipelines like ETL,ELT but ETL is most commonly used.

ETL data pipeline

Extract retrieving data. At the start of the pipeline, we’re dealing with raw data from numerous sources like databases, APIs, files, etc.

Transform standardizing data. Having data extracted, scripts transform it to meet the format requirements. Data transformation significantly improves data discoverability and usability.

Load saving data to a new destination. After bringing data into a usable state, engineers can load it to the destination, typically a database management system (DBMS) or data warehouse.

Data engineering pipeline

Note II. The mechanism that automates ingestion, transformation, and serving steps of the data engineering process is known as a data pipeline.

Commonly, ETL pipelines are used for

Data migration between systems or environments (from on-premises to cloud databases);

Data wrangling or converting raw data into a usable format for analytics, BI, and machine learning projects;

Data integration from various systems and IoT devices; and copying tables from one database to another.

major pitfalls in building data pipelines:

lacking relevant metrics

underestimating data load

Note II. Besides a pipeline, a data warehouse must be built to support and facilitate data science activities. Let’s see how it works.

Introduction to data warehousing and data lakes

Data warehouse

DW is a central repository storing data in queryable forms. From a technical standpoint, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. Traditionally, DWs only contained structured data or data that can be arranged in tables. However, modern DWs can also support unstructured data (such as images, pdf files, and audio formats).

Setting up secure and reliable data flow is challenging. Many things can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict, generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types to obfuscate sensitive information while not missing critical data.

Relational Databases and SQL Fundamentals

Assignment I

Click on button to expand and ready instructions and download materials

Postgresql download windows and instalations

https://drive.google.com/drive/folders/1vVU17PaTe9yUfuA4SQ_7xcpe1WjmSlgI?usp=sharing

navigate in folder colled setup and download postgresql and dbeaver

Install postgresql

NB do you have to remember the passowrd you set

Install dbeaver

Set up connection to database

if every thing is good, you have to check is Postgresql installed well

Check via SQLshell(psql)

Enter your password

If we get same output now we are save,postgresql is installed well

check if we con connect via dbeaver now, we are ready to go

Uploading data from CSV to database

Creating database

we have two way to create database

Creating database via SQL shell

CREATE DATABASE training;

Via any Grapgical application used to interact with database

After creating database, now we have to create table with the flowing sql command

drop table if exists property;

create table property(
    property_type VARCHAR(255),
    addresslocality VARCHAR(255),
    bedrooms INT,
    bathrooms INT,
    derived_lcy DOUBLE precision not null
);

Importing data postgressql table

All remaining query we run using them using dbvear edit query

check if data loaded cullectly

viewing all data in propery table

select
    property_type,
    addresslocality,
    bedrooms,
    bathrooms,
    derived_lcy
from
    public.property;

select
    *
from
    public.property;

SQL Commands

Data Definition Language(DDL)

Create a table

In order to upload our synthetic transactions data set, we need to create the table structure into which the data must go. These structures can become quite complex, but for now we only going to create a fact and dimension table (transactions and customer information respectively). We allocate our PRIMARY KEY as the date or” transaction, the id and the customer who performed the transaction.
```
   create table customers(
customer VARCHAR(20),
gender VARCHAR(100),
age int,

primary key(customer));
```

Note I Keys are an important feature which can optimize looking up a transaction and also ensuring performance while maintaining data integrity. We don’t cover keys in this course, but its something to be aware of, especially. when we start learning about joins

Now we gonna load the customer information. This table contains the demographic information associated with each customer.


 create table transactions(

transaction_date DATE,

transaction_id VARCHAR(20),

customer VARCHAR(20),

sku VARCHAR(100),

amount decimal,

primary key(transaction_date,
transaction_id,
customer),
foreign key(customer) references customers(customer)
);

From the two CREATE TABLE statements you can see that the two tables are linked via the customer column. We will be learning JOINS near the end of today, which willjoin the tables together so that we can get demographic information on transactions.

Our final table structures

Loading data from google drive folder (Transaction.csv,customer.csv)

Data Query Language(DQL)

Note I The KING of all statements: SELECT

Previously I decided I wanted to return all the columns (*), but what if l only want to return one or two of columns?

SELECT customer, amount FROM transactions LIMIT 10;

Lets build a bigger SELECT statement( I like 3 tab indentation ):

select
    transaction_date,
    customer,
    sku,
    amount
from
    transactions
limit 10;

SELECT but with filter criteria

What happens if we only want to return transactions of a certain type?

Well, then we can employ the WHERE statement We are going to collect the same columns as previously, but now we will specify the WHERE criteria on sku column where equal to airtime:

select
    transaction_date,
    customer,
    sku,
    amount
from
    transactions
where
    sku = 'airtime'
limit 10;

exercise

In your notebook, write the code to bring back 100 examples where the sku is p2p and transaction_date is 2020-08-19.

SELECT but with filter criteria and order In certain circumstances, it is necessary to order your date to get the correct output For instance if we want to get the top 10 largest amounts:


select
    transaction_date,
    customer,
    sku,
    amount
from
    transactions
order by
    amount desc
limit 10;

Aggregations (Pivoting) in SQL

Pivoting forms part ofthe aggregation function of SQL. This helps us answer questions like:

- What is the average amount of spent per gender?
- Total value and volume per date?
- Total volume and value dissagregated by gender and age?

As you can see, aggregations or ROUP BY clase gets used OFTEN, so learn it well and get comfortable with it.

Lets illustrate a basic example before moving onto complex queries. What is the total value per date?

select
    transaction_date,
ROUND(SUM(amount))
from
    transactions
group by
    transaction_date
order by
    transaction_date desc
limit 10;

What is the total value, volume and distinct customers doing p2p per day?

select
    transaction_date,
    ROUND(SUM(amount)) value,
    COUNT(*) volume,
    COUNT(distinct customer) distinct_cust
from
    transactions
where
    sku = 'p2p'
group by
    transaction_date
order by
    transaction_date desc
limit 10;

Notice how l ALIAS my aggregations as {aggregation} then {name}. This will make your life a lot easier and in some case it is mandatory… as in joins.

Last but not least: JOINS

By now you are asking yourself, if we designed our database in the beautiful star schema that we talked about earlier, how do we join all the information together again? This is where JOINS come in and there are a multitude ofthem. Mostimportant one is LEFT JOIN and INNER JOIN:

Lets attempt a basic join before we combine joins with aggregations. To start off we will JOIN the customers table onto the transactions table:


select
    *
from
    transactions as trans
left join customers as cust

on
    trans.customer = cust.customer
limit 10;

There are ways to optimize your joins to be extremely fast.

One is keys (which is why we used primary keys in our tables).

Notice how I use the term USING and not ON. If your columns are named the same in both tables this is a much cleaner way to join.

select
    transaction_date,
    gender,
    age,
    COUNT(*) volume,
    ROUND(SUM(amount)) value,
    COUNT(distinct customer) distinCt_customers
from
    transactions as trans
left join customers as cust
        using(customer)
where
    sku = 'p2p'
group by
    transaction_date,
    gender,
    age
limit 10;

Data analysis using SQL

Assignment II

Flow instruction and steps and then Answer provided questions from database you have created

Question 1.

1. How many transactions are there in the transactions table?

1. How many customers are there in the customer database?

1. Are there the same number of customers in the transactions and customer databases? Why is it important that these figures are the same?

1. How many years and months does this data cover?

1. What are all the different transaction types present in the transactions data?

1. What is the most common transaction type?

1. Which transaction type has the highest total amount (value) transacted? What is the average transaction size for each transaction type?. and explain why is this type in rwanda context?

1. Which transaction is done by the highest number of unique customers?

1. What percentage of customers are female?

1. How many customers are 25 years old? How many customers are 60 years old? What does this say about financial inclusion?

1. What is the total value transacted by customers 18 years or younger? How many distinct customers are there 18 years or younger? Provide your answer in one table.

Question 2.

step 1
- Create database colled AssignmentII
step 2
- we have files colled, fact_price.csv, dim_property_type.csv, dim_property_type_join.csv and dim_property_meta.csv,
step3 Create flowing table in AssignmentII database and import all file we have into corresponding table
- fact_price
- dim_property_type
- dim_property_type_join
- dim_property_meta
B

Produce ER diagrame of database created

NOSQL databases

Four V’s of Big Data

SQL VS NoSQL

Common Characteristics of NoSQL database

Non relational
- They don’t use the relational data model and thus don’t use the SQL language
Cluster friendly
- They tend to be designed to run on a cluster
Open Source
Schema less
- They don’t have a fixed schema
- They allow you to store any data in any record

NoSQL Originators

Google ( BigTable , LevelDB
LinkedIn (Voldemort)
Facebook (Cassandra)
Twitter ( Hadoop Hbase , FlockDB , Cassandra)
Netflix ( SimpleDB , Hadoop HBase ,Cassandra)

Different Types of NoSQL

Key Value Store
- A key that refers to a payload (actual content / data)
- MemcacheDB , Azure Table Storage, Redis
Column Store
- Column data is saved together, as opposed to row data
- Super useful for data analytics
- Hadoop , Cassandra, Hypertable
Document / XML / Object Store
- Key (and possibly other indexes) point at a serialized object
- DB can operate against values in document
- MongoDB , CouchDB , RavenDB
Graph Store
- Nodes are stored independently, and the relationship between nodes (edges) are stored with data
- Neo4j

Mongodb

why Document-based

Handles Schema Changes Well (easy development)
Solves Impedance Mismatch problem
Usually in JSON

Not really schema-less
- Implicit schema to retrieve specific values
- E.g.: I want a price of an order!

Mongodb instration guide and evirenement setup

Mongodb instration Guid

step1
step 2
- set up connection

Mobgodb Commands

Creating database

use bookdb;

use can be used to switch to database or create it if does not exist

Insert a document without an _id field example

db.books.insertOne(
{
    itle: 'MongoDB insertOne',
    isbn: '0-7617-6154-3'
}
)

Check if data is inserted correctly

db.books.find()

Output

{
    "_id" : ObjectId("64855cf6acadb7ef84350738"),
    "itle" : "MongoDB insertOne",
    "isbn" : "0-7617-6154-3"
}

Insert a document with an _id field example

_id: 1,
   title: "Mastering Big Data",
   isbn: "0-9270-4986-4"

Output

{
    "_id" : ObjectId("64855cf6acadb7ef84350738"),
    "itle" : "MongoDB insertOne",
    "isbn" : "0-7617-6154-3"
}

{
    "_id" : NumberInt(1),
    "title" : "Mastering Big Data",
    "isbn" : "0-9270-4986-4"
}

Using MongoDB insertMany()

db.books.insertMany(
[
{ title:  "NoSQL Distilled", isbn: "0-4696-7030-4",autor:"sample name"},
   { title:  "NoSQL in 7 Days", isbn: "0-4086-6859-8"},
   { title:  "NoSQL Database", isbn: "0-2504-6932-4"},
]
)

MongoDB findOne()

use productdb
db.products.insertMany([
    { "_id" : 1, "name" : "xPhone", "price" : 799, "releaseDate": ISODate("2011-05-14"), "spec" : { "ram" : 4, "screen" : 6.5, "cpu" : 2.66 },"color":["white","black"],"storage":[64,128,256]},
    { "_id" : 2, "name" : "xTablet", "price" : 899, "releaseDate": ISODate("2011-09-01") , "spec" : { "ram" : 16, "screen" : 9.5, "cpu" : 3.66 },"color":["white","black","purple"],"storage":[128,256,512]},
    { "_id" : 3, "name" : "SmartTablet", "price" : 899, "releaseDate": ISODate("2015-01-14"), "spec" : { "ram" : 12, "screen" : 9.7, "cpu" : 3.66 },"color":["blue"],"storage":[16,64,128]},
    { "_id" : 4, "name" : "SmartPad", "price" : 699, "releaseDate": ISODate("2020-05-14"),"spec" : { "ram" : 8, "screen" : 9.7, "cpu" : 1.66 },"color":["white","orange","gold","gray"],"storage":[128,256,1024]},
    { "_id" : 5, "name" : "SmartPhone", "price" : 599,"releaseDate": ISODate("2022-09-14"), "spec" : { "ram" : 4, "screen" : 5.7, "cpu" : 1.66 },"color":["white","orange","gold","gray"],"storage":[128,256]}
 ])

 db.products.findOne(
 {
     _id:2
 }
 )

Using MongoDB findOne() method to select some fields

db.collection.findOne(query, projection);

db.products.findOne({})

Example 1

db.products.findOne(
{_id:5},
{name:1,color:1},

)

MongoDB find()

db.collection.find(query, projection)

use bookdb

db.books.insertMany([
    { "_id" : 2, "title" : "Android in Action, Second Edition", "isbn" : "1935182722", "categories" : [ "Java" ] },
    { "_id" : 3, "title" : "Specification by Example", "isbn" : "1617290084", "categories" : [ "Software Engineering" ] },
    { "_id" : 4, "title" : "Flex 3 in Action", "isbn" : "1933988746", "categories" : [ "Internet" ] },
    { "_id" : 5, "title" : "Flex 4 in Action", "isbn" : "1935182420", "categories" : [ "Internet" ] },
    { "_id" : 6, "title" : "Collective Intelligence in Action", "isbn" : "1933988312", "categories" : [ "Internet" ] },
    { "_id" : 7, "title" : "Zend Framework in Action", "isbn" : "1933988320", "categories" : [ "Web Development" ] },
    { "_id" : 8, "title" : "Flex on Java", "isbn" : "1933988797", "categories" : [ "Internet" ] },
    { "_id" : 9, "title" : "Griffon in Action", "isbn" : "1935182234", "categories" : [ "Java" ] },
    { "_id" : 10, "title" : "OSGi in Depth", "isbn" : "193518217X", "categories" : [ "Java" ] },
    { "_id" : 11, "title" : "Flexible Rails", "isbn" : "1933988509", "categories" : [ "Web Development" ] },
    { "_id" : 13, "title" : "Hello! Flex 4", "isbn" : "1933988762", "categories" : [ "Internet" ] },
    { "_id" : 14, "title" : "Coffeehouse", "isbn" : "1884777384", "categories" : [ "Miscellaneous" ] },
    { "_id" : 15, "title" : "Team Foundation Server 2008 in Action", "isbn" : "1933988592", "categories" : [ "Microsoft .NET" ] },
    { "_id" : 16, "title" : "Brownfield Application Development in .NET", "isbn" : "1933988711", "categories" : [ "Microsoft" ] },
    { "_id" : 17, "title" : "MongoDB in Action", "isbn" : "1935182870", "categories" : [ "Next Generation Databases" ] },
    { "_id" : 18, "title" : "Distributed Application Development with PowerBuilder 6.0", "isbn" : "1884777686", "categories" : [ "PowerBuilder" ] },
    { "_id" : 19, "title" : "Jaguar Development with PowerBuilder 7", "isbn" : "1884777864", "categories" : [ "PowerBuilder", "Client-Server" ] },
    { "_id" : 20, "title" : "Taming Jaguar", "isbn" : "1884777686", "categories" : [ "PowerBuilder" ] },
    { "_id" : 21, "title" : "3D User Interfaces with Java 3D", "isbn" : "1884777902", "categories" : [ "Java", "Computer Graphics" ] },
    { "_id" : 22, "title" : "Hibernate in Action", "isbn" : "193239415X", "categories" : [ "Java" ] },
    { "_id" : 23, "title" : "Hibernate in Action (Chinese Edition)", "categories" : [ "Java" ] },
    { "_id" : 24, "title" : "Java Persistence with Hibernate", "isbn" : "1932394885", "categories" : [ "Java" ] },
    { "_id" : 25, "title" : "JSTL in Action", "isbn" : "1930110529", "categories" : [ "Internet" ] },
    { "_id" : 26, "title" : "iBATIS in Action", "isbn" : "1932394826", "categories" : [ "Web Development" ] },
    { "_id" : 27, "title" : "Designing Hard Software", "isbn" : "133046192", "categories" : [ "Object-Oriented Programming", "S" ] },
    { "_id" : 28, "title" : "Hibernate Search in Action", "isbn" : "1933988649", "categories" : [ "Java" ] },
    { "_id" : 29, "title" : "jQuery in Action", "isbn" : "1933988355", "categories" : [ "Web Development" ] },
    { "_id" : 30, "title" : "jQuery in Action, Second Edition", "isbn" : "1935182323", "categories" : [ "Java" ] }
]);

Example 1

db.books.find({ categories: 'Java'}, { title: 1,isbn: 1})

Example 2

db.books.find({ categories: 'Java'}, { title: 1,isbn: 1}).limit(3)

MongoDB projection

Example 1

db.products.find({price: 899})

Example 2

db.products.find({}, {
    name: 1,
    price: 1
});

Example 3

db.products.find({}, {
    name: 1,
    price: 1
,
    _id: 0
});

Example

db.products.find({_id:1}, {
    releaseDate: 0,
    spec: 0,
    storage: 0
})

Comparison query operations

Rwanda ICT Chamber programme

Data Engineering

2023-05-10

Introduction to Data engineering

Understanding DS Teams

Analytical process

Advantages of T-shaped Teams

Moving to a T-shaped DS Team

What are these expert roles?

Data science hieracrhy of needs.

What are these expert roles?

Analyst (Business Analyst)

Data Scientist

Database Administrator

Data Engineer

SysAdmin

DevOps

Data engineering and its role in data lifecyle

ETL data pipeline

Data engineering pipeline

major pitfalls in building data pipelines:

Introduction to data warehousing and data lakes

Data warehouse

Relational Databases and SQL Fundamentals

Assignment I

Postgresql download windows and instalations

Uploading data from CSV to database

Creating database

Importing data postgressql table

check if data loaded cullectly

SQL Commands

Data Definition Language(DDL)

Data Query Language(DQL)

Last but not least: JOINS

Data analysis using SQL

Assignment II

Question 1.

Question 2.

NOSQL databases

Four V’s of Big Data

SQL VS NoSQL

Common Characteristics of NoSQL database

NoSQL Originators

Different Types of NoSQL

Mongodb

why Document-based

Mongodb instration guide and evirenement setup

Mongodb instration Guid

Mobgodb Commands