NOTE I: Over the last few years data science has evolved into a multidisciplinary field with specific specialist roles becoming more important.
With this shift, companies are more and more looking to hire “T”-shaped individuals to join their analytics team.
Deciding how to divide up the work is known as demand-leveling and the traditional approach is to balance resourceand demand based on availability of the team.
Tasks that serve the immediate needs of clients/managers.
Deciding how to divide up the work is known as demand-leveling and the traditional approach is to balance resources and demand based on availability of the team.
NOTE I: Use experts to solve bottlenecks and non-experts to provide additional support.
By having a data science team of individuals that are T-shaped, experts can offl oad menial tasks to non-experts thathave the required basic knowledge.
So, besides effi ciency what other advantages are there in T-Shaped teams?
Expnand to see additional advantages
Essential to avoid the paradox of expertise
Assess what skills and knowledge that each individual already has.Note II Unfortunately, there isn’t a universal cheatcheat…
Ensure that work is broken down into incremental outcomes across the work streams.
The journey might look a bit different for each individual, but the general career path looks like this:
Note III. The typical data analyst role is consulting-centric….
Analysts usually have a strong economic or business background that is supported by his/her knowledge of statistics and mathematics. As opposed to Data Scientist, analysts are more generalist and tend to bemore flexible in the job market.
Knowledge of
Note IV. A data scientist is someone who is better at statistics than any software engineer, and better at softwareengineering than any statistician.
Job Description
Although it is important for Data Scientists to have a good understanding of business processes, the majority of their work will involve solving complex problems by developing new tools, methods or procedures. While the analyst is closely involved in answering an organization’s business questions on a day to day basis, the DS focuses on a macro level to develop ways to meet those business needs. It is important to develop analyses in a structured way so that they can be automated and scaled if the business requires them on a regular basis.
This is a much more specialized role and organisations that fully utilize the data scientist skillset can be hard to find.
Knowledge of
Note V. Contingency is the name of the game..
Job Description
A database administrator, commonly abbreviated as DBA, maintains the integrity and functioning of a database. This position entails running regular diagnostic tests to ensure data is not corrupt and combing for bugs or glitches within the system. Safely storing and backing-up datain case of system failure or memory loss and creating plans for addressing large-scale errors are also important responsibilities of a DBA.
It is important that DBAs works closely with SysAdmins to ensure high availability of servers supporting clusters.
Knowledge of
Note VI. Overseeing the technical part of data…
Job Description
Data Engineers are usually senior individuals in the organisation with extensive knowledge of data models, databases, IT infrastructure and software engineering.
Your data engineers are responsible for building optimized data flows that can be relied on in every day decision making and operations. To accomplish this, data engineers need experience in building database architecture by allocating data storage, establishing rules for data flow and most importantly, choosing the correct technology stack to run the data pipelines.
Knowledge of
Note VII. When its good, its good, but when its bad, hell hath no fury like a SysAdmin scorned…
Job Description
System administrators (SysAdmin) are benevolent creatures with endless power who make sure your computer and network remain in good working order no matter what the silly data engineers/scientist/devops do.
By far the most important role if a organisation is to succeed, as the SysAdmin is responsible for keeping the system running in a secure manner no matter what the workload is. SysAdmins are also responsible for configuration management tools so that a system can be restored procedurely if it goes down.
Knowledge of
Note VIII. DevOps transforms the delivery capability of development and software teams….
Job Description
Development & Operations (DevOps) is a series of practices and processes that are intended to speed up and automate the developing,testing, and releasing of software to allow for the continuous delivery of software and software updates. DevOps are the team that is responsible for putting models, applications, dashboards and APIs intoproduction and orchestrating how each of the pieces of software interact with one another and the public.
Knowledge of
The data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others
We divide the data engineering lifecycle into five stages
We cannot engineer data we do not have, and that reason why data generation is the first step in the data lifecycle, data generated in different sources like data collection tools,web scraping the electronic device like PC, smartphones ,sensors, and so on …
Moves data from multiple sources SQL and NoSQL databases, IoT devices, websites, streaming services, etc.to a target system to be transformed for further analysis. Data comes in various forms and can be both structured and unstructured.
Adjusts disparate data to the needs of end users. It involves removing errors and duplicates from data, normalizing it, and converting it into the needed format.
Delivers transformed data to end users — a BI platform, dashboard, or data science team
Note I. we have many types of data pipelines like ETL,ELT but ETL is most commonly used.
Note II. The mechanism that automates ingestion, transformation, and serving steps of the data engineering process is known as a data pipeline.
Commonly, ETL pipelines are used for
Note II. Besides a pipeline, a data warehouse must be built to support and facilitate data science activities. Let’s see how it works.
DW is a central repository storing data in queryable forms. From a technical standpoint, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. Traditionally, DWs only contained structured data or data that can be arranged in tables. However, modern DWs can also support unstructured data (such as images, pdf files, and audio formats).
Setting up secure and reliable data flow is challenging. Many things can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict, generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types to obfuscate sensitive information while not missing critical data.
Click on button to expand and ready instructions and download materials
https://drive.google.com/drive/folders/1vVU17PaTe9yUfuA4SQ_7xcpe1WjmSlgI?usp=sharing
navigate in folder colled setup and download postgresql and dbeaver
Install postgresql
NB do you have to remember the
passowrd you set
Install dbeaver
Set up connection to database
if every thing is good,
you have to check is Postgresql installed well
Enter your password
If we get same output now we
are save,postgresql is installed well
we have two way to create database
CREATE DATABASE training; or
After creating database, now we have
to create table with the flowing sql command
drop table if exists property;
create table property(
property_type VARCHAR(255),
addresslocality VARCHAR(255),
bedrooms INT,
bathrooms INT,
derived_lcy DOUBLE precision not null
); All remaining query we run using
them using dbvear edit query
select
property_type,
addresslocality,
bedrooms,
bathrooms,
derived_lcy
from
public.property;or
select
*
from
public.property;Create a table
In order to upload our synthetic transactions data set, we need to create the table structure into which the data must go. These structures can become quite complex, but for now we only going to create a fact and dimension table (transactions and customer information respectively). We allocate our PRIMARY KEY as the date or” transaction, the id and the customer who performed the transaction.
create table customers(
customer VARCHAR(20),
gender VARCHAR(100),
age int,
primary key(customer));Note I Keys are an important feature which can optimize looking up a transaction and also ensuring performance while maintaining data integrity. We don’t cover keys in this course, but its something to be aware of, especially. when we start learning about joins
Now we gonna load the customer information. This table contains the demographic information associated with each customer.
create table transactions(
transaction_date DATE,
transaction_id VARCHAR(20),
customer VARCHAR(20),
sku VARCHAR(100),
amount decimal,
primary key(transaction_date,
transaction_id,
customer),
foreign key(customer) references customers(customer)
);From the two CREATE TABLE statements you can see that the two tables are linked via the customer column. We will be learning JOINS near the end of today, which willjoin the tables together so that we can get demographic information on transactions.
Our final table structures
Loading data from google drive folder (Transaction.csv,customer.csv)
Note I The KING of all statements: SELECT
Previously I decided I wanted to return all the columns (*), but what if l only want to return one or two of columns?
SELECT customer, amount FROM transactions LIMIT 10;Lets build a bigger SELECT statement( I like 3 tab indentation ):
select
transaction_date,
customer,
sku,
amount
from
transactions
limit 10;SELECT but with filter criteria
What happens if we only want to return transactions of a certain type?
Well, then we can employ the WHERE statement We are going to collect the same columns as previously, but now we will specify the WHERE criteria on sku column where equal to airtime:
select
transaction_date,
customer,
sku,
amount
from
transactions
where
sku = 'airtime'
limit 10;exercise
In your notebook, write the code to bring back 100 examples where the sku is p2p and transaction_date is 2020-08-19.
SELECT but with filter criteria and order In certain circumstances, it is necessary to order your date to get the correct output For instance if we want to get the top 10 largest amounts:
select
transaction_date,
customer,
sku,
amount
from
transactions
order by
amount desc
limit 10;Aggregations (Pivoting) in SQL
Pivoting forms part ofthe aggregation function of SQL. This helps us answer questions like:
- What is the average amount of spent per gender?
- Total value and volume per date?
- Total volume and value dissagregated by gender and age?
As you can see, aggregations or ROUP BY clase gets used OFTEN, so learn it well and get comfortable with it.
Lets illustrate a basic example before moving onto complex queries. What is the total value per date?
select
transaction_date,
ROUND(SUM(amount))
from
transactions
group by
transaction_date
order by
transaction_date desc
limit 10;What is the total value, volume and distinct customers doing p2p per day?
select
transaction_date,
ROUND(SUM(amount)) value,
COUNT(*) volume,
COUNT(distinct customer) distinct_cust
from
transactions
where
sku = 'p2p'
group by
transaction_date
order by
transaction_date desc
limit 10;Notice how l ALIAS my aggregations as {aggregation} then {name}. This will make your life a lot easier and in some case it is mandatory… as in joins.
By now you are asking yourself, if we designed our database in the beautiful star schema that we talked about earlier, how do we join all the information together again? This is where JOINS come in and there are a multitude ofthem. Mostimportant one is LEFT JOIN and INNER JOIN:
Lets attempt a basic join before we
combine joins with aggregations. To start off we will
JOIN the customers table onto the
transactions table:
select
*
from
transactions as trans
left join customers as cust
on
trans.customer = cust.customer
limit 10;There are ways to optimize your joins to be extremely fast.
Notice how I use the term USING and not ON. If your columns are named the same in both tables this is a much cleaner way to join.
select
transaction_date,
gender,
age,
COUNT(*) volume,
ROUND(SUM(amount)) value,
COUNT(distinct customer) distinCt_customers
from
transactions as trans
left join customers as cust
using(customer)
where
sku = 'p2p'
group by
transaction_date,
gender,
age
limit 10;Flow instruction and steps and then Answer provided questions from database you have created
A
step 1
step 2
step3 Create flowing table in AssignmentII database and import all file we have into corresponding table
B
Produce ER diagrame of database created
Non relational
Cluster friendly
Open Source
Schema less
They don’t have a fixed schema
They allow you to store any data in any record
Google ( BigTable , LevelDB
LinkedIn (Voldemort)
Facebook (Cassandra)
Twitter ( Hadoop Hbase , FlockDB , Cassandra)
Netflix ( SimpleDB , Hadoop HBase ,Cassandra)
Key Value Store
A key that refers to a payload (actual content / data)
MemcacheDB , Azure Table Storage, Redis
Column Store
Column data is saved together, as opposed to row data
Super useful for data analytics
Hadoop , Cassandra, Hypertable
Document / XML / Object Store
Key (and possibly other indexes) point at a serialized object
DB can operate against values in document
MongoDB , CouchDB , RavenDB
Graph Store
Nodes are stored independently, and the relationship between nodes (edges) are stored with data
Neo4j
Handles Schema Changes Well (easy development)
Solves Impedance Mismatch problem
Usually in JSON
Not really schema-less
Implicit schema to retrieve specific values
E.g.: I want a price of an order!
use bookdb;use can be used to switch to database or create it if does not exist
db.books.insertOne(
{
itle: 'MongoDB insertOne',
isbn: '0-7617-6154-3'
}
)Check if data is inserted correctly
db.books.find()Output
{
"_id" : ObjectId("64855cf6acadb7ef84350738"),
"itle" : "MongoDB insertOne",
"isbn" : "0-7617-6154-3"
}
_id: 1,
title: "Mastering Big Data",
isbn: "0-9270-4986-4"Output
{
"_id" : ObjectId("64855cf6acadb7ef84350738"),
"itle" : "MongoDB insertOne",
"isbn" : "0-7617-6154-3"
}
{
"_id" : NumberInt(1),
"title" : "Mastering Big Data",
"isbn" : "0-9270-4986-4"
}
db.books.insertMany(
[
{ title: "NoSQL Distilled", isbn: "0-4696-7030-4",autor:"sample name"},
{ title: "NoSQL in 7 Days", isbn: "0-4086-6859-8"},
{ title: "NoSQL Database", isbn: "0-2504-6932-4"},
]
)use productdb
db.products.insertMany([
{ "_id" : 1, "name" : "xPhone", "price" : 799, "releaseDate": ISODate("2011-05-14"), "spec" : { "ram" : 4, "screen" : 6.5, "cpu" : 2.66 },"color":["white","black"],"storage":[64,128,256]},
{ "_id" : 2, "name" : "xTablet", "price" : 899, "releaseDate": ISODate("2011-09-01") , "spec" : { "ram" : 16, "screen" : 9.5, "cpu" : 3.66 },"color":["white","black","purple"],"storage":[128,256,512]},
{ "_id" : 3, "name" : "SmartTablet", "price" : 899, "releaseDate": ISODate("2015-01-14"), "spec" : { "ram" : 12, "screen" : 9.7, "cpu" : 3.66 },"color":["blue"],"storage":[16,64,128]},
{ "_id" : 4, "name" : "SmartPad", "price" : 699, "releaseDate": ISODate("2020-05-14"),"spec" : { "ram" : 8, "screen" : 9.7, "cpu" : 1.66 },"color":["white","orange","gold","gray"],"storage":[128,256,1024]},
{ "_id" : 5, "name" : "SmartPhone", "price" : 599,"releaseDate": ISODate("2022-09-14"), "spec" : { "ram" : 4, "screen" : 5.7, "cpu" : 1.66 },"color":["white","orange","gold","gray"],"storage":[128,256]}
]) db.products.findOne(
{
_id:2
}
)db.collection.findOne(query, projection);
db.products.findOne({})
Example 1
db.products.findOne(
{_id:5},
{name:1,color:1},
)db.collection.find(query, projection)
use bookdb
db.books.insertMany([
{ "_id" : 2, "title" : "Android in Action, Second Edition", "isbn" : "1935182722", "categories" : [ "Java" ] },
{ "_id" : 3, "title" : "Specification by Example", "isbn" : "1617290084", "categories" : [ "Software Engineering" ] },
{ "_id" : 4, "title" : "Flex 3 in Action", "isbn" : "1933988746", "categories" : [ "Internet" ] },
{ "_id" : 5, "title" : "Flex 4 in Action", "isbn" : "1935182420", "categories" : [ "Internet" ] },
{ "_id" : 6, "title" : "Collective Intelligence in Action", "isbn" : "1933988312", "categories" : [ "Internet" ] },
{ "_id" : 7, "title" : "Zend Framework in Action", "isbn" : "1933988320", "categories" : [ "Web Development" ] },
{ "_id" : 8, "title" : "Flex on Java", "isbn" : "1933988797", "categories" : [ "Internet" ] },
{ "_id" : 9, "title" : "Griffon in Action", "isbn" : "1935182234", "categories" : [ "Java" ] },
{ "_id" : 10, "title" : "OSGi in Depth", "isbn" : "193518217X", "categories" : [ "Java" ] },
{ "_id" : 11, "title" : "Flexible Rails", "isbn" : "1933988509", "categories" : [ "Web Development" ] },
{ "_id" : 13, "title" : "Hello! Flex 4", "isbn" : "1933988762", "categories" : [ "Internet" ] },
{ "_id" : 14, "title" : "Coffeehouse", "isbn" : "1884777384", "categories" : [ "Miscellaneous" ] },
{ "_id" : 15, "title" : "Team Foundation Server 2008 in Action", "isbn" : "1933988592", "categories" : [ "Microsoft .NET" ] },
{ "_id" : 16, "title" : "Brownfield Application Development in .NET", "isbn" : "1933988711", "categories" : [ "Microsoft" ] },
{ "_id" : 17, "title" : "MongoDB in Action", "isbn" : "1935182870", "categories" : [ "Next Generation Databases" ] },
{ "_id" : 18, "title" : "Distributed Application Development with PowerBuilder 6.0", "isbn" : "1884777686", "categories" : [ "PowerBuilder" ] },
{ "_id" : 19, "title" : "Jaguar Development with PowerBuilder 7", "isbn" : "1884777864", "categories" : [ "PowerBuilder", "Client-Server" ] },
{ "_id" : 20, "title" : "Taming Jaguar", "isbn" : "1884777686", "categories" : [ "PowerBuilder" ] },
{ "_id" : 21, "title" : "3D User Interfaces with Java 3D", "isbn" : "1884777902", "categories" : [ "Java", "Computer Graphics" ] },
{ "_id" : 22, "title" : "Hibernate in Action", "isbn" : "193239415X", "categories" : [ "Java" ] },
{ "_id" : 23, "title" : "Hibernate in Action (Chinese Edition)", "categories" : [ "Java" ] },
{ "_id" : 24, "title" : "Java Persistence with Hibernate", "isbn" : "1932394885", "categories" : [ "Java" ] },
{ "_id" : 25, "title" : "JSTL in Action", "isbn" : "1930110529", "categories" : [ "Internet" ] },
{ "_id" : 26, "title" : "iBATIS in Action", "isbn" : "1932394826", "categories" : [ "Web Development" ] },
{ "_id" : 27, "title" : "Designing Hard Software", "isbn" : "133046192", "categories" : [ "Object-Oriented Programming", "S" ] },
{ "_id" : 28, "title" : "Hibernate Search in Action", "isbn" : "1933988649", "categories" : [ "Java" ] },
{ "_id" : 29, "title" : "jQuery in Action", "isbn" : "1933988355", "categories" : [ "Web Development" ] },
{ "_id" : 30, "title" : "jQuery in Action, Second Edition", "isbn" : "1935182323", "categories" : [ "Java" ] }
]);Example 1
db.books.find({ categories: 'Java'}, { title: 1,isbn: 1})Example 2
db.books.find({ categories: 'Java'}, { title: 1,isbn: 1}).limit(3)Example 1
db.products.find({price: 899})Example 2
db.products.find({}, {
name: 1,
price: 1
});Example 3
db.products.find({}, {
name: 1,
price: 1
,
_id: 0
});Example
db.products.find({_id:1}, {
releaseDate: 0,
spec: 0,
storage: 0
})