Data Analytics Certification Notes

Introduction

These are the notes while preparing for AWS Certified Data Analytics Specialty 2022. In this journey, we will make important concept notes for following five sections for Data Analytics

Collection
Storage
Processing
Analysis
Visualization
Domain & Security

Collection

Kinesis (Data Streams,Producers,Consumers,Enhanced Fan Out,Scaling,Security,Data Firehose)
SQS
IoT
Database Migration Service (DMS)
Direct Connect
Snow Family
MSK (Managed Streaming, Connect, Serverless)

Storage

S3 (Storage,Lifecycle Rules,Versioning,Replication,Performance,Security,Event Notifications)
DynamoDB (Basics,APIs,Indexes,PartiQL,DAX,Streams,TTL,Patterns,Security)
ElastiCache (Fundamental)

Processing

Lambda
Glue, Hive, ETL (Catalog, end-points, Costs)
Glue Studio & DataBrew
Glue Elastic Views
Lake Formation
Infrastructure (EMR, Hadoop, Serverless, Apache Spark)
Spark integration with Kinesis & Redshift
Applications on EMR (Hive,Pig,HBase,Presto,Zeppelin,Hue,Splunk,Flume)
Data Pipeline
Step Functions

Analysis

Kinesis Analytics
OpenSearch
Athena
Redshift

Visualization

Quicksight (Pricing,Dashboards, ML Insights)

Domain & Security

S3 Encryption
KMS (Basics,Key Rotation)
Cloud HSM
STS & Cross Account
Identity Federation
Policies
CloudTrail
VPC Endpoints

Collection

There are multiple ways for data collection in AWS.

Real Time collection - where we can perform action on our data
- KDS, SQS, IoT - These services help you to react in real-time to events or data that is happening in your infrastructure
Near-real-time - Reactive Actions
- Firehose, DMS
Batch - Historical Analysis - when we want to move large amount of data to perform some data analysis
- Snowball, Data Pipeline

Kinesis Data Streams

Overview

Way to stream big data into your systems. It is made of multiple shards and this is something we need to provision ahead of time
Shard - Data is split across all the shards and they define stream capacity in terms of ingestion and consumption rates
Producers - Send data (Produce Records) into KDS and can be manyfold. Ex. applications, clients, SDK, KPL, Kinesis Agent. Data can be sent at the rate of 1 MB/s or 1000 msg/sec per shard
Records - is made of partition key and data blob (upto 1 MB)
- Partition key - determines in which shard will the record go to
- Data Blob - Value itself
Consumers - Applications, Lambda functions, Firehose, Kinesis Data Analytics
- When a consumer receives a record, it receives a partition key, sequence number (where the record was in shard), and data blob
- 2 MB/sec (shared) per shard all consumers
- 2 MB/sec (enhanced) per shard per consumer
Properties:
- Retention period - 1 day to 1 year
- Ability to reprocess data
- Once data is inserted, it can’t be deleted (immutable)
- Data that shares same partition, goes to same shard
Capacity Modes:
- Provisioned Mode
  - Choose # of shards and scale manually or through API
  - Input (1 MB/sec) & throughput (2 MB/sec both classic/enhanced)
  - Pay per shard provisioned per hour
- On-demand Mode
  - No provision needed
  - Default capacity provisioned (4 MB/sec)
  - Automatic scaling (observed throughput peak during last 30 days)
  - Pay per stream per hour & data in/out per GB
Security:
- It is within region (where we have shards)
- IAM polices for shards
- Encryption
  - in-flight: HTTPS endpoints
  - at-rest: KMS
  - client side
- VPC Endpoints available for Kinesis
- Monitor API calls through CloudTrail

Producers

How is the data ingested into Kinesis Streams.

SDK - allows you to write code or use CLI to directly send data into Kinesis Streams
- PutRecord(s) API
  - API to send one or more records
  - Uses batching and increases throughput (which means there will be less HTTP requests as we send many records as a part of one HTTP request)
  - Over the limits of throughput, we will get ProvisionedThroughputExceeded
  - SDK can be used in very different ways: Mobile SDK (Android, iOS)
  - Used case - in case of low throughput, need higher latency with simple API or just working directly from Lambda
  - AWS Managed sources (uses SDK) for KDS - CW logs, IoT, Kinesis Data Analytics
- ProvisionedThroughputExceeded
  - It happens when we are sending more data than expected i.e. exceeding MB/s or TPS for any shard
  - Due to hot shard (partition key is corrupted and excess data into that partition). Need to distribute as much as possible
  - (+) Retries with backoff
  - (+) Increase shards (scaling)
  - (+) ensure partition key is good (distributed well)
Kinesis Producer Library (KPL) - more advanced, write better code and has good features (for enhanced throughput)
- Easy to use and highly configurable C++/Java library
- Used for building high performance, long running producers
- Automated + configurable retry mechanism (Automatically deals with issue with API (SDK))
- 2 Types of APIs:
  - Synchronous: Same as SDK
  - Asynchronous: Better performance for async process
- Submit metrics to CW for monitoring
- Supports batching - increased throughput + decrease cost (ON by default)
  - Collect Records + Write to multiple shards
  - Aggregate that increases latency i.e. capability to store multiple records in one record + increase payload size and imporve throughput
- Compression (by user only) - make records smaller
- To read KPL Records, we need KCL or special helper library (can’t use CLI)
- Batching
  - Let us say we are sending 2 KB of data to Kinesis Data streams using KPL
  - It won’t be sent away on spot but will wait for next records that might be coming
  - At one point, KPL can aggregate all records into one record and we can do it multiple times
  - And then to make it more efficient, it will Collect all aggregated records in PutRecords in one API call
  - We can use RecordMaxBufferedTime that introduces some delay waiting for all records to go together in one API call (default is 100ms)
  - WHEN NOT TO USE KPL - Applciation that can’t tolerate additional delay is not good use case here (need to use SDK directly here)

Kinesis Agent - Linux program that runs on server to fetch log files and send reliably to Kinesis Streams
- Java based agent and built upon KPL
- Only Linux based system
- Features:
  - Write from multiple directories and write to multiple streams
  - Routing feature based on directory / log file
  - Pre-process data before sending to streams
  - handles log file rotation, checkpointing and retry upon failures
  - Emits metrics to CW for monitoring
Third party libraries - Spark, Flume, log4j, Kafka Connect, NiFi

Consumers

How is the data consumed from Kinesis Streams

SDK
- GetRecord(s) API
  - Records are polled by consumers from a shard
  - Each shard has 2 MB total aggregate throughput i.e. 3 shards means total 6MB
  - return upto 10 MB of data with a throttle for 5 sec or upto 10k records
  - Latency limit - Max of 5 GetRecords() calls shard/sec i.e. 200ms latency
  - More consumers = less throughput

Kinesis Client Library (KCL) - Similarly, we produce by KPL, here we will consume the data by KCL
- Java-first library but exists for other languages too (Golang, Python, Ruby, Node, .Net)
- Read records from Kinesis produced with KPL (de-aggregation)
- Multiple consumers with multiple shards in one group - Shard Discovery
- Checkpointing - feature to resume progress
- Uses DynamoDB for coordination and checkpointing
  - Provision DynamoDB (WCU/RCU)
  - Or use On-demand
  - If we get ExpiredIteratorException, we should increase WCU
- Record processors will process the data
- Example below
  - We have a Kinesis Streams with 4 shards
  - We have a DynamoDB for Checkpointing and Coordination
  - We have 2 KCL applications running on two different EC2 in same group
  - With the help of shard discovery mechanism, KCL1 will be reading from Shard 1&2 and KCL2 will be reading from Shard 3&4
  - Then KCL applications will be checkpointing with DynamoDB

Kinesis Connector Library - Older Java library in 2016 and uses KCL library in the back-end and uses EC2 to write data to different sources.
- It’s sole purpose is to take data from Kinesis data streams and write data to S3, DynamoDB, Redshift, ElasticSearch
- This service is replaced by Firehose and Lambda together

Third party libraries - Apache Spark, Log4j, Appenders, Flume, Kafka Connect
Kinesis Firehose
AWS Lambda
- Source records from Data streams
- It has a library to de-aggregate record from KPL
- Use to run lightweight ETL to S3, DynamoDB, Redshift, ElasticSearch
- Read in real-time from Kineses Data Streamsn and trigger notifications (with configurable batch size)

Enhanced Fan Out

Works with KCL 2.0 & AWS Lambda
Each consumer will get 2 MB/s provisioned throughput per shard i.e. the data is pushed with 2 MB/s when consumer is SubscribeToShard()
Enhanced Fan Out - pushing data to consumers over HTTP/2
Can cover more consumers + reduced latency (~70 ms)
Standard Consumers vs Enhanced Fan out
- SC - Low # of consuming applications, tolerate ~200 ms latency, low cost
- EF - Multiple consumer applications, low latency ~70 ms and higher costs

Scaling

Operations
- Add Shards - Shard splitting, inc stream capacity (1 MB/s data in per shard), divide a hot shard, old shard is closed and will be deleted once the data is expired, helps to improve throughput
- Merge Shards - dec stream capacity + save cost, group 2 shards with low traffic, old shards are closed and deleted based on data expiration
- Out-of-order records :
  - Reason is Resharding
  - Read from child shards
  - Data that hasn’t be read is still be parent
  - After resharding, read entirely from parent until we don’t have new records
  - KCL already has logic built-in even after resharding
- AutoScaling: not native Kinesis feature, API call to change UpdateShardCount, implement AutoScaling with Lambda
Limitations
- Resharding can’t be done in parallel (plan capacity in advance)
- Perform one resharding operation at a time
- For 1k shards it take 30K seconds (~8 hrs) to double shards to 2000
- Can’t scale
  - more than 10x for each rolling 24h period for each stream
  - up to more then double your current shard count
  - down below 1/2 of your current shard count
  - up to more than 500 shards in a stream
  - stream upto 500 shards down
  - up more than shard limit for your account

Handling Duplicate Records

Producers:
- Due to network timeouts, duplicates are created as acknowledgement may not be received to producers
- embed unique record it id in data to deduplicate
Consumers:
- Retries can make application read data twice
- Retries happen when:
  - worker terminates unexpectedly
  - worker instances are added/removed
  - shards are merged or split
  - application is deployed
- Fixes:
  - make consumer applciation idempotent
  - if final destination can handle duplicates, it is recommended to do it there

Security

Control access/authorization using IAM
Encryption in flight using HTTPS
Encryption at rest using KMS
Client side encryption
VPC Endpoints available for Kinesis to access within VPC

Kinesis Firehose

Kinesis Firehose is used to store data in to target destinations
Near Real time service i.e. stores big batch to data to write into target destinations (batch writes) - 60 s latency min for non full batches
Diagram to understand the producers and consumers destinations:

Fully AWS Managed service, No Admin is required
Load data into S3, Redhsift, ElasticSearch, Splunk
Automatic scaling
Support data formats
Data conversions from JSON to parquet/ORC (S3)
Data transformation through Lanbda (csv to json)
Supports compressions when target is S3 (GZIP, ZIP, SNAPPY)
Only GZIP is data, then can be loaded into Redshift as well
Pay only for amount of data going through Firehose
Spark/KCL do not read from KDF
Delivery Diagram

Buffer Sizing
- Firehose accumulates records in a buffer
- Buffer is flushed based on time (minimum 1min) and size (few MBs) rules (reaches max value)
- Firehose can automatically increase buffer size to increase throughput
- Higher throughput means buffer size will be hit
- Lower throughput means buffer time will be hit

Streams vs Firehose

Streams:
- Write custom code (producer/consumer)
- Real time (~200 ms latency for classic, ~70 ms for enhanced fan out)
- Must manage scaling (shard splitting / merging)
- Data storage (1 to 365 days), replay capability, multi-consumers
- Use with lambda to isnert data in real time to ElasticSearch
Firehose:
- Fully managed, data can be send to S3, Splunk, Redshift, ElasticSearch
- Serverless data transformations with Lambda
- Near real time (lowest buffer time is 1 minute)
- Automated Scaling
- No data storage

SQS

Producers will send message to SQS queues and consumers will poll messages from SQS
Standard:
- Fully managed and auto-scaling (from 1 message per sec to 10k messages per second)
- Default retention of messages from 4 to 14 days
- No limit of how many messages can be in a queue
- Low latency (< 10 ms on publish and receive)
- Horizontal scaling
- Can have duplicate messages (atleast once deliver)
- Can have out of order message (no FIFO)
- Message size limitation - 256 KB
Producing Messages:
- Default body
- Add message attributes (metadata - optional)
- Provide Delay Delivery (optional)
- In return, we get message identifier + MD5 has of body
Consuming Messages:
- Poll SQS for messages (upto 10 messages at a time)
- Process messages within visibility timeout
- Delete message using message ID and receipt handle.

SQS FIFO (First-in-First-Out)
- .fifo
- Lower throughput
- Messages are processed in order
- Messages are sent exactly sent once
- 5 minute interval de-duplication using ‘Duplication ID’
- Not compatible with S3 event notification
SQS Extended Limit
- java library to send large messages > 256 KB

Use cases:
- Decouple applications
- Buffer writes to db
- Handle large loads of messages coming in
- Can be integrated with Auto-Scaling using CW
Limits:
- Max of 120k in-flight messages being processed by consumers
- Batch request has max of 10 messages
- Message content is limited to XML, JSON, unformatted text
- Standard SQS have unlimited TPS (transactions per second)
- FIFO support upto 3000 messages per second (using batching)
- Max size is 256 KB
- Data retention from 1 minute to 14 days
- Pricing - per API request and network usage
Security:
- Encryption in flight using HTTPS endpoint
- Can be enable SSE using KMS (encrpts only body not metadata)
- IAM policy to allow usage of SQS
- SQS queue access policy

Kinesis Vs SQS

Difference between Kinesis and SQS
Kinesis Data Streams	SQS
Data can be consumed many times	Queue, decouple applications
Data is deleted after retention period	One application per queue
Ordering of records is preserved	Records are deleted after consumption
Build multiple applications reading from same stream (Pub/Sub)	Messages are processed independently
Streaming MapReduce quering capability	Ordering for FIFO queues
Checkpoint needed to track consumption progress	Capability to delay messages
Provisioned mode or on-demand mode	Dynamic scaling of load
Use cases: Fast log & event data collection/processing, Real-time metrics/reporting, mobile data capture, Real-time DA, Gaming data feed, Complex Stream Processing, Data feed from IoT	Use cases: Order, image processing, Auto-scaling queues, Buffer/batch messages for future processing, Request Offloading

Hands-on Sessions

Kinesis Data Streams Sessions

Other Collections

This section includes IoT, Database Migration Service (DMS), Direct Connect, Snow Family, Managed Streaming for Kafka (MSK)

IoT

DMS

Direct Connect

Snow Family

MSK

Kinesis vs MSK

Storage (S3)

This section includes details overview about S3

S3

S3	Description
Overview	* Store objects (files) in buckets (directories) * global unique name * defined at regional level * Naming convention - No uppercase, underscore, 3-63 long character, Not an IP, Must start with lowercase letter or number * Objects - (files) have key where key is FULL path * Object values - max size is 5TB, uploading > 5GB we have to use multi-part upload, can have Version ID * Metdata - list of text key/value pairs system or user metadata * Tags - upto 10 key-value pair for security/lifecycle
Consistency	* Strong consistency * Whenever we successfully write (PUT) or overwrite an existing object, subsequent read request immediately receives latest version of the object * Same case when you list the new object
Durability	* Defines how many times an object in S3 will be lost * High durability (99.999999999 %) across multiple AZ * For all storage classes
Availability	* Measures how readily available a service is * Depends upon storage class * Standard Class has 99.99% availability (i.e. not available 53 min a year)
Storage Classes	* Standard - General Purpose * Standard-Infrequent Access (IA) * One Zone-Infrequent Access * Glacier Instant Retrieval, Flexible Retrieval, Deep Archive * Intelligent Tiering

S3 Storage Classes

Standard	StandardIA	OneZoneIA	Glacier	IT
* 99.99% Availability * Used for frequently accessed data * Low latency and high throughput * Sustain 2 concurrent facility failures Use cases: Big Data Analytics, Mobile & gaming, content distribution	* 99.9% Availability * Less Frequently accessed but requires rapid access when needed * Lower cost than S3 standard * Use Case: Disaster Recovery and backups	* High durability (99.999999999 %) across single AZ * Data is lost when AZ is destroyed * 99.5% Availability * Use Case: Store secondary backup copies of on-premises data or data you can create	* Low cost object storage meant for archiving or backup * Pricing: price for storage + object retrieval cost * Glacier IR: ms retrieval, great for data accessed once a quarter, minimum storage of 90 days * Glacier FR: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12, free), minimum storage of 90 days * Glacier Deep Archive: Standard (12h), Bulk (48h), minimum storage of 180 days	* Small monthly monitoring & auto-tiering fee * moves objects automatically between AccessTiers based on usage * There are no retrieval charges * Frequent (default), Infrequent (30 days), Archive Instant (90 days), Archive Access (90 to 700+ days), Deep Archive Access (180 to 700+ days)

S3 Lifecycle Rules

Following diagram illustrates the transition of objects in S3 across different S3 Storage Classes:

Transition Action: It define when objects are transitioned to another storage class
- Move objects to Standard IA 60 days after creation
- Move to glacier for archiving after 6 months
Expiration Actions: configure objects to expire after sometime
- Access log files can be set to delete after 365 days
- Can be used to delete old version of files (if versioning is enabled)
- Can be used to delete incomplete multi-part uploads
Rules are created for certain prefix and certain object tags

S3 Performance

Baseline:
- Automatically scales to high request rates, latency of 100-200 ms
- Application can achieve at least 3500 PUT/COPY/POST/DELETE & 5500 GETR/HEAD requests per seconds per prefix in a bucket
KMS Limits
- If you use SSE-KMS, you may be impacted by KMS limits
- Upload a file –> GenerateDataKey KMS API
- Download a file –> Decrypt KMS API
- KMS Quota - Count towards KMS per second (5500, 10000, 30000 request/sec based on region)
Optimization
- Multi-part upload - recommended for files > 100 MB and must for files > 5 GB, help in parallelize uploads
- S3 Transfer Acceleration - Increase transfer speed by transferring file to AWS Edge location which will forward data to S3 bucket in target region
Reading a file in most efficient way
- S3 Byte Range Fetches - Parallelize GETs by requesting specific byte ranges, better resilience in case of failures
- Use case: To speed up downloads, To retrieve only partial data

S3 Encryption

Encryption
- SSE-S3:
  - encrypts S3 objects using keys handled and managed by AWS
  - Object is encrypted server side
  - AES-256 encryption side
  - Must set header: “x-amz-server-side-encryption”:“AES256”

SSE-KMS:
- Leverage AWS KMS to manage encryption keys
- User control + Audit trail
- Object is encrypted server side
- Must set header: “x-amz-server-side-encryption”:“aws:kms”

SSE-C:
- When you want to manage your own manage encryption keys
- S3 doesn’t store encryption keys
- HTTPS must be used and encryption is done at server side
- Keys need to be provided in HTTP headers for every HTTP request made

Client Side Encryption
- When we encrypt the file before uploading the file to S3 at client side
- Client is responsible to decrypt as well when retrieving from S3
- Client manages keys and encryption cycle

Encryption in transit (SSL/TLS)
- S3 exposes HTTP endpoint (non encrypted) and HTTPS endpoint (encryption in flight)
- HTTPS in mandatory for SSE-C

S3 Access Points

Concept
- S3 bucket has a lot of different types of data i.e. finance, marketing, sales etc.
- We have different set of users/groups that need access to their respective data based on their clearance
- Access point with access point policy (similar to general s3 policy)
- Advantage is that we have simple bucket policy and simplify security management for S3 buckets.
- Each Access points has its own DNS name (VPC Origin) and access point policy

S3 Access points: VPC Origin
- Define access point within VPC
- Ex. to access S3 from isolated EC2, we must create VPC endpoint (Gateway/Interface)
- VPC Endpoint policy must allow access to target bucket and Access point

S3 Object Lambda
- Use AWS Lambda to change object before it is retrieved by caller application
- Example:
  - We have 2 teams, one wants original data and one wants redacted data.
  - The team that wants redacted data can use S3 Access points pointing to Lambda function that enables S3 object Lambda Access Point
- Use cases:
  - Redacting PII for analytics or non-prod environments.
  - Converting across data formats such as XML to JSON.
  - Resizing and watermarking images on the fly using caller-specific details.

S3 Other features

S3 Select & Glacier Select
- Retrieve less data using SQL by performing server side filtering
- Can filter by rows and columns
- less network transfer, less CPU cost client-side
- Glacier Select can only do uncompressed CSV files
S3 Event Notifications
- ObjectCreated, ObjectRemoved, ObjectRestore, Replication
- Automatically react to objects in S3
- Object name filtering possible (*jpg)
- Use Case: Generate thumbnails of images uploaded to S3
- Destinations: SNS, SQS, Lambda
- EventBridge Integration:
  - All events are triggered to Event Notification Bridge which can set rules to 18+ AWS service destinations
  - Advanced filtering options with JSON
  - Multiple destinations
Versioning
- Versioning of files can be in S3 but can be enabled at bucket level
- Same key overwrite will increment the version i.e. 1,2,3 …
- Recommended:
  - Protect against unintended deletes
  - Easy roll back to previous version
  - Any files that is not versioned, will have version “null”
  - Suspending versioning doesn’t delete previous versions
  - Deleting a file, keeps the delete marker files intact to restore the object if needed
  - Deleting a file and it’s delete marker will delete the file permanently
Replication
- CRR - Cross region replication - to copy files from bucket in region 1 to bucket in region 2 asynchronously
  - Compliance, Lower latency access, replication across accounts
- SRR - Same region replication - to copy files from bucket in region to another bucket in same region asynchronously
  - Log aggregation, Live replication between production and test accounts
- Notes:
  - must enable versioning in source and destination buckets
  - buckets can be in different accounts as well
  - After activating, only new objects are replicated
  - Replicate existing objects using S3 Batch Replication
  - Can replicate DELETE markers from source to target
  - Deletions with Version ID are NOT replicated
  - No chaining of replication
Security
- User/Role based: IAM policies - which API calls should be allowed for a specific user from IAM console
- Resource based: Bucket policies - bucket wide rules from S3 con sole - allows cross account
  - Object ACL - finer grain
  - Bucket ACL - less common
- Networking: Supports VPC Endpoints
- Logging & Audit: S3 access logs, API calls can be logged in CloudTrail
- MFA Delete: MFA can be required in versioned buckets to delete objects
- Pre-Signed URLs: URLs that are valid only for a limited time
Bucket Policies
- JSON based policies
  - Resource: Buckets and objects
  - Actions: Set of API to Allow or Deny
  - Effect: Allow/Deny
  - Principal: account/user to apply policy to
- Block Public Access
  - block objects to public access
  - New ACLs, any ACLs, new public bucket or access point policies
  - Can be set at account level

Storage (DynamoDB)

This section includes details overview about DynamoDB

DynamoDB

It is NoSQL Serverless Database.

Traditional Architecture
- Generally clients use application layer comprising of load balancers and EC2 instances (Auto scaling) attached with database layer i.e. RDS
- Traditional applications leverage RDBMS databases
- These databases have the SQL query language
- Strong requirements about how the data should be modeled
- Ability to do query joins, aggregations, complex computations
- Vertical scaling (getting a more powerful CPU/RAM/IO)
- Limited Horizontal scaling (increasing reading capability by adding EC2/RDS Read Replicas)

NoSQL databases
- Non-relational databases and are distributed (i.e. gives some horizontal scalability)
- includes some famous tech like MongoDB, DynamoDB etc.
- don’t support query joins
- All the data that is needed for a query is present in one row
- don’t perform aggregations such as “SUM”, “AVG” etc.
- scale horizontally
DynamoDB Overview
- Fully managed, highly available with replication across multiple AZs
- NoSQL database
- Scales to massive workloads, distributed database
- Millions of requests per seconds, trillions of row, 100s of TB of storage
- Fast & consistent in performance (low latency)
- IAM integration
- Enables event driven with DynamoDB Streams
- Low cost & autoscaling capabilities

Data Analytics Certification Notes