Data Analytics Certification Notes
Introduction
These are the notes while preparing for AWS Certified Data Analytics Specialty 2022. In this journey, we will make important concept notes for following five sections for Data Analytics
- Collection
- Storage
- Processing
- Analysis
- Visualization
- Domain & Security
Collection
- Kinesis (Data Streams,Producers,Consumers,Enhanced Fan Out,Scaling,Security,Data Firehose)
- SQS
- IoT
- Database Migration Service (DMS)
- Direct Connect
- Snow Family
- MSK (Managed Streaming, Connect, Serverless)
Storage
- S3 (Storage,Lifecycle Rules,Versioning,Replication,Performance,Security,Event Notifications)
- DynamoDB (Basics,APIs,Indexes,PartiQL,DAX,Streams,TTL,Patterns,Security)
- ElastiCache (Fundamental)
Processing
- Lambda
- Glue, Hive, ETL (Catalog, end-points, Costs)
- Glue Studio & DataBrew
- Glue Elastic Views
- Lake Formation
- Infrastructure (EMR, Hadoop, Serverless, Apache Spark)
- Spark integration with Kinesis & Redshift
- Applications on EMR (Hive,Pig,HBase,Presto,Zeppelin,Hue,Splunk,Flume)
- Data Pipeline
- Step Functions
Analysis
- Kinesis Analytics
- OpenSearch
- Athena
- Redshift
Visualization
- Quicksight (Pricing,Dashboards, ML Insights)
Domain & Security
- S3 Encryption
- KMS (Basics,Key Rotation)
- Cloud HSM
- STS & Cross Account
- Identity Federation
- Policies
- CloudTrail
- VPC Endpoints
Other Topics
- EC2 for Big Data
- AWS AppSync and Kendra
- AWS Data Exchange
- AWS AppFlow
- AWS Cleanup
- Sagemaker
Collection
There are multiple ways for data collection in AWS.
- Real Time collection - where we can perform action
on our data
- KDS, SQS, IoT - These services help you to react in real-time to events or data that is happening in your infrastructure
- Near-real-time - Reactive Actions
- Firehose, DMS
- Batch - Historical Analysis - when we want to move
large amount of data to perform some data analysis
- Snowball, Data Pipeline
Kinesis Data Streams
Overview
- Way to stream big data into your systems. It is made of multiple shards and this is something we need to provision ahead of time
- Shard - Data is split across all the shards and they define stream capacity in terms of ingestion and consumption rates
- Producers - Send data (Produce Records) into KDS and can be manyfold. Ex. applications, clients, SDK, KPL, Kinesis Agent. Data can be sent at the rate of 1 MB/s or 1000 msg/sec per shard
- Records - is made of partition key and
data blob (upto 1 MB)
- Partition key - determines in which shard will the record go to
- Data Blob - Value itself
- Consumers - Applications, Lambda functions,
Firehose, Kinesis Data Analytics
- When a consumer receives a record, it receives a partition key, sequence number (where the record was in shard), and data blob
- 2 MB/sec (shared) per shard all consumers
- 2 MB/sec (enhanced) per shard per consumer
- Properties:
- Retention period - 1 day to 1 year
- Ability to reprocess data
- Once data is inserted, it can’t be deleted (immutable)
- Data that shares same partition, goes to same shard
- Capacity Modes:
- Provisioned Mode
- Choose # of shards and scale manually or through API
- Input (1 MB/sec) & throughput (2 MB/sec both classic/enhanced)
- Pay per shard provisioned per hour
- On-demand Mode
- No provision needed
- Default capacity provisioned (4 MB/sec)
- Automatic scaling (observed throughput peak during last 30 days)
- Pay per stream per hour & data in/out per GB
- Provisioned Mode
- Security:
- It is within region (where we have shards)
- IAM polices for shards
- Encryption
- in-flight: HTTPS endpoints
- at-rest: KMS
- client side
- VPC Endpoints available for Kinesis
- Monitor API calls through CloudTrail
Producers
How is the data ingested into Kinesis Streams.
- SDK - allows you to write code or use CLI to
directly send data into Kinesis Streams
- PutRecord(s) API
- API to send one or more records
- Uses batching and increases throughput (which means there will be less HTTP requests as we send many records as a part of one HTTP request)
- Over the limits of throughput, we will get ProvisionedThroughputExceeded
- SDK can be used in very different ways: Mobile SDK (Android, iOS)
- Used case - in case of low throughput, need higher latency with simple API or just working directly from Lambda
- AWS Managed sources (uses SDK) for KDS - CW logs, IoT, Kinesis Data Analytics
- ProvisionedThroughputExceeded
- It happens when we are sending more data than expected i.e. exceeding MB/s or TPS for any shard
- Due to hot shard (partition key is corrupted and excess data into that partition). Need to distribute as much as possible
- (+) Retries with backoff
- (+) Increase shards (scaling)
- (+) ensure partition key is good (distributed well)
- PutRecord(s) API
- Kinesis Producer Library (KPL) - more advanced,
write better code and has good features (for enhanced throughput)
- Easy to use and highly configurable C++/Java library
- Used for building high performance, long running producers
- Automated + configurable retry mechanism (Automatically deals with issue with API (SDK))
- 2 Types of APIs:
- Synchronous: Same as SDK
- Asynchronous: Better performance for async process
- Submit metrics to CW for monitoring
- Supports batching - increased throughput + decrease
cost (ON by default)
- Collect Records + Write to multiple shards
- Aggregate that increases latency i.e. capability to store multiple records in one record + increase payload size and imporve throughput
- Compression (by user only) - make records smaller
- To read KPL Records, we need KCL or special helper library (can’t use CLI)
- Batching
- Let us say we are sending 2 KB of data to Kinesis Data streams using KPL
- It won’t be sent away on spot but will wait for next records that might be coming
- At one point, KPL can aggregate all records into
one record and we can do it multiple times
- And then to make it more efficient, it will Collect all aggregated records in PutRecords in one API call
- We can use RecordMaxBufferedTime that introduces some delay waiting for all records to go together in one API call (default is 100ms)
- WHEN NOT TO USE KPL - Applciation that can’t tolerate additional delay is not good use case here (need to use SDK directly here)
- Kinesis Agent - Linux program that runs on server
to fetch log files and send reliably to Kinesis Streams
- Java based agent and built upon KPL
- Only Linux based system
- Features:
- Write from multiple directories and write to multiple streams
- Routing feature based on directory / log file
- Pre-process data before sending to streams
- handles log file rotation, checkpointing and retry upon failures
- Emits metrics to CW for monitoring
- Third party libraries - Spark, Flume, log4j, Kafka Connect, NiFi
Consumers
How is the data consumed from Kinesis Streams
- SDK
- GetRecord(s) API
- Records are polled by consumers from a shard
- Each shard has 2 MB total aggregate throughput i.e. 3 shards means total 6MB
- return upto 10 MB of data with a throttle for 5 sec or upto 10k records
- Latency limit - Max of 5 GetRecords() calls shard/sec i.e. 200ms latency
- More consumers = less throughput
- GetRecord(s) API
- Kinesis Client Library (KCL) - Similarly, we
produce by KPL, here we will consume the data by KCL
- Java-first library but exists for other languages too (Golang, Python, Ruby, Node, .Net)
- Read records from Kinesis produced with KPL (de-aggregation)
- Multiple consumers with multiple shards in one group - Shard Discovery
- Checkpointing - feature to resume progress
- Uses DynamoDB for coordination and checkpointing
- Provision DynamoDB (WCU/RCU)
- Or use On-demand
- If we get ExpiredIteratorException, we should increase WCU
- Record processors will process the data
- Example below
- We have a Kinesis Streams with 4 shards
- We have a DynamoDB for Checkpointing and Coordination
- We have 2 KCL applications running on two different EC2 in same group
- With the help of shard discovery mechanism, KCL1 will be reading from Shard 1&2 and KCL2 will be reading from Shard 3&4
- Then KCL applications will be checkpointing with DynamoDB
- Kinesis Connector Library - Older Java library in
2016 and uses KCL library in the back-end and uses EC2 to write data
to different sources.
- It’s sole purpose is to take data from Kinesis data streams and write data to S3, DynamoDB, Redshift, ElasticSearch
- This service is replaced by Firehose and Lambda together
- Third party libraries - Apache Spark, Log4j, Appenders, Flume, Kafka Connect
- Kinesis Firehose
- AWS Lambda
- Source records from Data streams
- It has a library to de-aggregate record from KPL
- Use to run lightweight ETL to S3, DynamoDB, Redshift, ElasticSearch
- Read in real-time from Kineses Data Streamsn and trigger notifications (with configurable batch size)
Enhanced Fan Out
- Works with KCL 2.0 & AWS Lambda
- Each consumer will get 2 MB/s provisioned throughput per shard i.e. the data is pushed with 2 MB/s when consumer is SubscribeToShard()
- Enhanced Fan Out - pushing data to consumers over HTTP/2
- Can cover more consumers + reduced latency (~70 ms)
- Standard Consumers vs Enhanced Fan out
- SC - Low # of consuming applications, tolerate ~200 ms latency, low cost
- EF - Multiple consumer applications, low latency ~70 ms and higher costs
Scaling
- Operations
- Add Shards - Shard splitting, inc stream capacity (1 MB/s data in per shard), divide a hot shard, old shard is closed and will be deleted once the data is expired, helps to improve throughput
- Merge Shards - dec stream capacity + save cost, group 2 shards with low traffic, old shards are closed and deleted based on data expiration
- Out-of-order records :
- Reason is Resharding
- Read from child shards
- Data that hasn’t be read is still be parent
- After resharding, read entirely from parent until we don’t have new records
- KCL already has logic built-in even after resharding
- AutoScaling: not native Kinesis feature, API call to change UpdateShardCount, implement AutoScaling with Lambda
- Limitations
- Resharding can’t be done in parallel (plan capacity in advance)
- Perform one resharding operation at a time
- For 1k shards it take 30K seconds (~8 hrs) to double shards to
2000
- Can’t scale
- more than 10x for each rolling 24h period for each stream
- up to more then double your current shard count
- down below 1/2 of your current shard count
- up to more than 500 shards in a stream
- stream upto 500 shards down
- up more than shard limit for your account
Handling Duplicate Records
- Producers:
- Due to network timeouts, duplicates are created as acknowledgement may not be received to producers
- embed unique record it id in data to deduplicate
- Consumers:
- Retries can make application read data twice
- Retries happen when:
- worker terminates unexpectedly
- worker instances are added/removed
- shards are merged or split
- application is deployed
- Fixes:
- make consumer applciation idempotent
- if final destination can handle duplicates, it is recommended to do it there
Security
- Control access/authorization using IAM
- Encryption in flight using HTTPS
- Encryption at rest using KMS
- Client side encryption
- VPC Endpoints available for Kinesis to access within VPC
Kinesis Firehose
- Kinesis Firehose is used to store data in to target destinations
- Near Real time service i.e. stores big batch to data to write into target destinations (batch writes) - 60 s latency min for non full batches
- Diagram to understand the producers and consumers destinations:
Fully AWS Managed service, No Admin is required
Load data into S3, Redhsift, ElasticSearch, Splunk
Automatic scaling
Support data formats
Data conversions from JSON to parquet/ORC (S3)
Data transformation through Lanbda (csv to json)
Supports compressions when target is S3 (GZIP, ZIP, SNAPPY)
Only GZIP is data, then can be loaded into Redshift as well
Pay only for amount of data going through Firehose
Spark/KCL do not read from KDF
Delivery Diagram
- Buffer Sizing
- Firehose accumulates records in a buffer
- Buffer is flushed based on time (minimum 1min) and size (few MBs) rules (reaches max value)
- Firehose can automatically increase buffer size to increase throughput
- Higher throughput means buffer size will be hit
- Lower throughput means buffer time will be hit
Streams vs Firehose
- Streams:
- Write custom code (producer/consumer)
- Real time (~200 ms latency for classic, ~70 ms for enhanced fan out)
- Must manage scaling (shard splitting / merging)
- Data storage (1 to 365 days), replay capability, multi-consumers
- Use with lambda to isnert data in real time to ElasticSearch
- Firehose:
- Fully managed, data can be send to S3, Splunk, Redshift, ElasticSearch
- Serverless data transformations with Lambda
- Near real time (lowest buffer time is 1 minute)
- Automated Scaling
- No data storage
SQS
- Producers will send message to SQS queues and consumers will poll messages from SQS
- Standard:
- Fully managed and auto-scaling (from 1 message per sec to 10k messages per second)
- Default retention of messages from 4 to 14 days
- No limit of how many messages can be in a queue
- Low latency (< 10 ms on publish and receive)
- Horizontal scaling
- Can have duplicate messages (atleast once deliver)
- Can have out of order message (no FIFO)
- Message size limitation - 256 KB
- Producing Messages:
- Default body
- Add message attributes (metadata - optional)
- Provide Delay Delivery (optional)
- In return, we get message identifier + MD5 has of body
- Consuming Messages:
- Poll SQS for messages (upto 10 messages at a time)
- Process messages within visibility timeout
- Delete message using message ID and receipt handle.
- SQS FIFO (First-in-First-Out)
- .fifo
- Lower throughput
- Messages are processed in order
- Messages are sent exactly sent once
- 5 minute interval de-duplication using ‘Duplication ID’
- Not compatible with S3 event notification
- SQS Extended Limit
- java library to send large messages > 256 KB
- Use cases:
- Decouple applications
- Buffer writes to db
- Handle large loads of messages coming in
- Can be integrated with Auto-Scaling using CW
- Limits:
- Max of 120k in-flight messages being processed by consumers
- Batch request has max of 10 messages
- Message content is limited to XML, JSON, unformatted text
- Standard SQS have unlimited TPS (transactions per second)
- FIFO support upto 3000 messages per second (using batching)
- Max size is 256 KB
- Data retention from 1 minute to 14 days
- Pricing - per API request and network usage
- Security:
- Encryption in flight using HTTPS endpoint
- Can be enable SSE using KMS (encrpts only body not metadata)
- IAM policy to allow usage of SQS
- SQS queue access policy
Kinesis Vs SQS
| Kinesis Data Streams | SQS |
|---|---|
| Data can be consumed many times | Queue, decouple applications |
| Data is deleted after retention period | One application per queue |
| Ordering of records is preserved | Records are deleted after consumption |
| Build multiple applications reading from same stream (Pub/Sub) | Messages are processed independently |
| Streaming MapReduce quering capability | Ordering for FIFO queues |
| Checkpoint needed to track consumption progress | Capability to delay messages |
| Provisioned mode or on-demand mode | Dynamic scaling of load |
| Use cases: Fast log & event data collection/processing, Real-time metrics/reporting, mobile data capture, Real-time DA, Gaming data feed, Complex Stream Processing, Data feed from IoT | Use cases: Order, image processing, Auto-scaling queues, Buffer/batch messages for future processing, Request Offloading |
Other Collections
This section includes IoT, Database Migration Service (DMS), Direct Connect, Snow Family, Managed Streaming for Kafka (MSK)
IoT
DMS
Direct Connect
Snow Family
MSK
Kinesis vs MSK
Storage (S3)
This section includes details overview about S3
S3
| S3 | Description |
|---|---|
| Overview | * Store objects (files) in buckets (directories) * global unique name * defined at regional level * Naming convention - No uppercase, underscore, 3-63 long character, Not an IP, Must start with lowercase letter or number * Objects - (files) have key where key is FULL path * Object values - max size is 5TB, uploading > 5GB we have to use multi-part upload, can have Version ID * Metdata - list of text key/value pairs system or user metadata * Tags - upto 10 key-value pair for security/lifecycle |
| Consistency | * Strong consistency * Whenever we successfully write (PUT) or overwrite an existing object, subsequent read request immediately receives latest version of the object * Same case when you list the new object |
| Durability | * Defines how many times an object in S3 will be lost
* High durability (99.999999999 %) across multiple AZ * For all storage classes |
| Availability | * Measures how readily available a service is * Depends upon storage class * Standard Class has 99.99% availability (i.e. not available 53 min a year) |
| Storage Classes | * Standard - General Purpose * Standard-Infrequent Access (IA) * One Zone-Infrequent Access * Glacier Instant Retrieval, Flexible Retrieval, Deep Archive * Intelligent Tiering |
S3 Storage Classes
| Standard | StandardIA | OneZoneIA | Glacier | IT |
|---|---|---|---|---|
| * 99.99% Availability * Used for frequently accessed data * Low latency and high throughput * Sustain 2 concurrent facility failures Use cases: Big Data Analytics, Mobile & gaming, content distribution |
* 99.9% Availability * Less Frequently accessed but requires rapid access when needed * Lower cost than S3 standard * Use Case: Disaster Recovery and backups |
* High durability (99.999999999 %) across single AZ
* Data is lost when AZ is destroyed * 99.5% Availability * Use Case: Store secondary backup copies of on-premises data or data you can create |
* Low cost object storage meant for archiving or backup
* Pricing: price for storage + object retrieval cost * Glacier IR: ms retrieval, great for data accessed once a quarter, minimum storage of 90 days * Glacier FR: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12, free), minimum storage of 90 days * Glacier Deep Archive: Standard (12h), Bulk (48h), minimum storage of 180 days |
* Small monthly monitoring & auto-tiering fee * moves objects automatically between AccessTiers based on usage * There are no retrieval charges * Frequent (default), Infrequent (30 days), Archive Instant (90 days), Archive Access (90 to 700+ days), Deep Archive Access (180 to 700+ days) |
S3 Lifecycle Rules
Following diagram illustrates the transition of objects in S3 across different S3 Storage Classes:
Transition Action: It define when objects are transitioned to another storage class
- Move objects to Standard IA 60 days after creation
- Move to glacier for archiving after 6 months
Expiration Actions: configure objects to expire after sometime
- Access log files can be set to delete after 365 days
- Can be used to delete old version of files (if versioning is enabled)
- Can be used to delete incomplete multi-part uploads
Rules are created for certain prefix and certain object tags
S3 Performance
- Baseline:
- Automatically scales to high request rates, latency of 100-200 ms
- Application can achieve at least 3500 PUT/COPY/POST/DELETE & 5500 GETR/HEAD requests per seconds per prefix in a bucket
- KMS Limits
- If you use SSE-KMS, you may be impacted by KMS limits
- Upload a file –> GenerateDataKey KMS API
- Download a file –> Decrypt KMS API
- KMS Quota - Count towards KMS per second (5500, 10000, 30000 request/sec based on region)
- Optimization
- Multi-part upload - recommended for files > 100 MB and must for files > 5 GB, help in parallelize uploads
- S3 Transfer Acceleration - Increase transfer speed by transferring file to AWS Edge location which will forward data to S3 bucket in target region
- Reading a file in most efficient way
- S3 Byte Range Fetches - Parallelize GETs by requesting specific byte ranges, better resilience in case of failures
- Use case: To speed up downloads, To retrieve only partial data
S3 Encryption
- Encryption
- SSE-S3:
- encrypts S3 objects using keys handled and managed by AWS
- Object is encrypted server side
- AES-256 encryption side
- Must set header: “x-amz-server-side-encryption”:“AES256”
- SSE-S3:
- SSE-KMS:
- Leverage AWS KMS to manage encryption keys
- User control + Audit trail
- Object is encrypted server side
- Must set header: “x-amz-server-side-encryption”:“aws:kms”
- SSE-C:
- When you want to manage your own manage encryption keys
- S3 doesn’t store encryption keys
- HTTPS must be used and encryption is done at server side
- Keys need to be provided in HTTP headers for every HTTP request made
- Client Side Encryption
- When we encrypt the file before uploading the file to S3 at client side
- Client is responsible to decrypt as well when retrieving from S3
- Client manages keys and encryption cycle
- Encryption in transit (SSL/TLS)
- S3 exposes HTTP endpoint (non encrypted) and HTTPS endpoint (encryption in flight)
- HTTPS in mandatory for SSE-C
S3 Access Points
- Concept
- S3 bucket has a lot of different types of data i.e. finance, marketing, sales etc.
- We have different set of users/groups that need access to their respective data based on their clearance
- Access point with access point policy (similar to general s3 policy)
- Advantage is that we have simple bucket policy and simplify security management for S3 buckets.
- Each Access points has its own DNS name (VPC Origin) and access point policy
- S3 Access points: VPC Origin
- Define access point within VPC
- Ex. to access S3 from isolated EC2, we must create VPC endpoint (Gateway/Interface)
- VPC Endpoint policy must allow access to target bucket and Access point
- S3 Object Lambda
- Use AWS Lambda to change object before it is retrieved by caller application
- Example:
- We have 2 teams, one wants original data and one wants redacted data.
- The team that wants redacted data can use S3 Access points pointing to Lambda function that enables S3 object Lambda Access Point
- Use cases:
- Redacting PII for analytics or non-prod environments.
- Converting across data formats such as XML to JSON.
- Resizing and watermarking images on the fly using caller-specific details.
S3 Other features
S3 Select & Glacier Select
- Retrieve less data using SQL by performing server side filtering
- Can filter by rows and columns
- less network transfer, less CPU cost client-side
- Glacier Select can only do uncompressed CSV files
S3 Event Notifications
- ObjectCreated, ObjectRemoved, ObjectRestore, Replication
- Automatically react to objects in S3
- Object name filtering possible (*jpg)
- Use Case: Generate thumbnails of images uploaded to S3
- Destinations: SNS, SQS, Lambda
- EventBridge Integration:
- All events are triggered to Event Notification Bridge which can set rules to 18+ AWS service destinations
- Advanced filtering options with JSON
- Multiple destinations
Versioning
- Versioning of files can be in S3 but can be enabled at bucket level
- Same key overwrite will increment the version i.e. 1,2,3 …
- Recommended:
- Protect against unintended deletes
- Easy roll back to previous version
- Any files that is not versioned, will have version “null”
- Suspending versioning doesn’t delete previous versions
- Deleting a file, keeps the delete marker files intact to restore the object if needed
- Deleting a file and it’s delete marker will delete the file permanently
Replication
- CRR - Cross region replication - to copy files from
bucket in region 1 to bucket in region 2 asynchronously
- Compliance, Lower latency access, replication across accounts
- SRR - Same region replication - to copy files from
bucket in region to another bucket in same region
asynchronously
- Log aggregation, Live replication between production and test accounts
- Notes:
- must enable versioning in source and destination buckets
- buckets can be in different accounts as well
- After activating, only new objects are replicated
- Replicate existing objects using S3 Batch Replication
- Can replicate DELETE markers from source to target
- Deletions with Version ID are NOT replicated
- No chaining of replication
- CRR - Cross region replication - to copy files from
bucket in region 1 to bucket in region 2 asynchronously
Security
- User/Role based: IAM policies - which API calls should be allowed for a specific user from IAM console
- Resource based: Bucket policies - bucket wide rules
from S3 con sole - allows cross account
- Object ACL - finer grain
- Bucket ACL - less common
- Networking: Supports VPC Endpoints
- Logging & Audit: S3 access logs, API calls can be logged in CloudTrail
- MFA Delete: MFA can be required in versioned buckets to delete objects
- Pre-Signed URLs: URLs that are valid only for a limited time
Bucket Policies
- JSON based policies
- Resource: Buckets and objects
- Actions: Set of API to Allow or Deny
- Effect: Allow/Deny
- Principal: account/user to apply policy to
- Block Public Access
- block objects to public access
- New ACLs, any ACLs, new public bucket or access point policies
- Can be set at account level
- JSON based policies
Storage (DynamoDB)
This section includes details overview about DynamoDB
DynamoDB
It is NoSQL Serverless Database.
- Traditional Architecture
- Generally clients use application layer comprising of load balancers and EC2 instances (Auto scaling) attached with database layer i.e. RDS
- Traditional applications leverage RDBMS databases
- These databases have the SQL query language
- Strong requirements about how the data should be modeled
- Ability to do query joins, aggregations, complex computations
- Vertical scaling (getting a more powerful CPU/RAM/IO)
- Limited Horizontal scaling (increasing reading capability by adding EC2/RDS Read Replicas)
- NoSQL databases
- Non-relational databases and are distributed (i.e. gives some horizontal scalability)
- includes some famous tech like MongoDB, DynamoDB etc.
- don’t support query joins
- All the data that is needed for a query is present in one row
- don’t perform aggregations such as “SUM”, “AVG” etc.
- scale horizontally
- DynamoDB Overview
- Fully managed, highly available with replication across multiple AZs
- NoSQL database
- Scales to massive workloads, distributed database
- Millions of requests per seconds, trillions of row, 100s of TB of storage
- Fast & consistent in performance (low latency)
- IAM integration
- Enables event driven with DynamoDB Streams
- Low cost & autoscaling capabilities