Creating a Big Data Ecosystem on AWS

Big Data

Collect - raw data
Store - scalable and security
Process & Analyse - transform
Consume & View - value
Descriptive Analysis - what did happen and why?
Predictive Analysis - What can happens?
Prescriptive Analysis - if it happens, what can I do?

Big Data Tools on AWS

Collect - Direct Connect (connection between my instance and AWS - high transferency rate), Snowball (physical transference), Kinesis & Kinesis Firehose (data streaming)
Store - S3 (file storage - data lake), Glacier (database - slow consultancy), RDS (relational), Aurora, DynamoDB (no relational), Redshift (data warehouse - information analysis), CloudSearch (tool data research), Elasticsearch
Analyse - EMR (Elastic Map Reduce), Machine Learning, QuickSight (graphics, visual), Kinesis Analytics (integrate data from streaming and extract information), Athena (SQL queries)

Amazon EMR- Hadoop MapReduce

Create bucket
Access S3: https://s3.console.aws.amazon.com/s3/
Create DataLake structure

S3 -> Create bucket -> [bucket name] -> [region] -> privacy

Create folder structure in the bucket - Create folder - folders: data, output, temp
Create ssh keys to access instances

EC2 -> Network & Security -> Key Pairs -> Download pem or ppk

Create credentials with ID and secret key of AWS to use MapReducerJob (manages services and put cluster to run)

Your user -> My Security Credentials -> Access Keys -> Download csv file

Upload file that will be analysed on S3 bucket
Edit mrjob.conf

runners:
 emr:
  aws_access_key_id: {your_key_id}
  aws_secret_access_key: {your_secret_access_key}
  ec2_key_pair: {KEY}
  ec2_key_pair_file: ~/.ssh/{KEY}.pem
  region: us-west-2
  ssh_tunnel: true
  instance_type: m5.xlarge
  num_core_instances: 3

Install python libraries needed to the job: boto3 and mrjob.
Config SSH key

nano ~/.ssh/key.pem
# paste key

Running job

python script.py -r emr  python3 --output-dir=s3://{your_s3_bucket_name}/output/logs1 --cloud-tmp-dir=s3://{your_s3_bucket_name}/temp/ # Copy URIs

Creating a Big Data Ecosystem on AWS

Natália Faraj Murad

28/08/2021

Big Data

Big Data Tools on AWS

Amazon EMR- Hadoop MapReduce