Collect - raw data
Store - scalable and security
Process & Analyse - transform
Consume & View - value
Descriptive Analysis - what did happen and why?
Predictive Analysis - What can happens?
Prescriptive Analysis - if it happens, what can I do?
Collect - Direct Connect (connection between my instance and AWS - high transferency rate), Snowball (physical transference), Kinesis & Kinesis Firehose (data streaming)
Store - S3 (file storage - data lake), Glacier (database - slow consultancy), RDS (relational), Aurora, DynamoDB (no relational), Redshift (data warehouse - information analysis), CloudSearch (tool data research), Elasticsearch
Analyse - EMR (Elastic Map Reduce), Machine Learning, QuickSight (graphics, visual), Kinesis Analytics (integrate data from streaming and extract information), Athena (SQL queries)
Create bucket
Access S3: https://s3.console.aws.amazon.com/s3/
Create DataLake structure
S3 -> Create bucket -> [bucket name] -> [region] -> privacy
Create folder structure in the bucket - Create folder - folders: data, output, temp
Create ssh keys to access instances
EC2 -> Network & Security -> Key Pairs -> Download pem or ppk
Your user -> My Security Credentials -> Access Keys -> Download csv file
Upload file that will be analysed on S3 bucket
Edit mrjob.conf
runners:
emr:
aws_access_key_id: {your_key_id}
aws_secret_access_key: {your_secret_access_key}
ec2_key_pair: {KEY}
ec2_key_pair_file: ~/.ssh/{KEY}.pem
region: us-west-2
ssh_tunnel: true
instance_type: m5.xlarge
num_core_instances: 3
Install python libraries needed to the job: boto3 and mrjob.
Config SSH key
nano ~/.ssh/key.pem
# paste key
python script.py -r emr python3 --output-dir=s3://{your_s3_bucket_name}/output/logs1 --cloud-tmp-dir=s3://{your_s3_bucket_name}/temp/ # Copy URIs