Big Data

Big Data Tools on AWS

Amazon EMR- Hadoop MapReduce

S3 -> Create bucket -> [bucket name] -> [region] -> privacy

  • Create folder structure in the bucket - Create folder - folders: data, output, temp

  • Create ssh keys to access instances

EC2 -> Network & Security -> Key Pairs -> Download pem or ppk

  • Create credentials with ID and secret key of AWS to use MapReducerJob (manages services and put cluster to run)

Your user -> My Security Credentials -> Access Keys -> Download csv file

  • Upload file that will be analysed on S3 bucket

  • Edit mrjob.conf

runners:
 emr:
  aws_access_key_id: {your_key_id}
  aws_secret_access_key: {your_secret_access_key}
  ec2_key_pair: {KEY}
  ec2_key_pair_file: ~/.ssh/{KEY}.pem
  region: us-west-2
  ssh_tunnel: true
  instance_type: m5.xlarge
  num_core_instances: 3
  • Install python libraries needed to the job: boto3 and mrjob.

  • Config SSH key

nano ~/.ssh/key.pem
# paste key
  • Running job
python script.py -r emr  python3 --output-dir=s3://{your_s3_bucket_name}/output/logs1 --cloud-tmp-dir=s3://{your_s3_bucket_name}/temp/ # Copy URIs