My notes of a course by DIO.

Log data will be uploaded on this firehose.

Bucket -> Create New -> Create S3 Bucket

EC2 - management instances service by AWS.

Launch Instance -> Linux -> Create -> Instance Type

Create New SSH Key -> Download Launch Instance

puttygen arq.pem -O private -o key.ppk

# if public key
puttygen keyfile.pem -L

Get public DNS of instance AWS - IPv4 address

Connect-SSH Client -> Copy public DNS -> Paste on putty hostname

SSH-Auth -> Browse -> Import key.ppk

pattern login: ec2-user

On AWS-instance:

sudo yum install -y aws-kinesis-agent
sudo yum install -y git

# Get repository with dataset and scripts
git clone https://github.com/cassianobrexbit/dio-live-aws-bigdata-2.git
unzip dataset
# Script python to process dataset line by line and generate logs

# Transform script in executable
chmod a+x loggeneratorscript.py

# Create directory that will store the logs
sudo mkdir /var/log/logdir

# Access kinesis to start working
cd /etc/aws-kinesis

# Access agent.json file to set configurations
sudo nano agent.json

File agent.json

## Attention to the region - it can be found at infos - kinesis firehose details - region

“flows” - directory of logs

“filePattern”:“/var/log/logdir/*.log”

“deliveryStream”: copiar delivery stream ARN

Instances -> Select -> Actions -> Security -> Modify IAM role

Create New IAM role -> Create role -> Select role to allow access to services

sudo service aws-kinesis-agent start
sudo chkconfig aws-kinesis-agent on
sudo ./loggeneratorscript.py 500000 #number of logs
tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log

number of shards

Create Data Stream

sudo service aws-kinesis-agent restart

It can be visualized with Glue Data Brew.

File agent.json model - content provided by the course

{
"cloudwatch.emitMetrics": true,
"kinesis.endpoint": "kinesis.<region>.amazonaws.com",
"firehose.endpoint": "firehose.<region>.amazonaws.com",
  "flows" : [
    {
      "filePattern": "/var/log/logdir/*.log",
      "kinesisStream": "DataStreamName",
      "partitionKeyOption": "RANDOM",
      "dataProcessingOptions": [
        {
          "optionName": "CSVTOJSON",
          "customFieldNames": ["country", "iso_code", "total_vaccinations", "people_fully_vaccinated", "total_vaccinations_per_hundred", "vaccines", "source_name", "source_website"]
        }
       ]
    },
    {
      "filePattern": "/var/log/logdir/*.log",
      "deliveryStream": "FirehoseName"
    }
  ]
}