Amazon Web Services
Amazon Web Services
This tutorial is intended to clarify how Amazon Web Services (AWS) work and the steps that you need to follow on this last project.
We will divide the project in four steps:- 1 Defining the goal: which is the purpose of this task?
- 2 Setting up computing environment
- 3 Gathering the data
- 4 Developping models to predict sentiment
1. DEFINING THE GOAL: WHICH IS THE PURPOSE OF THIS TASK?
You work for Alert Analytics. Your client is Helio, a smart phone and table app developer. This company is working with a government health agency to create an app that will enable the aid workers to manage local health conditions by facilitation communication with medical professionals located elsewhere.
Helio has created a short list of five devices that are all capable of executing the app suite’s functions. They have asked us to examine the prevalence of positive and negative attitudes toward two specific devices on the web (to help them narrow their list):iphone and galaxy.
-
So, our goal is to provide our client with a report that contains an analysis of sentiment toward the target devices based on people’s opinions, as well as a description of the methods and processes we used to arrive at our conclusions.
-
The approach would be looking for positive, negative and neutral words immediately adjacent to terms that refer to camera display, performance and operating system.
- We’re going to analyze data from Common Crawl (an open repository of web crawl data, over 5 billion pages so far, that is stored on Amazon’s Public Data Sets) and we’re going to use the cloud computing platform provided by Amazon Web Services (AWS) to conduct the analysis.Here you have an example of the dataset that you will extract from the web
2. SETTING UP THE COMPUTING ENVIRONMENT
2.1. Install Python
- If you have python installed you can install Numpy and Pandas individually on your own
- If not, you can download and install Anaconda (version 3).It’s a python distribution that makes easy to install it plus a number of it’s most often used libraries.
2.2. Setup AWS
Firstly, you need to create an AWS account. AWS is a platform for the use of cloud applications and services. The main tools you’ll need are:- EC2: it’s the CPU. It’s for rent computers. So you will have Virtual computing environments or instancies
- S3: you can use it to store and retrieve any amount of data, at any time, from anywhere on the web
- AIM: you can create users with different roles and permissions
-
EMR: Amazon Elastic MapReduce is where you’re going to create your clusters for processing big amounts of data. Here you will specify your inputs, outputs, security settings…
Then, you need to install the AWS CLI. The AWS Command Line Interface (AWS CLI) is an open source tool that enables you to interact with AWS services using commands in your command-line shell. Hereyou have the instructions for Linux, Windows (you can just download an installer), mac…
- Test the Install running in the command shell “aws help”. If you receive a warning you will need to follow the instructions for adding CLI to your PATHWAYaws
-
You can create an user called “ubiqum” from the IAM console.
-
Set the following permissions: AmazonEC2FullAccess,AmazonS3FullAccess,DataScientist
-
Download the csv with the Access key ID and the Secret access key (you will need them for configuring the AWS CLI)
- Run “aws configure” in your command shell
-
You will need to enter four pieces of information: - Access Key (you downloaded this when you created the ubiqum user) - Secret Access Key (you downloaded this when you created the ubiqum user) - Default Region Name: us-east-1 - Default Output Format: json
2.3. Additional tools
For gathering the data you also need to install:
-
Cyberduck. It’s a special type of browser that allows you to view and access files stored on Amazon S3. This will enable you to easily explore S3 buckets and upload/download multiple files at a time. So, try to connect it to S3
- A text editor. It is highly recommended that you install a third-party text editor. The text editors that come preinstalled on IOS and Windows machines are not appropriate for this level of work. We recommend Sublime
3. GATHERING THE DATA
Now you’re ready to use EMR to collect the web data you will analyze to provide the general sentiment toward the smart phones. The data you will need to collect is in the web database compiled by Common Crawl.
The work you will do in EMR involves running scripts that will access and compile web pages from the Common Crawl that are relevant to smart phones (the most helpful web pages are those that contain a smart phone review), count the number of positive, negative, or uncertain sentiments expressed, and then store this data in an S3 repository folder.
As you can imagine, accessing and compiling all this data will take quite a while; but the good news is that there’s no need to write the scripts to do it—the previous analyst that was working on this project, Amy Gorman, has already written and tested the three Python programs that will help you accomplish your task.
- The mapper script (Mapper.py): examines and counts data from portions of the Common Crawl data
- The reducer script (Reducer.p): accumulates the analysis from the individual mapper jobs
- The aggregation (Concatenatepv3.py): helps stitch together the raw output from the multiple job flows you will need to initiate to analyze all the necessary data.
Then the mapper scans and records instances of words related to the features of the phones, such as camera, display, or performance, that also have a positive, negative, or uncertain word within a 5-14 words. When this sentiment toward a specific phone feature is found, the mapper emits a count for each of the instances it observes. The mapper is sending this information back to the reducer, which is running on a master node. The reducer accumulates the information it receives from the mappers and writes it to the output file on S3.
3.1. Getting and preparing our input adresses from Common Crawl Blog
Common Crawl is a non-profit organization that crawls and archives the entire readable Internet once per month. The archived files are stored on Amazon Web Services N. Virginia S3
-
Download the wet paths file for last month
-
The .gz file is a compressed file and cannot be opened by sublime or any text editor. It needs to un-compressed or unzipped first
-
Open your wet.paths file with Sublime or anyh text editor. You will find 10s of thousands of addresses similar to:
-
You will need to add “s3://commoncrawl/” to the beginning of any file address you intend to use as input (tip: you can use regular expressions)
3.2. Let’s prepare our first cluster!
1. Identify an input address for a Common Crawl WET file.
- One for the mapper and reducer scripts
- A second bucket for your output
-
A third bucket for debugging logs
3. Upload the mapper and reducer scripts to an S3 bucket. You can use Cyberduck or CLI.
-
A.Go to “advanced options”. In the section “Add steps” ensure you check Auto Terminate. If not, your cluster will continue to run and you will continue to be charged.
-
B. In the “Add steps” section, select “streaming program”. Here you will specify that you’re running your own application and that it’s a streaming job type. Click on Configure.
- The mapper location (a s3 bucket) - The reducer location (a s3 bucket) - The input location. This field is asking for the URL of the Common Crawl WET file you want to access, so select and copy only one address. - The output location (a s3 bucket).NOTE: you will need to add the name of a folder to the end of the address.
5. Create a small EMR cluster using the EMR console: Part 2. Hardware Configuration.
Most of these options can be left at their default settings, although if you have problems with m3.large type, you can try with the m4.large type.6. Create a small EMR cluster using the EMR console: Part 3. Security.
Most of these options can be left at their default settings. Only make sure of proceeding without an EC2 key pair and ensure that the cluster is visible to all IAM users in account.7. Run the cluster!.
3.4. Let’s prepare the big cluster!
The EMR console provides an easy-to-use graphical interface for launching and monitoring your job flows directly from a web browser, but you may find that it is not efficient enough when running large jobs.
In this case, you will need to use the command line interface you installed back in task one. This interface will be accessed from the ‘terminal’ in Mac OSX and ‘Command Prompt’ on a Windows machine.Prior to running this job, let’s prepare the WET files.
1. Select and prepare 200 WET file addresses-
Open the WET paths file with a text editor and choose 200 addresses (remember that you need to add s3://commoncrawl/ to the beginning of each of your addresses.
-
Copy your 200 addresses and paste them into a new tab of your text editor
- Save this new tab as a .bdf file
2. Personalize the CreateJsonFilespy script Copy the CreateJsonFiles.py Python script into the folder with the .bdf file and personalize the script.
-
Update the python script to have the correct S3 locations for your Mapper.py and Reducer.py files
-
Update the python script to have the correct S3 address for your output bucket
-
Review all lines of script to assure that the mapper and reducer file names are correct
- Save the file
-
In your command shell, use the change directory command to point to the folder containing your BDF and CreateJSON files.
-
Run the CreateJSON file from command line (“Python createJsonFilesPv3.py”)
- You will be asked for the input .bdf file’s name and a name for the output file.
At this point, you should have a .json file containing all of the appropriate markup needed by AWS EMR.The .json files will contain one step per WET address from the .bdf file.
4. Checking the validity of your .json file In order for the CLI to correctly process .json files the .json file has to be formatted corrected or be structured in the correct manner. Thankfully there are numerous online validation tools that we can use to check the validity of JSON files.
-
In order to check the structure of your .json file open it in Sublime and copy/paste the text into JSONLint.
-
If your file structure is valid the site will inform you of such; if not, you might need to modify the structure of your .json file so the CLI will process it correctly.
- After you make any necessary modifications to your file structure copy the text out of JSONLint and paste it back into Sublime and save it as a .json file
5. Run the the .json files from CLI to create a EMR Cluster Running a json file from CLI will initiate the creation of a new AWS EMR cluster that will search for the desired sentiment data. You can monitor the progress of your cluster from the EMR Console
-
Choose a cluster name
-
subnetID: SubnetId can be found by visiting one of your recent, successful clusters using Console. Look under Network and Hardware. You should see something with a structure similar to: subnet-61e40522
-
log-uri: This s3 address should point to your debugging bucket
- json file name
Your cluster should be running! Check it from the EMR Console
4. DEVELOPPING MODELS TO PREDICT SENTIMENT
As you can imagine, using the opinions about the camera, the software or the sound of a device for predicting the sentiment toward this specific device sounds like a quite easy challenge. What if we increase the difficulty?, Would you be able to predict the sentiment toward a phone based on the opinions of the other phone?
- Try to predict the sentiment toward the iphone based on galaxy columns
- Try to predict the sentiment toward the iphone based on galaxy columns