Collect data from Common Crawl and use Elastic Map Reducer to process them in Amazon Web Service

Overview

Background Info

  • We are working with a government health agency to create a suite of smart phone medical apps to facilitate communication between medical professionals, and aid workers in developing countries. The government agency will be providing aid workers with technical support services, but they need to limit the support to a single model of smart phone and operating system. This will also help to limit purchase costs and ensure uniformity when training aid workers to use the device.

Objective

  • We were given a short list of choices that are all capable of executing the app suite’s functions, and we were asked to analyze the positive and negative attitudes toward these smart phones online in order to narrow this list down to one device in the end. An extensive web sentiment analysis will be performed in this scenario to gain insights into the attitudes toward these devices.

  • Sentiment analysis include two parts, and this is the first part. We will use Amazon Elastic Map Reducer (EMR) platform to a series of Hadoop Streaming jobs that will collect a large amounts of smart phone-related web pages from a massive repository of web data called the Common Crawl. Once the data has been gathered, they were saved as hundreds of compressed (.gz) files, which need to be unzipped first. We then compile them into a large data matrix .csv file using Command Line Interface for future sentiment analysis.

Data Collection and Preparation

Implementation of Sentiment Analysis

  • Sentiment analysis is a method for gauging opinions of individuals or groups, such as a segment of a brand’s audience or an individual customer in communication with a customer support representative. Based on a scoring mechanism, sentiment analysis monitors reviews, conversations and evaluates language and voice inflections to quantify attitudes, opinions, and emotions related to a business, product or service, or topic.

  • In our project, we first counted frequency of the words associated with sentiment or reviews on smart phone devices from these uncompressed internet web pages. Later, we incoporated machine learning algorithms to look for the patterns and enabled us to label each of these reviews with a score which represent the levels of the positive, negative or neutral toward these devices.Finally, all streamed data has been stored in the AWS Simple Storage Services(S3).

  • Due to the nature of massive data processing, AWS Elastic Compute Cloud (E2C) has been implemented to conduct the analysis. The data we analyzed came from the Common Crawl, an open repository of web crawl data that can be accessed to the general public.

The following Python scripts were provided:

  • Mapper.py is to examine and count data from portions of the Common Crawl data.Mapper.py
  • Reducer.py is to accumulate the analysis from the individual mapper jobs.Reducer.py
  • Concatenatepv3.py is to aggregate the results from multiple streaming jobs.Concatenate.py

Identify the Data Source-Common Crawl

  • Common Crawl is a non-profit organization that crawls and archives the entire readable Internet once per month. The archived files are stored on Amazon Web Services N. Virginia S3. The crawl is split into 1000’s of roughly similar sized files which are then saved as WARC file type and gzipped (WARC stands for Web ARChive format). Each of these files has it’s own specific address and we use these addresses as input with Amazon Web Services.
  • Because we are interested in sentiment mining, we will focus on using a subset of the WARC files that only contain text: WET. As a first step to getting our input addresses, we download the wet paths file for last month on Common Crawl Blog.

    • The wet.paths file consists 10s of thousands of addresses like below, and it was saved as BDF file.
    • We added “s3://commoncrawl/” to the beginning of any file address we intend to use as input, so that EWR can recongize these addresses.
    • We set up three S3 buckets via web console, a bucket that you will use for mapper and reducer scripts, a bucket for your output, and a bucket for debugging logs. The aws-logs-598119493318-us-east-1 bucket was created automatically by Amazon S3.

Use the Command Line Interface to run EMR job flows

  • Although the EMR console on the AWS provides an easy-to-use graphical interface, it is not efficient enough when running large jobs. In this case, we will need to use command line interface (CLI) in AWS which provide the ability to programmatically launch and monitor progress of running job flows. This interface will be accessed from the ‘terminal’ in Mac OSX and ‘Command Prompt’ on a Windows machine.
  • We are provided with a CreateJson Python File (createJsonFiles.py). We need to edit the CreateJsonFiles.py Python script with .bdf file and it will create JSON files in the end.
  • Steps to take:
    • Select WET file addresses starting with s3://commoncrawl/
    • Update the python script to have the correct S3 locations for your Mapper.py and Reducer.py files.
    • Update the python script to have the correct S3 address for your output bucket.
    • Run the python createJsonFiles.py code from the command shell to generate your json file.

    • Click here for full script link.

  • Output will be look like this:
  • Checking the validity of your .json file- In order for the CLI to correctly process, json file has to be formatted corrected or be structured in the correct manner. Here, we copied the code in sublime text editor and paste to JSONLint to validate our code. We did a little adjustment toward the end on formatting and now we are ready for the next step.
  • Run the .json files from CLI to create a EMR Cluster
    • Below is an AWS command to run the JSON file:
    • Monitor your Cluster from AWS Console.

Consolidate the results of EMR job flows

  • Download the EMR output from S3 output bucket. We use CyberDuck to download all of the individual output folders to a single folder locates on the local machine.
  • Put concatenatepv3.py file in the same folder where we saved all the EMR output folders.Concatenatepv3.py will open each of the EMR output folders and aggregate all of the part files into two .csv files.
  • Open a command prompt and run python concatenatepv3.py script
  • Rename ‘concatenated_factors.csv’ to LargeMatrix.csv.

Final Dataset

library(readr)
#### Import Final Dataset 
LargeMatrix <- read_csv("LargeMatrix.csv")

Let’s get a peek of the first few rows of final dataset and we will use this dataset for Part 2 of this Sentiment Analysis.

head(LargeMatrix)
## # A tibble: 6 x 59
##      id iphone samsunggalaxy sonyxperia nokialumina htcphone   ios googleandroid
##   <dbl>  <dbl>         <dbl>      <dbl>       <dbl>    <dbl> <dbl>         <dbl>
## 1     0      2             0          0           0        0     2             6
## 2     1      1             0          0           0        0     0             0
## 3     2      1             0          0           0        0     0             0
## 4     3      1             0          0           0        0     0             0
## 5     4      1             0          0           0        0     0             0
## 6     5      0             0          1           0        0     0             0
## # ... with 51 more variables: iphonecampos <dbl>, samsungcampos <dbl>,
## #   sonycampos <dbl>, nokiacampos <dbl>, htccampos <dbl>, iphonecamneg <dbl>,
## #   samsungcamneg <dbl>, sonycamneg <dbl>, nokiacamneg <dbl>, htccamneg <dbl>,
## #   iphonecamunc <dbl>, samsungcamunc <dbl>, sonycamunc <dbl>,
## #   nokiacamunc <dbl>, htccamunc <dbl>, iphonedispos <dbl>,
## #   samsungdispos <dbl>, sonydispos <dbl>, nokiadispos <dbl>, htcdispos <dbl>,
## #   iphonedisneg <dbl>, samsungdisneg <dbl>, sonydisneg <dbl>,
## #   nokiadisneg <dbl>, htcdisneg <dbl>, iphonedisunc <dbl>,
## #   samsungdisunc <dbl>, sonydisunc <dbl>, nokiadisunc <dbl>, htcdisunc <dbl>,
## #   iphoneperpos <dbl>, samsungperpos <dbl>, sonyperpos <dbl>,
## #   nokiaperpos <dbl>, htcperpos <dbl>, iphoneperneg <dbl>,
## #   samsungperneg <dbl>, sonyperneg <dbl>, nokiaperneg <dbl>, htcperneg <dbl>,
## #   iphoneperunc <dbl>, samsungperunc <dbl>, sonyperunc <dbl>,
## #   nokiaperunc <dbl>, htcperunc <dbl>, iosperpos <dbl>, googleperpos <dbl>,
## #   iosperneg <dbl>, googleperneg <dbl>, iosperunc <dbl>, googleperunc <dbl>