We were given a short list of choices that are all capable of executing the app suite’s functions, and we were asked to analyze the positive and negative attitudes toward these smart phones online in order to narrow this list down to one device in the end. An extensive web sentiment analysis will be performed in this scenario to gain insights into the attitudes toward these devices.
Sentiment analysis include two parts, and this is the first part. We will use Amazon Elastic Map Reducer (EMR) platform to a series of Hadoop Streaming jobs that will collect a large amounts of smart phone-related web pages from a massive repository of web data called the Common Crawl. Once the data has been gathered, they were saved as hundreds of compressed (.gz) files, which need to be unzipped first. We then compile them into a large data matrix .csv file using Command Line Interface for future sentiment analysis.
Sentiment analysis is a method for gauging opinions of individuals or groups, such as a segment of a brand’s audience or an individual customer in communication with a customer support representative. Based on a scoring mechanism, sentiment analysis monitors reviews, conversations and evaluates language and voice inflections to quantify attitudes, opinions, and emotions related to a business, product or service, or topic.
In our project, we first counted frequency of the words associated with sentiment or reviews on smart phone devices from these uncompressed internet web pages. Later, we incoporated machine learning algorithms to look for the patterns and enabled us to label each of these reviews with a score which represent the levels of the positive, negative or neutral toward these devices.Finally, all streamed data has been stored in the AWS Simple Storage Services(S3).
Due to the nature of massive data processing, AWS Elastic Compute Cloud (E2C) has been implemented to conduct the analysis. The data we analyzed came from the Common Crawl, an open repository of web crawl data that can be accessed to the general public.
The following Python scripts were provided:
Because we are interested in sentiment mining, we will focus on using a subset of the WARC files that only contain text: WET. As a first step to getting our input addresses, we download the wet paths file for last month on Common Crawl Blog.
Run the python createJsonFiles.py code from the command shell to generate your json file.
Click here for full script link.
library(readr)
#### Import Final Dataset
LargeMatrix <- read_csv("LargeMatrix.csv")
Let’s get a peek of the first few rows of final dataset and we will use this dataset for Part 2 of this Sentiment Analysis.
head(LargeMatrix)
## # A tibble: 6 x 59
## id iphone samsunggalaxy sonyxperia nokialumina htcphone ios googleandroid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2 0 0 0 0 2 6
## 2 1 1 0 0 0 0 0 0
## 3 2 1 0 0 0 0 0 0
## 4 3 1 0 0 0 0 0 0
## 5 4 1 0 0 0 0 0 0
## 6 5 0 0 1 0 0 0 0
## # ... with 51 more variables: iphonecampos <dbl>, samsungcampos <dbl>,
## # sonycampos <dbl>, nokiacampos <dbl>, htccampos <dbl>, iphonecamneg <dbl>,
## # samsungcamneg <dbl>, sonycamneg <dbl>, nokiacamneg <dbl>, htccamneg <dbl>,
## # iphonecamunc <dbl>, samsungcamunc <dbl>, sonycamunc <dbl>,
## # nokiacamunc <dbl>, htccamunc <dbl>, iphonedispos <dbl>,
## # samsungdispos <dbl>, sonydispos <dbl>, nokiadispos <dbl>, htcdispos <dbl>,
## # iphonedisneg <dbl>, samsungdisneg <dbl>, sonydisneg <dbl>,
## # nokiadisneg <dbl>, htcdisneg <dbl>, iphonedisunc <dbl>,
## # samsungdisunc <dbl>, sonydisunc <dbl>, nokiadisunc <dbl>, htcdisunc <dbl>,
## # iphoneperpos <dbl>, samsungperpos <dbl>, sonyperpos <dbl>,
## # nokiaperpos <dbl>, htcperpos <dbl>, iphoneperneg <dbl>,
## # samsungperneg <dbl>, sonyperneg <dbl>, nokiaperneg <dbl>, htcperneg <dbl>,
## # iphoneperunc <dbl>, samsungperunc <dbl>, sonyperunc <dbl>,
## # nokiaperunc <dbl>, htcperunc <dbl>, iosperpos <dbl>, googleperpos <dbl>,
## # iosperneg <dbl>, googleperneg <dbl>, iosperunc <dbl>, googleperunc <dbl>