Preliminary Project Overview
Obtaining and Scrubbing Data
Exploring Data
- Basic stats
- Creating Corpora and TDM
Modeling Data
Interpreting Data

Preliminary Project Overview

Project Title:

Satire and Data Science : An Exploration into one of the Current Final Frontiers

Project Participants:

Joy Payton and Karen Weigandt

Motivation and Inspiration:

This project is inspired by the Hutzler 571 Banana Slicer, and other products that really do exist, as hard as it is to believe (like Bic Cristal For Her pens and theft deterrent moldy sandwich bags) . Joy Payton is the originator of the idea for the analysis, and Karen Weigandt is a collector of kitchen gadgets of the incredibly useless variety.

This project is actually not only fun and entertaining, it is based on real social need. Satire is used as a tool to inspire change through humor and wit. It allows us to look at situations with a critical eye, and hopefully grow with the introspection that naturally follows. For many in the Asperger’s and autism community, this understanding does not come naturally, though the capability and intelligence lies within. In learning how to break it down into something that can be analyzed by a computer, we need to dissect the elements down into quantifiable differences, which can become teachable moments. This is a process that can be useful in a variety of situations, as our society and interactions become more complex as time goes on.

This study has been undertaken in partial fulfillment of the Data Analytics Masters Degree at The City University of New York, in the IS 607 course.

Project Goals:

In this project, we plan to examine satire in Amazon reviews and attempt to teach a computer to recognize what many humans cannot. This way, we may be able to come up with some hermeneutic “rules of thumb” to help people figure out when a written text is meant to make fun of someone or something. The craft of satire / poking fun at a situation, product, or person is intrinsically satisfying to satirists, sometimes especially if those being made fun of are unable to figure it out on their own.

The desired outcome is that we gain knowledge of data science concepts, both within and beyond those encountered in the IS607 curriculum. We hope to execute a data science workflow that includes obtaining data through webscraping and/or using tools like APIs and existing public databases, scrubbing, exploring and modeling the data using a variety of packages available in R, and finally interpreting the data using statistical and visualization tools to present our findings and conclusions.

Data Science Workflow

For our workflow, we have chosen to use the OSEMN model: We will

Obtain
Scrub
Explore
Model _ iNterpret

Amazon product reviews with an eye toward developing a satire detection model.

A Few Caveats vis-a-vis Obtaining Data

After triggering the Amazon “robot check” several times (and being confused as to why data extraction was failing!), we have placed Sys.sleep() commands in a number of places to keep this from happening and have made duplicate data available in a GitHub repository.

The Sys.sleep() commands as well as the memory-intensive data mining tasks we undertake cause the script to take a surprisingly long time to run. This is a known issue and a condition of data gathering.

Another challenge was the offensive word list used, which was comprehensive to the point of being overly sensitive.

Obtaining and Scrubbing Data

Introduction to the Data

The URL for the first page of the Hutzler Banana Slicer can be found here: http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/?sortBy=byRankDescending&pageNumber=1. Additional pages can be found by changing the last GET parameter to pageNumber=2, 3, etc.

By using the element inspector of Chrome, we can peek into the html structure and discover that each review is structured in this way:

div with class “reviews” holds all reviews
div with class “review” holds a single review
- div with class “helpful-votes-count” has text describing how helpful the review has been to other users
- span with class “a-icon-alt” has text describing the number of stars
- <a> with class “review-title” has title of the review
- <a> with class “author” has the username of the reviewer
- span with class “review-date” has text containing the date of the review
- span with class “review-text” contains the review text

Tool Preparation

First, we need to install and load rvest, dplyr, and stringi. We’ll also load a few packages even though I’m not using them just yet. They’ll come in handy later! Note that you may have to install packages that are not present in your own R environment.

library(rvest)

## Warning: package 'rvest' was built under R version 3.2.2

## Loading required package: xml2

library(plyr)
library(stringi)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(RCurl)

## Loading required package: bitops

library(class)
library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

Getting the Amazon Reviews

First, we’ll set up the local directories where the satire and serious reviews will live:

dir.create("amazon_satire", showWarnings = FALSE)
dir.create("amazon_serious", showWarnings = FALSE)

Note: the code block below demonstrates how to obtain fresh reviews from Amazon. Instead of issuing repeated calls to Amazon’s servers, we have opted to save a copy of the html as it appeared on December 9, 2015 and place it in our GitHub repository. This html will be removed at the end of this semester in deference to the intellectual property rights of Amazon over this content.

getAmazonHTML<- function(base_url, num_pages=20, corpus_identifier){
  
# We obtain the html once only.  This way, if we have to test various ways to 
# manipulate it, we can do so on our local copies and not keep hitting Amazon's servers

# Create a directory for this product's reviews, using the product name as the directory name.
  
directory_name<-paste(corpus_identifier, "/", str_match(base_url, "amazon.com/(.+)/product-reviews")[2], sep="")
dir.create(directory_name, showWarnings = FALSE) # don't bother telling us if the directory already exists

for (i in 1:num_pages) {
  current_url<-paste(base_url,i,sep="")
  filename<-paste(directory_name, "/Page-", i, ".html", sep="")
  download.file(current_url, filename)
  Sys.sleep(4)  # Be a good citizen and don't hammer the server
}
return(directory_name)

}

Note: the code block below shows how we can instead populate our local file system with copies taken from GitHub, sparing Amazon the extra network traffic.

getAmazonHTML<- function(base_url, num_pages=20, corpus_identifier){

# We obtain the html from GitHub, which represents a copy of what was on Amazon
# on Dec. 9, 2015.

# Directory names in the repo reflect the product name

remote_directory_name<-str_match(base_url, "amazon.com/(.+)/product-reviews")[2]
local_directory_name<-paste(corpus_identifier, "/", str_match(base_url, "amazon.com/(.+)/product-reviews")[2], sep="")
dir.create(local_directory_name, showWarnings = FALSE) # don't bother telling us if the directory already exists

for (i in 1:num_pages) {
  current_url<-paste("https://raw.githubusercontent.com/pm0kjp/IS607_Final_Project/master/", remote_directory_name, "/Page-", i, ".html", sep="")
  filename<-paste(local_directory_name, "/Page-", i, ".html", sep="")
  download.file(current_url, filename, method="libcurl")
}
return(local_directory_name)
}

In order to create a data frame with review text and related metadata, we will create a function that will accomplish several tasks:

It will download the first n pages of reviews (default is n=20) and store the html in the file system of the local computer.
It will then use rvest to extract only the review from the html files saved to the file system
Finally, it will clean the data and return a data frame.

Note that this function, for the purposes of this project, provides both obtaining and scrubbing functions in our data science model.

getReviews <-function(base_url, num_pages=20, corpus_identifier) {

# Get the HTML using the function we defined (either grabbing it directly from Amazon, or from our project GitHub.)

directory_name<-getAmazonHTML(base_url,num_pages, corpus_identifier)
  
# We'll load the html files and choose just the "reviews" section.  
# Note that classes are prefixed with a period, so we use ".reviews".  
# Within the "reviews" section, we have any number of nodes that we need to parse.   
# We first create a list called "reviews" that's empty, then fill it with reviews from each html file.
  
reviews<-list()
for (i in 1:length(list.files(directory_name))) {
  amazon_html <- read_html(paste(directory_name, "/Page-", i, ".html", sep=""))
  review_section<-amazon_html %>% html_node(".reviews")
  reviews<-c(reviews, review_section %>% html_nodes(".review"))
}

# Now we create vectors for each element we're going to pull out of each review.

attr(reviews, "class")<-"xml_nodeset"
helpful_votes<-reviews %>% html_nodes(".helpful-votes-count") %>% html_text()
stars<-reviews %>% html_nodes(".a-icon-alt") %>% html_text()
title<-reviews %>% html_nodes(".review-title") %>% html_text()
author<-reviews %>% html_nodes(".author") %>% html_text()
date<-reviews %>% html_nodes(".review-date") %>% html_text()
text<-reviews %>% html_nodes(".review-text") %>% html_text()

# Now we column bind those vectors:
amazon_reviews<-data.frame(cbind(date, title, author, text, stars, helpful_votes), stringsAsFactors = FALSE)
head(amazon_reviews)

# I've still got to do some cleanup:  date, stars, and helpful_votes need to have extraneous text removed.
amazon_reviews$date<-as.Date(amazon_reviews$date, "on %B %d, %Y")
amazon_reviews$stars<-as.numeric(gsub(" out.+", "", amazon_reviews$stars))
amazon_reviews$helpful_votes<-gsub(" of.+", "", amazon_reviews$helpful_votes)
amazon_reviews$helpful_votes<-as.numeric(gsub(",", "", amazon_reviews$helpful_votes))
return(amazon_reviews)
}

Now we can use that function to get as many pages of reviews as we want from as many products as we want. Besides the data for the Banana Slicer, we’re also interested in reviews on the Bic Cristal For Her pen, the Denon AKDL1 Dedicated Link Cable (Discontinued by Manufacturer), Uranium Ore, and the Samsung UN85S9 Framed 85-Inch 4K Ultra HD 3D Smart LED TV, all of which are characterized by (seemingly 100%) satirical reviews.

We add satire reviews to data frames, each one representing a product, with 200 rows representing individual reviews:

banana_slicer_reviews<-getReviews("http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_satire")
bic_reviews<-getReviews("http://www.amazon.com/BIC-Cristal-1-0mm-Black-MSLP16-Blk/product-reviews/B004F9QBE6/?sortBy=bySubmissionDateDescending&pageNumber=", corpus_identifier="amazon_satire")
cable_reviews<-getReviews("http://www.amazon.com/Denon-AKDL1-Dedicated-Discontinued-Manufacturer/product-reviews/B000I1X6PM/?sortBy=bySubmissionDateDescending&pageNumber=", corpus_identifier="amazon_satire")
uranium_reviews<-getReviews("http://www.amazon.com/Images-SI-Inc-Uranium-Ore/product-reviews/B000796XXM/?sortBy=bySubmissionDateDescending&pageNumber=", corpus_identifier="amazon_satire")
tv_reviews<-getReviews("http://www.amazon.com/Samsung-UN85S9-Framed-85-Inch-Ultra/product-reviews/B00CMEN95U/?sortBy=bySubmissionDateDescending&pageNumber=", corpus_identifier="amazon_satire")

We do the same for our serious reviews:

apple_slicer_reviews<-getReviews("http://www.amazon.com/OXO-Grips-Apple-Corer-Divider/product-reviews/B00004OCKT/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_serious")

## Warning in cbind(date, title, author, text, stars, helpful_votes): number
## of rows of result is not a multiple of vector length (arg 3)

pen_reviews<-getReviews("http://www.amazon.com/BIC-Velocity-1-6mm-Black-VLGB11-Blk/product-reviews/B004F9QBDC/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_serious")
hdmi_cable_reviews<-getReviews("http://www.amazon.com/Mediabridge-ULTRA-HDMI-Cable-25-Feet/product-reviews/B0031TRZX2/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_serious")
gemstone_reviews<-getReviews("http://www.amazon.com/Madagascar-gemstones-labradorite-septarian-chrysocolla/product-reviews/B003KQZY2K/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_serious")
normal_tv_reviews<-getReviews("http://www.amazon.com/Samsung-UN40J5200-40-Inch-1080p-Smart/product-reviews/B00WR292JE/?sortBy=byRankDescending&pageNumber=", corpus_identifier="amazon_serious")

Exploring Data

Basic stats

Here we can do basic stats that don’t require text mining: like number of words per review, number of words per title, number of words in all caps, etc. Again, we’ll create a function that will allow us to apply the analysis to any data frame.

We’ll also count offensive terms, which we’ve obtained from http://www.cs.cmu.edu/~biglou/resources/bad-words.txt

offensive<-unlist(read.table("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"))
# Add spaces on either side to tokenize the offensive.  
# The reason is that we don't want to hit on "ho", let's say, if the word 
# is "holiday".  Only if "ho" is used as a word in and of itself.
offensive<-paste(" ",offensive," ",sep="")

Now that we have the offensive terms in an R object, we can create a counting function, which will add new columns to the data frame passed to it.

countWords<-function (df) {
# Count words
df<-mutate(df, num_words_title = stri_count(title,regex="\\S+"))
df<-mutate(df, num_words_review = stri_count(text,regex="\\S+"))
# Count words in all caps that aren't single letter words like I, A, etc.
df<-mutate(df, num_caps_title = stri_count(title,regex="[A-Z]{2,}"))
df<-mutate(df, num_caps_review = stri_count(text,regex="[A-Z]{2,}"))
# Count number of exclamation points
df<-mutate(df, num_excl_title = stri_count(title,regex="!"))
df<-mutate(df, num_excl_review = stri_count(text,regex="!"))
# Count number of question marks
df<-mutate(df, num_quest_title = stri_count(title,regex="\\?"))
df<-mutate(df, num_quest_review = stri_count(text,regex="\\?"))
# Count offensive
df<-mutate(df, num_offensive_title = stri_count(title, regex=paste(offensive, collapse = "|")))
df<-mutate(df, num_offensive_review = stri_count(text, regex=paste(offensive, collapse = "|")))

return(df)
}

In each data frame, let’s count the elements we suspect might be indicative of satire: exclamation points, words in all capital letters, question marks, and offensive terms.

banana_slicer_reviews<-countWords(banana_slicer_reviews)
bic_reviews<-countWords(bic_reviews)
cable_reviews<-countWords(cable_reviews)
uranium_reviews<-countWords(uranium_reviews)
tv_reviews<-countWords(tv_reviews)

apple_slicer_reviews<-countWords(apple_slicer_reviews)
pen_reviews<-countWords(pen_reviews)
hdmi_cable_reviews<-countWords(hdmi_cable_reviews)
gemstone_reviews<-countWords(gemstone_reviews)
normal_tv_reviews<-countWords(normal_tv_reviews)

It makes sense for us to come up with a huge data frame that includes all the rows of the above data frames with one more column stating the serious / satire category.

satire_reviews<-rbind(banana_slicer_reviews,bic_reviews,cable_reviews,uranium_reviews,tv_reviews)
satire_reviews$category<-"Satire"
serious_reviews<-rbind(apple_slicer_reviews,pen_reviews,hdmi_cable_reviews,gemstone_reviews,normal_tv_reviews)
serious_reviews$category<-"Serious"
all_reviews<-rbind(satire_reviews,serious_reviews)

Now we can do a bit of data exploration with data visualization!

# Boxplot of offensive terms by category
ggplot(all_reviews, aes(x = category, y= num_offensive_review)) + geom_boxplot() + ggtitle("Boxplot: Number of offensive terms by category")

# We can see how the number of offensive words compares to the number of exclamation points in the reviews, and if this might differ for satire and serious reviews
qplot(num_offensive_review, num_excl_review, data = all_reviews, color = category) + ggtitle("Number of Offensive Words vs. Number of Exclamation Points")

# We can also see if there how exclamations are related to the length of the reviews
qplot(num_excl_review, num_words_review, data = all_reviews, color = category) + ggtitle("Exclamation vs. Number of Words")

#and if capitalized words relate to the size of the review by category
qplot(num_caps_review, num_words_review, data = all_reviews, color = category) + ggtitle("Capitalized Words  vs. Number of Words")

# Another thing we can look at is whther the aggregation of these indicators shows any kind of pattern with respect to the category, when compared to the length of the reviews
# Combine the indicators like exclamation points, question marks, capitals and offensive words into one value for each review
sub_counts_df <- subset(all_reviews[, 8:16])
sub_counts_df <- subset(sub_counts_df[, -8])
sub_counts_df <- subset(sub_counts_df[, -6])
sub_counts_df <- subset(sub_counts_df[, -4])
sub_counts_df <- subset(sub_counts_df[, -2])
sub_counts_df <- subset(sub_counts_df[, -1])
sub_counts_df$indicate_tot <- rowSums(sub_counts_df)

indicate_comp_df <- cbind.data.frame(sub_counts_df$indicate_tot, all_reviews$num_words_review, all_reviews$category)

qplot(sub_counts_df$indicate_tot, all_reviews$num_words_review, data = indicate_comp_df, color = all_reviews$category) + ggtitle("Number of Indicators vs. Total Number of Words")

If we want to see if the offensive check is legit (maybe we are hitting on words that aren’t really offensive in our opinion), we can do something like the following:

bad_apples<-apple_slicer_reviews %>% filter (num_offensive_review > 0)
str_match_all(bad_apples$text, paste(offensive, collapse = "|"))

## [[1]]
##      [,1]     
## [1,] " hole " 
## [2,] " knife "
## 
## [[2]]
##      [,1]    
## [1,] " hole "
## 
## [[3]]
##      [,1]      
## [1,] " harder "
## [2,] " bigger "
## 
## [[4]]
##      [,1]     
## [1,] " knife "
## 
## [[5]]
##      [,1]     
## [1,] " knife "
## 
## [[6]]
##      [,1]     
## [1,] " knife "
## 
## [[7]]
##      [,1]      
## [1,] " stupid "
## 
## [[8]]
##      [,1]       
## [1,] " banging "
## 
## [[9]]
##      [,1]     
## [1,] " nasty "
## 
## [[10]]
##      [,1]     
## [1,] " knife "
## 
## [[11]]
##      [,1]     
## [1,] " knife "
## 
## [[12]]
##      [,1]     
## [1,] " knife "
## 
## [[13]]
##      [,1]     
## [1,] " color "
## 
## [[14]]
##      [,1]     
## [1,] " knife "
## 
## [[15]]
##      [,1]      
## [1,] " bigger "
## 
## [[16]]
##      [,1]      
## [1,] " abuse " 
## [2,] " bigger "
## 
## [[17]]
##      [,1]     
## [1,] " dirty "
## 
## [[18]]
##      [,1]     
## [1,] " black "
## [2,] " hole " 
## [3,] " hole " 
## 
## [[19]]
##      [,1]     
## [1,] " knife "
## 
## [[20]]
##      [,1]     
## [1,] " knife "
## 
## [[21]]
##      [,1]      
## [1,] " harder "
## 
## [[22]]
##      [,1]     
## [1,] " knife "
## 
## [[23]]
##      [,1]     
## [1,] " knife "
## 
## [[24]]
##      [,1]      
## [1,] " harder "
## 
## [[25]]
##      [,1]     
## [1,] " knife "
## 
## [[26]]
##      [,1]     
## [1,] " knife "
## 
## [[27]]
##      [,1]     
## [1,] " knife "

In this case, we’re hitting on words that aren’t really offensive, or only would be contextually. Since these terms seem to be equitably distributed in the satire and serious reviews, we leave them be for now.

Creating Corpora and TDM

For preprocessing purposes, I will create one corpus for each kind of review (one “satire” corpus with its TDM, and one “serious” corpus with its TDM). In each corpus, I’ll treat the product names as stopwords (so that, for example, the satire corpus does not treat the term “banana”, which appears in a product name, as a bellweather word).

library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC)

We create a function that will take any of the data frames and create text files of the title and review text from each row, which will contribute to corpus creation.

makeTextFiles <-function(corpus_name, df) {
dir.create(corpus_name, showWarnings = FALSE) # don't bother telling us if the directory already exists
for (i in 1:nrow(df)){
file_name<-paste(corpus_name, "/", deparse(substitute(df)), i, sep="")
write(paste(df$title[i],df$text[i]), file_name)
}
}

We’ll now use the function we created to write text files representing each row of each data frame. While we could do that via a combined data frame, here we’ll do it data frame by data frame, which preserves the name of the product in the file name, in case we need to do closer examination of any given text file. We create two corpora, one serious, and one satire.

makeTextFiles("amazon_satire/corpus", bic_reviews)
makeTextFiles("amazon_satire/corpus", banana_slicer_reviews)
makeTextFiles("amazon_satire/corpus", cable_reviews)
makeTextFiles("amazon_satire/corpus", tv_reviews)
makeTextFiles("amazon_satire/corpus", uranium_reviews)

makeTextFiles("amazon_serious/corpus", pen_reviews)
makeTextFiles("amazon_serious/corpus", apple_slicer_reviews)
makeTextFiles("amazon_serious/corpus", hdmi_cable_reviews)
makeTextFiles("amazon_serious/corpus", normal_tv_reviews)
makeTextFiles("amazon_serious/corpus", gemstone_reviews)

As always in data science, our data science methodology is iterative, and we’re back to the “scrubbing” task. We need to do preprocessing of our corpora. We’ll try a standard bag-of-words, where we remove punctuation, set everything to lower case, and do word stemming.

We create a function to handle this preprocessing. The function takes as input a directory name where the corpus is found and an optional string for providing additional stopwords (for example, removing words that are in the name of the product itself).

createCleanCorpus <- function(dir_name, special_stopwords="") {
  corpus<-Corpus(DirSource(dir_name))  
  corpus <- tm_map(corpus, removePunctuation)   
  corpus <- tm_map(corpus, removeNumbers)   
  corpus <- tm_map(corpus, content_transformer(tolower))   
  corpus <- tm_map(corpus, removeWords, stopwords("english")) 
  stopwords_vector<-strsplit(special_stopwords, " ")[[1]]
  corpus <- tm_map(corpus, removeWords, stopwords_vector) # remove special stopwords both before and ...
  corpus <- tm_map(corpus, stemDocument) 
  corpus <- tm_map(corpus, removeWords, stopwords_vector) # ... after stemming.
  return(corpus)
}

Using the function we just defined, we’ll build two corpora and obtain the term document matrix for each:

satire_corpus<-createCleanCorpus("amazon_satire/corpus", "hutzler banana slice slicer bic cristal pen for her woman women  man men female feminin denon akdl dedicated link cable discontinued by manufacturer samsung uns framed inch k ultra hd d smart led tv uranium ore tvs hdmi ")

satire_tdm <- TermDocumentMatrix(satire_corpus)   

serious_corpus<-createCleanCorpus("amazon_serious/corpus", "oxo apple grip core corer divide divider slice slicer bic pen velocity mediabridge hdmi black ultra cable feet madagascar gemstone gemstones stone stones labradorite septarian chrysocolla samsung inch p unj smart tv televis tvs display")

serious_tdm <- TermDocumentMatrix(serious_corpus)

Now that we have our TDMs, we want to attach the category (“satire” or “serious”) to the Term Document Matrix. We’ll transpose the matrix as well.

bindCategoryTDM<- function(tdm, category){
  transposed_matrix<-t(data.matrix(tdm))
  df<-as.data.frame(transposed_matrix, stringsAsFactors = FALSE)
  df<-cbind(df, rep(category, nrow(df)))
  colnames(df)[ncol(df)]<-"review_category"
  return(df)
}

satire_df<-bindCategoryTDM(satire_tdm, "satire")
serious_df<-bindCategoryTDM(serious_tdm, "serious")

We’ll stack the two data frames, replacing any NA with 0, and moving the category variable to the front

df_for_model<-rbind.fill(satire_df,serious_df)
df_for_model[is.na(df_for_model)]<-0
df_for_model<-df_for_model %>% dplyr::select(review_category, everything())

Let’s take a look at what we have. We’ll take a look at the first five and last columns.

head(df_for_model[, c(1:5, ncol(df_for_model))])

##   review_category abandon abil abject abl yummi
## 1          satire       0    0      0   1     0
## 2          satire       0    0      0   0     0
## 3          satire       0    0      0   0     0
## 4          satire       0    0      0   0     0
## 5          satire       0    0      0   0     0
## 6          satire       0    0      0   0     0

Modeling Data

Note: we initially tried to use the klaR Naive Bayes, but found that where there is a variance of 0, klAr throws an error (see http://stats.stackexchange.com/questions/35694/naive-bayes-fails-with-a-perfect-predictor). For this reason, we moved toward the use of the e1071 package.

Note that we’re dealing with 2000 cases, so predict() will churn for a while before completing. For this reason, it is added in a separate code block with eval set to FALSE.

library(e1071)

# set up a training sample (70% of the data frame)
df_train <- sample(1:nrow(df_for_model), ceiling(nrow(df_for_model)* 0.7), replace=FALSE)

# and a testing sample that excludes the above:
df_test <- (1:nrow(df_for_model))[-df_train]

# Construct the model, using the training sample, and apply it across the board to see its fit:

model<-naiveBayes(review_category ~ ., data=df_for_model[df_train,])
prediction<-predict(model, newdata=df_for_model[,-1])
table(prediction, df_for_model$review_category)

# Apply the model to just the test data (not used in training):
prediction<-predict(model, newdata=df_for_model[df_test,-1])
table(prediction, df_for_model[df_test,"review_category"] )

# Unfortunately this is not a good model!  Everything shows up as "serious", except for one 
# correctly classified satire text.

The results from our model’s prediction are discouraging. They are reproduced here in case one does not wish to spend the CPU cycles and/or time required to execute the prediction:

## prediction satire serious
##    satire       0       1
##    serious   1000     999

As we can see, this initial attempt at Naive Bayes failed, and took a very long time to arrive at its predictions. What if we selected only some words, the words that were most distinctive for the satire corpus and the most distinctive words for the serious corpus?

# First, find the terms that appear in each TDM > 30X
satire_terms<-findFreqTerms(satire_tdm, 30)
serious_terms<-findFreqTerms(serious_tdm, 30)

# Then, find the intersection (where the terms are the same in both corpora), and remove that from each term list
# so that we are left with *distinctive* terms.

satire_distinctive_terms<-setdiff(satire_terms, serious_terms)
serious_distinctive_terms<-setdiff(serious_terms, satire_terms)

distinctive_terms<-c(satire_distinctive_terms, serious_distinctive_terms)
distinctive_terms

##   [1] "allow"      "anyth"      "ask"        "bad"        "becom"     
##   [6] "began"      "believ"     "black"      "blue"       "boy"       
##  [11] "bring"      "call"       "care"       "cat"        "caus"      
##  [16] "cereal"     "chang"      "christma"   "close"      "complet"   
##  [21] "couldnt"    "curv"       "decid"      "design"     "direct"    
##  [26] "dog"        "door"       "enjoy"      "experi"     "eye"       
##  [31] "face"       "fact"       "felt"       "figur"      "final"     
##  [36] "food"       "free"       "friend"     "fruit"      "futur"     
##  [41] "gave"       "gift"       "god"        "guy"        "hair"      
##  [46] "half"       "happen"     "head"       "hear"       "heard"     
##  [51] "help"       "hope"       "hour"       "hous"       "husband"   
##  [56] "idea"       "ill"        "imagin"     "immedi"     "instead"   
##  [61] "kid"        "kind"       "kitchen"    "knew"       "knife"     
##  [66] "ladi"       "later"      "least"      "leav"       "left"      
##  [71] "let"        "life"       "live"       "lost"       "mayb"      
##  [76] "might"      "mine"       "month"      "morn"       "mother"    
##  [81] "music"      "name"       "near"       "next"       "night"     
##  [86] "noth"       "notic"      "offer"      "part"       "pay"       
##  [91] "peel"       "person"     "pick"       "place"      "plan"      
##  [96] "power"      "real"       "realiz"     "rememb"     "result"    
## [101] "said"       "saw"        "seen"       "sell"       "simpli"    
## [106] "someon"     "soon"       "spend"      "state"      "stop"      
## [111] "stuff"      "tell"       "three"      "togeth"     "told"      
## [116] "travel"     "understand" "unfortun"   "victorio"   "warn"      
## [121] "wasnt"      "whole"      "wife"       "wish"       "wonder"    
## [126] "wont"       "word"       "world"      "wrong"      "your"      
## [131] "absolut"    "app"        "appl"       "audio"      "awesom"    
## [136] "bag"        "beauti"     "blade"      "bold"       "bottom"    
## [141] "brand"      "button"     "cheap"      "comfort"    "compani"   
## [146] "compar"     "compon"     "cost"       "coupl"      "custom"    
## [151] "definit"    "easi"       "easier"     "easili"     "etc"       
## [156] "exact"      "excel"      "expens"     "extra"      "fast"      
## [161] "favorit"    "featur"     "foot"       "function"   "game"      
## [166] "handl"      "hard"       "hdtv"       "hook"       "hub"       
## [171] "impress"    "includ"     "ink"        "instal"     "internet"  
## [176] "issu"       "laptop"     "larg"       "length"     "local"     
## [181] "mix"        "mount"      "movi"       "netflix"    "other"     
## [186] "overal"     "perform"    "plastic"    "player"     "polish"    
## [191] "port"       "press"      "push"       "quit"       "rate"      
## [196] "reason"     "recommend"  "remot"      "replac"     "requir"    
## [201] "rock"       "satisfi"    "screen"     "select"     "servic"    
## [206] "sharp"      "signal"     "slow"       "smooth"     "sometim"   
## [211] "speaker"    "stand"      "stream"     "support"    "surpris"   
## [216] "system"     "theyr"      "thick"      "tip"        "tool"      
## [221] "top"        "tumbl"      "tumbler"    "type"       "unit"      
## [226] "updat"      "valu"       "video"      "wifi"

# Note that when we look at these terms, there seems to be overfitting: we are capturing words that 
# are specific to the products being reviewed.  We might consider going back and adding additional
# product-specific keywords if we e to expand the scope of this project.

# Now let's try Naive Bayes with this limited bank of words:

model<-naiveBayes(review_category ~ ., data=df_for_model[df_train, c("review_category", distinctive_terms)])
prediction<-predict(model, newdata=df_for_model[,-1])
table(prediction, df_for_model$review_category)

##           
## prediction satire serious
##    satire     678      93
##    serious    322     907

# Apply the model to just the test data (not used in training):
prediction<-predict(model, newdata=df_for_model[df_test,c(distinctive_terms)])
table(prediction, df_for_model[df_test,"review_category"] )

##           
## prediction satire serious
##    satire     192      43
##    serious    109     256

As we can see, limiting the scope of variables has helped our Naive Bayes model, both in effectiveness and speed.

What if we attempted a K Nearest Neighbor algorithm instead? We’ll try it with all the terms, not the limited pool of most distinctive words, to see what we come up with.

# We'll try KNN.

knn_model<-knn(df_for_model[df_train, -1], df_for_model[df_test, -1], df_for_model[df_train, 1])
confusion_matrix<-table("Predictions" = knn_model, "Actual" = df_for_model[df_test,1])
confusion_matrix

##            Actual
## Predictions satire serious
##     satire     181      80
##     serious    120     219

accuracy<-sum(diag(confusion_matrix) / length(df_test)) * 100
accuracy

## [1] 66.66667

Interpreting Data

We can see from some of the visualizations above that indicators like punctuation and capitalization, as well as length of reviews can signify that some reviews warrant closer inspection for satirical content, but are by no means definitive. Even combining these indicative elements is no guarantee of recognition of satire.

A bag-of-words approach is an interesting exercise, but also by no means comprehensive enough for definitive classification. Let’s look at some visualizations of the vocabulary used in satirical vs. serious reviews.

install.packages("wordcloud", dependencies = TRUE, repos = "http://cran.mirrors.hoobly.com/") # word-cloud generator 
install.packages("RColorBrewer", dependencies = TRUE, repos = "http://cran.mirrors.hoobly.com/") # color palettes

library(ggplot2)
library(wordcloud)
library(RColorBrewer)

We can create wordclouds and barplots to show thw most frequently used words in satirical reviews, and in serious reviews.

# create a data frame for words and their frequency - satire

sat_m <- as.matrix(satire_tdm) # convert tdm to matrix

sat_m_sort <- sort(rowSums(sat_m), decreasing=TRUE) # sort by frequency, highest first

sat_m_sort_df <- as.data.frame(sat_m_sort) # save the frequency values 

satire_freq_df <- data.frame(words = row.names(sat_m_sort_df), freq = sat_m_sort_df$sat_m_sort) # create a data frame containing the words and their frequencies

# Draw a word cloud
set.seed(2112) # Set seed for reproducibility
wordcloud(words = satire_freq_df$words, freq = satire_freq_df$freq, min.freq = 20,
          max.words=100, random.order=TRUE, rot.per=0.30, 
          colors=brewer.pal(8, "Set1"))

ggplot(satire_freq_df[1:25,], aes(x = words, y = freq)) + geom_bar(stat = "identity", fill="#FF9999", colour="black") + labs(title="Satire Word Frequencies")

# create a data frame for words and their frequency - serious

ser_m <- as.matrix(serious_tdm) # convert tdm to matrix

ser_m_sort <- sort(rowSums(ser_m), decreasing=TRUE) # sort by frequency, highest first

ser_m_sort_df <- as.data.frame(ser_m_sort) # save the frequency values 

serious_freq_df <- data.frame(words = row.names(ser_m_sort_df), freq = ser_m_sort_df$ser_m_sort) # create a data frame containing the words and their frequencies

# Draw a word cloud
set.seed(2112)
wordcloud(words = serious_freq_df$words, freq = serious_freq_df$freq, min.freq = 20,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Set1"))

ggplot(serious_freq_df[1:25,], aes(x = words, y = freq)) + geom_bar(stat = "identity", fill="#56B4E9", colour="black") + labs(title="Serious Word Frequencies")

From the visualizations above, we can see that the words used extensively in satire are similar to those used in serious reviews. This is one reason bag-of-words alone is not a workable model, even inclusive of sentiment analysis.

The key insight gained from this project is that while satire is hard for humans to figure out, it becomes even more complex when you try to break it down into what separates it from standard language. Selecting and scraping the data (product reviews from amazon.com), using a database of sentiment analysis words, and exploring some of the fundamental indicators associated with satire in this environment gave us some insight into the difficulties facing the data science community as progress is made in this challenging endeavor. The elements considered, and the model chosen have an enormous effect on the analysis results. Exploratory analysis and visualization is essential in building understanding of the influence of the different variables. Machine learning offers a way for the computer to gain its own “life” experience with satire. We chose to use a simple bag of words analysis for this project, and tested a couple of different algorithms. Since the first model did not give viable results, we retested this algorithm with a refined, more distinctive bag of words.

One aspect of satire that is difficult to account for is the manner in which it references cultural and topical issues. Until this can be broken down into logical components (perhaps using some type of graph database) and a manner of implementation, I fear computers as well as humans will find that the assessment of satire will remain more of an art than a science.

IS 607 Final Project

The Karens: (Karen) Joy Payton and Karen Weigandt

Fall 2015