IS 607 Final Project Proposal

Introduction
Inspiration
Proposed Sources
Proposed Methodology

This project is being undertaken in partial fulfillment of the requirements of IS 607, a course in the CUNY MS in Data Analytics.

Introduction

The detection of satire and similar literary genres (sarcasm, irony, hyperbole, etc.) is an interesting textual classification problem. Is it possible for a machine learning algorithm implemented in R to successfully classify satirical texts from non-satirical texts?

Inspiration

Thanks to the reviewers of the Hutzler Banana Slicer, whose satirical reviews led me down the path of asking whether I could create an R algorithm to accomplish this task.

Proposed Sources

Both short-form and longer-form news websites are selected in order to match for complexity. News sites are one of the more popular types of satire available online, and their topics often match those of real news sites (politics, health, etc.). This allows for a challenging topic-matched pair of corpora.

Satire News Sites

http://www.huffingtonpost.com/news/satire/

http://www.theonion.com

http://www.newyorker.com/humor/borowitz-report

Non-Satire (Real) News Sites

http://www.nytimes.com

http://www.cnn.com

http://huffingtonpost.com/news (with the exception of /satire)

Proposed Methodology

The proposed methodology is supervised machine learning with an ensemble approach for increased accuracy. The initial approach is based in the bag-of-words paradigm.

Screen scrape a number of stories from satire and real news sites, separating them in a file system according to their source
Create a training subset of a random sample of satire and real news site
Use R’s tm package to create corpora and execute a corpus cleaning algorithm and other required preprocessing
Create a term matrix
Attempt various combinations of classification algorithms using RTextTools

If a bag of words approach proves to fail as a classification method, we can consider adding the following classification factors:

Use of non-standard punctuation or capitalization as data
Term proximity evaluation
Advanced tokenization based on ontology or phrase extraction
Semantic or sentiment analysis

IS 607 Final Project Proposal

Joy Payton

November 5, 2015

Introduction

Inspiration

Proposed Sources

Proposed Methodology