This project is being undertaken in partial fulfillment of the requirements of IS 607, a course in the CUNY MS in Data Analytics.

Introduction

The detection of satire and similar literary genres (sarcasm, irony, hyperbole, etc.) is an interesting textual classification problem. Is it possible for a machine learning algorithm implemented in R to successfully classify satirical texts from non-satirical texts?

Inspiration

Thanks to the reviewers of the Hutzler Banana Slicer, whose satirical reviews led me down the path of asking whether I could create an R algorithm to accomplish this task.

Proposed Sources

Both short-form and longer-form news websites are selected in order to match for complexity. News sites are one of the more popular types of satire available online, and their topics often match those of real news sites (politics, health, etc.). This allows for a challenging topic-matched pair of corpora.

Satire News Sites

http://www.huffingtonpost.com/news/satire/

http://www.theonion.com

http://www.newyorker.com/humor/borowitz-report

Non-Satire (Real) News Sites

http://www.nytimes.com

http://www.cnn.com

http://huffingtonpost.com/news (with the exception of /satire)

Proposed Methodology

The proposed methodology is supervised machine learning with an ensemble approach for increased accuracy. The initial approach is based in the bag-of-words paradigm.

If a bag of words approach proves to fail as a classification method, we can consider adding the following classification factors: