This project is being undertaken in partial fulfillment of the requirements of IS 607, a course in the CUNY MS in Data Analytics.
The detection of satire and similar literary genres (sarcasm, irony, hyperbole, etc.) is an interesting textual classification problem. Is it possible for a machine learning algorithm implemented in R to successfully classify satirical texts from non-satirical texts?
Thanks to the reviewers of the Hutzler Banana Slicer, whose satirical reviews led me down the path of asking whether I could create an R algorithm to accomplish this task.
Both short-form and longer-form news websites are selected in order to match for complexity. News sites are one of the more popular types of satire available online, and their topics often match those of real news sites (politics, health, etc.). This allows for a challenging topic-matched pair of corpora.
Satire News Sites
http://www.huffingtonpost.com/news/satire/
http://www.newyorker.com/humor/borowitz-report
Non-Satire (Real) News Sites
http://huffingtonpost.com/news (with the exception of /satire)
The proposed methodology is supervised machine learning with an ensemble approach for increased accuracy. The initial approach is based in the bag-of-words paradigm.
If a bag of words approach proves to fail as a classification method, we can consider adding the following classification factors: