Swear Word Analysis Using Google NGram Data

Amy Su Jiang
December 20th 2014

Project Objective

  • Problem Statement:

With the proliferation of reading materials parents and teachers need a way to determine the appropriateness of any book or magazine

  • Objective:

Design a standardized book rating system to help parents and teachers easily understand the maturity level of any book

Data Description

data <- read.csv("BadLanguage.csv")
summary(data)
       Offensive.Level   Target.Level1     Target.Level2     Key.word  
 1 - racism    :20     person   :128   men idiot  :24    anus    :  1  
 2 - homophobia:21     situation: 22   idiot      :17    arse    :  1  
 3 - insult    :87                     homosexual :16    arsehole:  1  
 4 - others    :22                     men lowlife:14    ass     :  1  
                                       swearing   :10    assbite :  1  
                                       mess up    : 8    asses   :  1  
                                       (Other)    :61    (Other) :144  
head(data)
  Offensive.Level Target.Level1 Target.Level2    Key.word
1      1 - racism        person       arabian  sandnigger
2      1 - racism        person         asian        gook
3      1 - racism        person         black        coon
4      1 - racism        person         black     jigaboo
5      1 - racism        person         black junglebunny
6      1 - racism        person         black       negro

Data Analysis

plot of chunk unnamed-chunk-2



D3 Illustration

Conclusion

* Creation of book rating metrics with insulting language features:
A logical classification system for rating books with insulting language features was developed

* Development of time based classification method:
Due to the dramatic change in the oral language in American literature, books from each historical period deserve its own rating metrics using the insulting language features of the time

* Next Step:
To complete book rating metrics with context based language detection feature, we must use google 2 ngrams and 3 ngrams book data for further study