Naive Bayes Classification of Wikipedia Comments

This notebook goes over a basic implementation of a Bayesian classifier, which I use here to identify toxic(unconstructive) message board comments on Wikipedia.

The data are the full text content of various comments on Wikipedia discussion boards that have been manually rated, by several workers, as being hostile/unconstructive(toxic) or good/constructive on a scale of -5 to 5. For this analysis, I sum all the scores for each message and label everything with a negative score as toxic, and everything with a non-negative score as nontoxic.

comment	split	score
This One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion sounds arbitrary and ad hoc Does it really belong in n encyclopedia article I don’t see that it adds anything useful The paragraph that follows seems much more useful Are there any political theorists out there who can clarify the issues It seems to me that this is an issue that Locke Rousseau de Toqueville and others must have debated SR	train	0
Clarification for you and Zundark’s right i should have checked the Wikipedia bugs page first This is a bug in the code that makes wikipedia work it just means that there is a line of code that may have an error as small as an extra space It’s analogous in a VERY simplified way to trying to make something bold in HTML and forgetting to put the at the end so you’d see something like this words in bold Instead of this words in bold It’s not like a virus that is code somebody deliberately wrote in order to infect your computer and damage files so it won’t go around JHK	train	0
Elected or Electoral JHK	test	0
This is such a fun entry Devotchka I once had a coworker from Korea and not only couldn’t she tell the difference between USA English and British English she had trouble telling the difference between different European languages Kind of keeps things in perspective eh Not suprising While I can easily tell the difference between French German Italian Spanish Dutch etc put me in a room with a Chinese Japanese Korean Vietnamese and a Thai speaker and I probably couldn’t tell the difference If I saw it written I’d probably have somewhat more luck though SJK Vietnamese has more syllable final consonants than Japanese I think you can tell them apart that way maybe Is this right Juuitchan Someone suggested Heath Robinson and Rube Goldberg as a vocabulary difference It’s certainly an interesting parallel but I don’t think it really belongs here They were both artists with their own style and both are known on both sides of the pond although their use as descriptive adjectives is split as suggested At any rate they can’t quite be considered translations because as an adjective Rube Goldberg is more specific describing an overly complex mechanical device or a complex series of interdependent actions Heath Robinson in contrast is more surrealistic or fantasy oriented LDC As an American I would like to say that to me a bum is a homeless person as much as the butt a flat is an apartment and rubbish certainly is trash Granted I agree that a fag is not a cigarette and underground is not a subway I may do some actual research and come back and fiddle with that list Eean I think Americans certainly understand the use of bum for butt rubbish for trash and to a lesser degree flat for apartment But we don’t use those terms much Point to a container for discarded things and an American will say that’s a trash can a Brit will say that’s a rubbish bin Americans are more likely to use rubbish in the sense of bullshit LDC I deleted the following pair limited Ltd and incorporated since they actually mean different things Incorporated means a corporation limited means a limited liability corporation you can also have unlimited liability corporations and no liability corporations British and Australian also Ltd is roughly equivalent to American LLC SJK I would say ‘torch’ was much more common than ‘pocket lamp’ which sounds quite old fashioned ‘Flashlight’ would be more easily recognised than the latter Yes I’d call it a torch and it would probably be labeled as a flashlight in its manufacturer’s packaging IMO ‘torch’ is colloquial British English The Anome Oh so flashlight is correct British usage My dictionary said Am and the Oxford English Dictionary carried flashlight only in the meaning of photography Then I’ll remove the entry again AxelBoldt	train	0
Please relate the ozone hole to increases in cancer and provide figures Otherwise this article will be biased toward the environmentalist anti CFC point of view instead of being neutral Ed Poor	test	0

Data Cleaning

The following cleaning steps are applied to the messages:

Superflous “stop words” are removed, on the assumption that they only contribute noise.
Punctuation and other regex matches are removed
The document is “lemmatized” - words with different spelling but the same meaning are all converted to the same string
Words with very few (<5) occurrences are removed

The resulting documents are significantly stripped down. One can try removing some stop words or adjusting their regex pattern if they think relevant information has been lost.

token
analogy mathematical term envision distribution opinion population curve consensus statement represent range opinion standard deviation opinion sound arbitrary ad hoc belong encyclopedia article add paragraph political theorist clarify issue issue locke Rousseau de debate sr
clarification check wikipedia bug page bug code make wikipedia mean line code error extra space analogous simplify bold html forget word bold word bold virus code deliberately write infect computer damage file
elect electoral
fun entry coworker korea difference usa English British English trouble tell difference European language perspective eh easily difference French german Italian Spanish dutch Chinese Japanese Korean Vietnamese Thai speaker difference write luck Vietnamese syllable final consonant Japanese suggest heath robinson rube goldberg vocabulary difference parallel belong artist style pond descriptive adjective split suggest rate consider translation adjective rube goldberg specific describe overly complex mechanical device complex series action heath robinson contrast fantasy oriented American bum homeless person butt flat apartment rubbish trash grant agree fag cigarette underground subway actual research fiddle list American understand bum butt rubbish trash lesser degree flat apartment term container discard American trash brit rubbish bin American rubbish sense bullshit delete pair limit incorporate incorporate mean corporation limit mean limit liability corporation unlimited liability corporation liability corporation British Australian roughly equivalent American llc torch common pocket lamp sound fashion flashlight easily recognise call torch label flashlight manufacturer packaging imo torch colloquial British English anome flashlight correct British usage dictionary oxford English dictionary carry flashlight mean photography remove entry
relate ozone hole increase cancer provide figure article bias environmentalist anti cfc view neutral editor poor

As the name suggests, our naive Bayes classifier is implemented by using Bayes rule and a naive assumption of independence between occurrences of words in a document. Specifically, we model each message an an observation from a multinomial distribution where

\[P(\textrm{toxic}|\textrm{message}) = \frac{P(\textrm{message}|\textrm{toxic})P(\textrm{toxic})}{P(\textrm{message})} \propto P(\textrm{message}|\textrm{toxic})P(\textrm{toxic}) = P(\textrm{toxic})\prod_{i=1}^nP(\textrm{word}_i|\textrm{toxic})\]

…and similarly for non-toxic(good) messages. The classification rule is simply toxic/nontoxic depending on whether the probability of toxicity is greater/less than the probability of a non-toxic message, conditional on the contents of the message.

\[\textrm{if } P(\textrm{toxic}|\textrm{message}) > P(\textrm{good}|\textrm{message}) \rightarrow \textrm{tag as toxic, else tag as good}\]

Or rather, \[\textrm{if }\frac{P(\textrm{toxic}|\textrm{message})}{P(\textrm{good}|\textrm{message})} = \frac{P(\textrm{toxic})\prod_{i=1}^nP(\textrm{word}_i|\textrm{toxic})}{P(\textrm{good})\prod_{i=1}^nP(\textrm{word}_i|\textrm{good})} > 1 \rightarrow \textrm{tag as toxic, else tag as good}\]

To prevent arithmetic underflow, we instead calculate the sum of the logs of the probabilities:

\[\textrm{if }\log{P(\textrm{toxic}|\textrm{message})}-\log{P(\textrm{good}|\textrm{message})} = [\log{P(\textrm{toxic})+\sum_{i=1}^n\log P(\textrm{word}_i|\textrm{toxic})}]-[\log{P(\textrm{good})+\sum_{i=1}^n\log P(\textrm{word}_i|\textrm{good})}] > 0 \rightarrow \textrm{tag as toxic, else tag as good } \textbf{(1)}\] It remains to decide how to calculate the probabilities. For this implementation, we will simply use the maximum likelihood estimator, the fraction of observed counts within the toxic and good messages. Specifically: \[\frac{n_i}{N}, \textrm{where }n_i\textrm{is the observed counts of a particular word in all messages of the same class (toxic or good), and }N\textrm{ is the total word count within that class}\]

For each class, we create a table containing each word, and its associated log probability. From this, we can calculate the probabilities in (1) and obtain our prediction for a given message. We can calculate the “toxicity” of a word by the ratio of the log probabilities with the toxic probability in the numerator \(\frac{\log{P(\textrm{toxic}|\textrm{all messages})}}{\log{P(\textrm{good}|\textrm{all messages})}}\)

Top 75 non-toxic words (lowest 75 log probability ratios), sized by counts.

It is as about as vulgar and racist as you can imagine.

Testing

Finally, we obtain predictions for each message and check performance. The prior probabilities for toxicity and non-toxicity \(P(\textrm{toxic}), P(\textrm{good})\) are simply the fraction of all documents that belong to that class: \(\frac{\textrm{number of documents of a certain class}}{\textrm{total number of documents}}\)

	0	1
0	24729	2104
1	1312	3544

The accuracy here is not great, we are correctly flagging toxic messages with about 76% accuracy and some supposedly innocent comments are being flagged as well. The misclassified comments tend to be more benign. Their average sum of scores (not simplified to 0-1) tend to be closer to 0 than the correctly classified comments.

Finally, we attempt to improve our accuracy using tf-idf vectorization in Python. The data is passed between R and Python using the .feather file format. This is accomplished completely within this notebook. Results shown below.

	0	1
0	26472	2269
1	396	2556

Our accuracy improves appreciably, with an overall ~92% accuracy and ~87% correct classification of toxic-rated comments.

Acknowledgements

Spam Filtering based on Naive Bayes Classification, Tianhao Sun, May 1, 2009 http://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf

Silge J and Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi: 10.21105/joss.00037 (URL: http://doi.org/10.21105/joss.00037), <URL: http://dx.doi.org/10.21105/joss.00037>.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Dawei Lang (2016). wordcloud2: Create Word Cloud by htmlWidget. R package version 0.2.0. https://CRAN.R-project.org/package=wordcloud2

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Python Software Foundation. Python Language Reference, version 3. Available at http://www.python.org