Naive Bayes (NB) is a classifying algorithm which uses data about prior events to estimate the probability of future events. Typically it’s best applied to problems in which the information from numerous attributes should be considered simultaneously in oder to estimate the P of an outcome. While many algos typically ignore features with weak effects, this technique uses all available info to subtly change predictions.
We’ll spend a bit of time on the theory.
Often we’re interested in monitoring several non-mutually exclusive events for the same trial. Thinking about ham and spam email:
We would want to calculate P(spam ∩ Viagra). But how?
It depends on the joint probability of the two events, i.e. how the prob of one event is related to the prob of the other. Dependent events are the basis for predictive modelling. Suppose P(Viagra) is independent of P(spam): it would be easy to calculate P(Viagra ∩ Spam) as simply P(Viagra) x P(spam) = 20% x 5% = 1%. But in reality P(Viagra) and P(spam) are likely highly dependent.
Quick reminder of prob theory from GCSE!
Some new terminology: say we were tasked with determining P(spam). With no other info, the most reasonable guess is the probability that any prior message was spam (in our case, 20%). This estimate is aka the prior probability (P(Spam))
Next suppose we’re told that the term Viagra is in the email.The probability that Viagra was used in previous spam messages is called the likelihood (P(Viagram ∩ Spam)); the probability that Viagra was used in any message at all is called the marginal likelihood (P(Viagra))
Using Bayes’, we can calculate a posterior probability (P(spam ∩ Viagra)), and this will be the clincher (if it’s > 0.5, we classify as spam). Bayes’ formula can be re-written in words as:
\(posterior prob = likelihood * prior / marginal_likelihood\)
\(P(Spam | Viagra) = P(Viagra | Spam) P(Spam) / P(Viagra)\)
In order to calculate the components above, we need to construct a frequency table that records the number of times Viagra appeared in both spam and ham messages:
| Freq | Yes | No | Total |
|---|---|---|---|
| spam | 4 | 16 | 20 |
| ham | 1 | 79 | 80 |
| Total | 5 | 95 | 100 |
and from that, construct a likelihood table, e.g.
| Freq | Yes | No | Total |
|---|---|---|---|
| spam | 4/20 | 16/20 | 20 |
| ham | 1/80 | 79/80 | 80 |
| Total | 5/100 | 95/100 | 100 |
This tells us that P(Viagra ∩ spam) is 4/20 = 0.2 (20%). So we can now calculate our posterior probability as 0.2 x 0.2 / 0.05 = 0.8 (80%), so such messages should be ditched.
Naive Bayes is now the de facto standard for text classification. Its primary weakness is that it’s not well suited to lots of numeric data (these have to be bucketed as factors). Another weakness is that its resultant estimated probabilities aren’t as accurate as its classification results (though they are easy to obtain for a prediction). However, it is simple, fast and effetive, does well with noisy or missing data, requires relatively few examples for training.
It’s called ‘naive’ because of the assumptions it makes: it assumes that all of the features in the dataset are equally important and independent - this is rarely true in most real-world applications. E.g. if you really were trying to detect spam, the sender of the email is likely more important than the text contained. That said, it still performs well when these assumptions are violated.
So we’ll be using some SMS data to build a bayes classifier from this link, with some info about it available here.
sms = read.csv("D://dev//R//mlwr/chap4-naivebayes/sms_spam.csv", stringsAsFactors=F)
The data is simply 5574 records consisting of two fields, type (ham or spam, mostly ham) and the text.
str(sms)
## 'data.frame': 5574 obs. of 2 variables:
## $ type: chr "ham" "ham" "spam" "ham" ...
## $ text: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...
round(prop.table(table(sms$type))*100, digits = 1)
##
## ham spam
## 86.6 13.4
We should fix the ‘type’ feature as it’s really a factor:
sms$type = factor(sms$type)
We’ll leave the text alone as we’re about to learn some fun text processing tools now.
R has a cool library for text mining called tm - this does all sorts of clever things to handle punctuation, numbers, eliminate frequent words (but, and, or etc).
#install.packages("tm")
library(tm)
## Loading required package: NLP
We need to build a corpus, which is a collection of documents, from the texts.
sms_corpus = Corpus(VectorSource(sms$text))
print(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5574
inspect(sms_corpus[1:3])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 111
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 29
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 155
It’s worth noting that Corpus can read from PDFs and Word docs etc, run this for more:
#print(vignette)
We need to clean up the data for the aforementioned fluff, we’ll use tm_map():
corpus_clean = tm_map(sms_corpus, tolower) # convert to lower case
corpus_clean = tm_map(corpus_clean, removeNumbers) # remove digits
corpus_clean = tm_map(corpus_clean, removeWords, stopwords()) # and but or you etc
corpus_clean = tm_map(corpus_clean, removePunctuation) # you'll never guess!
corpus_clean = tm_map(corpus_clean, stripWhitespace) # reduces w/s to 1
inspect(corpus_clean[1:3])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## [1] go jurong point crazy available bugis n great world la e buffet cine got amore wat
##
## [[2]]
## [1] ok lar joking wif u oni
##
## [[3]]
## [1] free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s
We will now tokenize each message into words to build the key structure for the analysis, a sparse matrix comprising:
This will obviously be mostly 0-filled (whence sparse). We use DocumentTermMatrix()
corpus_clean = tm_map(corpus_clean, PlainTextDocument) # this is a tm API necessity
dtm = DocumentTermMatrix(corpus_clean)
Note that the dtm has quite a few columns!
str(dtm)
## List of 6
## $ i : int [1:43114] 1 1 1 1 1 1 1 1 1 1 ...
## $ j : int [1:43114] 236 472 911 913 1233 1506 2794 2832 3549 5139 ...
## $ v : num [1:43114] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 5574
## $ ncol : int 7957
## $ dimnames:List of 2
## ..$ Docs : chr [1:5574] "character(0)" "character(0)" "character(0)" "character(0)" ...
## ..$ Terms: chr [1:7957] "â<U+0093>harry""| __truncated__ "â<U+0080><U+0093>""| __truncated__ "â<U+0080><U+009C>""| __truncated__ "â<U+0080><U+009C>harry""| __truncated__ ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
Let’s now create our training and testing sets. The data is randomly ordered, so we can just split the data on an arbitrary line at 75%/25%:
# split the raw data:
sms.train = sms[1:4200, ] # about 75%
sms.test = sms[4201:5574, ] # the rest
# then split the document-term matrix
dtm.train = dtm[1:4200, ]
dtm.test = dtm[4201:5574, ]
# and finally the corpus
corpus.train = corpus_clean[1:4200]
corpus.test = corpus_clean[4201:5574]
# let's just assert that our split is reasonable: raw data should have about 87% ham
# in both training and test sets:
round(prop.table(table(sms.train$type))*100)
##
## ham spam
## 86 14
round(prop.table(table(sms.test$type))*100)
##
## ham spam
## 87 13
# LGTM
Wordclouds show words in larger fonts if they’re more frequent. Let’s build one for ham and one for spam to see if we can tell whether or not our NB classifier is likely to be successful. Predictably, there’s an R package for that! The wordcloud package works with tm corpuses.
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(corpus.train,
min.freq=40, # 10% of num docs in corpus is rough standard
random.order = FALSE) # biggest words are nearer the centre
We didn’t create separate corpuses for spam and ham, so the above is type-neutral. But wordcloud can also work with raw-text (which you’d expect):
spam = subset(sms.train, type == "spam")
ham = subset(sms.train, type == "ham")
wordcloud(spam$text,
max.words=40, # look at the 40 most common words
scale=c(3, 0, 5)) # adjust max and min font sizes for words shown
wordcloud(ham$text,
max.words=40, # look at the 40 most common words
scale=c(3, 0, 5)) # adjust max and min font sizes for words shown
So looks like the biggest words don’t overlap much between ham and spam - this suggests NB has a fighting chance.
As shown earlier, DTMs have more than 7000 columns - that’s way too much, so let’s shrink it down: eliminate words which appear in less than 5 SMS messages (about 0.1%). This should reduce the feature-set to a far more manageable number. We’ll use tm’s findFreqTerms() function.
freq_terms = findFreqTerms(dtm.train, 5)
reduced_dtm.train = DocumentTermMatrix(corpus.train, list(dictionary=freq_terms))
reduced_dtm.test = DocumentTermMatrix(corpus.test, list(dictionary=freq_terms))
# have we reduced the number of features?
ncol(reduced_dtm.train)
## [1] 1231
ncol(reduced_dtm.test)
## [1] 1231
# yay
Almost there. Remember that NB works on factors, but our DTM only has numerics. Let’s define a function which converts counts to yes/no factor, and apply it to our reduced matrices.
convert_counts = function(x) {
x = ifelse(x > 0, 1, 0)
x = factor(x, levels = c(0, 1), labels=c("No", "Yes"))
return (x)
}
# apply() allows us to work either with rows or columns of a matrix.
# MARGIN = 1 is for rows, and 2 for columns
reduced_dtm.train = apply(reduced_dtm.train, MARGIN=2, convert_counts)
reduced_dtm.test = apply(reduced_dtm.test, MARGIN=2, convert_counts)
We’ll use an NB implementation from package e1071; there’s another in klaR too, fyi.
Training the model and using it for classification is a 2-stage jobby (unlike kNN). We call naiveBayes() which returns a model, then predict().
#install.packages("e1071")
library(e1071)
# store our model in sms_classifier
sms_classifier = naiveBayes(reduced_dtm.train, sms.train$type)
sms_test.predicted = predict(sms_classifier,
reduced_dtm.test)
# once again we'll use CrossTable() from gmodels
#install.packages("gmodels")
library(gmodels)
CrossTable(sms_test.predicted,
sms.test$type,
prop.chisq = FALSE, # as before
prop.t = FALSE, # eliminate cell proprtions
dnn = c("predicted", "actual")) # relabels rows+cols
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1374
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1190 | 28 | 1218 |
## | 0.977 | 0.023 | 0.886 |
## | 0.995 | 0.157 | |
## -------------|-----------|-----------|-----------|
## spam | 6 | 150 | 156 |
## | 0.038 | 0.962 | 0.114 |
## | 0.005 | 0.843 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1196 | 178 | 1374 |
## | 0.870 | 0.130 | |
## -------------|-----------|-----------|-----------|
##
##
Once again, we get an impressive out-of-the-box result: 97.5% correct! Sadly, however:
The erroneously dropped ham should really be fixed, as an important email might well have been missed (meeting / emergency etc), and such a product would be shelved.
There’s a Laplace estimator optional arg to naiveBayes() which we didn’t use. Recall that if there’s a single 0 in the multiplications (owing to absence of a word) we get a probability level of 0, which may not be quite fair. Let’s try setting it to 1:
sms_classifier2 = naiveBayes(reduced_dtm.train,
sms.train$type,
laplace = 1)
sms_test.predicted2 = predict(sms_classifier2,
reduced_dtm.test)
CrossTable(sms_test.predicted2,
sms.test$type,
prop.chisq = FALSE, # as before
prop.t = FALSE, # eliminate cell proprtions
dnn = c("predicted", "actual")) # relabels rows+cols
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1374
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1192 | 30 | 1222 |
## | 0.975 | 0.025 | 0.889 |
## | 0.997 | 0.169 | |
## -------------|-----------|-----------|-----------|
## spam | 4 | 148 | 152 |
## | 0.026 | 0.974 | 0.111 |
## | 0.003 | 0.831 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1196 | 178 | 1374 |
## | 0.870 | 0.130 | |
## -------------|-----------|-----------|-----------|
##
##
Ok, so we got a small improvement out of that then for dropped ham, but at the expense of increased spam.