naive bayes for sms spam

The idea

Naive Bayes (NB) is a classifying algorithm which uses data about prior events to estimate the probability of future events. Typically it’s best applied to problems in which the information from numerous attributes should be considered simultaneously in oder to estimate the P of an outcome. While many algos typically ignore features with weak effects, this technique uses all available info to subtly change predictions.

We’ll spend a bit of time on the theory.

Often we’re interested in monitoring several non-mutually exclusive events for the same trial. Thinking about ham and spam email:

spam is mutually exclusive from ham (say P(spam) = 20%, then P(ham) = 80%)
now say the word Viagra appears - the email could well be either spam or ham
say P(Viagra) = 5%; per above, it is a non-mutually exclusive event from ham/spam

We would want to calculate P(spam ∩ Viagra). But how?

It depends on the joint probability of the two events, i.e. how the prob of one event is related to the prob of the other. Dependent events are the basis for predictive modelling. Suppose P(Viagra) is independent of P(spam): it would be easy to calculate P(Viagra ∩ Spam) as simply P(Viagra) x P(spam) = 20% x 5% = 1%. But in reality P(Viagra) and P(spam) are likely highly dependent.

Quick reminder of prob theory from GCSE!

For independent events A and B, P(A ∩ B) = P(A) P(B)
For dependent events, Bayes says \(P(A|B) = P(B|A) P(A) / P(B) == P(A ∩ B) / P(B)\)

Some new terminology: say we were tasked with determining P(spam). With no other info, the most reasonable guess is the probability that any prior message was spam (in our case, 20%). This estimate is aka the prior probability (P(Spam))

Next suppose we’re told that the term Viagra is in the email.The probability that Viagra was used in previous spam messages is called the likelihood (P(Viagram ∩ Spam)); the probability that Viagra was used in any message at all is called the marginal likelihood (P(Viagra))

Using Bayes’, we can calculate a posterior probability (P(spam ∩ Viagra)), and this will be the clincher (if it’s > 0.5, we classify as spam). Bayes’ formula can be re-written in words as:

\(posterior prob = likelihood * prior / marginal_likelihood\)

\(P(Spam | Viagra) = P(Viagra | Spam) P(Spam) / P(Viagra)\)

In order to calculate the components above, we need to construct a frequency table that records the number of times Viagra appeared in both spam and ham messages:

Freq	Yes	No	Total
spam	4	16	20
ham	1	79	80
Total	5	95	100

and from that, construct a likelihood table, e.g.

Freq	Yes	No	Total
spam	4/20	16/20	20
ham	1/80	79/80	80
Total	5/100	95/100	100

This tells us that P(Viagra ∩ spam) is 4/20 = 0.2 (20%). So we can now calculate our posterior probability as 0.2 x 0.2 / 0.05 = 0.8 (80%), so such messages should be ditched.

The pros and cons

Naive Bayes is now the de facto standard for text classification. Its primary weakness is that it’s not well suited to lots of numeric data (these have to be bucketed as factors). Another weakness is that its resultant estimated probabilities aren’t as accurate as its classification results (though they are easy to obtain for a prediction). However, it is simple, fast and effetive, does well with noisy or missing data, requires relatively few examples for training.

It’s called ‘naive’ because of the assumptions it makes: it assumes that all of the features in the dataset are equally important and independent - this is rarely true in most real-world applications. E.g. if you really were trying to detect spam, the sender of the email is likely more important than the text contained. That said, it still performs well when these assumptions are violated.

The Data

So we’ll be using some SMS data to build a bayes classifier from this link, with some info about it available here.

sms = read.csv("D://dev//R//mlwr/chap4-naivebayes/sms_spam.csv", stringsAsFactors=F)

The data is simply 5574 records consisting of two fields, type (ham or spam, mostly ham) and the text.

str(sms)

## 'data.frame':    5574 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "spam" "ham" ...
##  $ text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C"| __truncated__ "U dun say so early hor... U c already then say..." ...

round(prop.table(table(sms$type))*100, digits = 1)

## 
##  ham spam 
## 86.6 13.4

We should fix the ‘type’ feature as it’s really a factor:

sms$type = factor(sms$type)

We’ll leave the text alone as we’re about to learn some fun text processing tools now.

Data preparation and text processing

R has a cool library for text mining called tm - this does all sorts of clever things to handle punctuation, numbers, eliminate frequent words (but, and, or etc).

#install.packages("tm")
library(tm)

## Loading required package: NLP

We need to build a corpus, which is a collection of documents, from the texts.

sms_corpus = Corpus(VectorSource(sms$text))
print(sms_corpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574

inspect(sms_corpus[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 111
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 29
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 155

It’s worth noting that Corpus can read from PDFs and Word docs etc, run this for more:

#print(vignette)

We need to clean up the data for the aforementioned fluff, we’ll use tm_map():

corpus_clean = tm_map(sms_corpus, tolower)                    # convert to lower case
corpus_clean = tm_map(corpus_clean, removeNumbers)            # remove digits
corpus_clean = tm_map(corpus_clean, removeWords, stopwords()) # and but or you etc
corpus_clean = tm_map(corpus_clean, removePunctuation)        # you'll never guess!
corpus_clean = tm_map(corpus_clean, stripWhitespace)          # reduces w/s to 1
inspect(corpus_clean[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] go jurong point crazy available bugis n great world la e buffet cine got amore wat
## 
## [[2]]
## [1] ok lar joking wif u oni
## 
## [[3]]
## [1] free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s

We will now tokenize each message into words to build the key structure for the analysis, a sparse matrix comprising:

the columns are the union of words in our corpus
the rows correspond to each text message
the cells are the number of times each word is seen

This will obviously be mostly 0-filled (whence sparse). We use DocumentTermMatrix()

corpus_clean = tm_map(corpus_clean, PlainTextDocument) # this is a tm API necessity
dtm = DocumentTermMatrix(corpus_clean)

Note that the dtm has quite a few columns!

str(dtm)

## List of 6
##  $ i       : int [1:43114] 1 1 1 1 1 1 1 1 1 1 ...
##  $ j       : int [1:43114] 236 472 911 913 1233 1506 2794 2832 3549 5139 ...
##  $ v       : num [1:43114] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 5574
##  $ ncol    : int 7957
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:5574] "character(0)" "character(0)" "character(0)" "character(0)" ...
##   ..$ Terms: chr [1:7957] "â<U+0093>harry""| __truncated__ "â<U+0080><U+0093>""| __truncated__ "â<U+0080><U+009C>""| __truncated__ "â<U+0080><U+009C>harry""| __truncated__ ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

Split out training and test sets

Let’s now create our training and testing sets. The data is randomly ordered, so we can just split the data on an arbitrary line at 75%/25%:

# split the raw data:
sms.train = sms[1:4200, ] # about 75%
sms.test  = sms[4201:5574, ] # the rest

# then split the document-term matrix
dtm.train = dtm[1:4200, ]
dtm.test  = dtm[4201:5574, ]

# and finally the corpus
corpus.train = corpus_clean[1:4200]
corpus.test  = corpus_clean[4201:5574]

# let's just assert that our split is reasonable: raw data should have about 87% ham
# in both training and test sets:
round(prop.table(table(sms.train$type))*100)

## 
##  ham spam 
##   86   14

round(prop.table(table(sms.test$type))*100)

## 
##  ham spam 
##   87   13

# LGTM

Slight aside: Visualising text data with wordclouds

Wordclouds show words in larger fonts if they’re more frequent. Let’s build one for ham and one for spam to see if we can tell whether or not our NB classifier is likely to be successful. Predictably, there’s an R package for that! The wordcloud package works with tm corpuses.

#install.packages("wordcloud")
library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(corpus.train,
          min.freq=40,          # 10% of num docs in corpus is rough standard
          random.order = FALSE) # biggest words are nearer the centre

We didn’t create separate corpuses for spam and ham, so the above is type-neutral. But wordcloud can also work with raw-text (which you’d expect):

spam = subset(sms.train, type == "spam")
ham  = subset(sms.train, type == "ham")

wordcloud(spam$text,
          max.words=40,     # look at the 40 most common words
          scale=c(3, 0, 5)) # adjust max and min font sizes for words shown

wordcloud(ham$text,
          max.words=40,     # look at the 40 most common words
          scale=c(3, 0, 5)) # adjust max and min font sizes for words shown

So looks like the biggest words don’t overlap much between ham and spam - this suggests NB has a fighting chance.

Creating indicator features for frequent words

As shown earlier, DTMs have more than 7000 columns - that’s way too much, so let’s shrink it down: eliminate words which appear in less than 5 SMS messages (about 0.1%). This should reduce the feature-set to a far more manageable number. We’ll use tm’s findFreqTerms() function.

freq_terms = findFreqTerms(dtm.train, 5)
reduced_dtm.train = DocumentTermMatrix(corpus.train, list(dictionary=freq_terms))
reduced_dtm.test =  DocumentTermMatrix(corpus.test, list(dictionary=freq_terms))

# have we reduced the number of features?
ncol(reduced_dtm.train)

## [1] 1231

ncol(reduced_dtm.test)

## [1] 1231

# yay

Almost there. Remember that NB works on factors, but our DTM only has numerics. Let’s define a function which converts counts to yes/no factor, and apply it to our reduced matrices.

convert_counts = function(x) {
  x = ifelse(x > 0, 1, 0)
  x = factor(x, levels = c(0, 1), labels=c("No", "Yes"))
  return (x)
}

# apply() allows us to work either with rows or columns of a matrix.
# MARGIN = 1 is for rows, and 2 for columns
reduced_dtm.train = apply(reduced_dtm.train, MARGIN=2, convert_counts)
reduced_dtm.test  = apply(reduced_dtm.test, MARGIN=2, convert_counts)

Training and evaluating our model

We’ll use an NB implementation from package e1071; there’s another in klaR too, fyi.

Training the model and using it for classification is a 2-stage jobby (unlike kNN). We call naiveBayes() which returns a model, then predict().

#install.packages("e1071")
library(e1071)
# store our model in sms_classifier
sms_classifier = naiveBayes(reduced_dtm.train, sms.train$type)
sms_test.predicted = predict(sms_classifier,
                             reduced_dtm.test)

# once again we'll use CrossTable() from gmodels
#install.packages("gmodels")
library(gmodels)
CrossTable(sms_test.predicted,
           sms.test$type,
           prop.chisq = FALSE, # as before
           prop.t     = FALSE, # eliminate cell proprtions
           dnn        = c("predicted", "actual")) # relabels rows+cols

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1374 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1190 |        28 |      1218 | 
##              |     0.977 |     0.023 |     0.886 | 
##              |     0.995 |     0.157 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       150 |       156 | 
##              |     0.038 |     0.962 |     0.114 | 
##              |     0.005 |     0.843 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1196 |       178 |      1374 | 
##              |     0.870 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Once again, we get an impressive out-of-the-box result: 97.5% correct! Sadly, however:

28 spam messages avoided detection (15.7%, not great)
6 ham messages got trashed (0.5%, not bad, but crucial)

The erroneously dropped ham should really be fixed, as an important email might well have been missed (meeting / emergency etc), and such a product would be shelved.

Improving model performance

There’s a Laplace estimator optional arg to naiveBayes() which we didn’t use. Recall that if there’s a single 0 in the multiplications (owing to absence of a word) we get a probability level of 0, which may not be quite fair. Let’s try setting it to 1:

sms_classifier2 = naiveBayes(reduced_dtm.train,
                             sms.train$type,
                             laplace = 1)
sms_test.predicted2 = predict(sms_classifier2,
                              reduced_dtm.test)
CrossTable(sms_test.predicted2,
           sms.test$type,
           prop.chisq = FALSE, # as before
           prop.t     = FALSE, # eliminate cell proprtions
           dnn        = c("predicted", "actual")) # relabels rows+cols

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1374 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1192 |        30 |      1222 | 
##              |     0.975 |     0.025 |     0.889 | 
##              |     0.997 |     0.169 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         4 |       148 |       152 | 
##              |     0.026 |     0.974 |     0.111 | 
##              |     0.003 |     0.831 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1196 |       178 |      1374 | 
##              |     0.870 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Ok, so we got a small improvement out of that then for dropped ham, but at the expense of increased spam.