In this R tutorial, we will show how to implement a popular classificaiton algorithm called Naive Bayes Classification to classify emails as either junk or not junk. The Naive Bayes Classification algorithm is a probabilistic model that attempts to predict the probability of an event occurring given the available evidence.
Bayesian classifier methdos utilize training data to calculate an observed probability of each class based on a set of features. When the classifier is used on unobserved data, it uses the observed probabilities to predict the most likely class for the new features. Bayesian classifiers are best applied to problems in which the information from multiple attributes should be considered simultaneously in order to estimate the probability of an outcome.
Bayesian probability is rooted in the idea that estimated outcomes should be based on the evidence at hand. Conditional probability (the probability that an event will occurr given one or more other events occurring) is at the heart of Bayes Theorem. Bayes Theorem describes the relationship between two events, which can be expressed as
\[ \begin{aligned} P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)} = \frac{P(A \cap B)}{P(B)} &&&& (1) \end{aligned} \]
where
To illustrate how one might use Bayes Theorem to calculate the probability of an event occurring, suppose we wanted to estimate the probability that an email is junk mail given that the word “viagra” appears in the email. Mathematically, we can express the conditional probability as
\[ \begin{aligned} P(\text{junk} \vert \text{viagra}) = \frac{P(\text{viagra} \vert \text{junk})P(\text{junk})}{P(\text{viagra})} \end{aligned} \]
Let’s create a frequency table to help us calculate the probability that an email is junk mail given that the word viagra appears in an email.
Looking at the table, we have
which gives us
\[ \begin{aligned} P(\text{junk} \vert \text{viagra}) = \frac{0.24 \times 0.21}{0.07} = 0.72 \end{aligned} \]
Of course, in the above example, we are only estimating the probability that an email is junk based on one feature (viagra). The Naives Bayes classifier can be extended to include multiple features. If we let \(C_{L}\) denote the classification and \(W_{i}\) denote the words that appear in an email, we can express the conditional probability given two or more features as
\[ \begin{aligned} P(C_{L} | W_{1} \cap W_{2} \cap W_{3}, \ldots \cap W_{n}) = \frac{P(W_{1}\cap W_{2} \cap W_{3} \cap \ldots W_{n})P(C_{L})}{P(W_{1}\cap W_{2} \cap W_{3} \cap \ldots W_{n})} \end{aligned} \]
Essentially, the probability of level \(L\) for class \(C\), given the evidence provided by features \(F_{n}\), is equal to the product of the probabilities of each piece of eidence conditioned on the class level, the prior probability of the class level, and a scaling factor of \(1/Z\), which converts the result to a probability
\[ \begin{aligned} P(C_{L} \vert F_{1}, \ldots, F_{n}) = \frac{1}{Z} p(C_{L}) \prod_{i=1}^{n}p(F_{i} \vert C_{L}) &&&& (2) \end{aligned} \]
Now that we have briefly gone over Bayes Theorem and how to implement for classifciation, let’s apply it to a data set of emails which we will classify as either junk or not junk given the words that appear in the email.
emails.raw <- read.csv('/Users/czar.yobero/Documents/emails.csv', stringsAsFactors = F)
emails.raw$type <- factor(emails.raw$type)
kable(head(emails.raw, n = 15), format = 'html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| type | text |
|---|---|
| not junk | Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… |
| not junk | Ok lar… Joking wif u oni… |
| junk | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| not junk | U dun say so early hor… U c already then say… |
| not junk | Nah I don’t think he goes to usf, he lives around here though |
| junk | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
| not junk | Even my brother is not like to speak with me. They treat me like aids patent. |
| not junk | As per your request ‘Melle Melle (Oru Minnaminunginte Nurungu Vettam)’ has been set as your callertune for all Callers. Press *9 to copy your friends Callertune |
| junk | WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. |
| junk | Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030 |
| not junk | I’m gonna be home soon and i don’t want to talk about this stuff anymore tonight, k? I’ve cried enough today. |
| junk | SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info |
| junk | URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18 |
| not junk | I’ve been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times. |
| not junk | I HAVE A DATE ON SUNDAY WITH WILL!! |
Because we are dealing with unstructured text data, we will need to tokenize the data and make a few transformations so as to normalize the data.
We first need to build a text corpus.Then, we will need to normalize our text corpus by removing stop words (e.g., the, I, for, etc.) as well as lower case all texts, remove white space, and remove punctuation.
library(tm)
emails.corpus <- Corpus(VectorSource(emails.raw$text))
# Cleaning the Corpus
emails.corpus.clean <- tm_map(emails.corpus, tolower)
emails.corpus.clean <- tm_map(emails.corpus.clean, removeNumbers)
emails.corpus.clean <- tm_map(emails.corpus.clean, removeWords, stopwords('english'))
emails.corpus.clean <- tm_map(emails.corpus.clean, stripWhitespace)
emails.corpus.clean <- tm_map(emails.corpus.clean, removePunctuation)
inspect(emails.corpus.clean[1:10])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 10
##
## [1] go jurong point crazy available bugis n great world la e buffet cine got amore wat
## [2] ok lar joking wif u oni
## [3] free entry wkly comp win fa cup final tkts st may text fa receive entry questionstd txt ratetcs apply s
## [4] u dun say early hor u c already say
## [5] nah think goes usf lives around though
## [6] freemsg hey darling weeks now word back like fun still tb ok xxx std chgs send rcv
## [7] even brother like speak treat like aids patent
## [8] per request melle melle oru minnaminunginte nurungu vettam set callertune callers press copy friends callertune
## [9] winner valued network customer selected receivea prize reward claim call claim code kl valid hours
## [10] mobile months u r entitled update latest colour mobiles camera free call mobile update co free
Next, we need to create a Document Term Matrix, which is essentially a matrix where each column represents a word and each row is a particular email. If a word appears in an email, we insert 1 under the appropriate column, and 0 otherwise. We also want to exclude words that don’t appear at least five times. That way, our results will hopefully be more robust. Then, we’ll divide our data into training and test sets.
emails.dtm <- DocumentTermMatrix(emails.corpus.clean)
# Split data sets
emails.raw.train <- emails.raw[1:3902, ]
emails.raw.test <- emails.raw[3903:5574, ]
emails.dtm.train <- emails.dtm[1:3902, ]
emails.dtm.test <- emails.dtm[3903:5574, ]
emails.corpus.train <- emails.corpus.clean[1:3902]
emails.corpus.test <- emails.corpus.clean[3903:5574]
# Create indicator features for frequent words and eliminate words which appear less than three times.
freq.terms <- findFreqTerms(emails.dtm.train, 3)
reduced.dtm.train <- DocumentTermMatrix(emails.corpus.train, list(dictionary = freq.terms))
reduced.dtm.test <- DocumentTermMatrix(emails.corpus.test, list(dictionary = freq.terms))
# Naive Bayes classifier is typically trained on data with categorical features.
# To make up for this, we will define a function that changes the count of words
# into factor variables that simply indicates "yes" or "no" depending on
# whether the word appears at all.
convertCounts = function(x) {
x = ifelse(x > 0, 1, 0)
x = factor(x, levels = c(0, 1), labels=c("no", "yes"))
return (x)
}
reduced.dtm.train <- apply(reduced.dtm.train, MARGIN = 2, convertCounts)
reduced.dtm.test <- apply(reduced.dtm.test, MARGIN = 2, convertCounts)
Now we are ready to implement the naive Bayes classification algorithm to our data set.
library(e1071)
emails.classifier <- naiveBayes(reduced.dtm.train, emails.raw.train$type, laplace = 1)
emails.test.predict <- predict(emails.classifier, reduced.dtm.test)
Let’s see how accurate our model is.
library(gmodels)
CrossTable(emails.test.predict,
emails.raw.test$type,
prop.chisq = F,
prop.t = F,
prop.c = F,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 1672
##
##
## | actual
## predicted | junk | not junk | Row Total |
## -------------|-----------|-----------|-----------|
## junk | 74 | 128 | 202 |
## | 0.366 | 0.634 | 0.121 |
## -------------|-----------|-----------|-----------|
## not junk | 154 | 1316 | 1470 |
## | 0.105 | 0.895 | 0.879 |
## -------------|-----------|-----------|-----------|
## Column Total | 228 | 1444 | 1672 |
## -------------|-----------|-----------|-----------|
##
##
As you can see from the table, our model was able to classify the unobserved data as junk 36.6% of the time and classify emails as not junk 89.5% of the time. Obviously there is plenty of room to improve on this model, but that is for another time.
The point is to show how you can apply the Naive Bayes classifier to classify a data set given a set of (mostly) categorical features.