Document Classification : Emails





Overview




The Ham or Spam classification problem is a common and ongoing pursuit in the academic and professional world.



There are several approaches that one can explore, and there are many references available to review.



For example, Mala Deep wrote an article on Towards Data Science, and he offers the following algorithms.[1] . Mala researched the various methods and reported the following accuracies:



Algorithm Accuracy (%)
Bagging Classification .9736
Random Forest Classification .9796
Naive Bayes Classification .9826
Extra Tree Classification .9820
SVM Classification .8343
KNN Classification .8606
Decision Tree Classification .9730



Deepal Dsilva uses the naiveBayes() function from the e1071 R package to train her classifier. [3]



Tejan Kermali also wrote a Towards Data Science article on how to use Python to employ the TDF-IDF and BOW algorithms in Python [2] .



Mandy Gu also wrote a Towards Data Science python article which is quite excellent. [5]




Classification




There are many articles about classifications algorithms.



I will discuss what Tejan Kermali did in his Towards Data Science article and specifically his Bag of Words algorithm in python [6] .



After documenting the classifier I will demonstrate it in an R implmentation.



Bag of Words Algorithm



The algorithm calculates the probabilities that the message is Spam or Ham and ultimately returns.

\[ Ham \ Or \ Spam \ = \ \begin{cases} P(spam) \geq P(ham) | Spam ) \\ P(spam) \lt P(ham) | Ham ) \\ \end{cases} \]

Both probabilities is accumulated in 2 phases



Phase 1 The Frequencies:



In the first part, we loop through each word and adjust the accumulated probability based on the frequency probability of that word.

Before we begin, we should take note of what a word is

The Words:

We may optionally decide to do the following :

  1. Remove Stop Words, i.e. words like “a” and “the”. They dont add meaning for us
  2. Remove numbers
  3. Remove punctuation marks.
  4. Reduce all alphabetic characters to lower case



Tejan decided to combine words to improve conceptual integrity. Individual words often have less meaning than combined words.

Thus every message is parsed into a rolling series of 2 words, that is



Call me when you get a chance.


is fed into the classifier as

“Call me”, “me when”, “when you”, “you get”, “get a”, “a chance”



Each 2-word phrase performs a lookup into the spam training datasets and the Ham dataset.



The Process:



When I ran the simulation, the phrase “mobile number” was found significantly more in the Spam mail training data:



Dataset Words “mobile phone” Frequency
Ham 112,310 11 .0001
Spam 32,581 13 .0004



For each phrase w the overall P(spam) is adjusted by the following :



\[Spam \ Freq(w) \ = \ \frac{s(w)}{s})\]



\[P(spam) \ += \ log(SpamFreq(w))\]

Where

s(w) = # ocuurencies of word w in spam dataset s = # words in spam dataset

For each phrase w the overall P(ham) is adjusted by the following :



\[HamFreq(w) \ = \ \frac{h(w)}{h})\]



Thus “mobile number” adjusts P(ham) and P(spam) as follows :

Probability Freq log(Freq) Adjustment
Ham .0001 -9.1106 -9.1106
Spam .0004 -7.8866 -7.8866



This makes some intuitive sense.



Since the log of the lower frequency is less than the log of the higher frequency, it infers the probability of it being spam is greater.



But what if the test data phase isnt in the training data set ?



That happens a lot actually.



In that case we default to the following:



\[P(Spam) \ += \ \log(AllSpamWords)\] \[P(Ham) \ += \ \log(AllHamWords)\]

These adjustments are the same no matter the word, in my simulation these were :



Probability Words Phrases Sum Log(Sum) Adjustment
Ham 9954 8147 18101 9.803 -9.803
Spam 2981 2342 5323 8.579 -8.5797



Thus, any word that does not appear in Ham, the P(ham) will be adjusted down by 9.903. Not being in ham makes it suspicious i guess.

Note: This method of equating the negative natural log of large numbers to the positive natural log of small numbers may seem inappropriate at first, but in both cases the results are fairly contained between 5 and 10, and its an ok model to track the frequency of words as it pertains to the spam dataset vs. the ham data set







Phase 2 The Blanket Adjustment:



Each email gets a blanket adjustment as follows.

Probability SubTotal Total Prob log(Prob)
Ham 1041 1200 .8675 -0.142
Spam 159 1200 .1325 -2.021



Thus every email gets a P(ham) bump over P(spam) by almost 2.

This puts the burden of P(spam) to have several phases that have a high frequency within the train dataset.

Demonstration in R



The following code uses the same data and demonstrates the Classification algorithm.



First read in the data.

py_path<-'C:\\Users\\arono\\source\\R\\Data607\\'
py_file<-'spam_test.csv'
py_csv<-paste0(py_path,py_file)   # paste0 to omit the space

py_csv_df<-read.csv(py_csv)

# paste together the columns broken up by the commas
py_csv_df<-py_csv_df %>% mutate(text=paste0(v2,X,X.1,X.2))

py_csv_df<-py_csv_df %>% select(c("v1","text"))

names(py_csv_df)<-c("type","text")



Remove all numbers and punctuation marks.

# remove numbers and many punctuation marks
py_csv_df$text<-gsub("[[:punct:]]", "", py_csv_df$text)
py_csv_df$text<-gsub("[0-9]", "", py_csv_df$text)



Tokenize it to words. Remove the most common stop words.

library(tidytext)
library(tm)

stop_df<-data.frame(stopwords())
names(stop_df)<-c("word")



py_words_df<- py_csv_df %>%
  unnest_tokens(word, text)

# remove stop words
py_words_df <- py_words_df %>%
  anti_join(stop_df, by = "word")



Calculate the pct of spam and ham mail. We need this later for the Classify function.

total_msgs<-nrow(py_csv_df)
total_ham_msgs<-  py_csv_df %>% filter(type == 'ham')  %>%  count() 
total_spam_msgs<-  py_csv_df %>% filter(type == 'spam')  %>%  count() 
pct_ham<-total_ham_msgs/total_msgs
pct_spam<-total_spam_msgs/total_msgs

print(paste("Total Messages      ", total_msgs))
## [1] "Total Messages       1791"
print(paste("Total Spam Messages ", total_spam_msgs))
## [1] "Total Spam Messages  229"
print(paste("Total Ham Messages  ", total_ham_msgs))
## [1] "Total Ham Messages   1562"
print(paste("The pct of ham is   ", pct_ham))
## [1] "The pct of ham is    0.87213847012842"
print(paste("The pct of spam is  ", pct_spam))
## [1] "The pct of spam is   0.12786152987158"



Calculate frequencies.

py_word_freq_df <-py_words_df %>%
  group_by(type, word) %>%
  summarise(freq = n()  )



words_in_ham<-py_words_df %>%     filter(type == 'ham')  %>%  count() 
words_in_spam<-py_words_df %>%    filter(type == 'spam')  %>%  count()



py_ham_word_freq_df <-py_word_freq_df  %>%   filter(type == 'ham')
py_spam_word_freq_df <-py_word_freq_df  %>%   filter(type == 'spam')

py_ham_word_freq_df <- py_ham_word_freq_df %>% mutate(freq_pct=freq/words_in_ham[1,1])

py_spam_word_freq_df <- py_spam_word_freq_df %>% mutate(freq_pct=freq/words_in_spam[1,1])



Define the Classify function mimicing Tejans implemetation.

Classify<- function(msg) {
  
msg_words<- strsplit(msg," ")

pSpam=0
pHam=0

for (w in msg_words[[1]]) {
  w_f<-py_spam_word_freq_df %>%  filter(word == w)
  
  if (nrow(w_f) == 1) {
    adj<-log(w_f[1,'freq_pct'])
    pSpam=pSpam+adj
    print(paste(w, " found in Spam with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
  } 
  else
  {
    adj<-log(words_in_spam[1,1])
    pSpam=pSpam-adj
    print(paste(w, " not Found in Spam : adjusting by ", sprintf("%.4f", -1 * adj)))
  }
  


}


for (w in msg_words[[1]]) {
  w_f<-py_ham_word_freq_df %>%  filter(word == w)
  
  if (nrow(w_f) == 1) {
    adj<-log(w_f[1,'freq_pct'])
    pHam=pHam+adj
    print(paste(w, " found in Ham with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
  } 
  else
  {
    adj<-log(words_in_ham[1,1])
    pHam=pHam-adj
    print(paste(w, " not Found in Ham: adjusting by ", sprintf("%.4f", -1 * adj)))

  }
  

}

print(paste("Base pHam =  ", pHam))
print(paste("Base pSpam =  ", pSpam))

# the blanket adjustment is the percentage of mail
pSpam = pSpam + log(pct_spam)
pHam = pHam   + log(pct_ham)

print(paste("Final pHam =  ", sprintf("%.4f", pHam)))
print(paste("Final pSpam =  ", sprintf("%.4f", pSpam)))


if (pHam < pSpam)
{
  print("The message is SPAM")
}
else
{
  print("The message is ham")
  
}

  
  
}



Call it with something spammy.

Classify("call now for free prize")
## [1] "call  found in Spam with frequency  0.0294  adjusting by  -3.5266"
## [1] "now  found in Spam with frequency  0.0164  adjusting by  -4.1125"
## [1] "for  not Found in Spam : adjusting by  -8.1901"
## [1] "free  found in Spam with frequency  0.0219  adjusting by  -3.8206"
## [1] "prize  found in Spam with frequency  0.0089  adjusting by  -4.7243"
## [1] "call  found in Ham with frequency  0.0058  adjusting by  -5.1453"
## [1] "now  found in Ham with frequency  0.0062  adjusting by  -5.0816"
## [1] "for  not Found in Ham: adjusting by  -9.4760"
## [1] "free  found in Ham with frequency  0.0013  adjusting by  -6.6428"
## [1] "prize  not Found in Ham: adjusting by  -9.4760"
## [1] "Base pHam =   -35.8216385128364"
## [1] "Base pSpam =   -24.3742249553107"
## [1] "Final pHam =   -35.9584"
## [1] "Final pSpam =   -26.4310"
## [1] "The message is SPAM"



Call it with something normal.

Classify("please see attached need good answer by tomorrow")
## [1] "please  found in Spam with frequency  0.0055  adjusting by  -5.1943"
## [1] "see  found in Spam with frequency  0.0008  adjusting by  -7.0915"
## [1] "attached  not Found in Spam : adjusting by  -8.1901"
## [1] "need  not Found in Spam : adjusting by  -8.1901"
## [1] "good  found in Spam with frequency  0.0014  adjusting by  -6.5806"
## [1] "answer  found in Spam with frequency  0.0003  adjusting by  -8.1901"
## [1] "by  not Found in Spam : adjusting by  -8.1901"
## [1] "tomorrow  found in Spam with frequency  0.0003  adjusting by  -8.1901"
## [1] "please  found in Ham with frequency  0.0019  adjusting by  -6.2571"
## [1] "see  found in Ham with frequency  0.0031  adjusting by  -5.7624"
## [1] "attached  not Found in Ham: adjusting by  -9.4760"
## [1] "need  found in Ham with frequency  0.0033  adjusting by  -5.7148"
## [1] "good  found in Ham with frequency  0.0053  adjusting by  -5.2419"
## [1] "answer  found in Ham with frequency  0.0005  adjusting by  -7.6842"
## [1] "by  not Found in Ham: adjusting by  -9.4760"
## [1] "tomorrow  found in Ham with frequency  0.0023  adjusting by  -6.0748"
## [1] "Base pHam =   -55.6873436002087"
## [1] "Base pSpam =   -59.8168339230962"
## [1] "Final pHam =   -55.8242"
## [1] "Final pSpam =   -61.8736"
## [1] "The message is ham"