Document Classification : Emails

Overview

The Ham or Spam classification problem is a common and ongoing pursuit in the academic and professional world.

There are several approaches that one can explore, and there are many references available to review.

For example, Mala Deep wrote an article on Towards Data Science, and he offers the following algorithms.[1] . Mala researched the various methods and reported the following accuracies:

Algorithm	Accuracy (%)
Bagging Classification	.9736
Random Forest Classification	.9796
Naive Bayes Classification	.9826
Extra Tree Classification	.9820
SVM Classification	.8343
KNN Classification	.8606
Decision Tree Classification	.9730

Deepal Dsilva uses the naiveBayes() function from the e1071 R package to train her classifier. [3]

Tejan Kermali also wrote a Towards Data Science article on how to use Python to employ the TDF-IDF and BOW algorithms in Python [2] .

Mandy Gu also wrote a Towards Data Science python article which is quite excellent. [5]

Classification

There are many articles about classifications algorithms.

I will discuss what Tejan Kermali did in his Towards Data Science article and specifically his Bag of Words algorithm in python [6] .

After documenting the classifier I will demonstrate it in an R implmentation.

Bag of Words Algorithm

The algorithm calculates the probabilities that the message is Spam or Ham and ultimately returns.

\[ Ham \ Or \ Spam \ = \ \begin{cases} P(spam) \geq P(ham) | Spam ) \\ P(spam) \lt P(ham) | Ham ) \\ \end{cases} \]

Both probabilities is accumulated in 2 phases

Phase 1 The Frequencies:

In the first part, we loop through each word and adjust the accumulated probability based on the frequency probability of that word.

Before we begin, we should take note of what a word is

The Words:

We may optionally decide to do the following :

Remove Stop Words, i.e. words like “a” and “the”. They dont add meaning for us
Remove numbers
Remove punctuation marks.
Reduce all alphabetic characters to lower case

Tejan decided to combine words to improve conceptual integrity. Individual words often have less meaning than combined words.

Thus every message is parsed into a rolling series of 2 words, that is

Call me when you get a chance.

is fed into the classifier as

“Call me”, “me when”, “when you”, “you get”, “get a”, “a chance”

Each 2-word phrase performs a lookup into the spam training datasets and the Ham dataset.

The Process:

When I ran the simulation, the phrase “mobile number” was found significantly more in the Spam mail training data:

Dataset	Words	“mobile phone”	Frequency
Ham	112,310	11	.0001
Spam	32,581	13	.0004

For each phrase w the overall P(spam) is adjusted by the following :

\[Spam \ Freq(w) \ = \ \frac{s(w)}{s})\]

\[P(spam) \ += \ log(SpamFreq(w))\]

Where

s(w) = # ocuurencies of word w in spam dataset s = # words in spam dataset

For each phrase w the overall P(ham) is adjusted by the following :

\[HamFreq(w) \ = \ \frac{h(w)}{h})\]

Thus “mobile number” adjusts P(ham) and P(spam) as follows :

Probability	Freq	log(Freq)	Adjustment
Ham	.0001	-9.1106	-9.1106
Spam	.0004	-7.8866	-7.8866

This makes some intuitive sense.

Since the log of the lower frequency is less than the log of the higher frequency, it infers the probability of it being spam is greater.

But what if the test data phase isnt in the training data set ?

That happens a lot actually.

In that case we default to the following:

\[P(Spam) \ += \ \log(AllSpamWords)\] \[P(Ham) \ += \ \log(AllHamWords)\]

These adjustments are the same no matter the word, in my simulation these were :

Probability	Words	Phrases	Sum	Log(Sum)	Adjustment
Ham	9954	8147	18101	9.803	-9.803
Spam	2981	2342	5323	8.579	-8.5797

Thus, any word that does not appear in Ham, the P(ham) will be adjusted down by 9.903. Not being in ham makes it suspicious i guess.

Note: This method of equating the negative natural log of large numbers to the positive natural log of small numbers may seem inappropriate at first, but in both cases the results are fairly contained between 5 and 10, and its an ok model to track the frequency of words as it pertains to the spam dataset vs. the ham data set

Phase 2 The Blanket Adjustment:

Each email gets a blanket adjustment as follows.

Probability	SubTotal	Total	Prob	log(Prob)
Ham	1041	1200	.8675	-0.142
Spam	159	1200	.1325	-2.021

Thus every email gets a P(ham) bump over P(spam) by almost 2.

This puts the burden of P(spam) to have several phases that have a high frequency within the train dataset.

Demonstration in R

The following code uses the same data and demonstrates the Classification algorithm.

First read in the data.

py_path<-'C:\\Users\\arono\\source\\R\\Data607\\'
py_file<-'spam_test.csv'
py_csv<-paste0(py_path,py_file)   # paste0 to omit the space

py_csv_df<-read.csv(py_csv)

# paste together the columns broken up by the commas
py_csv_df<-py_csv_df %>% mutate(text=paste0(v2,X,X.1,X.2))

py_csv_df<-py_csv_df %>% select(c("v1","text"))

names(py_csv_df)<-c("type","text")

Remove all numbers and punctuation marks.

# remove numbers and many punctuation marks
py_csv_df$text<-gsub("[[:punct:]]", "", py_csv_df$text)
py_csv_df$text<-gsub("[0-9]", "", py_csv_df$text)

Tokenize it to words. Remove the most common stop words.

library(tidytext)
library(tm)

stop_df<-data.frame(stopwords())
names(stop_df)<-c("word")



py_words_df<- py_csv_df %>%
  unnest_tokens(word, text)

# remove stop words
py_words_df <- py_words_df %>%
  anti_join(stop_df, by = "word")

Calculate the pct of spam and ham mail. We need this later for the Classify function.

total_msgs<-nrow(py_csv_df)
total_ham_msgs<-  py_csv_df %>% filter(type == 'ham')  %>%  count() 
total_spam_msgs<-  py_csv_df %>% filter(type == 'spam')  %>%  count() 
pct_ham<-total_ham_msgs/total_msgs
pct_spam<-total_spam_msgs/total_msgs

print(paste("Total Messages      ", total_msgs))

## [1] "Total Messages       1791"

print(paste("Total Spam Messages ", total_spam_msgs))

## [1] "Total Spam Messages  229"

print(paste("Total Ham Messages  ", total_ham_msgs))

## [1] "Total Ham Messages   1562"

print(paste("The pct of ham is   ", pct_ham))

## [1] "The pct of ham is    0.87213847012842"

print(paste("The pct of spam is  ", pct_spam))

## [1] "The pct of spam is   0.12786152987158"

Calculate frequencies.

py_word_freq_df <-py_words_df %>%
  group_by(type, word) %>%
  summarise(freq = n()  )



words_in_ham<-py_words_df %>%     filter(type == 'ham')  %>%  count() 
words_in_spam<-py_words_df %>%    filter(type == 'spam')  %>%  count()



py_ham_word_freq_df <-py_word_freq_df  %>%   filter(type == 'ham')
py_spam_word_freq_df <-py_word_freq_df  %>%   filter(type == 'spam')

py_ham_word_freq_df <- py_ham_word_freq_df %>% mutate(freq_pct=freq/words_in_ham[1,1])

py_spam_word_freq_df <- py_spam_word_freq_df %>% mutate(freq_pct=freq/words_in_spam[1,1])

Define the Classify function mimicing Tejans implemetation.

Classify<- function(msg) {
  
msg_words<- strsplit(msg," ")

pSpam=0
pHam=0

for (w in msg_words[[1]]) {
  w_f<-py_spam_word_freq_df %>%  filter(word == w)
  
  if (nrow(w_f) == 1) {
    adj<-log(w_f[1,'freq_pct'])
    pSpam=pSpam+adj
    print(paste(w, " found in Spam with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
  } 
  else
  {
    adj<-log(words_in_spam[1,1])
    pSpam=pSpam-adj
    print(paste(w, " not Found in Spam : adjusting by ", sprintf("%.4f", -1 * adj)))
  }
  


}


for (w in msg_words[[1]]) {
  w_f<-py_ham_word_freq_df %>%  filter(word == w)
  
  if (nrow(w_f) == 1) {
    adj<-log(w_f[1,'freq_pct'])
    pHam=pHam+adj
    print(paste(w, " found in Ham with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
  } 
  else
  {
    adj<-log(words_in_ham[1,1])
    pHam=pHam-adj
    print(paste(w, " not Found in Ham: adjusting by ", sprintf("%.4f", -1 * adj)))

  }
  

}

print(paste("Base pHam =  ", pHam))
print(paste("Base pSpam =  ", pSpam))

# the blanket adjustment is the percentage of mail
pSpam = pSpam + log(pct_spam)
pHam = pHam   + log(pct_ham)

print(paste("Final pHam =  ", sprintf("%.4f", pHam)))
print(paste("Final pSpam =  ", sprintf("%.4f", pSpam)))


if (pHam < pSpam)
{
  print("The message is SPAM")
}
else
{
  print("The message is ham")
  
}

  
  
}

Call it with something spammy.

Classify("call now for free prize")

## [1] "call  found in Spam with frequency  0.0294  adjusting by  -3.5266"
## [1] "now  found in Spam with frequency  0.0164  adjusting by  -4.1125"
## [1] "for  not Found in Spam : adjusting by  -8.1901"
## [1] "free  found in Spam with frequency  0.0219  adjusting by  -3.8206"
## [1] "prize  found in Spam with frequency  0.0089  adjusting by  -4.7243"
## [1] "call  found in Ham with frequency  0.0058  adjusting by  -5.1453"
## [1] "now  found in Ham with frequency  0.0062  adjusting by  -5.0816"
## [1] "for  not Found in Ham: adjusting by  -9.4760"
## [1] "free  found in Ham with frequency  0.0013  adjusting by  -6.6428"
## [1] "prize  not Found in Ham: adjusting by  -9.4760"
## [1] "Base pHam =   -35.8216385128364"
## [1] "Base pSpam =   -24.3742249553107"
## [1] "Final pHam =   -35.9584"
## [1] "Final pSpam =   -26.4310"
## [1] "The message is SPAM"

Call it with something normal.

Classify("please see attached need good answer by tomorrow")

## [1] "please  found in Spam with frequency  0.0055  adjusting by  -5.1943"
## [1] "see  found in Spam with frequency  0.0008  adjusting by  -7.0915"
## [1] "attached  not Found in Spam : adjusting by  -8.1901"
## [1] "need  not Found in Spam : adjusting by  -8.1901"
## [1] "good  found in Spam with frequency  0.0014  adjusting by  -6.5806"
## [1] "answer  found in Spam with frequency  0.0003  adjusting by  -8.1901"
## [1] "by  not Found in Spam : adjusting by  -8.1901"
## [1] "tomorrow  found in Spam with frequency  0.0003  adjusting by  -8.1901"
## [1] "please  found in Ham with frequency  0.0019  adjusting by  -6.2571"
## [1] "see  found in Ham with frequency  0.0031  adjusting by  -5.7624"
## [1] "attached  not Found in Ham: adjusting by  -9.4760"
## [1] "need  found in Ham with frequency  0.0033  adjusting by  -5.7148"
## [1] "good  found in Ham with frequency  0.0053  adjusting by  -5.2419"
## [1] "answer  found in Ham with frequency  0.0005  adjusting by  -7.6842"
## [1] "by  not Found in Ham: adjusting by  -9.4760"
## [1] "tomorrow  found in Ham with frequency  0.0023  adjusting by  -6.0748"
## [1] "Base pHam =   -55.6873436002087"
## [1] "Base pSpam =   -59.8168339230962"
## [1] "Final pHam =   -55.8242"
## [1] "Final pSpam =   -61.8736"
## [1] "The message is ham"