Document Classification : Emails
The Ham or Spam classification problem is a common and ongoing pursuit in the academic and professional world.
There are several approaches that one can explore, and there are many references available to review.
For example, Mala Deep wrote an article on Towards Data Science, and he offers the following algorithms.[1] . Mala researched the various methods and reported the following accuracies:
| Algorithm | Accuracy (%) |
|---|---|
| Bagging Classification | .9736 |
| Random Forest Classification | .9796 |
| Naive Bayes Classification | .9826 |
| Extra Tree Classification | .9820 |
| SVM Classification | .8343 |
| KNN Classification | .8606 |
| Decision Tree Classification | .9730 |
Deepal Dsilva uses the naiveBayes() function from the e1071 R package to train her classifier. [3]
Tejan Kermali also wrote a Towards Data Science article on how to use Python to employ the TDF-IDF and BOW algorithms in Python [2] .
Mandy Gu also wrote a Towards Data Science python article which is quite excellent. [5]
There are many articles about classifications algorithms.
I will discuss what Tejan Kermali did in his Towards Data Science article and specifically his Bag of Words algorithm in python [6] .
After documenting the classifier I will demonstrate it in an R implmentation.
The algorithm calculates the probabilities that the message is Spam or Ham and ultimately returns.
\[
Ham \ Or \ Spam \ = \ \begin{cases}
P(spam) \geq P(ham) | Spam ) \\
P(spam) \lt P(ham) | Ham ) \\
\end{cases}
\]
Both probabilities is accumulated in 2 phases
In the first part, we loop through each word and adjust the accumulated probability based on the frequency probability of that word.
Before we begin, we should take note of what a word is
The Words:
We may optionally decide to do the following :
Tejan decided to combine words to improve conceptual integrity. Individual words often have less meaning than combined words.
Thus every message is parsed into a rolling series of 2 words, that is
Call me when you get a chance.
“Call me”, “me when”, “when you”, “you get”, “get a”, “a chance”
Each 2-word phrase performs a lookup into the spam training datasets and the Ham dataset.
The Process:
When I ran the simulation, the phrase “mobile number” was found significantly more in the Spam mail training data:
| Dataset | Words | “mobile phone” | Frequency |
|---|---|---|---|
| Ham | 112,310 | 11 | .0001 |
| Spam | 32,581 | 13 | .0004 |
For each phrase w the overall P(spam) is adjusted by the following :
\[Spam \ Freq(w) \ = \ \frac{s(w)}{s})\]
\[P(spam) \ += \ log(SpamFreq(w))\]
Where
s(w) = # ocuurencies of word w in spam dataset s = # words in spam dataset
For each phrase w the overall P(ham) is adjusted by the following :
\[HamFreq(w) \ = \ \frac{h(w)}{h})\]
Thus “mobile number” adjusts P(ham) and P(spam) as follows :
| Probability | Freq | log(Freq) | Adjustment |
|---|---|---|---|
| Ham | .0001 | -9.1106 | -9.1106 |
| Spam | .0004 | -7.8866 | -7.8866 |
This makes some intuitive sense.
Since the log of the lower frequency is less than the log of the higher frequency, it infers the probability of it being spam is greater.
But what if the test data phase isnt in the training data set ?
That happens a lot actually.
In that case we default to the following:
\[P(Spam) \ += \ \log(AllSpamWords)\] \[P(Ham) \ += \ \log(AllHamWords)\]
These adjustments are the same no matter the word, in my simulation these were :
| Probability | Words | Phrases | Sum | Log(Sum) | Adjustment |
|---|---|---|---|---|---|
| Ham | 9954 | 8147 | 18101 | 9.803 | -9.803 |
| Spam | 2981 | 2342 | 5323 | 8.579 | -8.5797 |
Thus, any word that does not appear in Ham, the P(ham) will be adjusted down by 9.903. Not being in ham makes it suspicious i guess.
Note: This method of equating the negative natural log of large numbers to the positive natural log of small numbers may seem inappropriate at first, but in both cases the results are fairly contained between 5 and 10, and its an ok model to track the frequency of words as it pertains to the spam dataset vs. the ham data set
Each email gets a blanket adjustment as follows.
| Probability | SubTotal | Total | Prob | log(Prob) |
|---|---|---|---|---|
| Ham | 1041 | 1200 | .8675 | -0.142 |
| Spam | 159 | 1200 | .1325 | -2.021 |
Thus every email gets a P(ham) bump over P(spam) by almost 2.
This puts the burden of P(spam) to have several phases that have a high frequency within the train dataset.
The following code uses the same data and demonstrates the Classification algorithm.
First read in the data.
py_path<-'C:\\Users\\arono\\source\\R\\Data607\\'
py_file<-'spam_test.csv'
py_csv<-paste0(py_path,py_file) # paste0 to omit the space
py_csv_df<-read.csv(py_csv)
# paste together the columns broken up by the commas
py_csv_df<-py_csv_df %>% mutate(text=paste0(v2,X,X.1,X.2))
py_csv_df<-py_csv_df %>% select(c("v1","text"))
names(py_csv_df)<-c("type","text")
Remove all numbers and punctuation marks.
# remove numbers and many punctuation marks
py_csv_df$text<-gsub("[[:punct:]]", "", py_csv_df$text)
py_csv_df$text<-gsub("[0-9]", "", py_csv_df$text)
Tokenize it to words. Remove the most common stop words.
library(tidytext)
library(tm)
stop_df<-data.frame(stopwords())
names(stop_df)<-c("word")
py_words_df<- py_csv_df %>%
unnest_tokens(word, text)
# remove stop words
py_words_df <- py_words_df %>%
anti_join(stop_df, by = "word")
Calculate the pct of spam and ham mail. We need this later for the Classify function.
total_msgs<-nrow(py_csv_df)
total_ham_msgs<- py_csv_df %>% filter(type == 'ham') %>% count()
total_spam_msgs<- py_csv_df %>% filter(type == 'spam') %>% count()
pct_ham<-total_ham_msgs/total_msgs
pct_spam<-total_spam_msgs/total_msgs
print(paste("Total Messages ", total_msgs))## [1] "Total Messages 1791"
print(paste("Total Spam Messages ", total_spam_msgs))## [1] "Total Spam Messages 229"
print(paste("Total Ham Messages ", total_ham_msgs))## [1] "Total Ham Messages 1562"
print(paste("The pct of ham is ", pct_ham))## [1] "The pct of ham is 0.87213847012842"
print(paste("The pct of spam is ", pct_spam))## [1] "The pct of spam is 0.12786152987158"
Calculate frequencies.
py_word_freq_df <-py_words_df %>%
group_by(type, word) %>%
summarise(freq = n() )
words_in_ham<-py_words_df %>% filter(type == 'ham') %>% count()
words_in_spam<-py_words_df %>% filter(type == 'spam') %>% count()
py_ham_word_freq_df <-py_word_freq_df %>% filter(type == 'ham')
py_spam_word_freq_df <-py_word_freq_df %>% filter(type == 'spam')
py_ham_word_freq_df <- py_ham_word_freq_df %>% mutate(freq_pct=freq/words_in_ham[1,1])
py_spam_word_freq_df <- py_spam_word_freq_df %>% mutate(freq_pct=freq/words_in_spam[1,1])
Define the Classify function mimicing Tejans implemetation.
Classify<- function(msg) {
msg_words<- strsplit(msg," ")
pSpam=0
pHam=0
for (w in msg_words[[1]]) {
w_f<-py_spam_word_freq_df %>% filter(word == w)
if (nrow(w_f) == 1) {
adj<-log(w_f[1,'freq_pct'])
pSpam=pSpam+adj
print(paste(w, " found in Spam with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
}
else
{
adj<-log(words_in_spam[1,1])
pSpam=pSpam-adj
print(paste(w, " not Found in Spam : adjusting by ", sprintf("%.4f", -1 * adj)))
}
}
for (w in msg_words[[1]]) {
w_f<-py_ham_word_freq_df %>% filter(word == w)
if (nrow(w_f) == 1) {
adj<-log(w_f[1,'freq_pct'])
pHam=pHam+adj
print(paste(w, " found in Ham with frequency ", sprintf("%.4f", w_f[1,'freq_pct']) , " adjusting by ", sprintf("%.4f",adj)))
}
else
{
adj<-log(words_in_ham[1,1])
pHam=pHam-adj
print(paste(w, " not Found in Ham: adjusting by ", sprintf("%.4f", -1 * adj)))
}
}
print(paste("Base pHam = ", pHam))
print(paste("Base pSpam = ", pSpam))
# the blanket adjustment is the percentage of mail
pSpam = pSpam + log(pct_spam)
pHam = pHam + log(pct_ham)
print(paste("Final pHam = ", sprintf("%.4f", pHam)))
print(paste("Final pSpam = ", sprintf("%.4f", pSpam)))
if (pHam < pSpam)
{
print("The message is SPAM")
}
else
{
print("The message is ham")
}
}
Call it with something spammy.
Classify("call now for free prize")## [1] "call found in Spam with frequency 0.0294 adjusting by -3.5266"
## [1] "now found in Spam with frequency 0.0164 adjusting by -4.1125"
## [1] "for not Found in Spam : adjusting by -8.1901"
## [1] "free found in Spam with frequency 0.0219 adjusting by -3.8206"
## [1] "prize found in Spam with frequency 0.0089 adjusting by -4.7243"
## [1] "call found in Ham with frequency 0.0058 adjusting by -5.1453"
## [1] "now found in Ham with frequency 0.0062 adjusting by -5.0816"
## [1] "for not Found in Ham: adjusting by -9.4760"
## [1] "free found in Ham with frequency 0.0013 adjusting by -6.6428"
## [1] "prize not Found in Ham: adjusting by -9.4760"
## [1] "Base pHam = -35.8216385128364"
## [1] "Base pSpam = -24.3742249553107"
## [1] "Final pHam = -35.9584"
## [1] "Final pSpam = -26.4310"
## [1] "The message is SPAM"
Call it with something normal.
Classify("please see attached need good answer by tomorrow")## [1] "please found in Spam with frequency 0.0055 adjusting by -5.1943"
## [1] "see found in Spam with frequency 0.0008 adjusting by -7.0915"
## [1] "attached not Found in Spam : adjusting by -8.1901"
## [1] "need not Found in Spam : adjusting by -8.1901"
## [1] "good found in Spam with frequency 0.0014 adjusting by -6.5806"
## [1] "answer found in Spam with frequency 0.0003 adjusting by -8.1901"
## [1] "by not Found in Spam : adjusting by -8.1901"
## [1] "tomorrow found in Spam with frequency 0.0003 adjusting by -8.1901"
## [1] "please found in Ham with frequency 0.0019 adjusting by -6.2571"
## [1] "see found in Ham with frequency 0.0031 adjusting by -5.7624"
## [1] "attached not Found in Ham: adjusting by -9.4760"
## [1] "need found in Ham with frequency 0.0033 adjusting by -5.7148"
## [1] "good found in Ham with frequency 0.0053 adjusting by -5.2419"
## [1] "answer found in Ham with frequency 0.0005 adjusting by -7.6842"
## [1] "by not Found in Ham: adjusting by -9.4760"
## [1] "tomorrow found in Ham with frequency 0.0023 adjusting by -6.0748"
## [1] "Base pHam = -55.6873436002087"
## [1] "Base pSpam = -59.8168339230962"
## [1] "Final pHam = -55.8242"
## [1] "Final pSpam = -61.8736"
## [1] "The message is ham"
[1] The Ultimate Guide To SMS: Spam or Ham Classifier
[2] Spam Classifier in Python from scratch
[3] Ham or Spam? SMS Text Classification with Machine Learning A Naive Bayes Implementation in R
[4] Building a Spam Ham Classifier
[5] Spam or Ham: Introduction to Natural Language Processing
[6] Ham or Spam? SMS Text Classification with Machine Learning A Naive Bayes Implementation in R