This project uses Naive Bayes to classify plain text documents as either spam or ham. This work is based on what I learned from a YouTube lecture by UC Berkeley Prof. Pieter Abbeel on Machine Learning: Naive Bayes (https://www.youtube.com/watch?v=DNvwfNEiKvw).
The tm library is used to process the corpus.
Three sets of data that were used. The training set was used to train the classification method. The hold off set was used to test the classification with different parameters to try and improve the accuracy of the classification. Finally, the testing set was used to test the classification once the parameters are set for the classification.
Source of data: http://spamassassin.apache.org/old/publiccorpus/
Training set:
Hold off set:
Testing set:
library(tm)
library(dplyr)
library(ggplot2)
library(SnowballC)
library(knitr)
library(stringr)
library(DT)
spam and ham corpusThe tm library is used to load the files.
The code below loads the spam-ham training set, hold off set, and testing set.
#SETS
ham_training_folder <- "./spamham/easy_ham_training_p1"
spam_training_folder <- "./spamham/spam_2_training_p2"
ham_testing_folder <- "./spamham/easy_ham_2_TESTING"
spam_testing_folder <- "./spamham/spam_TESTING"
ham_holdoff_folder <- "./spamham/easy_ham_2_HOLDOFF"
spam_holdoff_folder <- "./spamham/spam_HOLDOFF"
hard_ham_testing_folder <- "./spamham/hard_ham_TESTING"
#LOAD FILES
ham_training <- VCorpus(DirSource(ham_training_folder))
spam_training <- VCorpus(DirSource(spam_training_folder))
ham_testing <- VCorpus(DirSource(ham_testing_folder))
spam_testing <- VCorpus(DirSource(spam_testing_folder))
ham_holdoff <- VCorpus(DirSource(ham_holdoff_folder))
spam_holdoff <- VCorpus(DirSource(spam_holdoff_folder))
hard_ham_testing <- VCorpus(DirSource(hard_ham_testing_folder))
#summary(ham_training)
#summary(spam_training)
This table lists the number of emails in the ham and spam files in each set.
| File | No. of Emails |
|---|---|
| Training Set: Easy Ham | 2551 |
| Traning Set: Spam | 1396 |
| Hold Off Set: Easy Ham | 700 |
| Hold Off Set: Spam | 250 |
| Testing Set: Easy Ham | 701 |
| Testing Set: Spam | 251 |
| Testing Set: Hard Ham | 250 |
This simplistic approach in text classification only focuses on the English alpha characters and the case of the alpha characters is ignored. Text is converted to lowercase. Punctuations are also removed before the term document matrix and document term matrix are generated.
Other possible features that may play a role in improving the accuracy of the classification are not considered such as whether the message is formatted in html or not, if mispelled words are present, or if the message has words spelled in all uppercase characters among other things.
Removing English stop words decreased the observed accuracy of the classification based on the training set. In addition, stemming the words also decreased the observed accuracy of the classification based on the training set.
Below is observed classification on the hold off set.
No removal of stop words or stemming used ( k= 0.5)
Remove stop words (k = 0.5)
Stemming of words used (k = 0.5)
Characters in plain text documents are converted to lowercase.
convert.lowercase <- function(doc){
doc <- tm_map(doc, tolower)
doc <- tm_map(doc, PlainTextDocument)
return(doc)
}
#TRAINING SET
ham_training <- convert.lowercase(ham_training)
spam_training <- convert.lowercase(spam_training)
#TESTING SET
ham_testing <- convert.lowercase(ham_testing)
hard_ham_testing <- convert.lowercase(hard_ham_testing)
spam_testing <- convert.lowercase(spam_testing)
#HOLD OFF SET
ham_holdoff <- convert.lowercase(ham_holdoff)
spam_holdoff <- convert.lowercase(spam_holdoff)
#inspect(ham_training[[1]])
This simplistic approach in text classification only looks at alpha characters.
extract.alpha <- function(docs){
thispattern = "[a-z]+ |[a-z]+[a-z]$|[a-z]+[\\.|,|;] "
for (j in seq(docs))
{
letters_only <- unlist(str_extract_all(docs[[j]], thispattern))
docs[[j]] <- paste(letters_only, collapse = " ")
}
#docs <- tm_map(docs, removeWords, stopwords("english"))
#docs <- tm_map(docs, stemDocument, language = "english")
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removePunctuation)
#inspect(docs[[1]])
return(docs)
}
#TRAINING
ham_training <- extract.alpha(ham_training)
spam_training <- extract.alpha(spam_training)
#TESTING
ham_testing <- extract.alpha(ham_testing)
spam_testing <- extract.alpha(spam_testing)
hard_ham_testing <- extract.alpha((hard_ham_testing))
#HOLDOFF
ham_holdoff <- extract.alpha(ham_holdoff)
spam_holdoff <- extract.alpha(spam_holdoff)
#inspect(ham_training[[1]])
#inspect(spam_training[[1]])
#inspect(ham_testing[[1]])
#inspect(hard_ham_testing[[1]])
#inspect(spam_testing[[1]])
#inspect(ham_holdoff[[1]])
#inspect(spam_holdoff[[1]])
This is the training set term document matrix.
This is going to be used to generate the training set term frequency list.
ham_tdm <- TermDocumentMatrix(ham_training)
spam_tdm <- TermDocumentMatrix(spam_training)
This is the term frequency list for the training set.
Classification is going to depend on the term frequency list for ham and spam in the training set.
ham_freq <- rowSums(as.matrix(ham_tdm))
spam_freq <- rowSums(as.matrix(spam_tdm))
ham_termFreq <- cbind(names(ham_freq),ham_freq)
spam_termFreq <- cbind(names(spam_freq), spam_freq)
rownames(ham_termFreq) <- NULL
rownames(spam_termFreq) <- NULL
colnames(ham_termFreq) <- c("term", "frequency")
colnames(spam_termFreq) <- c("term", "frequency")
ham_termFreq <- data.frame(ham_termFreq, stringsAsFactors = FALSE)
spam_termFreq <- data.frame(spam_termFreq, stringsAsFactors = FALSE)
ham_termFreq$frequency <- as.numeric(ham_termFreq$frequency)
spam_termFreq$frequency <- as.numeric(spam_termFreq$frequency)
#head(ham_termFreq,10)
#head(spam_termFreq,10)
training set term frequency listTo calculate the conditional probability of the terms, Laplace smoothing was used. The value for k used is 0.5. For the training set, k values of 0.1, 0.5, 1.5, and 2 were investigated. The classification was better (particularly for spam) when k = 0.5.
ham.UNKNOWN and spam.UNKNOWN are conditional probabilities used when the term is not found in the training set.
The marginal probability for ham and spam are based on the proportion of ham emails and spam emails in the entire training set.
ham_prob <- length(ham_training)/ (length(ham_training) + length(spam_training))
spam_prob <- length(spam_training)/ (length(ham_training) + length(spam_training))
#Ham and spam marginal probability
ham_prob
## [1] 0.6463137
spam_prob
## [1] 0.3536863
#Laplace smoothing
k <- 0.5
#Total count of term occurrences in ham and spam
ham_N <- colSums(as.matrix(ham_termFreq$frequency))
spam_N <- colSums(as.matrix(spam_termFreq$frequency))
#Number of terms in ham
nrow(ham_termFreq)
## [1] 22238
#Number of terms in spam
nrow(spam_termFreq)
## [1] 19400
#Number of shared terms between ham and spam
nrow(as.data.frame(dplyr::intersect(ham_termFreq$term, spam_termFreq$term)))
## [1] 8073
#Number of distinct vocabulary on both ham and spam
spamham_vocabulary <- nrow(as.data.frame(dplyr::union(ham_termFreq$term, spam_termFreq$term)))
spamham_vocabulary
## [1] 33565
ham_termFreq$probability <- (ham_termFreq$frequency + k)/(ham_N + k*spamham_vocabulary)
spam_termFreq$probability <- (spam_termFreq$frequency + k)/(spam_N + k*spamham_vocabulary)
ham.UNKNOWN <- (0 + k)/(ham_N + k*spamham_vocabulary)
spam.UNKNOWN <- (0 + k)/(spam_N + k*spamham_vocabulary)
training - ham term frequency and term conditional probabilitytraining - spam term frequency and term conditional probabilityhold off and testing setsThis is the term document matrix (tdm) for the hold off and testing sets.
#HOLD OFF SET
ham_holdoff_dtm <- DocumentTermMatrix(ham_holdoff)
spam_holdoff_dtm <- DocumentTermMatrix(spam_holdoff)
#TESTING SET
ham_testing_dtm <- DocumentTermMatrix(ham_testing)
spam_testing_dtm <- DocumentTermMatrix(spam_testing)
hard_ham_testing_dtm <- DocumentTermMatrix(hard_ham_testing)
#CONVERT AS MATRIX
ham_holdoff_dtm <- as.data.frame(as.matrix(ham_holdoff_dtm))
spam_holdoff_dtm <- as.data.frame(as.matrix(spam_holdoff_dtm))
ham_testing_dtm <- as.data.frame(as.matrix(ham_testing_dtm))
spam_testing_dtm <- as.data.frame(as.matrix(spam_testing_dtm))
hard_ham_testing_dtm <- as.data.frame(as.matrix(hard_ham_testing_dtm))
ham and spam scoresThe terms in the tdm will be used to generate a ham.score and spam.score for each document. The score for each term is based on the frequency of that term in the training set. If a term is not present in the training set, the conditional probability of ham.UNKNOWN and spam.UNKNOWN are assigned to the term to generate the ham.score and spam.score.
##Get conditional probability of a term in the ham training set.
get.ham.probability <-function(term){
result <- ham_termFreq$probability[ham_termFreq$term == term]
if(length(result)==0){return(ham.UNKNOWN)}
return(result)
}
##Get conditional probability of a term in the spam training set.
get.spam.probability <- function(term){
result <- spam_termFreq$probability[spam_termFreq$term == term]
if(length(result)==0){return(ham.UNKNOWN)}
return(result)
}
##Return classification of ham or spam based on maximum score.
get.spamham <- function(ham.score, spam.score){
if(ham.score == spam.score){return("same score")}
if(ham.score > spam.score){return("ham")}
return("spam")
}
##Classify a document as either spam or ham.
doc.classify <- function(dtm, i){
doc <- t(dtm[i,1:ncol(dtm)]) #transpose
doc <- as.data.frame(doc)
colnames(doc) <- c("count")
doc_result <- NULL
doc_result$terms <- rownames(doc)[which(doc$count > 0)]
doc_result$count <- doc[which(doc$count > 0),1]
doc_result <- as.data.frame(doc_result)
doc_result$ham.probability.log <-
log(sapply(doc_result$terms, get.ham.probability))* doc_result$count
doc_result$spam.probability.log <-
log(sapply(doc_result$terms, get.spam.probability))* doc_result$count
doc_ham.score <- colSums(as.data.frame(doc_result$ham.probability.log)) + log(ham_prob)
doc_spam.score <- colSums(as.data.frame(doc_result$spam.probability.log)) + log(spam_prob)
doc_class <- get.spamham(doc_ham.score, doc_spam.score)
return(c(doc_ham.score, doc_spam.score, doc_class))
}
##Classify all the documents in the dtm as either spam or ham.
dtm.classify <- function(dtm){
docs_classification <- as.data.frame(NULL)
for(i in 1:nrow(dtm)){
#for(i in 1:100){
result <- doc.classify(dtm,i)
docs_classification[i,1] <- result[1]
docs_classification[i,2] <- result[2]
docs_classification[i,3] <- result[3]
}
colnames(docs_classification) <- c("ham.score", "spam.score", "classification")
return(docs_classification)
}
ham or spamFor each document, a ham.score and spam.score are calculated.
ham.score is calculated based on log(P(spam)xP(term_1|spam)x...xP(term_n|spam)).
spam.score is calculated based on log(P(ham)xP(term_1|ham)x...xP(term_n|ham)).
The log is used to generate the scores because multiplying the probabilities for all the terms present in the document produces a very small number that is very close to zero.
#HOLD OFF SET
holdoff_ham_classification <- dtm.classify(ham_holdoff_dtm)
holdoff_spam_classification <- dtm.classify(spam_holdoff_dtm)
#TESTING SET
testing_ham_classification <- dtm.classify(ham_testing_dtm)
testing_spam_classification <- dtm.classify(spam_testing_dtm)
testing_hard_ham_classification <- dtm.classify(hard_ham_testing_dtm)
ham.score and spam.scoreThis is a preview of the first 100 emails in the ham hold off set.
datatable(holdoff_ham_classification[1:100,])
This prepares a data frame that contains results data of the classification.
get.Info <- function(label, classification){
#classification <- holdoff_ham_classification
c(
label,
nrow(classification),
nrow(classification[classification$classification=="ham",]),
nrow(classification[classification=="spam",]),
round(nrow(classification[classification=="ham",])/
nrow(classification),2),
round(nrow(classification[classification$classification=="spam",])/
nrow(classification),2)
)
}
classification_results <-
rbind(
get.Info("Hold off - Easy Ham", holdoff_ham_classification),
get.Info("Hold off - Spam", holdoff_spam_classification),
get.Info("Testing - Easy Ham", testing_ham_classification),
get.Info("Testing - Spam", testing_spam_classification),
get.Info("Testing - Hard Ham",testing_hard_ham_classification)
)
colnames(classification_results) <-
c("Set", "Total Emails","Ham", "Spam", "Ham %", "Spam %")
For the hold off set, ham emails were classified as ham at 99% while spam emails were classified as spam only at 77%.
For the testing set, the easy ham ham emails were classified as ham at 98% while spam emails were classified as spam only at 68%.
Because there is a hard ham set available, I decided to run this against the training set, which is based on the easy ham data. As expected, the classification for the the hard ham did really poorly. Only 28% of the hard ham emails were classified as ham and the rest were incorrectly classified as spam.
| Set | Total Emails | Ham | Spam | Ham % | Spam % |
|---|---|---|---|---|---|
| Hold off - Easy Ham | 700 | 696 | 4 | 0.99 | 0.01 |
| Hold off - Spam | 250 | 57 | 193 | 0.23 | 0.77 |
| Testing - Easy Ham | 701 | 690 | 11 | 0.98 | 0.02 |
| Testing - Spam | 251 | 81 | 170 | 0.32 | 0.68 |
| Testing - Hard Ham | 250 | 70 | 180 | 0.28 | 0.72 |
spam and ham emails in the hold off setspam and ham emails in the testing setThe training set used the easy ham data. The ham emails in this testing set is classified as easy ham.
The training set used easy ham data. As you can see, classifying the hard ham emails on a training set that only looked at easy ham emails produced poor results. Majority of the hard ham emails were classified as spam.