This project uses Naive Bayes to classify plain text documents as either spam
or ham
. This work is based on what I learned from a YouTube lecture by UC Berkeley Prof. Pieter Abbeel on Machine Learning: Naive Bayes (https://www.youtube.com/watch?v=DNvwfNEiKvw).
The tm
library is used to process the corpus.
Three sets of data that were used. The training set
was used to train the classification method. The hold off
set was used to test the classification with different parameters to try and improve the accuracy of the classification. Finally, the testing
set was used to test the classification once the parameters are set for the classification.
Source of data: http://spamassassin.apache.org/old/publiccorpus/
Training set:
Hold off set:
Testing set:
library(tm)
library(dplyr)
library(ggplot2)
library(SnowballC)
library(knitr)
library(stringr)
library(DT)
spam
and ham
corpusThe tm
library is used to load the files.
The code below loads the spam-ham training
set, hold off
set, and testing
set.
#SETS
ham_training_folder <- "./spamham/easy_ham_training_p1"
spam_training_folder <- "./spamham/spam_2_training_p2"
ham_testing_folder <- "./spamham/easy_ham_2_TESTING"
spam_testing_folder <- "./spamham/spam_TESTING"
ham_holdoff_folder <- "./spamham/easy_ham_2_HOLDOFF"
spam_holdoff_folder <- "./spamham/spam_HOLDOFF"
hard_ham_testing_folder <- "./spamham/hard_ham_TESTING"
#LOAD FILES
ham_training <- VCorpus(DirSource(ham_training_folder))
spam_training <- VCorpus(DirSource(spam_training_folder))
ham_testing <- VCorpus(DirSource(ham_testing_folder))
spam_testing <- VCorpus(DirSource(spam_testing_folder))
ham_holdoff <- VCorpus(DirSource(ham_holdoff_folder))
spam_holdoff <- VCorpus(DirSource(spam_holdoff_folder))
hard_ham_testing <- VCorpus(DirSource(hard_ham_testing_folder))
#summary(ham_training)
#summary(spam_training)
This table lists the number of emails in the ham
and spam
files in each set.
File | No. of Emails |
---|---|
Training Set: Easy Ham | 2551 |
Traning Set: Spam | 1396 |
Hold Off Set: Easy Ham | 700 |
Hold Off Set: Spam | 250 |
Testing Set: Easy Ham | 701 |
Testing Set: Spam | 251 |
Testing Set: Hard Ham | 250 |
This simplistic approach in text classification only focuses on the English alpha characters and the case of the alpha characters is ignored. Text is converted to lowercase. Punctuations are also removed before the term document matrix and document term matrix are generated.
Other possible features that may play a role in improving the accuracy of the classification are not considered such as whether the message is formatted in html or not, if mispelled words are present, or if the message has words spelled in all uppercase characters among other things.
Removing English stop words decreased the observed accuracy of the classification based on the training
set. In addition, stemming the words also decreased the observed accuracy of the classification based on the training
set.
Below is observed classification on the hold off
set.
No removal of stop words or stemming used ( k= 0.5)
Remove stop words (k = 0.5)
Stemming of words used (k = 0.5)
Characters in plain text documents are converted to lowercase.
convert.lowercase <- function(doc){
doc <- tm_map(doc, tolower)
doc <- tm_map(doc, PlainTextDocument)
return(doc)
}
#TRAINING SET
ham_training <- convert.lowercase(ham_training)
spam_training <- convert.lowercase(spam_training)
#TESTING SET
ham_testing <- convert.lowercase(ham_testing)
hard_ham_testing <- convert.lowercase(hard_ham_testing)
spam_testing <- convert.lowercase(spam_testing)
#HOLD OFF SET
ham_holdoff <- convert.lowercase(ham_holdoff)
spam_holdoff <- convert.lowercase(spam_holdoff)
#inspect(ham_training[[1]])
This simplistic approach in text classification only looks at alpha characters.
extract.alpha <- function(docs){
thispattern = "[a-z]+ |[a-z]+[a-z]$|[a-z]+[\\.|,|;] "
for (j in seq(docs))
{
letters_only <- unlist(str_extract_all(docs[[j]], thispattern))
docs[[j]] <- paste(letters_only, collapse = " ")
}
#docs <- tm_map(docs, removeWords, stopwords("english"))
#docs <- tm_map(docs, stemDocument, language = "english")
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removePunctuation)
#inspect(docs[[1]])
return(docs)
}
#TRAINING
ham_training <- extract.alpha(ham_training)
spam_training <- extract.alpha(spam_training)
#TESTING
ham_testing <- extract.alpha(ham_testing)
spam_testing <- extract.alpha(spam_testing)
hard_ham_testing <- extract.alpha((hard_ham_testing))
#HOLDOFF
ham_holdoff <- extract.alpha(ham_holdoff)
spam_holdoff <- extract.alpha(spam_holdoff)
#inspect(ham_training[[1]])
#inspect(spam_training[[1]])
#inspect(ham_testing[[1]])
#inspect(hard_ham_testing[[1]])
#inspect(spam_testing[[1]])
#inspect(ham_holdoff[[1]])
#inspect(spam_holdoff[[1]])
This is the training
set term document matrix.
This is going to be used to generate the training
set term frequency list.
ham_tdm <- TermDocumentMatrix(ham_training)
spam_tdm <- TermDocumentMatrix(spam_training)
This is the term frequency list for the training
set.
Classification is going to depend on the term frequency list for ham
and spam
in the training
set.
ham_freq <- rowSums(as.matrix(ham_tdm))
spam_freq <- rowSums(as.matrix(spam_tdm))
ham_termFreq <- cbind(names(ham_freq),ham_freq)
spam_termFreq <- cbind(names(spam_freq), spam_freq)
rownames(ham_termFreq) <- NULL
rownames(spam_termFreq) <- NULL
colnames(ham_termFreq) <- c("term", "frequency")
colnames(spam_termFreq) <- c("term", "frequency")
ham_termFreq <- data.frame(ham_termFreq, stringsAsFactors = FALSE)
spam_termFreq <- data.frame(spam_termFreq, stringsAsFactors = FALSE)
ham_termFreq$frequency <- as.numeric(ham_termFreq$frequency)
spam_termFreq$frequency <- as.numeric(spam_termFreq$frequency)
#head(ham_termFreq,10)
#head(spam_termFreq,10)
training
set term frequency listTo calculate the conditional probability of the terms, Laplace smoothing was used. The value for k
used is 0.5
. For the training
set, k values of 0.1, 0.5, 1.5, and 2 were investigated. The classification was better (particularly for spam) when k = 0.5
.
ham.UNKNOWN
and spam.UNKNOWN
are conditional probabilities used when the term is not found in the training set.
The marginal probability for ham
and spam
are based on the proportion of ham
emails and spam
emails in the entire training set.
ham_prob <- length(ham_training)/ (length(ham_training) + length(spam_training))
spam_prob <- length(spam_training)/ (length(ham_training) + length(spam_training))
#Ham and spam marginal probability
ham_prob
## [1] 0.6463137
spam_prob
## [1] 0.3536863
#Laplace smoothing
k <- 0.5
#Total count of term occurrences in ham and spam
ham_N <- colSums(as.matrix(ham_termFreq$frequency))
spam_N <- colSums(as.matrix(spam_termFreq$frequency))
#Number of terms in ham
nrow(ham_termFreq)
## [1] 22238
#Number of terms in spam
nrow(spam_termFreq)
## [1] 19400
#Number of shared terms between ham and spam
nrow(as.data.frame(dplyr::intersect(ham_termFreq$term, spam_termFreq$term)))
## [1] 8073
#Number of distinct vocabulary on both ham and spam
spamham_vocabulary <- nrow(as.data.frame(dplyr::union(ham_termFreq$term, spam_termFreq$term)))
spamham_vocabulary
## [1] 33565
ham_termFreq$probability <- (ham_termFreq$frequency + k)/(ham_N + k*spamham_vocabulary)
spam_termFreq$probability <- (spam_termFreq$frequency + k)/(spam_N + k*spamham_vocabulary)
ham.UNKNOWN <- (0 + k)/(ham_N + k*spamham_vocabulary)
spam.UNKNOWN <- (0 + k)/(spam_N + k*spamham_vocabulary)
training - ham
term frequency and term conditional probabilitytraining - spam
term frequency and term conditional probabilityhold off
and testing
setsThis is the term document matrix (tdm) for the hold off
and testing
sets.
#HOLD OFF SET
ham_holdoff_dtm <- DocumentTermMatrix(ham_holdoff)
spam_holdoff_dtm <- DocumentTermMatrix(spam_holdoff)
#TESTING SET
ham_testing_dtm <- DocumentTermMatrix(ham_testing)
spam_testing_dtm <- DocumentTermMatrix(spam_testing)
hard_ham_testing_dtm <- DocumentTermMatrix(hard_ham_testing)
#CONVERT AS MATRIX
ham_holdoff_dtm <- as.data.frame(as.matrix(ham_holdoff_dtm))
spam_holdoff_dtm <- as.data.frame(as.matrix(spam_holdoff_dtm))
ham_testing_dtm <- as.data.frame(as.matrix(ham_testing_dtm))
spam_testing_dtm <- as.data.frame(as.matrix(spam_testing_dtm))
hard_ham_testing_dtm <- as.data.frame(as.matrix(hard_ham_testing_dtm))
ham
and spam
scoresThe terms in the tdm will be used to generate a ham.score
and spam.score
for each document. The score for each term is based on the frequency of that term in the training
set. If a term is not present in the training
set, the conditional probability of ham.UNKNOWN
and spam.UNKNOWN
are assigned to the term to generate the ham.score
and spam.score
.
##Get conditional probability of a term in the ham training set.
get.ham.probability <-function(term){
result <- ham_termFreq$probability[ham_termFreq$term == term]
if(length(result)==0){return(ham.UNKNOWN)}
return(result)
}
##Get conditional probability of a term in the spam training set.
get.spam.probability <- function(term){
result <- spam_termFreq$probability[spam_termFreq$term == term]
if(length(result)==0){return(ham.UNKNOWN)}
return(result)
}
##Return classification of ham or spam based on maximum score.
get.spamham <- function(ham.score, spam.score){
if(ham.score == spam.score){return("same score")}
if(ham.score > spam.score){return("ham")}
return("spam")
}
##Classify a document as either spam or ham.
doc.classify <- function(dtm, i){
doc <- t(dtm[i,1:ncol(dtm)]) #transpose
doc <- as.data.frame(doc)
colnames(doc) <- c("count")
doc_result <- NULL
doc_result$terms <- rownames(doc)[which(doc$count > 0)]
doc_result$count <- doc[which(doc$count > 0),1]
doc_result <- as.data.frame(doc_result)
doc_result$ham.probability.log <-
log(sapply(doc_result$terms, get.ham.probability))* doc_result$count
doc_result$spam.probability.log <-
log(sapply(doc_result$terms, get.spam.probability))* doc_result$count
doc_ham.score <- colSums(as.data.frame(doc_result$ham.probability.log)) + log(ham_prob)
doc_spam.score <- colSums(as.data.frame(doc_result$spam.probability.log)) + log(spam_prob)
doc_class <- get.spamham(doc_ham.score, doc_spam.score)
return(c(doc_ham.score, doc_spam.score, doc_class))
}
##Classify all the documents in the dtm as either spam or ham.
dtm.classify <- function(dtm){
docs_classification <- as.data.frame(NULL)
for(i in 1:nrow(dtm)){
#for(i in 1:100){
result <- doc.classify(dtm,i)
docs_classification[i,1] <- result[1]
docs_classification[i,2] <- result[2]
docs_classification[i,3] <- result[3]
}
colnames(docs_classification) <- c("ham.score", "spam.score", "classification")
return(docs_classification)
}
ham
or spam
For each document, a ham.score
and spam.score
are calculated.
ham.score
is calculated based on log(P(spam)xP(term_1|spam)x...xP(term_n|spam))
.
spam.score
is calculated based on log(P(ham)xP(term_1|ham)x...xP(term_n|ham))
.
The log is used to generate the scores because multiplying the probabilities for all the terms present in the document produces a very small number that is very close to zero.
#HOLD OFF SET
holdoff_ham_classification <- dtm.classify(ham_holdoff_dtm)
holdoff_spam_classification <- dtm.classify(spam_holdoff_dtm)
#TESTING SET
testing_ham_classification <- dtm.classify(ham_testing_dtm)
testing_spam_classification <- dtm.classify(spam_testing_dtm)
testing_hard_ham_classification <- dtm.classify(hard_ham_testing_dtm)
ham.score
and spam.score
This is a preview of the first 100 emails in the ham
hold off
set.
datatable(holdoff_ham_classification[1:100,])
This prepares a data frame that contains results data of the classification.
get.Info <- function(label, classification){
#classification <- holdoff_ham_classification
c(
label,
nrow(classification),
nrow(classification[classification$classification=="ham",]),
nrow(classification[classification=="spam",]),
round(nrow(classification[classification=="ham",])/
nrow(classification),2),
round(nrow(classification[classification$classification=="spam",])/
nrow(classification),2)
)
}
classification_results <-
rbind(
get.Info("Hold off - Easy Ham", holdoff_ham_classification),
get.Info("Hold off - Spam", holdoff_spam_classification),
get.Info("Testing - Easy Ham", testing_ham_classification),
get.Info("Testing - Spam", testing_spam_classification),
get.Info("Testing - Hard Ham",testing_hard_ham_classification)
)
colnames(classification_results) <-
c("Set", "Total Emails","Ham", "Spam", "Ham %", "Spam %")
For the hold off
set, ham
emails were classified as ham
at 99% while spam
emails were classified as spam
only at 77%.
For the testing
set, the easy ham ham
emails were classified as ham
at 98% while spam
emails were classified as spam
only at 68%.
Because there is a hard ham set available, I decided to run this against the training
set, which is based on the easy ham data. As expected, the classification for the the hard ham did really poorly. Only 28% of the hard ham emails were classified as ham
and the rest were incorrectly classified as spam
.
Set | Total Emails | Ham | Spam | Ham % | Spam % |
---|---|---|---|---|---|
Hold off - Easy Ham | 700 | 696 | 4 | 0.99 | 0.01 |
Hold off - Spam | 250 | 57 | 193 | 0.23 | 0.77 |
Testing - Easy Ham | 701 | 690 | 11 | 0.98 | 0.02 |
Testing - Spam | 251 | 81 | 170 | 0.32 | 0.68 |
Testing - Hard Ham | 250 | 70 | 180 | 0.28 | 0.72 |
spam
and ham
emails in the hold off
setspam
and ham
emails in the testing
setThe training
set used the easy ham data. The ham
emails in this testing
set is classified as easy ham.
The training
set used easy ham data. As you can see, classifying the hard ham emails on a training
set that only looked at easy ham emails produced poor results. Majority of the hard ham emails were classified as spam
.