Main Objective:
The main objective is to create a email text classifier using ham spam data from https://spamassassin.apache.org/old/publiccorpus/.
We are also expected to manually unzip the data and also programatically unzip the data.
Predict the class of new documents withheld from the example corpus. Then come up with a different set of documents to test.
Use the dictionary of common words.
Separate the message header from the message body
Data
The first set of data is the 20030228_easy_ham_2 dataset. This dataset is manually loaded into the local machine.
ham_files <- list.files(path='./20030228_easy_ham_2/easy_ham_2/',full.names = T)
spam_files <- list.files(path='./20050311_spam_2/spam_2/',full.names = T)
Cleaning the text and storing texts in a dataframe
The cleaning process happens twice in this application. Here I try to remove all html tags, punctuation, numbers and breaks.I also try to remove the header by using a blank line as the marker for the ending of the header and beginning of the body.
After processing the documents line by line I store them into two temporary files, one for the header data, the other for the body data. After the document is finished parsing, both the body and header data is added to a dataframe, with the mark of ham or spam to identify each document.
cleanFun <- function(htmlString) {
return(gsub("<.*?>", " ", htmlString))
}
body_mx <-setNames(data.frame(matrix(ncol = 4, nrow = 0)),
c("doc_id","text","header","ham_spam"))
for(i in 1:length(ham_files)){
file.create('headers_file.txt')
file.create('body_file.txt')
enc <- guess_encoding(ham_files[i], n_max = -1, threshold = 0.2)
con = file(ham_files[i],encoding = enc$encoding[1])
empty_count <- 0
tmp_doc <- readLines(con, warn = FALSE)
tmp_doc <- gsub("<.*?>", "", tmp_doc)
for(line in 1:length(tmp_doc)) {
if(nchar(tmp_doc[line]) == 0){
empty_count <- empty_count + 1
}
if(empty_count == 0){
clean <-cleanFun(tmp_doc[line])
clean <- str_replace_all(tmp_doc[line],
"[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
clean <- str_replace_all(clean,
"[\r\n]"," ")
clean1 <- gsub("\\s+"," ",clean)
write(clean1,file='headers_file.txt',append=TRUE)
}else{
clean <-cleanFun(tmp_doc[line])
clean <- str_replace_all(tmp_doc[line],
"[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
clean <- str_replace_all(clean,
"[\r\n]"," ")
clean1 <- gsub("\\s+"," ",clean)
write(clean1,file='body_file.txt',append=TRUE)
}
}
headers_txt <- read_file('headers_file.txt')
body_txt <- read_file('body_file.txt')
body_mx[nrow(body_mx) + 1,] = c(ham_files[i], body_txt,headers_txt, 'ham')
file.remove('headers_file.txt')
file.remove('body_file.txt')
close(con)
}
for(i in 1:length(spam_files)){
file.create('headers_file.txt')
file.create('body_file.txt')
enc <- guess_encoding(spam_files[i], n_max = -1, threshold = 0.2)
con = file(spam_files[i],encoding = enc$encoding[1])
empty_count <- 0
tmp_doc <- readLines(con, warn = FALSE)
tmp_doc <- gsub("<.*?>", "", tmp_doc)
for(line in 1:length(tmp_doc)) {
if(nchar(tmp_doc[line]) == 0){
empty_count <- empty_count + 1
}
if(empty_count == 0){
clean <-cleanFun(tmp_doc[line])
clean <- str_replace_all(tmp_doc[line],
"[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
clean <- str_replace_all(clean,
"[\r\n]"," ")
clean1 <- gsub("\\s+"," ",clean)
write(clean1,file='headers_file.txt',append=TRUE)
}else{
clean <-cleanFun(tmp_doc[line])
clean <- str_replace_all(tmp_doc[line],
"[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
clean <- str_replace_all(clean,
"[\r\n]"," ")
clean1 <- gsub("\\s+"," ",clean)
write(clean1,file='body_file.txt',append=TRUE)
}
}
headers_txt <- read_file('headers_file.txt')
body_txt <- read_file('body_file.txt')
body_mx[nrow(body_mx) + 1,] = c(spam_files[i], body_txt,headers_txt,'spam')
file.remove('headers_file.txt')
file.remove('body_file.txt')
close(con)
}
Quanteda
To turn the dataframe into a corpus I use the quanteda package. After tokenizing the documents, they are further cleaned and english stopwords are removed. A random sample of 80% of the total dataframe size is then taken for the training set. This is then turned into a document feature matrix and is further cleaned. The document feature matrix is then subsetted into groups of training and testing and put into a naive bayes text model.
Looking at the model summary we can see that the table shown has a set of values related to ham(top) and spam(bottom). Most of the data looks like insignificant values well below zero, but when comparing the hams to the spams we can see that some of the values are \(10^2\) or \(10^3\) times bigger than the other. Which is a significant difference in value.
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textstats)
hammy_spammy <- corpus(body_mx, text_field = "text")
hammy_spammy$id_numeric <- 1:ndoc(hammy_spammy)
hs_tokens <- tokens(hammy_spammy,remove_punct = T,remove_symbols = T,
remove_numbers = T,remove_url =T)%>%
tokens_remove(pattern = stopwords("en"))%>%
tokens_wordstem()
set.seed(300)
id_train <- sample(1:2798,.8*ndoc(hammy_spammy), replace = FALSE)
hs_dfm <- dfm(hs_tokens)
hs_dfm <- dfm_remove(hs_dfm, "\\b[a-zA-Z]\\b|nbsp|font", valuetype="regex")
hsdf_training <- dfm_subset(hs_dfm,id_numeric %in% id_train)
hsdf_testing <- dfm_subset(hs_dfm,!id_numeric %in% id_train)
hs_nb_model <- textmodel_nb(x = hsdf_training,y=hsdf_training$ham_spam)
summary(hs_nb_model)
##
## Call:
## textmodel_nb.dfm(x = hsdf_training, y = hsdf_training$ham_spam)
##
## Class Priors:
## (showing first 2 elements)
## ham spam
## 0.5 0.5
##
## Estimated Feature Scores:
## date tue aug chris garrigu messag id hope peopl addit
## ham 0.0007501 5.120e-04 1.298e-03 4.042e-04 1.842e-04 0.002785 0.0007276 0.0003998 0.001936 0.0002695
## spam 0.0003436 2.664e-06 7.991e-06 5.327e-06 2.664e-06 0.001521 0.0002877 0.0001891 0.001659 0.0002051
## sequenc notic pure cosmet chang well first exmh latest one
## ham 5.704e-04 0.0003189 7.187e-05 1.797e-05 0.0018550 0.0014508 0.0009208 1.666e-03 1.213e-04 0.003539
## spam 1.598e-05 0.0004022 3.729e-05 1.065e-05 0.0005993 0.0004715 0.0010388 2.664e-06 5.594e-05 0.002456
## start get can read flist totalcount unseen element array execut
## ham 0.0008175 0.003405 0.004788 0.0007456 7.636e-05 3.144e-05 3.099e-04 5.839e-05 1.078e-04 0.0003369
## spam 0.0009376 0.002951 0.003748 0.0005540 2.664e-06 2.664e-06 2.664e-06 2.664e-06 1.598e-05 0.0001172
Classification
When running the naive bayes prediction we can see that it is pretty accurate. With 276 hams correctly labeled ham and only one ham incorrectly labeled spam. For the spam 8 were classified as ham and 275 were correctly classified as spam.
hs_nb_model_matched <- dfm_match(hsdf_testing,features=featnames(hsdf_training))
actual_class <-hs_nb_model_matched$ham_spam
predicted_class <- predict(hs_nb_model,newdata = hs_nb_model_matched)
tab_class <- table(actual_class,predicted_class)
tab_class
## predicted_class
## actual_class ham spam
## ham 276 1
## spam 8 275
Confusion Matrix
The confusion matrix is used to assess the performance of a classification model. We can see that it says it has an accuracy rating of 98.39% and a 95% confidence interval of 96.97% to 99.26%.
The pos pred value is the correct classification percentage of the ham data, while the neg pred value is the correct classification percentage of the spam data.
Precision is measured by the true positives divided by the true positives plus the false positives.
Recall is almost the same as precision except that it is the false positives divided by the sum of the ture positives and false positive.
The F1 is Precision multiplied by recall which is then divided by the sum of precision and recall and it is all multiplied by 2.
The F1 variable is considered to be the true accuracy of a model. So the model has an overall accuracy of 98.4%
confusionMatrix(tab_class,mode='everything')
## Error in confusionMatrix(tab_class, mode = "everything"): could not find function "confusionMatrix"
##
## Listening on http://127.0.0.1:3839
Downloading the Second dataset programmatically
To accomplish programmatically downloading and extracting the zip files from the spamassassin.apache.org website I needed to use the R.utils package for the bunzip2 function. With the bunzip2 and the untar functions I was able to download and extract the files.
hard_hammy_spammy <- corpus(hard_body_mx, text_field = "text")
hard_hammy_spammy$id_numeric <- 1:ndoc(hard_hammy_spammy)
hard_hs_tokens <- tokens(hard_hammy_spammy,remove_punct = T,remove_symbols = T,
remove_numbers = T,remove_url =T)%>%
tokens_remove(pattern = stopwords("en"))%>%
tokens_wordstem()
set.seed(34823947)
hard_id_train <- sample(1:1649,.8*ndoc(hard_hammy_spammy), replace = FALSE)
hard_hs_dfm <- dfm(hard_hs_tokens)
hard_hs_dfm <- dfm_remove(hard_hs_dfm, "\\b[a-zA-Z]\\b|nbsp|font|size|width|color|height|face|src|img|border|href|com|arial|mail|email|td|br|tr|align|tabl|center|san|serif", valuetype="regex")
Eyeball of the textmodel summary
Looking at this summary we can see why this is the harder dataset to analyze. Where as in the other set of data, the difference in values for ham and spam were \(10^2\) and \(10^3\),we can see that this data is much closer in differences. All the values appear to be on the same decimal level as its counterpart. This will make distinguishing differences much harder.
hard_hsdf_training <-dfm_subset(hard_hs_dfm,id_numeric %in% hard_id_train)
hard_hsdf_testing <- dfm_subset(hard_hs_dfm,!id_numeric %in% hard_id_train)
hard_hs_nb_model <- textmodel_nb(x = hard_hsdf_training,y=hard_hsdf_training$ham_spam)
summary(hard_hs_nb_model)
##
## Call:
## textmodel_nb.dfm(x = hard_hsdf_training, y = hard_hsdf_training$ham_spam)
##
## Class Priors:
## (showing first 2 elements)
## ham spam
## 0.5 0.5
##
## Estimated Feature Scores:
## motlei fool tired getting mani credit card offers don want
## ham 7.058e-06 1.200e-04 7.058e-06 2.117e-05 0.0005787 0.0003246 0.0006140 7.058e-06 0.0009175 0.0009457
## spam 4.674e-06 2.804e-05 9.347e-06 2.804e-05 0.0009254 0.0013975 0.0008787 4.674e-06 0.0012152 0.0014769
## offering new ones three main bureaus unite state ve agre
## ham 7.058e-06 0.003543 7.058e-06 0.0003952 0.0001200 7.058e-06 0.0001835 0.0003176 0.0009245 0.0001694
## spam 1.402e-05 0.002248 9.347e-06 0.0003085 0.0002243 2.804e-05 0.0004206 0.0013694 0.0007992 0.0001355
## someon contact one ask let year resolut most nightmar hang
## ham 0.0002329 0.0004305 0.002054 0.0005505 0.0005787 0.001157 4.235e-05 3.529e-05 4.235e-05 3.529e-05
## spam 0.0002898 0.0008366 0.002902 0.0003038 0.0005141 0.001566 3.272e-05 2.337e-05 4.674e-06 5.608e-05
Harder model predictions
From the output we can see that this model is not nearly as accurate as the previous one. The model still accurately predicts ham values as being hams, but it classifies a lot of hams as spams. So, if this were used in a real company, a lot of emails would never make it to their intended target.
hard_hs_nb_model_matched <- dfm_match(hard_hsdf_testing,features=featnames(hard_hsdf_training))
hard_actual_class <-hard_hs_nb_model_matched$ham_spam
hard_predicted_class <- predict(hard_hs_nb_model,newdata = hard_hs_nb_model_matched)
hard_tab_class <- table(hard_actual_class,hard_predicted_class)
hard_tab_class
## hard_predicted_class
## hard_actual_class ham spam
## ham 44 1
## spam 51 234
Hard ham spam Confusion Matrix
At first the confusion matrix out put doesn’t look that bad since its accuracy rating says 86% but upon further inspecting we can see that the recall is pretty bad at 48%. Which means that of all the hams only 48% of them were correctly classified. So 52% of people who sent emails that are screened by this application will end up being upset when they don’t get a response. Due to this low recall rating the F1 rating is also low at 64%.
confusionMatrix(hard_tab_class,mode='everything')
## Error in confusionMatrix(hard_tab_class, mode = "everything"): could not find function "confusionMatrix"
Conclusion
This major difference in outcomes tells me that the harder emails need to be thoroughly inspected. By the looks of the harder dataset many of them were html documents. This could have had a major effect on the outcome.
Another possible thing that could have been done is taking the header text data and doing a classification on them.