Main Objective:

The main objective is to create a email text classifier using ham spam data from https://spamassassin.apache.org/old/publiccorpus/.

We are also expected to manually unzip the data and also programatically unzip the data.

Predict the class of new documents withheld from the example corpus. Then come up with a different set of documents to test.

Use the dictionary of common words.

Separate the message header from the message body

Data

The first set of data is the 20030228_easy_ham_2 dataset. This dataset is manually loaded into the local machine.

ham_files <- list.files(path='./20030228_easy_ham_2/easy_ham_2/',full.names = T)
spam_files <- list.files(path='./20050311_spam_2/spam_2/',full.names = T)

Cleaning the text and storing texts in a dataframe

The cleaning process happens twice in this application. Here I try to remove all html tags, punctuation, numbers and breaks.I also try to remove the header by using a blank line as the marker for the ending of the header and beginning of the body.

After processing the documents line by line I store them into two temporary files, one for the header data, the other for the body data. After the document is finished parsing, both the body and header data is added to a dataframe, with the mark of ham or spam to identify each document.

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", " ", htmlString))
}

body_mx <-setNames(data.frame(matrix(ncol = 4, nrow = 0)),
                   c("doc_id","text","header","ham_spam"))

for(i in 1:length(ham_files)){

  file.create('headers_file.txt')
  file.create('body_file.txt')
  enc <- guess_encoding(ham_files[i], n_max = -1, threshold = 0.2)
  con = file(ham_files[i],encoding = enc$encoding[1])

  empty_count <- 0

  tmp_doc <- readLines(con, warn = FALSE)
  tmp_doc <- gsub("<.*?>", "", tmp_doc)
  for(line in 1:length(tmp_doc)) {

    if(nchar(tmp_doc[line]) == 0){
      empty_count <- empty_count + 1
      }
    
    if(empty_count == 0){
      clean <-cleanFun(tmp_doc[line])
      clean <- str_replace_all(tmp_doc[line],
                               "[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
      clean <- str_replace_all(clean,
                               "[\r\n]"," ")
      clean1 <- gsub("\\s+"," ",clean)
      write(clean1,file='headers_file.txt',append=TRUE)
      
    }else{
      clean <-cleanFun(tmp_doc[line])
      clean <- str_replace_all(tmp_doc[line],
                               "[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
      clean <- str_replace_all(clean,
                               "[\r\n]"," ")
      clean1 <- gsub("\\s+"," ",clean)
      write(clean1,file='body_file.txt',append=TRUE)
    }
  }
  headers_txt <- read_file('headers_file.txt')
  body_txt <- read_file('body_file.txt')
  body_mx[nrow(body_mx) + 1,] = c(ham_files[i], body_txt,headers_txt, 'ham')
  file.remove('headers_file.txt')
  file.remove('body_file.txt')
  close(con)
}

for(i in 1:length(spam_files)){

  file.create('headers_file.txt')
  file.create('body_file.txt')
  enc <- guess_encoding(spam_files[i], n_max = -1, threshold = 0.2)
  con = file(spam_files[i],encoding = enc$encoding[1])

  empty_count <- 0

  tmp_doc <- readLines(con, warn = FALSE)
  tmp_doc <- gsub("<.*?>", "", tmp_doc)
  for(line in 1:length(tmp_doc)) {

    if(nchar(tmp_doc[line]) == 0){
      empty_count <- empty_count + 1
      }
    
    if(empty_count == 0){
      clean <-cleanFun(tmp_doc[line])
      clean <- str_replace_all(tmp_doc[line],
                               "[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
      clean <- str_replace_all(clean,
                               "[\r\n]"," ")
      clean1 <- gsub("\\s+"," ",clean)
      write(clean1,file='headers_file.txt',append=TRUE)
      
    }else{
      clean <-cleanFun(tmp_doc[line])
      clean <- str_replace_all(tmp_doc[line],
                               "[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," ")
      clean <- str_replace_all(clean,
                               "[\r\n]"," ")
      clean1 <- gsub("\\s+"," ",clean)
      write(clean1,file='body_file.txt',append=TRUE)
    }
  }
  headers_txt <- read_file('headers_file.txt')
  body_txt <- read_file('body_file.txt')
  body_mx[nrow(body_mx) + 1,] = c(spam_files[i], body_txt,headers_txt,'spam')
  file.remove('headers_file.txt')
  file.remove('body_file.txt')
  close(con)
}

Quanteda

To turn the dataframe into a corpus I use the quanteda package. After tokenizing the documents, they are further cleaned and english stopwords are removed. A random sample of 80% of the total dataframe size is then taken for the training set. This is then turned into a document feature matrix and is further cleaned. The document feature matrix is then subsetted into groups of training and testing and put into a naive bayes text model.

Looking at the model summary we can see that the table shown has a set of values related to ham(top) and spam(bottom). Most of the data looks like insignificant values well below zero, but when comparing the hams to the spams we can see that some of the values are \(10^2\) or \(10^3\) times bigger than the other. Which is a significant difference in value.

library(quanteda)
library(quanteda.textmodels)
library(quanteda.textstats)


hammy_spammy <- corpus(body_mx, text_field = "text")
hammy_spammy$id_numeric <- 1:ndoc(hammy_spammy)

hs_tokens <- tokens(hammy_spammy,remove_punct = T,remove_symbols = T,
                    remove_numbers = T,remove_url =T)%>%
  tokens_remove(pattern = stopwords("en"))%>%
  tokens_wordstem()

set.seed(300)
id_train <- sample(1:2798,.8*ndoc(hammy_spammy), replace = FALSE)

hs_dfm <- dfm(hs_tokens)
hs_dfm <- dfm_remove(hs_dfm, "\\b[a-zA-Z]\\b|nbsp|font", valuetype="regex")

hsdf_training <- dfm_subset(hs_dfm,id_numeric %in% id_train)
hsdf_testing <- dfm_subset(hs_dfm,!id_numeric %in% id_train)
hs_nb_model <- textmodel_nb(x = hsdf_training,y=hsdf_training$ham_spam)

summary(hs_nb_model)
## 
## Call:
## textmodel_nb.dfm(x = hsdf_training, y = hsdf_training$ham_spam)
## 
## Class Priors:
## (showing first 2 elements)
##  ham spam 
##  0.5  0.5 
## 
## Estimated Feature Scores:
##           date       tue       aug     chris   garrigu   messag        id      hope    peopl     addit
## ham  0.0007501 5.120e-04 1.298e-03 4.042e-04 1.842e-04 0.002785 0.0007276 0.0003998 0.001936 0.0002695
## spam 0.0003436 2.664e-06 7.991e-06 5.327e-06 2.664e-06 0.001521 0.0002877 0.0001891 0.001659 0.0002051
##        sequenc     notic      pure    cosmet     chang      well     first      exmh    latest      one
## ham  5.704e-04 0.0003189 7.187e-05 1.797e-05 0.0018550 0.0014508 0.0009208 1.666e-03 1.213e-04 0.003539
## spam 1.598e-05 0.0004022 3.729e-05 1.065e-05 0.0005993 0.0004715 0.0010388 2.664e-06 5.594e-05 0.002456
##          start      get      can      read     flist totalcount    unseen   element     array    execut
## ham  0.0008175 0.003405 0.004788 0.0007456 7.636e-05  3.144e-05 3.099e-04 5.839e-05 1.078e-04 0.0003369
## spam 0.0009376 0.002951 0.003748 0.0005540 2.664e-06  2.664e-06 2.664e-06 2.664e-06 1.598e-05 0.0001172

Classification

When running the naive bayes prediction we can see that it is pretty accurate. With 276 hams correctly labeled ham and only one ham incorrectly labeled spam. For the spam 8 were classified as ham and 275 were correctly classified as spam.

hs_nb_model_matched <- dfm_match(hsdf_testing,features=featnames(hsdf_training))

actual_class <-hs_nb_model_matched$ham_spam
predicted_class <- predict(hs_nb_model,newdata = hs_nb_model_matched)
tab_class <- table(actual_class,predicted_class)
tab_class
##             predicted_class
## actual_class ham spam
##         ham  276    1
##         spam   8  275

Confusion Matrix

The confusion matrix is used to assess the performance of a classification model. We can see that it says it has an accuracy rating of 98.39% and a 95% confidence interval of 96.97% to 99.26%.

The pos pred value is the correct classification percentage of the ham data, while the neg pred value is the correct classification percentage of the spam data.

Precision is measured by the true positives divided by the true positives plus the false positives.

Recall is almost the same as precision except that it is the false positives divided by the sum of the ture positives and false positive.

The F1 is Precision multiplied by recall which is then divided by the sum of precision and recall and it is all multiplied by 2.

The F1 variable is considered to be the true accuracy of a model. So the model has an overall accuracy of 98.4%

confusionMatrix(tab_class,mode='everything')
## Error in confusionMatrix(tab_class, mode = "everything"): could not find function "confusionMatrix"
## 
## Listening on http://127.0.0.1:3839

plot of chunk unnamed-chunk-9

Downloading the Second dataset programmatically

To accomplish programmatically downloading and extracting the zip files from the spamassassin.apache.org website I needed to use the R.utils package for the bunzip2 function. With the bunzip2 and the untar functions I was able to download and extract the files.

hard_hammy_spammy <- corpus(hard_body_mx, text_field = "text")
hard_hammy_spammy$id_numeric <- 1:ndoc(hard_hammy_spammy)

hard_hs_tokens <- tokens(hard_hammy_spammy,remove_punct = T,remove_symbols = T,
                    remove_numbers = T,remove_url =T)%>%
  tokens_remove(pattern = stopwords("en"))%>%
  tokens_wordstem()

set.seed(34823947)

hard_id_train <- sample(1:1649,.8*ndoc(hard_hammy_spammy), replace = FALSE)

hard_hs_dfm <- dfm(hard_hs_tokens)
hard_hs_dfm <- dfm_remove(hard_hs_dfm, "\\b[a-zA-Z]\\b|nbsp|font|size|width|color|height|face|src|img|border|href|com|arial|mail|email|td|br|tr|align|tabl|center|san|serif", valuetype="regex")

Eyeball of the textmodel summary

Looking at this summary we can see why this is the harder dataset to analyze. Where as in the other set of data, the difference in values for ham and spam were \(10^2\) and \(10^3\),we can see that this data is much closer in differences. All the values appear to be on the same decimal level as its counterpart. This will make distinguishing differences much harder.

hard_hsdf_training <-dfm_subset(hard_hs_dfm,id_numeric %in% hard_id_train)
hard_hsdf_testing <- dfm_subset(hard_hs_dfm,!id_numeric %in% hard_id_train)
hard_hs_nb_model <- textmodel_nb(x = hard_hsdf_training,y=hard_hsdf_training$ham_spam)
summary(hard_hs_nb_model)
## 
## Call:
## textmodel_nb.dfm(x = hard_hsdf_training, y = hard_hsdf_training$ham_spam)
## 
## Class Priors:
## (showing first 2 elements)
##  ham spam 
##  0.5  0.5 
## 
## Estimated Feature Scores:
##         motlei      fool     tired   getting      mani    credit      card    offers       don      want
## ham  7.058e-06 1.200e-04 7.058e-06 2.117e-05 0.0005787 0.0003246 0.0006140 7.058e-06 0.0009175 0.0009457
## spam 4.674e-06 2.804e-05 9.347e-06 2.804e-05 0.0009254 0.0013975 0.0008787 4.674e-06 0.0012152 0.0014769
##       offering      new      ones     three      main   bureaus     unite     state        ve      agre
## ham  7.058e-06 0.003543 7.058e-06 0.0003952 0.0001200 7.058e-06 0.0001835 0.0003176 0.0009245 0.0001694
## spam 1.402e-05 0.002248 9.347e-06 0.0003085 0.0002243 2.804e-05 0.0004206 0.0013694 0.0007992 0.0001355
##         someon   contact      one       ask       let     year   resolut      most  nightmar      hang
## ham  0.0002329 0.0004305 0.002054 0.0005505 0.0005787 0.001157 4.235e-05 3.529e-05 4.235e-05 3.529e-05
## spam 0.0002898 0.0008366 0.002902 0.0003038 0.0005141 0.001566 3.272e-05 2.337e-05 4.674e-06 5.608e-05

Harder model predictions

From the output we can see that this model is not nearly as accurate as the previous one. The model still accurately predicts ham values as being hams, but it classifies a lot of hams as spams. So, if this were used in a real company, a lot of emails would never make it to their intended target.

hard_hs_nb_model_matched <- dfm_match(hard_hsdf_testing,features=featnames(hard_hsdf_training))

hard_actual_class <-hard_hs_nb_model_matched$ham_spam
hard_predicted_class <- predict(hard_hs_nb_model,newdata = hard_hs_nb_model_matched)
hard_tab_class <- table(hard_actual_class,hard_predicted_class)
hard_tab_class
##                  hard_predicted_class
## hard_actual_class ham spam
##              ham   44    1
##              spam  51  234

Hard ham spam Confusion Matrix

At first the confusion matrix out put doesn’t look that bad since its accuracy rating says 86% but upon further inspecting we can see that the recall is pretty bad at 48%. Which means that of all the hams only 48% of them were correctly classified. So 52% of people who sent emails that are screened by this application will end up being upset when they don’t get a response. Due to this low recall rating the F1 rating is also low at 64%.

confusionMatrix(hard_tab_class,mode='everything')
## Error in confusionMatrix(hard_tab_class, mode = "everything"): could not find function "confusionMatrix"

Conclusion

This major difference in outcomes tells me that the harder emails need to be thoroughly inspected. By the looks of the harder dataset many of them were html documents. This could have had a major effect on the outcome.

Another possible thing that could have been done is taking the header text data and doing a classification on them.