Introduction

The purpose of this project is to get our feet wet in document classification. One application of document classification is identifying “spam” and “ham”. Spam is “any kind of unwanted, unsolicited digital communication, often an email, that gets sent out in bulk.”¹ Ham would be the opposite of spam and represent necessary and/or wanted digital communications. Spam can potentially contain malicious code and consume space on email servers.

This project will employ a spam/ham dataset to train a model. This model will then be run to make predictions of a new dataset to determine spam/ham.

These will be the libraries used for this project. I wanted to use quanteda package for this project. It is package for analyzing text documents and I was interested in it since Project 3. More information can be found here: https://quanteda.io/.

library(readtext)
library(RColorBrewer)
library(ggplot2)
library(rvest)
library(stringr)
library(summarytools)
library(tidytext)
library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(caret)

# Needed for the Confusion Matrix
library(e1071)

# Random Forest method
library(randomForest)

library(tm)

Import and Tidy the Data

The data to train the model is from https://spamassassin.apache.org/old/publiccorpus/. The files 20030228_spam.tar.bz2, 20030228_spam_2.tar.bz2, and 20050311_spam_2.tar.bz2 will be used. These files were uncompressed and the contents were moved to a single file. In addition, the cmds file located in each archive file was removed. The complete file with the combined emails is here: https://github.com/logicalschema/DATA607/raw/master/Project%204/spamemails.tgz. Another file https://github.com/logicalschema/DATA607/raw/master/Project%204/hamemails.tgz contains the emails from 20030228_easy_ham.tar.bz2, 20030228_easy_ham_2.tar.bz2, and 20030228_hard_ham.tar.bz2 with the cmds files removed.

This is an import of the tar files.

# This function will strip html tags from text. This uses the rvest package
# https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r
strip_html <- function(x) {
  return(gsub("<.*?>", " ", x))
}

# Function to clean troublesome strings from text
clean_text <- function(x) {
  
  # Strip out html tags
  x <- strip_html(x)

  # Remove hostnames: https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
  # https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&//=]*)
  # The tokens function has an argument 'remove_url = TRUE' but this did not work so I stripped it here
  x <- str_replace_all(x, "https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&//=]*)", " ")

  # Remove Email addresses
  x <- str_replace_all(x, "\\S*@\\S*\\s?", " ")

  # Remove \n, \t, and &nbsp;
  x <- str_replace_all(x, "\n", " ")
  x <- str_replace_all(x, "\t", " ")
  x <- str_replace_all(x, "&nbsp;", " ")
  
  return(x)
  
}

# Clean Tokens: returns a DFM by tokenizing a corpus
clean_corpus <- function(x) {

# Tokenize the corpora and remove punctuation and symbols: https://quanteda.io/reference/tokens.html
temp_tokens <- tokens(x, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_separators = TRUE)

# Convert the tokens to lowercase
temp_tokens <- tokens_tolower(temp_tokens)

# Create the DFM: document-feature matrix: https://quanteda.io/reference/dfm.html
temp_dfm <- dfm(temp_tokens, remove = stopwords("english"))

return(temp_dfm)
}

# Clean Tokens: returns a DTM by tokenizing a corpus
clean_corpus_dtm <- function(x) {

corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

dtm <- DocumentTermMatrix(corpus)

dtm <- removeSparseTerms(dtm, 0.95)

return(dtm)
}

# The function importFiles imports a url of a zipped tar file into a variable. Creates a directory called temp
importFiles <- function(remoteURL = NULL){
  download.file(remoteURL, "temp.tgz")
  
  if (dir.exists("temp")) unlink("temp", recursive = TRUE)
  dir.create("temp")
  
  # Unzips and expands the archive of the file
  untar("temp.tgz", exdir = "temp")
  file.remove("temp.tgz")
  
  temp_data <- readtext("temp/complete/*")
  unlink("temp", recursive = TRUE)
  return(temp_data)
}

# Import the spam emails
url <- "https://github.com/logicalschema/DATA607/raw/master/Project%204/spamemails.tgz"

raw_spam_data <- importFiles(url)
raw_spam_data <- cbind(raw_spam_data, type = "spam")

# Import the ham emails
url <- "https://github.com/logicalschema/DATA607/raw/master/Project%204/hamemails.tgz"

raw_ham_data <- importFiles(url)
raw_ham_data <- cbind(raw_ham_data, type = "ham")

Let’s take a look at the data imported.

head(raw_spam_data)

head(raw_ham_data)

Before we continue, we will combine the email types and randomly choose a sample from the collection. This is also necessary for the classifier that will be used later.²

# Create a corpus for all the documents
combined <- rbind(raw_spam_data, raw_ham_data)
combined$text <- clean_text(combined$text) # Cleans the text



# Create a sample of 3000 documents
set.seed(2543)
sample <- combined[sample(nrow(combined), 3000), ] # Sample without replacement
sample_raw_spam_data <- subset(sample, type == "spam")
sample_raw_ham_data <- subset(sample, type == "ham")


# Test
test_raw_data <- setdiff(combined, sample)

There is not a consistent format for each of the files from SpamAssassin. There are different headers for the emails in different orders across the collection. In one email, the Content-Type might be declared and in another the email client used would be listed. Let’s convert documents we have to corpora.

# Store the variables as corpora
spam_corpus <- corpus(sample_raw_spam_data)
ham_corpus <- corpus(sample_raw_ham_data)
training_corpus <- corpus(sample)
testing_corpus <- corpus(test_raw_data)

Here are snippets from each of the corpora.

texts(spam_corpus)[10]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             00359.4ab70de20a198b736ed01940c9745384 
## "From   Wed Sep 18 11:50:27 2002 Return-Path:   Delivered-To:  Received: from localhost (jalapeno [127.0.0.1])  by zzzzason.org (Postfix) with ESMTP id 714A516F16  for  ; Wed, 18 Sep 2002 11:50:26 +0100 (IST) Received: from jalapeno [127.0.0.1]  by localhost with IMAP (fetchmail-5.9.0)  for  (single-drop); Wed, 18 Sep 2002 11:50:26 +0100 (IST) Received: from lugh.tuatha.org  [194.125.145.45]) by     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8I0C0C02158 for      ; Wed, 18 Sep 2002 01:12:00 +0100 Received: from lugh.tuatha.org  [127.0.0.1]) by     lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id BAA32179; Wed, 18 Sep 2002     01:12:09 +0100 Received: from relay.dub-t3-1.nwcgroup.com      [195.129.80.16]) by lugh.tuatha.org     (8.9.3/8.9.3) with ESMTP id BAA32146 for  ; Wed,     18 Sep 2002 01:12:01 +0100 Received: from glb03 (unknown [81.18.33.83]) by     relay.dub-t3-1.nwcgroup.com (Postfix) with SMTP id 9F3D770044 for      ; Wed, 18 Sep 2002 01:11:58 +0100 (IST) From: \"Femi Daniel\"   To:   MIME-Version: 1.0 Content-Type: text/plain; charset=\"iso-8859-1\" Message-Id:   Subject: [ILUG-Social] HELLO Sender:  Errors-To:  X-Beenthere:  X-Mailman-Version: 2.0.11 Precedence: bulk List-Help:   List-Post:   List-Subscribe:  ,       List-Id: Irish Linux Users' Group social events   List-Unsubscribe:  ,       List-Archive:   X-Original-Date: Wed, 18 Sep 2002 01:11:52 Date: Wed, 18 Sep 2002 01:11:52  OFFICE OF:EGNR. FEMI DANIEL FEDERAL MINISTRY OF WORKS AND HOUSING FEDERAL SECRETARIAT OFFICE COMPLEX IKOYI-LAGOS.  ATTN:,                    First, I must solicit your strictest confidence in this transaction, this is by virtue of it's nature as being utterly confidential and top secret as you were introduced to me in confidence through the Nigeria Chamber of Commerce and Industries.  We are top officials from the Federal Ministry of Works and Housing,(FMW&H), Federal Ministry of Finance and the Presidency, making up the Contract Review Panel(CRP) set up by the Federal Government of Nigeria to review contracts awarded by the past military administration.   In the course of our work on the CRP, we discovered this fund which resulted from grossly over - invoiced contracts which were executed for the FMW&H during the last administration.The companies that executed the contracts have been duly paid and the contracts commissioned leaving the sum of US$21.4 Million floating in the escrow account of the Central Bank of Nigeria ready for payment.  I have therefore  been mandated as a matter of trust by my colleagues in the panel to look for an overseas partner to whom we could transfer the sum of US21.4 legally subcontracting the entitlement to your company.This is bearing in mind that our civil service code of conduct forbids us from owning foreign companies or operating foreign accounts while in government service, hence the need for an overseas partner.  We have agreed that the funds will be shared thus after it has been paid into your account. (1) 30% of the money will go to you for acting as the beneficiary of the fund. (2) 10% has been set aside as an abstract projection for reimbursement to both parties for incidental expences that may be incurred in the course of the transaction. (3) 60% to us the government officials (with which we intend to commence an importation business in conjunction with you)  All logistics are in place and all modalities worked out for the smooth conclusion of the transaction within ten to fourteen days of commencement after receipt of the following information; your full names, company's name, address,details & activities, telephone & fax numbers.These information will enable us make the applications and lodge claims to the concerned ministries & agencies in favour of your company and it is pertinent to state here that this transaction is entirely based on trust as the solar bank draft or certified cheque drawable in any of the Central Bank of Nigeria correspondent bankers around the world is going to be made in your name.  Please acknowledge the reciept of this letter using the above e-mail or the Alternative:  to reply me.   Yours faithfully,  Egnr. Femi Daniel.  NB:Bank Account Details not necessary as preferred mode of payment is by draft or cheque.    -   Irish Linux Users' Group Social Events:    for (un)subscription information. List maintainer:   "

texts(ham_corpus)[10]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 02361.68fa77659ad7611f8bcad48d030a3039 
## "From   Tue Oct  8 10:56:02 2002 Return-Path:   Delivered-To:  Received: from localhost (jalapeno [127.0.0.1])  by jmason.org (Postfix) with ESMTP id DB5DE16F16  for  ; Tue,  8 Oct 2002 10:56:01 +0100 (IST) Received: from jalapeno [127.0.0.1]  by localhost with IMAP (fetchmail-5.9.0)  for  (single-drop); Tue, 08 Oct 2002 10:56:01 +0100 (IST) Received: from dogma.slashnull.org (localhost [127.0.0.1]) by     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g9880VK06078 for      ; Tue, 8 Oct 2002 09:00:31 +0100 Message-Id:   To:  From: boingboing   Subject: A&L Daily to be auctioned in bankruptcy Date: Tue, 08 Oct 2002 08:00:31 -0000 Content-Type: text/plain; encoding=utf-8  URL:   Date: Not supplied  Arts and Letters Daily, a wonderful and dense blog, has folded up its tent due  to the bankruptcy of its parent company. A&L Daily will be auctioned off by the  receivers. Link[1] Discuss[2] (_Thanks, Misha!_)  [1]   [2]    "

The files are imported into the corpus format but we need to do some tidying.

First, let’s clean text in our data and convert to DFMs.

# Used for the Word Cloud
spam_dfm <- clean_corpus(spam_corpus)
ham_dfm <- clean_corpus(ham_corpus)


# Used for the Bayes Classifier
training_dfm <- clean_corpus(training_corpus)
testing_dfm <- clean_corpus(testing_corpus)

# Used for the Random Forest Classifier
training_dtm <- clean_corpus_dtm(sample$text)
testing_dtm <- clean_corpus_dtm(test_raw_data$text)

Analysis

This section will go analyze the data we have imported and tidied.

Bayes Theorem

This page https://www.r-bloggers.com/text-message-classification/ has code for using a Naïve Bayes classifier for classifying messages in addition to the Quanteda page linked earlier. We will train the model and look at a summary of the model. This employs both the caret and quanteda packages.

# Train the Bayes classifier
tmodel_nb <- textmodel_nb(training_dfm, training_dfm$type)

summary(tmodel_nb)

## 
## Call:
## textmodel_nb.dfm(x = training_dfm, y = training_dfm$type)
## 
## Class Priors:
## (showing first 2 elements)
## spam  ham 
##  0.5  0.5 
## 
## Estimated Feature Scores:
##           tue      aug return-path delivered-to received localhost
## spam 0.002672 0.003265    0.002190     0.001614  0.01125  0.003918
## ham  0.004671 0.009379    0.003317     0.004376  0.01954  0.010085
##      phobos.labs.netnoteinc.com  postfix   esmtp       id cc1c54411f    -0400
## spam                   0.000605 0.001996 0.00545 0.008167  2.420e-06 0.002367
## ham                    0.001546 0.006035 0.01183 0.014453  3.278e-06 0.004057
##           edt   phobos      imap fetchmail-5.9.0 single-drop      ist
## spam 0.001321 0.000334 0.0009462        0.001290    0.001290 0.001738
## ham  0.002454 0.001224 0.0030814        0.003047    0.003047 0.004704
##      usw-sf-list2 sourceforge.net usw-sf-fw2 dogma.slashnull.org g72ml1v14021
## spam    7.502e-05       0.0001984  3.872e-05            0.001963    2.420e-06
## ham     7.212e-04       0.0018636  3.606e-04            0.004003    3.278e-06
##           fri usw-sf-list1-b.sourceforge.net      helo usw-sf-list1      exim
## spam 0.001861                      3.872e-05 0.0003533    7.502e-05 0.0001404
## ham  0.003817                      3.590e-04 0.0012965    7.163e-04 0.0012162
##      3.31-va-mm2       #1
## spam   7.986e-05 0.000317
## ham    7.212e-04 0.001152

“Naïve Bayes can only take features into consideration that occur both in the training set and the test set.”³ The function dfm_match() can use the training_dfm as a pattern. The cross-table matrix shows how the classifier did.

matching_dfm <- dfm_match(testing_dfm, features = featnames(training_dfm))

# Evaluating how the classifier did
actual_class <- matching_dfm$type
predicted_class <- predict(tmodel_nb, newdata = matching_dfm)
tab_class <- table(actual_class, predicted_class)
tab_class

##             predicted_class
## actual_class spam  ham
##         spam  819   85
##         ham    13 2130

For further analysis, we can construct a Confusion Matrix⁴ using the caret package. A Confusion Matrix is basically an error matrix.

confusionMatrix(tab_class, mode = "everything")

## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class spam  ham
##         spam  819   85
##         ham    13 2130
##                                           
##                Accuracy : 0.9678          
##                  95% CI : (0.9609, 0.9738)
##     No Information Rate : 0.7269          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9211          
##                                           
##  Mcnemar's Test P-Value : 7.387e-13       
##                                           
##             Sensitivity : 0.9844          
##             Specificity : 0.9616          
##          Pos Pred Value : 0.9060          
##          Neg Pred Value : 0.9939          
##               Precision : 0.9060          
##                  Recall : 0.9844          
##                      F1 : 0.9435          
##              Prevalence : 0.2731          
##          Detection Rate : 0.2688          
##    Detection Prevalence : 0.2967          
##       Balanced Accuracy : 0.9730          
##                                           
##        'Positive' Class : spam            
##

Random Forest

John DeBlase’s Rpub https://rpubs.com/bsnacks000/125281 and Chris Marshall’s article⁵ were the basis for this. We use the sample data from before to develop a Random Forest classifier. For more informatio about the Random Forest, you can see this article: https://en.wikipedia.org/wiki/Random_forest.

# Convert the training_dtm and testing_dtm to data frames
rf_df <- as.data.frame(as.matrix(training_dtm))
colnames(rf_df) <- make.names(colnames(rf_df))

rf_test <- as.data.frame(as.matrix(testing_dtm))
colnames(rf_test) <- make.names(colnames(rf_test))


# Add the type variable to the data frame
rf_df$rf_variable_spam <- sample$type
rf_test$rf_variable_spam <- test_raw_data$type


# Train the Random Forest Model
rf_model <- randomForest(rf_variable_spam~., data = rf_df)
summary(rf_model)

##                 Length Class  Mode     
## call               3   -none- call     
## type               1   -none- character
## predicted       3000   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           6000   matrix numeric  
## oob.times       3000   -none- numeric  
## classes            2   -none- character
## importance       440   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3000   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

# Evaluating how the classifier did
prediction_rf_class <- predict(rf_model, type="prob")[,2] 


# Evaluate the performance of the random forest model on the training set
rf_tab <- table(rf_df$rf_variable_spam, prediction_rf_class > 0.5)
colnames(rf_tab) <- c("spam", "ham")
rf_tab

##       
##        spam  ham
##   spam  963   30
##   ham    36 1971

Word Clouds

Below is a look at the word clouds for the spam and ham data. Notice how localhost and domain names appear. In spam classifying, domains can be filtered out.

# Create a word cloud: https://quanteda.io/reference/textplot_wordcloud.html
spam.colors <- brewer.pal(9, "Paired")  

textplot_wordcloud(spam_dfm, min_count = 200, color = spam.colors)  
title("Spam Wordcloud", col.main = "black")

# Create a word cloud: https://quanteda.io/reference/textplot_wordcloud.html
ham.colors <- brewer.pal(9, "Paired")  

textplot_wordcloud(ham_dfm, min_count = 200, color = ham.colors)  
title("Ham Wordcloud", col.main = "black")

Conclusion

Witha cursory entry into classifiers, I found that it was easier to use the quanteda and caret packages to use the Naïve Bayes classifier for spam. However, the Random Forest Classifier can be tweaked to obtain a higher accuracy. Sophisticated spam filters can be employed to examine the source address, MAC addresses, and domains to quickly remove spam.

Data 607 Project 4

Sung Lee

4/24/2020