Assignment on RPubs
Rmd on Github
The purpose of this project is to get our feet wet in document classification. One application of document classification is identifying “spam” and “ham”. Spam is “any kind of unwanted, unsolicited digital communication, often an email, that gets sent out in bulk.”1 Ham would be the opposite of spam and represent necessary and/or wanted digital communications. Spam can potentially contain malicious code and consume space on email servers.
This project will employ a spam/ham dataset to train a model. This model will then be run to make predictions of a new dataset to determine spam/ham.
These will be the libraries used for this project. I wanted to use quanteda package for this project. It is package for analyzing text documents and I was interested in it since Project 3. More information can be found here: https://quanteda.io/.
library(readtext)
library(RColorBrewer)
library(ggplot2)
library(rvest)
library(stringr)
library(summarytools)
library(tidytext)
library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(caret)
# Needed for the Confusion Matrix
library(e1071)
# Random Forest method
library(randomForest)
library(tm)
The data to train the model is from https://spamassassin.apache.org/old/publiccorpus/. The files 20030228_spam.tar.bz2, 20030228_spam_2.tar.bz2, and 20050311_spam_2.tar.bz2 will be used. These files were uncompressed and the contents were moved to a single file. In addition, the cmds file located in each archive file was removed. The complete file with the combined emails is here: https://github.com/logicalschema/DATA607/raw/master/Project%204/spamemails.tgz. Another file https://github.com/logicalschema/DATA607/raw/master/Project%204/hamemails.tgz contains the emails from 20030228_easy_ham.tar.bz2, 20030228_easy_ham_2.tar.bz2, and 20030228_hard_ham.tar.bz2 with the cmds files removed.
This is an import of the tar files.
# This function will strip html tags from text. This uses the rvest package
# https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r
strip_html <- function(x) {
return(gsub("<.*?>", " ", x))
}
# Function to clean troublesome strings from text
clean_text <- function(x) {
# Strip out html tags
x <- strip_html(x)
# Remove hostnames: https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
# https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&//=]*)
# The tokens function has an argument 'remove_url = TRUE' but this did not work so I stripped it here
x <- str_replace_all(x, "https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b([-a-zA-Z0-9()@:%_\\+.~#?&//=]*)", " ")
# Remove Email addresses
x <- str_replace_all(x, "\\S*@\\S*\\s?", " ")
# Remove \n, \t, and
x <- str_replace_all(x, "\n", " ")
x <- str_replace_all(x, "\t", " ")
x <- str_replace_all(x, " ", " ")
return(x)
}
# Clean Tokens: returns a DFM by tokenizing a corpus
clean_corpus <- function(x) {
# Tokenize the corpora and remove punctuation and symbols: https://quanteda.io/reference/tokens.html
temp_tokens <- tokens(x, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_separators = TRUE)
# Convert the tokens to lowercase
temp_tokens <- tokens_tolower(temp_tokens)
# Create the DFM: document-feature matrix: https://quanteda.io/reference/dfm.html
temp_dfm <- dfm(temp_tokens, remove = stopwords("english"))
return(temp_dfm)
}
# Clean Tokens: returns a DTM by tokenizing a corpus
clean_corpus_dtm <- function(x) {
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.95)
return(dtm)
}
# The function importFiles imports a url of a zipped tar file into a variable. Creates a directory called temp
importFiles <- function(remoteURL = NULL){
download.file(remoteURL, "temp.tgz")
if (dir.exists("temp")) unlink("temp", recursive = TRUE)
dir.create("temp")
# Unzips and expands the archive of the file
untar("temp.tgz", exdir = "temp")
file.remove("temp.tgz")
temp_data <- readtext("temp/complete/*")
unlink("temp", recursive = TRUE)
return(temp_data)
}
# Import the spam emails
url <- "https://github.com/logicalschema/DATA607/raw/master/Project%204/spamemails.tgz"
raw_spam_data <- importFiles(url)
raw_spam_data <- cbind(raw_spam_data, type = "spam")
# Import the ham emails
url <- "https://github.com/logicalschema/DATA607/raw/master/Project%204/hamemails.tgz"
raw_ham_data <- importFiles(url)
raw_ham_data <- cbind(raw_ham_data, type = "ham")
Let’s take a look at the data imported.
head(raw_spam_data)
head(raw_ham_data)
Before we continue, we will combine the email types and randomly choose a sample from the collection. This is also necessary for the classifier that will be used later.2
# Create a corpus for all the documents
combined <- rbind(raw_spam_data, raw_ham_data)
combined$text <- clean_text(combined$text) # Cleans the text
# Create a sample of 3000 documents
set.seed(2543)
sample <- combined[sample(nrow(combined), 3000), ] # Sample without replacement
sample_raw_spam_data <- subset(sample, type == "spam")
sample_raw_ham_data <- subset(sample, type == "ham")
# Test
test_raw_data <- setdiff(combined, sample)
There is not a consistent format for each of the files from SpamAssassin. There are different headers for the emails in different orders across the collection. In one email, the Content-Type might be declared and in another the email client used would be listed. Let’s convert documents we have to corpora.
# Store the variables as corpora
spam_corpus <- corpus(sample_raw_spam_data)
ham_corpus <- corpus(sample_raw_ham_data)
training_corpus <- corpus(sample)
testing_corpus <- corpus(test_raw_data)
Here are snippets from each of the corpora.
texts(spam_corpus)[10]
## 00359.4ab70de20a198b736ed01940c9745384
## "From Wed Sep 18 11:50:27 2002 Return-Path: Delivered-To: Received: from localhost (jalapeno [127.0.0.1]) by zzzzason.org (Postfix) with ESMTP id 714A516F16 for ; Wed, 18 Sep 2002 11:50:26 +0100 (IST) Received: from jalapeno [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for (single-drop); Wed, 18 Sep 2002 11:50:26 +0100 (IST) Received: from lugh.tuatha.org [194.125.145.45]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8I0C0C02158 for ; Wed, 18 Sep 2002 01:12:00 +0100 Received: from lugh.tuatha.org [127.0.0.1]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id BAA32179; Wed, 18 Sep 2002 01:12:09 +0100 Received: from relay.dub-t3-1.nwcgroup.com [195.129.80.16]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id BAA32146 for ; Wed, 18 Sep 2002 01:12:01 +0100 Received: from glb03 (unknown [81.18.33.83]) by relay.dub-t3-1.nwcgroup.com (Postfix) with SMTP id 9F3D770044 for ; Wed, 18 Sep 2002 01:11:58 +0100 (IST) From: \"Femi Daniel\" To: MIME-Version: 1.0 Content-Type: text/plain; charset=\"iso-8859-1\" Message-Id: Subject: [ILUG-Social] HELLO Sender: Errors-To: X-Beenthere: X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Irish Linux Users' Group social events List-Unsubscribe: , List-Archive: X-Original-Date: Wed, 18 Sep 2002 01:11:52 Date: Wed, 18 Sep 2002 01:11:52 OFFICE OF:EGNR. FEMI DANIEL FEDERAL MINISTRY OF WORKS AND HOUSING FEDERAL SECRETARIAT OFFICE COMPLEX IKOYI-LAGOS. ATTN:, First, I must solicit your strictest confidence in this transaction, this is by virtue of it's nature as being utterly confidential and top secret as you were introduced to me in confidence through the Nigeria Chamber of Commerce and Industries. We are top officials from the Federal Ministry of Works and Housing,(FMW&H), Federal Ministry of Finance and the Presidency, making up the Contract Review Panel(CRP) set up by the Federal Government of Nigeria to review contracts awarded by the past military administration. In the course of our work on the CRP, we discovered this fund which resulted from grossly over - invoiced contracts which were executed for the FMW&H during the last administration.The companies that executed the contracts have been duly paid and the contracts commissioned leaving the sum of US$21.4 Million floating in the escrow account of the Central Bank of Nigeria ready for payment. I have therefore been mandated as a matter of trust by my colleagues in the panel to look for an overseas partner to whom we could transfer the sum of US21.4 legally subcontracting the entitlement to your company.This is bearing in mind that our civil service code of conduct forbids us from owning foreign companies or operating foreign accounts while in government service, hence the need for an overseas partner. We have agreed that the funds will be shared thus after it has been paid into your account. (1) 30% of the money will go to you for acting as the beneficiary of the fund. (2) 10% has been set aside as an abstract projection for reimbursement to both parties for incidental expences that may be incurred in the course of the transaction. (3) 60% to us the government officials (with which we intend to commence an importation business in conjunction with you) All logistics are in place and all modalities worked out for the smooth conclusion of the transaction within ten to fourteen days of commencement after receipt of the following information; your full names, company's name, address,details & activities, telephone & fax numbers.These information will enable us make the applications and lodge claims to the concerned ministries & agencies in favour of your company and it is pertinent to state here that this transaction is entirely based on trust as the solar bank draft or certified cheque drawable in any of the Central Bank of Nigeria correspondent bankers around the world is going to be made in your name. Please acknowledge the reciept of this letter using the above e-mail or the Alternative: to reply me. Yours faithfully, Egnr. Femi Daniel. NB:Bank Account Details not necessary as preferred mode of payment is by draft or cheque. - Irish Linux Users' Group Social Events: for (un)subscription information. List maintainer: "
texts(ham_corpus)[10]
## 02361.68fa77659ad7611f8bcad48d030a3039
## "From Tue Oct 8 10:56:02 2002 Return-Path: Delivered-To: Received: from localhost (jalapeno [127.0.0.1]) by jmason.org (Postfix) with ESMTP id DB5DE16F16 for ; Tue, 8 Oct 2002 10:56:01 +0100 (IST) Received: from jalapeno [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for (single-drop); Tue, 08 Oct 2002 10:56:01 +0100 (IST) Received: from dogma.slashnull.org (localhost [127.0.0.1]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g9880VK06078 for ; Tue, 8 Oct 2002 09:00:31 +0100 Message-Id: To: From: boingboing Subject: A&L Daily to be auctioned in bankruptcy Date: Tue, 08 Oct 2002 08:00:31 -0000 Content-Type: text/plain; encoding=utf-8 URL: Date: Not supplied Arts and Letters Daily, a wonderful and dense blog, has folded up its tent due to the bankruptcy of its parent company. A&L Daily will be auctioned off by the receivers. Link[1] Discuss[2] (_Thanks, Misha!_) [1] [2] "
The files are imported into the corpus format but we need to do some tidying.
First, let’s clean text in our data and convert to DFMs.
# Used for the Word Cloud
spam_dfm <- clean_corpus(spam_corpus)
ham_dfm <- clean_corpus(ham_corpus)
# Used for the Bayes Classifier
training_dfm <- clean_corpus(training_corpus)
testing_dfm <- clean_corpus(testing_corpus)
# Used for the Random Forest Classifier
training_dtm <- clean_corpus_dtm(sample$text)
testing_dtm <- clean_corpus_dtm(test_raw_data$text)
This section will go analyze the data we have imported and tidied.
This page https://www.r-bloggers.com/text-message-classification/ has code for using a Naïve Bayes classifier for classifying messages in addition to the Quanteda page linked earlier. We will train the model and look at a summary of the model. This employs both the caret and quanteda packages.
# Train the Bayes classifier
tmodel_nb <- textmodel_nb(training_dfm, training_dfm$type)
summary(tmodel_nb)
##
## Call:
## textmodel_nb.dfm(x = training_dfm, y = training_dfm$type)
##
## Class Priors:
## (showing first 2 elements)
## spam ham
## 0.5 0.5
##
## Estimated Feature Scores:
## tue aug return-path delivered-to received localhost
## spam 0.002672 0.003265 0.002190 0.001614 0.01125 0.003918
## ham 0.004671 0.009379 0.003317 0.004376 0.01954 0.010085
## phobos.labs.netnoteinc.com postfix esmtp id cc1c54411f -0400
## spam 0.000605 0.001996 0.00545 0.008167 2.420e-06 0.002367
## ham 0.001546 0.006035 0.01183 0.014453 3.278e-06 0.004057
## edt phobos imap fetchmail-5.9.0 single-drop ist
## spam 0.001321 0.000334 0.0009462 0.001290 0.001290 0.001738
## ham 0.002454 0.001224 0.0030814 0.003047 0.003047 0.004704
## usw-sf-list2 sourceforge.net usw-sf-fw2 dogma.slashnull.org g72ml1v14021
## spam 7.502e-05 0.0001984 3.872e-05 0.001963 2.420e-06
## ham 7.212e-04 0.0018636 3.606e-04 0.004003 3.278e-06
## fri usw-sf-list1-b.sourceforge.net helo usw-sf-list1 exim
## spam 0.001861 3.872e-05 0.0003533 7.502e-05 0.0001404
## ham 0.003817 3.590e-04 0.0012965 7.163e-04 0.0012162
## 3.31-va-mm2 #1
## spam 7.986e-05 0.000317
## ham 7.212e-04 0.001152
“Naïve Bayes can only take features into consideration that occur both in the training set and the test set.”3 The function dfm_match() can use the training_dfm as a pattern. The cross-table matrix shows how the classifier did.
matching_dfm <- dfm_match(testing_dfm, features = featnames(training_dfm))
# Evaluating how the classifier did
actual_class <- matching_dfm$type
predicted_class <- predict(tmodel_nb, newdata = matching_dfm)
tab_class <- table(actual_class, predicted_class)
tab_class
## predicted_class
## actual_class spam ham
## spam 819 85
## ham 13 2130
For further analysis, we can construct a Confusion Matrix4 using the caret package. A Confusion Matrix is basically an error matrix.
confusionMatrix(tab_class, mode = "everything")
## Confusion Matrix and Statistics
##
## predicted_class
## actual_class spam ham
## spam 819 85
## ham 13 2130
##
## Accuracy : 0.9678
## 95% CI : (0.9609, 0.9738)
## No Information Rate : 0.7269
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9211
##
## Mcnemar's Test P-Value : 7.387e-13
##
## Sensitivity : 0.9844
## Specificity : 0.9616
## Pos Pred Value : 0.9060
## Neg Pred Value : 0.9939
## Precision : 0.9060
## Recall : 0.9844
## F1 : 0.9435
## Prevalence : 0.2731
## Detection Rate : 0.2688
## Detection Prevalence : 0.2967
## Balanced Accuracy : 0.9730
##
## 'Positive' Class : spam
##
John DeBlase’s Rpub https://rpubs.com/bsnacks000/125281 and Chris Marshall’s article5 were the basis for this. We use the sample data from before to develop a Random Forest classifier. For more informatio about the Random Forest, you can see this article: https://en.wikipedia.org/wiki/Random_forest.
# Convert the training_dtm and testing_dtm to data frames
rf_df <- as.data.frame(as.matrix(training_dtm))
colnames(rf_df) <- make.names(colnames(rf_df))
rf_test <- as.data.frame(as.matrix(testing_dtm))
colnames(rf_test) <- make.names(colnames(rf_test))
# Add the type variable to the data frame
rf_df$rf_variable_spam <- sample$type
rf_test$rf_variable_spam <- test_raw_data$type
# Train the Random Forest Model
rf_model <- randomForest(rf_variable_spam~., data = rf_df)
summary(rf_model)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 3000 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 6000 matrix numeric
## oob.times 3000 -none- numeric
## classes 2 -none- character
## importance 440 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3000 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
# Evaluating how the classifier did
prediction_rf_class <- predict(rf_model, type="prob")[,2]
# Evaluate the performance of the random forest model on the training set
rf_tab <- table(rf_df$rf_variable_spam, prediction_rf_class > 0.5)
colnames(rf_tab) <- c("spam", "ham")
rf_tab
##
## spam ham
## spam 963 30
## ham 36 1971
Below is a look at the word clouds for the spam and ham data. Notice how localhost and domain names appear. In spam classifying, domains can be filtered out.
# Create a word cloud: https://quanteda.io/reference/textplot_wordcloud.html
spam.colors <- brewer.pal(9, "Paired")
textplot_wordcloud(spam_dfm, min_count = 200, color = spam.colors)
title("Spam Wordcloud", col.main = "black")
# Create a word cloud: https://quanteda.io/reference/textplot_wordcloud.html
ham.colors <- brewer.pal(9, "Paired")
textplot_wordcloud(ham_dfm, min_count = 200, color = ham.colors)
title("Ham Wordcloud", col.main = "black")
Witha cursory entry into classifiers, I found that it was easier to use the quanteda and caret packages to use the Naïve Bayes classifier for spam. However, the Random Forest Classifier can be tweaked to obtain a higher accuracy. Sophisticated spam filters can be employed to examine the source address, MAC addresses, and domains to quickly remove spam.