Project 4 - Document Classification

Overview

For this project, we will investigate a corpus of harvested emails from Apache SpamAssasin and build a manual classifier to identify them as either “spam” or “ham”, then compare this to an algorithmic approach using a Random Forest classifier.

Data Import

There are five SpamAssasin datasets (two “spam” and three “ham”), downloaded to our ‘data’ folder in tar archive format. The following code (which should only be run once) extracts and decompresses these archives into a ‘data/unzip’ subfolder:

data_dir = 'data' # compressed archives
dest_dir = 'data/unzip' # destination for unzipped files
# unzip automatic - run first time only
for (file in list.files(data_dir, pattern='.*\\.bz2', full.names=TRUE)){
  untar(file,exdir=dest_dir)
}

Next, we’ll loop through the decompressed Spam and Ham emails and load them into a dataframe with appropriate labeling. (grep is used to avoid loading some non-email files titled “cmd” that were included in the original archives.)

# load data automatically
spam_email <- vector() # init
ham_email <- vector() # init
df <- data.frame(source=character(), label=character(), email=character()) # init

# load spam
for (path in list.files(dest_dir, pattern='.*spam.*', full.names=TRUE)){
  for (f in grep(list.files(path,full.names=TRUE), pattern='.*cmd.*', invert=TRUE, value=TRUE)){
    df[nrow(df)+1,] = c(path,'spam',read_file(f))
  }
}

# load ham
for (path in list.files(dest_dir, pattern='.*ham.*', full.names=TRUE)){
  for (f in grep(list.files(path,full.names=TRUE), pattern='.*cmd.*', invert=TRUE, value=TRUE)){
    df[nrow(df)+1,] = c(path,'ham',read_file(f))
  }
}

The resulting dataframe has 6046 rows, each representing a single email from the corpus with all the original content (‘email’), the name of the original archive, and a label of either ‘ham’ or ‘spam’:

source label email
data/unzip/spam spam

From Thu Aug 22 13:17:22 2002 Return-Path: <> Delivered-To: Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <>; Thu, 22 Aug 2002 13:09:41 +0100 From: Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN”> <HTML><HEAD> <META content=3D”text/html; charset=3Dwindows-1252” http-equiv=3DContent-T= ype> <META content=3D”MSHTML 5.00.2314.1000” name=3DGENERATOR></HEAD> <BODY><!– Inserted by Calypso –> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= ules=3Dnone style=3D”COLOR: black; DISPLAY: none” width=3D”100%“> <TBODY> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TD></TR> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TBODY></TABLE><!– End Calypso –><!– Inserted by Calypso= –><FONT color=3D#000000 face=3DVERDANA,ARIAL,HELVETICA size=3D-2><BR></FONT></TD><= /TR></TABLE><!– End Calypso –><FONT color=3D#ff0000 face=3D”Copperplate Gothic Bold” size=3D5 PTSIZE=3D”10”> <CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff= 0000 face=3D”Copperplate Gothic Bold” size=3D5 PTSIZE=3D”10”> <CENTER>Why Spend More Than You Have To? <CENTER><FONT color=3D#ff0000 face=3D”Copperplate Gothic Bold” size=3D5 PT= SIZE=3D”10”> <CENTER>Life Quote Savings <CENTER> <P align=3Dleft></P> <P align=3Dleft></P></FONT></U></I></B><BR></FONT></U></B></U></I> <P></P> <CENTER> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D0 cellSpacing=3D0 wi= dth=3D650> <TBODY></TBODY></TABLE> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D5 cellSpacing=3D0 wi= dth=3D650> <TBODY> <TR> <TD colSpan=3D2 width=3D”35%“><B><FONT face=3DVerdana size=3D4>Ensurin= g your family’s financial security is very important. Life Quote Savings ma= kes buying life insurance simple and affordable. We Provide FREE Access = to The Very Best Companies and The Lowest Rates.</FONT></B></TD></TR> <TR> <TD align=3Dmiddle vAlign=3Dtop width=3D”18%“> <TABLE borderColor=3D#111111 width=3D”100%“> <TBODY> <TR> <TD style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100= %“><FONT face=3DVerdana size=3D4><B>Life Quote Savings</B> is FAST, EAS= Y and SAVES you money! Let us help you get started with the best val= ues in the country on new coverage. You can SAVE hundreds or even tho= usands of dollars by requesting a FREE quote from Lifequote Savings. = Our service will take you less than 5 minutes to complete. Shop an= d compare. SAVE up to 70% on all types of Life insurance! </FONT></TD></TR> <TR><BR><BR> <TD height=3D50 style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100%“> <P align=3Dcenter><B><FONT face=3DVerdana size=3D5><A href=3D”http://website.e365.cc/savequote/">Click Here For Your=

        Free Quote!&lt;/A&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
      &lt;P&gt;&lt;FONT face=3DVerdana size=3D4&gt;&lt;STRONG&gt;
      &lt;CENTER&gt;Protecting your family is the best investment you'll eve=
r make!<BR></B></TD></TR> <TR><BR><BR></STRONG></FONT></TD></TR></TD></TR> <TR></TR></TBODY></TABLE> <P align=3Dleft><FONT face=3D”Arial, Helvetica, sans-serif” size=3D2= ></FONT></P> <P></P> <CENTER><BR><BR><BR> <P></P> <P align=3Dleft><BR></B><BR><BR><BR><BR></P> <P align=3Dcenter><BR></P> <P align=3Dleft><BR></B><BR><BR></FONT>If you are in receipt of this= email in error and/or wish to be removed from our list, <A href=3D”mailto:coins@btamail.net.cn">PLEASE CLICK HERE</A> AND TYPE = REMOVE. If you reside in any state which prohibits e-mail solicitations for insuran= ce, please disregard this email.<BR></FONT><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR= ><BR><BR><BR></FONT></P></CENTER></CENTER></TR></TBODY></TABLE></CENTER></= CENTER></CENTER></CENTER></CENTER></BODY></HTML>

Data Transformations

Before beginning the analysis, we need to deal with the widely-varying content of these emails, many of which contain HTML tags and non-standard characters.

  • Encoding: After some trial and error, encoded all the content of ‘email’ into latin1; works for all the rows.
  • Separate Header and Body: Used a combination of substr() and str_locate() to separate the content into ‘header’ and ‘body’ based on a double-linebreak.
  • HTML Tags: Created a short function to remove HTML tags and linebreak characters, and used sapply to run against the contents of df$body
# convert to latin1
df$email <- str_conv(df$email, 'latin1')

# separate header
df$header <- substr(df$email,1,str_locate(df$email,'\n\n')) 

# separate body
df$body <- substr(df$email,str_locate(df$email,'\n\n'),str_length(df$email))

# strip html tags and line breaks
strip_html <- function(x){gsub('</?[^>]+>|\\n|=','',x)}
df$body <- sapply(df$body,strip_html) # works

df <- df %>% select(!email) # drop column
source label header body
data/unzip/spam spam From Thu Aug 22 13:17:22 2002 Return-Path: <> Delivered-To: Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <>; Thu, 22 Aug 2002 13:09:41 +0100 From: Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable Save up to 70% on Life Insurance.Why Spend More Than You Have To?Life Quote Savings Ensuring your family’s financial security is very important. Life Quote Savings makes buying life insurance simple and affordable. We Provide FREE Access to The Very Best Companies and The Lowest Rates. Life Quote Savings is FAST, EASY and SAVES you money! Let us help you get started with the best values in the country on new coverage. You can SAVE hundreds or even thousands of dollars by requesting a FREE quote from Lifequote Savings. Our service will take you less than 5 minutes to complete. Shop and compare. SAVE up to 70% on all types of Life insurance! Click Here For Your Free Quote! Protecting your family is the best investment you’ll ever make! If you are in receipt of this email in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REMOVE. If you reside in any state which prohibits e-mail solicitations for insurance, please disregard this email.

Split Training and Test datasets

Focusing on the body text only for this analysis, randomly select rows and split them into training and test datasets - 50/50 for argument’s sake. I’ve set a random seed in the setup chunk to this document for reproducibility.

# drop orig and add index, train/test split
df <- df %>% 
  mutate(email_id = row_number()) %>%
  mutate(segment = rbinom(nrow(df), 1, 0.5))

# let's consider body text only
df_train <- df %>%
  filter(segment==0) %>%
  select(email_id,body,label)

df_test <- df %>%
  filter(segment==1) %>% 
  select(email_id,body,label)

Manual Classification

For a simple manual approach, we’ll tally up the individual words most commonly associated with ‘ham’ and ‘spam’ email body content in our training dataset, and then classify the emails in our test dataset depending on the frequency of these words.

With the help of the tidytext library, we’ll un-nest the individual words (tokens), group them by their ham/spam label, and get a count of their frequency:

# tokenize
df_train_words <- df_train %>%
  unnest_tokens(word,body) %>%
  group_by(label, word) %>%
  summarize(count=n())
label word count
ham __ 36
ham ___ 14
ham ____ 3
ham _____ 8
ham ______ 1
ham ___________________________ 2

Right off the bat, we can see a large number of tokens represent special characters, numerics and other ‘non-alpha’ characters. While this information would be useful in a model-based approach, it’s probably too much extraneous information for this simple analysis.

This next step filters out common stopwords, non-alpha and special characters, short terms (1-2 characters) and low-frequency terms.

# clean up
df_train_words_tidy <- df_train_words %>%
  anti_join(stop_words) %>% # filter out stop words
  filter(grepl('^[:alpha:]',word)) %>% # filter out numeric / non-alpha entries
  filter(count>5) %>% # filter out low-frequency terms
  filter(str_length(word)>2) %>% # filter out 1-2 charater terms
  arrange(desc(count),word)
label word count
ham http 5842
ham listinfo 1222
ham list 1219
ham linux 907
ham people 710
spam list 618

And for the final step, compose lists of words that appear exclusively in either ‘ham’ or ‘spam’ emails:

# get top words for spam/ham
df_spam_words <- df_train_words_tidy %>%
  filter(label=='spam')

df_ham_words <- df_train_words_tidy %>%
  filter(label=='ham')

# spam words only
df_spam_words_unique <- df_spam_words %>%
  anti_join(df_ham_words, by=c('word') )

# ham words only
df_ham_words_unique <- df_ham_words %>%
  anti_join(df_spam_words, by=c('word') )
label word count
ham pgp 288
ham perl 281
ham aug 262
ham apt 218
ham lists.freshrpms.net 216
ham log 208
label word count
spam aging 74
spam loan 71
spam herba 66
spam les 63
spam align3d22center22 60
spam loans 59

Now we’ll score the emails in the Test dataset. Start by tokenizing the words in the test emails, grouping them by unique email_id and cleaning up some of the non-alpha and special characters (similar to the previous step, but leaving in low-frequency terms.)

# tokenize
df_test_words <- df_test %>%
  select(!label) %>%
  unnest_tokens(word,body) %>%
  group_by(word,email_id) %>%
  summarize(count=n())

# clean up
df_test_words_tidy <- df_test_words %>%
  anti_join(stop_words) %>% 
  filter(grepl('^[:alpha:]',word)) %>% 
  filter(str_length(word)>2)
word email_id count
a___ 4599 1
a_______________________________________________exmh 2847 1
a_______________________________________________exmh 3065 1
a_______________________________________________webdev 2048 1
a_maximum_ 4982 1
a:active 847 1

And now we count up the frequency of unique ham/spam words per email.

# join with SPAM words list and get count
df_test_words_spam <- df_test_words_tidy %>%
  inner_join(y = df_spam_words_unique, by=c('word')) %>%
  select(word,email_id,count.x) %>%
  rename(spam = count.x)

# join with HAM words list and get count
df_test_words_ham <- df_test_words_tidy %>%
  inner_join(y = df_ham_words_unique, by=c('word')) %>%
  select(word,email_id,count.x) %>%
  rename(ham = count.x)

# total up SPAM score per email
df_test_spam_scores <- df_test_words_spam %>%
  group_by(email_id) %>%
  summarize(spam = sum(spam))

# total up HAM score per email
df_test_ham_scores <- df_test_words_ham %>%
  group_by(email_id) %>%
  summarize(ham = sum(ham))
email_id ham
1 1
2 1
3 1
8 1
9 2
13 1
email_id spam
2 6
3 6
4 9
8 6
9 1
10 1

And finally, join these ham/spam scores back to the test dataset and score them. If a given email’s ‘spam’ score is greater than its ‘ham’ score, flag it as spam. And, to assist in calculating the performance, if the predicted value matches the original label, then assign ‘success’ == 1.

# join back to df
df_results <- df %>%
  filter(segment==1) %>% # just the test segment
  left_join(df_test_spam_scores, by=c('email_id')) %>%
  left_join(df_test_ham_scores, by=c('email_id'))

# fix NAs
df_results[is.na(df_results)] <- 0 # so we can calculate wihout NA
  
# if spam score > ham score, flag the email as 'spam'. success=1 if predicted correctly.
df_results <- df_results %>%
  mutate(flag=if_else(spam>ham,'spam','ham')) %>%
  mutate(success=if_else(label==flag,1,0))
source label header body email_id segment spam ham flag success
data/unzip/spam spam From Thu Aug 22 13:17:22 2002 Return-Path: <> Delivered-To: Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <>; Thu, 22 Aug 2002 13:09:41 +0100 From: Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable Save up to 70% on Life Insurance.Why Spend More Than You Have To?Life Quote Savings Ensuring your family’s financial security is very important. Life Quote Savings makes buying life insurance simple and affordable. We Provide FREE Access to The Very Best Companies and The Lowest Rates. Life Quote Savings is FAST, EASY and SAVES you money! Let us help you get started with the best values in the country on new coverage. You can SAVE hundreds or even thousands of dollars by requesting a FREE quote from Lifequote Savings. Our service will take you less than 5 minutes to complete. Shop and compare. SAVE up to 70% on all types of Life insurance! Click Here For Your Free Quote! Protecting your family is the best investment you’ll ever make! If you are in receipt of this email in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REMOVE. If you reside in any state which prohibits e-mail solicitations for insurance, please disregard this email. 1 1 0 1 ham 0
data/unzip/spam spam From Thu Aug 22 13:27:39 2002 Return-Path: <> Delivered-To: Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id A7FD7454F6 for <>; Thu, 22 Aug 2002 08:27:38 -0400 (EDT) Received: from phobos [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for (single-drop); Thu, 22 Aug 2002 13:27:38 +0100 (IST) Received: from lugh.tuatha.org ( [194.125.145.45]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MCJiZ06043 for <>; Thu, 22 Aug 2002 13:19:44 +0100 Received: from lugh ( [127.0.0.1]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id NAA29323; Thu, 22 Aug 2002 13:18:52 +0100 Received: from email.qves.com ([67.104.83.251]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id NAA29282 for <>; Thu, 22 Aug 2002 13:18:37 +0100 X-Authentication-Warning: lugh.tuatha.org: Host [67.104.83.251] claimed to be email.qves.com Received: from qvp0091 ([169.254.6.22]) by email.qves.com with Microsoft SMTPSVC(5.0.2195.2966); Thu, 22 Aug 2002 06:18:18 -0600 From: “Slim Down” <> To: <> Date: Thu, 22 Aug 2002 06:18:18 -0600 Message-Id: <> MIME-Version: 1.0 Content-Type: text/plain; charset=“iso-8859-1” Content-Transfer-Encoding: 7bit X-Mailer: Microsoft CDO for Windows 2000 Thread-Index: AcJJ1f+3FWdz11AmR6uWbmQN5gGxxw== Content-Class: urn:content-classes:message X-Mimeole: Produced By Microsoft MimeOLE V6.00.2462.0000 X-Originalarrivaltime: 22 Aug 2002 12:18:18.0699 (UTC) FILETIME=[FFB949B0:01C249D5] Subject: [ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206 Sender: Errors-To: X-Mailman-Version: 1.1 Precedence: bulk List-Id: Irish Linux Users’ Group <ilug.linux.ie> X-Beenthere:
  1. Fight The Risk of Cancer!http://www.adclick.ws/p.cfm?o315&spk0072) Slim Down - Guaranteed to lose 10-12 lbs in 30 dayshttp://www.adclick.ws/p.cfm?o249&spk0073) Get the Child Support You Deserve - Free Legal Advicehttp://www.adclick.ws/p.cfm?o245&spk0024) Join the Web’s Fastest Growing Singles Communityhttp://www.adclick.ws/p.cfm?o259&spk0075) Start Your Private Photo Album Online!http://www.adclick.ws/p.cfm?o283&spk007Have a Wonderful Day,Offer ManagerPrizeMamaIf you wish to leave this list please use the link below.http://%7C17%7C114258– Irish Linux Users’ Group: ://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.List maintainer:
2 1 6 1 spam 1

And now for some basic diagnostics. This manual approach correctly classified emails about 80% of the time!

# accuracy level
accuracy <- df_results %>%
  filter(segment==1) %>%
  select(success) %>%
  summarize(accuracy=sum(success)/length(success)) %>%
  pull()
x
0.8024894

Unfortunately, 19 legitimate emails were incorrectly labeled as Spam, for a 3.1% False Positive rate.

# confusion matrix
conf_matrix <- df_results %>%
  filter(segment==1) %>%
  select(label,success) %>% 
  group_by(label,success) %>%
  summarize(count=n())
label success count
ham 0 19
ham 1 2039
spam 0 584
spam 1 411

Modeled Approach - Random Forest

For comparison, using the randomForest library to train and test a classifier on the same datasets results in 100% accuracy. Most likely a case of overfitting!

library(randomForest)

df_train_rf <- df_train %>%
  mutate(label=as.factor(label)) # convert to factor for classification

rf <- randomForest(label ~ ., data = df_train_rf, importance=TRUE, 
                   proximity=TRUE)

rf_predict <- predict(object=rf, newdata=df_test)

rf_results <- df_test %>%
  cbind(flag=rf_predict) %>%
  mutate(success=if_else(label==flag,1,0)) %>%
  select(!body)
email_id label flag success
1 spam spam 1
2 spam spam 1
3 spam spam 1
4 spam spam 1
8 spam spam 1
9 spam spam 1
# accuracy level
rf_accuracy <- rf_results %>%
  select(success) %>%
  summarize(accuracy=sum(success)/length(success)) %>%
  pull()
x
1
# confusion matrix
rf_conf_matrix <- rf_results %>%
  select(label,success) %>% 
  group_by(label,success) %>%
  summarize(count=n())
label success count
ham 1 2058
spam 1 995