Project 4 - Document Classification

Overview

For this project, we will investigate a corpus of harvested emails from Apache SpamAssasin and build a manual classifier to identify them as either “spam” or “ham”, then compare this to an algorithmic approach using a Random Forest classifier.

Data Import

There are five SpamAssasin datasets (two “spam” and three “ham”), downloaded to our ‘data’ folder in tar archive format. The following code (which should only be run once) extracts and decompresses these archives into a ‘data/unzip’ subfolder:

data_dir = 'data' # compressed archives
dest_dir = 'data/unzip' # destination for unzipped files

# unzip automatic - run first time only
for (file in list.files(data_dir, pattern='.*\\.bz2', full.names=TRUE)){
  untar(file,exdir=dest_dir)
}

Next, we’ll loop through the decompressed Spam and Ham emails and load them into a dataframe with appropriate labeling. (grep is used to avoid loading some non-email files titled “cmd” that were included in the original archives.)

# load data automatically
spam_email <- vector() # init
ham_email <- vector() # init
df <- data.frame(source=character(), label=character(), email=character()) # init

# load spam
for (path in list.files(dest_dir, pattern='.*spam.*', full.names=TRUE)){
  for (f in grep(list.files(path,full.names=TRUE), pattern='.*cmd.*', invert=TRUE, value=TRUE)){
    df[nrow(df)+1,] = c(path,'spam',read_file(f))
  }
}

# load ham
for (path in list.files(dest_dir, pattern='.*ham.*', full.names=TRUE)){
  for (f in grep(list.files(path,full.names=TRUE), pattern='.*cmd.*', invert=TRUE, value=TRUE)){
    df[nrow(df)+1,] = c(path,'ham',read_file(f))
  }
}

The resulting dataframe has 6046 rows, each representing a single email from the corpus with all the original content (‘email’), the name of the original archive, and a label of either ‘ham’ or ‘spam’:

source label email

data/unzip/spam

spam

source	label	email
data/unzip/spam	spam	From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <dcek1a1@netsgo.com> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <0103c1042001882DD_IT7@dd_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN”> <HTML><HEAD> <META content=3D”text/html; charset=3Dwindows-1252” http-equiv=3DContent-T= ype> <META content=3D”MSHTML 5.00.2314.1000” name=3DGENERATOR></HEAD> <BODY><!– Inserted by Calypso –> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= ules=3Dnone style=3D”COLOR: black; DISPLAY: none” width=3D”100%“> <TBODY> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TD></TR> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TBODY></TABLE><!– End Calypso –><!– Inserted by Calypso= –><FONT color=3D#000000 face=3DVERDANA,ARIAL,HELVETICA size=3D-2><BR></FONT></TD><= /TR></TABLE><!– End Calypso –><FONT color=3D#ff0000 face=3D”Copperplate Gothic Bold” size=3D5 PTSIZE=3D”10”> <CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff= 0000 face=3D”Copperplate Gothic Bold” size=3D5 PTSIZE=3D”10”> <CENTER>Why Spend More Than You Have To? <CENTER><FONT color=3D#ff0000 face=3D”Copperplate Gothic Bold” size=3D5 PT= SIZE=3D”10”> <CENTER>Life Quote Savings <CENTER> <P align=3Dleft></P> <P align=3Dleft></P></FONT></U></I></B><BR></FONT></U></B></U></I> <P></P> <CENTER> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D0 cellSpacing=3D0 wi= dth=3D650> <TBODY></TBODY></TABLE> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D5 cellSpacing=3D0 wi= dth=3D650> <TBODY> <TR> <TD colSpan=3D2 width=3D”35%“><B><FONT face=3DVerdana size=3D4>Ensurin= g your family’s financial security is very important. Life Quote Savings ma= kes buying life insurance simple and affordable. We Provide FREE Access = to The Very Best Companies and The Lowest Rates.</FONT></B></TD></TR> <TR> <TD align=3Dmiddle vAlign=3Dtop width=3D”18%“> <TABLE borderColor=3D#111111 width=3D”100%“> <TBODY> <TR> <TD style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100= %“><FONT face=3DVerdana size=3D4><B>Life Quote Savings</B> is FAST, EAS= Y and SAVES you money! Let us help you get started with the best val= ues in the country on new coverage. You can SAVE hundreds or even tho= usands of dollars by requesting a FREE quote from Lifequote Savings. = Our service will take you less than 5 minutes to complete. Shop an= d compare. SAVE up to 70% on all types of Life insurance! </FONT></TD></TR> <TR><BR><BR> <TD height=3D50 style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100%“> <P align=3Dcenter><B><FONT face=3DVerdana size=3D5><A href=3D”http://website.e365.cc/savequote/">Click Here For Your= `Free Quote!</A></FONT></B></P></TD> <P><FONT face=3DVerdana size=3D4><STRONG> <CENTER>Protecting your family is the best investment you'll eve=` r make!<BR></B></TD></TR> <TR><BR><BR></STRONG></FONT></TD></TR></TD></TR> <TR></TR></TBODY></TABLE> <P align=3Dleft><FONT face=3D”Arial, Helvetica, sans-serif” size=3D2= ></FONT></P> <P></P> <CENTER><BR><BR><BR> <P></P> <P align=3Dleft><BR></B><BR><BR><BR><BR></P> <P align=3Dcenter><BR></P> <P align=3Dleft><BR></B><BR><BR></FONT>If you are in receipt of this= email in error and/or wish to be removed from our list, <A href=3D”mailto:coins@btamail.net.cn">PLEASE CLICK HERE</A> AND TYPE = REMOVE. If you reside in any state which prohibits e-mail solicitations for insuran= ce, please disregard this email.<BR></FONT><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR= ><BR><BR><BR></FONT></P></CENTER></CENTER></TR></TBODY></TABLE></CENTER></= CENTER></CENTER></CENTER></CENTER></BODY></HTML>

From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <dcek1a1@netsgo.com> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <0103c1042001882DD_IT7@dd_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN”> <HTML><HEAD> <META content=3D”text/html; charset=3Dwindows-1252” http-equiv=3DContent-T= ype> <META content=3D”MSHTML 5.00.2314.1000” name=3DGENERATOR></HEAD> <BODY><!– Inserted by Calypso –> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= ules=3Dnone style=3D”COLOR: black; DISPLAY: none” width=3D”100%“> <TBODY> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TD></TR> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TBODY></TABLE><!– End Calypso –><!– Inserted by Calypso= –> </TD><= /TR></TABLE><!– End Calypso –> <CENTER>Save up to 70% on Life Insurance.</CENTER> <CENTER>Why Spend More Than You Have To? <CENTER> <CENTER>Life Quote Savings <CENTER> <CENTER> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D0 cellSpacing=3D0 wi= dth=3D650> <TBODY></TBODY></TABLE> <TABLE border=3D0 borderColor=3D#111111 cellPadding=3D5 cellSpacing=3D0 wi= dth=3D650> <TBODY> <TR> <TD colSpan=3D2 width=3D”35%“>Ensurin= g your family’s financial security is very important. Life Quote Savings ma= kes buying life insurance simple and affordable. We Provide FREE Access = to The Very Best Companies and The Lowest Rates.</TD></TR> <TR> <TD align=3Dmiddle vAlign=3Dtop width=3D”18%“> <TABLE borderColor=3D#111111 width=3D”100%“> <TBODY> <TR> <TD style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100= %“>Life Quote Savings is FAST, EAS= Y and SAVES you money! Let us help you get started with the best val= ues in the country on new coverage. You can SAVE hundreds or even tho= usands of dollars by requesting a FREE quote from Lifequote Savings. = Our service will take you less than 5 minutes to complete. Shop an= d compare. SAVE up to 70% on all types of Life insurance! </TD></TR> <TR> <TD height=3D50 style=3D”PADDING-LEFT: 5px; PADDING-RIGHT: 5px” width=3D”100%“> <A href=3D”http://website.e365.cc/savequote/">Click Here For Your=

        Free Quote!&lt;/A&gt;&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;&lt;/TD&gt;
      &lt;P&gt;&lt;FONT face=3DVerdana size=3D4&gt;&lt;STRONG&gt;
      &lt;CENTER&gt;Protecting your family is the best investment you'll eve=

r make! </TD></TR> <TR> </TD></TR></TD></TR> <TR></TR></TBODY></TABLE> <CENTER> If you are in receipt of this= email in error and/or wish to be removed from our list, <A href=3D”mailto:coins@btamail.net.cn">PLEASE CLICK HERE</A> AND TYPE = REMOVE. If you reside in any state which prohibits e-mail solicitations for insuran= ce, please disregard this email. <BR= > </CENTER></CENTER></TR></TBODY></TABLE></CENTER></= CENTER></CENTER></CENTER></CENTER></BODY></HTML>

Data Transformations

Before beginning the analysis, we need to deal with the widely-varying content of these emails, many of which contain HTML tags and non-standard characters.

Encoding: After some trial and error, encoded all the content of ‘email’ into latin1; works for all the rows.
Separate Header and Body: Used a combination of substr() and str_locate() to separate the content into ‘header’ and ‘body’ based on a double-linebreak.
HTML Tags: Created a short function to remove HTML tags and linebreak characters, and used sapply to run against the contents of df$body

# convert to latin1
df$email <- str_conv(df$email, 'latin1')

# separate header
df$header <- substr(df$email,1,str_locate(df$email,'\n\n')) 

# separate body
df$body <- substr(df$email,str_locate(df$email,'\n\n'),str_length(df$email))

# strip html tags and line breaks
strip_html <- function(x){gsub('</?[^>]+>|\\n|=','',x)}
df$body <- sapply(df$body,strip_html) # works

df <- df %>% select(!email) # drop column

source	label	header	body
data/unzip/spam	spam	From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <dcek1a1@netsgo.com> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <0103c1042001882DD_IT7@dd_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable	Save up to 70% on Life Insurance.Why Spend More Than You Have To?Life Quote Savings Ensuring your family’s financial security is very important. Life Quote Savings makes buying life insurance simple and affordable. We Provide FREE Access to The Very Best Companies and The Lowest Rates. Life Quote Savings is FAST, EASY and SAVES you money! Let us help you get started with the best values in the country on new coverage. You can SAVE hundreds or even thousands of dollars by requesting a FREE quote from Lifequote Savings. Our service will take you less than 5 minutes to complete. Shop and compare. SAVE up to 70% on all types of Life insurance! Click Here For Your Free Quote! Protecting your family is the best investment you’ll ever make! If you are in receipt of this email in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REMOVE. If you reside in any state which prohibits e-mail solicitations for insurance, please disregard this email.

Split Training and Test datasets

Focusing on the body text only for this analysis, randomly select rows and split them into training and test datasets - 50/50 for argument’s sake. I’ve set a random seed in the setup chunk to this document for reproducibility.

# drop orig and add index, train/test split
df <- df %>% 
  mutate(email_id = row_number()) %>%
  mutate(segment = rbinom(nrow(df), 1, 0.5))

# let's consider body text only
df_train <- df %>%
  filter(segment==0) %>%
  select(email_id,body,label)

df_test <- df %>%
  filter(segment==1) %>% 
  select(email_id,body,label)

Manual Classification

For a simple manual approach, we’ll tally up the individual words most commonly associated with ‘ham’ and ‘spam’ email body content in our training dataset, and then classify the emails in our test dataset depending on the frequency of these words.

With the help of the tidytext library, we’ll un-nest the individual words (tokens), group them by their ham/spam label, and get a count of their frequency:

# tokenize
df_train_words <- df_train %>%
  unnest_tokens(word,body) %>%
  group_by(label, word) %>%
  summarize(count=n())

label	word	count
ham	__	36
ham	___	14
ham	____	3
ham	_____	8
ham	______	1
ham	___________________________	2

Right off the bat, we can see a large number of tokens represent special characters, numerics and other ‘non-alpha’ characters. While this information would be useful in a model-based approach, it’s probably too much extraneous information for this simple analysis.

This next step filters out common stopwords, non-alpha and special characters, short terms (1-2 characters) and low-frequency terms.

# clean up
df_train_words_tidy <- df_train_words %>%
  anti_join(stop_words) %>% # filter out stop words
  filter(grepl('^[:alpha:]',word)) %>% # filter out numeric / non-alpha entries
  filter(count>5) %>% # filter out low-frequency terms
  filter(str_length(word)>2) %>% # filter out 1-2 charater terms
  arrange(desc(count),word)

label	word	count
ham	http	5842
ham	listinfo	1222
ham	list	1219
ham	linux	907
ham	people	710
spam	list	618

And for the final step, compose lists of words that appear exclusively in either ‘ham’ or ‘spam’ emails:

# get top words for spam/ham
df_spam_words <- df_train_words_tidy %>%
  filter(label=='spam')

df_ham_words <- df_train_words_tidy %>%
  filter(label=='ham')

# spam words only
df_spam_words_unique <- df_spam_words %>%
  anti_join(df_ham_words, by=c('word') )

# ham words only
df_ham_words_unique <- df_ham_words %>%
  anti_join(df_spam_words, by=c('word') )

label	word	count
ham	pgp	288
ham	perl	281
ham	aug	262
ham	apt	218
ham	lists.freshrpms.net	216
ham	log	208

label	word	count
spam	aging	74
spam	loan	71
spam	herba	66
spam	les	63
spam	align3d22center22	60
spam	loans	59

Now we’ll score the emails in the Test dataset. Start by tokenizing the words in the test emails, grouping them by unique email_id and cleaning up some of the non-alpha and special characters (similar to the previous step, but leaving in low-frequency terms.)

# tokenize
df_test_words <- df_test %>%
  select(!label) %>%
  unnest_tokens(word,body) %>%
  group_by(word,email_id) %>%
  summarize(count=n())

# clean up
df_test_words_tidy <- df_test_words %>%
  anti_join(stop_words) %>% 
  filter(grepl('^[:alpha:]',word)) %>% 
  filter(str_length(word)>2)

word	email_id	count
a___	4599	1
a_______________________________________________exmh	2847	1
a_______________________________________________exmh	3065	1
a_______________________________________________webdev	2048	1
a_maximum_	4982	1
a:active	847	1

And now we count up the frequency of unique ham/spam words per email.

# join with SPAM words list and get count
df_test_words_spam <- df_test_words_tidy %>%
  inner_join(y = df_spam_words_unique, by=c('word')) %>%
  select(word,email_id,count.x) %>%
  rename(spam = count.x)

# join with HAM words list and get count
df_test_words_ham <- df_test_words_tidy %>%
  inner_join(y = df_ham_words_unique, by=c('word')) %>%
  select(word,email_id,count.x) %>%
  rename(ham = count.x)

# total up SPAM score per email
df_test_spam_scores <- df_test_words_spam %>%
  group_by(email_id) %>%
  summarize(spam = sum(spam))

# total up HAM score per email
df_test_ham_scores <- df_test_words_ham %>%
  group_by(email_id) %>%
  summarize(ham = sum(ham))

email_id	ham
1	1
2	1
3	1
8	1
9	2
13	1

email_id	spam
2	6
3	6
4	9
8	6
9	1
10	1

And finally, join these ham/spam scores back to the test dataset and score them. If a given email’s ‘spam’ score is greater than its ‘ham’ score, flag it as spam. And, to assist in calculating the performance, if the predicted value matches the original label, then assign ‘success’ == 1.

# join back to df
df_results <- df %>%
  filter(segment==1) %>% # just the test segment
  left_join(df_test_spam_scores, by=c('email_id')) %>%
  left_join(df_test_ham_scores, by=c('email_id'))

# fix NAs
df_results[is.na(df_results)] <- 0 # so we can calculate wihout NA
  
# if spam score > ham score, flag the email as 'spam'. success=1 if predicted correctly.
df_results <- df_results %>%
  mutate(flag=if_else(spam>ham,'spam','ham')) %>%
  mutate(success=if_else(label==flag,1,0))

source	label	header	body	email_id	segment	spam	ham	flag	success
data/unzip/spam	spam	From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: <dcek1a1@netsgo.com> Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: <0103c1042001882DD_IT7@dd_it7> Content-Type: text/html; charset=“iso-8859-1” Content-Transfer-Encoding: quoted-printable	Save up to 70% on Life Insurance.Why Spend More Than You Have To?Life Quote Savings Ensuring your family’s financial security is very important. Life Quote Savings makes buying life insurance simple and affordable. We Provide FREE Access to The Very Best Companies and The Lowest Rates. Life Quote Savings is FAST, EASY and SAVES you money! Let us help you get started with the best values in the country on new coverage. You can SAVE hundreds or even thousands of dollars by requesting a FREE quote from Lifequote Savings. Our service will take you less than 5 minutes to complete. Shop and compare. SAVE up to 70% on all types of Life insurance! Click Here For Your Free Quote! Protecting your family is the best investment you’ll ever make! If you are in receipt of this email in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REMOVE. If you reside in any state which prohibits e-mail solicitations for insurance, please disregard this email.	1	1	0	1	ham	0
data/unzip/spam	spam	From ilug-admin@linux.ie Thu Aug 22 13:27:39 2002 Return-Path: <ilug-admin@linux.ie> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id A7FD7454F6 for <zzzz@localhost>; Thu, 22 Aug 2002 08:27:38 -0400 (EDT) Received: from phobos [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:27:38 +0100 (IST) Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MCJiZ06043 for <zzzz-ilug@jmason.org>; Thu, 22 Aug 2002 13:19:44 +0100 Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id NAA29323; Thu, 22 Aug 2002 13:18:52 +0100 Received: from email.qves.com ([67.104.83.251]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id NAA29282 for <ilug@linux.ie>; Thu, 22 Aug 2002 13:18:37 +0100 X-Authentication-Warning: lugh.tuatha.org: Host [67.104.83.251] claimed to be email.qves.com Received: from qvp0091 ([169.254.6.22]) by email.qves.com with Microsoft SMTPSVC(5.0.2195.2966); Thu, 22 Aug 2002 06:18:18 -0600 From: “Slim Down” <taylor@s3.serveimage.com> To: <ilug@linux.ie> Date: Thu, 22 Aug 2002 06:18:18 -0600 Message-Id: <59e6301c249d5$ffb7ea20$1606fea9@freeyankeedom.com> MIME-Version: 1.0 Content-Type: text/plain; charset=“iso-8859-1” Content-Transfer-Encoding: 7bit X-Mailer: Microsoft CDO for Windows 2000 Thread-Index: AcJJ1f+3FWdz11AmR6uWbmQN5gGxxw== Content-Class: urn:content-classes:message X-Mimeole: Produced By Microsoft MimeOLE V6.00.2462.0000 X-Originalarrivaltime: 22 Aug 2002 12:18:18.0699 (UTC) FILETIME=[FFB949B0:01C249D5] Subject: [ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206 Sender: ilug-admin@linux.ie Errors-To: ilug-admin@linux.ie X-Mailman-Version: 1.1 Precedence: bulk List-Id: Irish Linux Users’ Group <ilug.linux.ie> X-Beenthere: ilug@linux.ie	Fight The Risk of Cancer!http://www.adclick.ws/p.cfm?o315&spk0072) Slim Down - Guaranteed to lose 10-12 lbs in 30 dayshttp://www.adclick.ws/p.cfm?o249&spk0073) Get the Child Support You Deserve - Free Legal Advicehttp://www.adclick.ws/p.cfm?o245&spk0024) Join the Web’s Fastest Growing Singles Communityhttp://www.adclick.ws/p.cfm?o259&spk0075) Start Your Private Photo Album Online!http://www.adclick.ws/p.cfm?o283&spk007Have a Wonderful Day,Offer ManagerPrizeMamaIf you wish to leave this list please use the link below.http://www.qves.com/trim/?ilug@linux.ie%7C17%7C114258– Irish Linux Users’ Group: ilug@linux.iehttp://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.List maintainer: listmaster@linux.ie	2	1	6	1	spam	1

And now for some basic diagnostics. This manual approach correctly classified emails about 80% of the time!

# accuracy level
accuracy <- df_results %>%
  filter(segment==1) %>%
  select(success) %>%
  summarize(accuracy=sum(success)/length(success)) %>%
  pull()

x
0.8024894

Unfortunately, 19 legitimate emails were incorrectly labeled as Spam, for a 3.1% False Positive rate.

# confusion matrix
conf_matrix <- df_results %>%
  filter(segment==1) %>%
  select(label,success) %>% 
  group_by(label,success) %>%
  summarize(count=n())

label	success	count
ham	0	19
ham	1	2039
spam	0	584
spam	1	411

Modeled Approach - Random Forest

For comparison, using the randomForest library to train and test a classifier on the same datasets results in 100% accuracy. Most likely a case of overfitting!

library(randomForest)

df_train_rf <- df_train %>%
  mutate(label=as.factor(label)) # convert to factor for classification

rf <- randomForest(label ~ ., data = df_train_rf, importance=TRUE, 
                   proximity=TRUE)

rf_predict <- predict(object=rf, newdata=df_test)

rf_results <- df_test %>%
  cbind(flag=rf_predict) %>%
  mutate(success=if_else(label==flag,1,0)) %>%
  select(!body)

email_id	label	flag	success
1	spam	spam	1
2	spam	spam	1
3	spam	spam	1
4	spam	spam	1
8	spam	spam	1
9	spam	spam	1

# accuracy level
rf_accuracy <- rf_results %>%
  select(success) %>%
  summarize(accuracy=sum(success)/length(success)) %>%
  pull()

x
1

# confusion matrix
rf_conf_matrix <- rf_results %>%
  select(label,success) %>% 
  group_by(label,success) %>%
  summarize(count=n())

label	success	count
ham	1	2058
spam	1	995