Introduction

Using a data set of labeled spam and ham (non-spam) e-mails, a predictive classifier tool is made that predicts if a new document is spam. This project explores document classification, aiming to boost productivity and organize information better.

Load libraries

library(utils)
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(e1071)

Import data

While I always prefer to use a method that pulls the data through GitHub so it is reproducible on any device, I could not figure out how to do that (or a substitute of that) with folders that have so many files. This method pulls it from the file paths on the computer that this RMD was written on.

#save paths that the folders containing the emails are saved under on my device
ham_path <- "/Users/kaylieevans/Documents/DATA607/Project04/SpamHam/easy_ham"
spam_path <- "/Users/kaylieevans/Documents/DATA607/Project04/SpamHam/spam_2"

#create a function that reads the file and returns the contents as a list
import_raw_emails <- function(folder_path) {
  raw_files <- list.files(folder_path, full.names = TRUE)
  raw_emails <- lapply(raw_files, function(file) {
    raw_emails <- readLines(file, encoding = "latin1")
    paste(raw_emails, collapse = "\n")
  })
  return(raw_emails)
}

#save spam and ham emails with this function
spam_raw <- import_raw_emails(spam_path)
ham_raw <- import_raw_emails(ham_path)
#create a combined data frame with email contents and a flag for spam or ham
emails <- bind_rows(
  spam_emails <- map_df(spam_raw, ~ data.frame(email_content = .x, spam_or_ham_fg = "spam", stringsAsFactors = FALSE)),
  ham_emails <- map_df(ham_raw, ~ data.frame(email_content = .x, spam_or_ham_fg = "ham", stringsAsFactors = FALSE))
)
#how many spam and ham emails 
table(emails$spam_or_ham_fg)
## 
##  ham spam 
## 2501 1397
#what does a single observation look like
emails$email_content[1400]
## [1] "From timc@2ubh.com  Thu Aug 22 13:52:59 2002\nReturn-Path: <timc@2ubh.com>\nDelivered-To: zzzz@localhost.netnoteinc.com\nReceived: from localhost (localhost [127.0.0.1])\n\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 0314547C66\n\tfor <zzzz@localhost>; Thu, 22 Aug 2002 08:52:58 -0400 (EDT)\nReceived: from phobos [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:52:59 +0100 (IST)\nReceived: from n16.grp.scd.yahoo.com (n16.grp.scd.yahoo.com\n    [66.218.66.71]) by dogma.slashnull.org (8.11.6/8.11.6) with SMTP id\n    g7MCrdZ07070 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:53:39 +0100\nX-Egroups-Return: sentto-2242572-52733-1030020820-zzzz=spamassassin.taint.org@returns.groups.yahoo.com\nReceived: from [66.218.67.198] by n16.grp.scd.yahoo.com with NNFMP;\n    22 Aug 2002 12:53:40 -0000\nX-Sender: timc@2ubh.com\nX-Apparently-To: zzzzteana@yahoogroups.com\nReceived: (EGP: mail-8_1_0_1); 22 Aug 2002 12:53:39 -0000\nReceived: (qmail 76099 invoked from network); 22 Aug 2002 12:53:39 -0000\nReceived: from unknown (66.218.66.218) by m5.grp.scd.yahoo.com with QMQP;\n    22 Aug 2002 12:53:39 -0000\nReceived: from unknown (HELO rhenium.btinternet.com) (194.73.73.93) by\n    mta3.grp.scd.yahoo.com with SMTP; 22 Aug 2002 12:53:39 -0000\nReceived: from host217-36-23-185.in-addr.btopenworld.com ([217.36.23.185])\n    by rhenium.btinternet.com with esmtp (Exim 3.22 #8) id 17hrT0-0004gj-00\n    for forteana@yahoogroups.com; Thu, 22 Aug 2002 13:53:38 +0100\nX-Mailer: Microsoft Outlook Express Macintosh Edition - 4.5 (0410)\nTo: zzzzteana <zzzzteana@yahoogroups.com>\nX-Priority: 3\nMessage-Id: <E17hrT0-0004gj-00@rhenium.btinternet.com>\nFrom: \"Tim Chapman\" <timc@2ubh.com>\nX-Yahoo-Profile: tim2ubh\nMIME-Version: 1.0\nMailing-List: list zzzzteana@yahoogroups.com; contact\n    forteana-owner@yahoogroups.com\nDelivered-To: mailing list zzzzteana@yahoogroups.com\nPrecedence: bulk\nList-Unsubscribe: <mailto:zzzzteana-unsubscribe@yahoogroups.com>\nDate: Thu, 22 Aug 2002 13:52:38 +0100\nSubject: [zzzzteana] Moscow bomber\nReply-To: zzzzteana@yahoogroups.com\nContent-Type: text/plain; charset=US-ASCII\nContent-Transfer-Encoding: 7bit\n\nMan Threatens Explosion In Moscow \n\nThursday August 22, 2002 1:40 PM\nMOSCOW (AP) - Security officers on Thursday seized an unidentified man who\nsaid he was armed with explosives and threatened to blow up his truck in\nfront of Russia's Federal Security Services headquarters in Moscow, NTV\ntelevision reported.\nThe officers seized an automatic rifle the man was carrying, then the man\ngot out of the truck and was taken into custody, NTV said. No other details\nwere immediately available.\nThe man had demanded talks with high government officials, the Interfax and\nITAR-Tass news agencies said. Ekho Moskvy radio reported that he wanted to\ntalk with Russian President Vladimir Putin.\nPolice and security forces rushed to the Security Service building, within\nblocks of the Kremlin, Red Square and the Bolshoi Ballet, and surrounded the\nman, who claimed to have one and a half tons of explosives, the news\nagencies said. Negotiations continued for about one and a half hours outside\nthe building, ITAR-Tass and Interfax reported, citing witnesses.\nThe man later drove away from the building, under police escort, and drove\nto a street near Moscow's Olympic Penta Hotel, where authorities held\nfurther negotiations with him, the Moscow police press service said. The\nmove appeared to be an attempt by security services to get him to a more\nsecure location. \n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n"

Clean and Transform the Data

Widen the data frame

From the above email, there seems to be an escape key. Let’s try to take each of these breaks and make them new columns.

Trying delimiter

#saving the first 30 delimited columns
df_wide <- separate(emails, email_content, into = paste0("text_", 1:30), sep = "\n")
## Warning: Expected 30 pieces. Additional pieces discarded in 3394 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Warning: Expected 30 pieces. Missing pieces filled with `NA` in 452 rows [107, 205, 213,
## 266, 362, 474, 520, 743, 783, 996, 1150, 1193, 1234, 1367, 1443, 1538, 1540,
## 1541, 1542, 1543, ...].
#check on the data frame
head(df_wide)
##                                                       text_1
## 1         From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002
## 2         From lmrn@mailexcite.com  Mon Jun 24 17:03:24 2002
## 3     From amknight@mailexcite.com  Mon Jun 24 17:03:49 2002
## 4     From jordan23@mailexcite.com  Mon Jun 24 17:04:20 2002
## 5 From merchantsworld2001@juno.com  Tue Aug  6 11:01:33 2002
## 6       Received: from hq.pro-ns.net (localhost [127.0.0.1])
##                                                            text_2
## 1                              Return-Path: <ilug-admin@linux.ie>
## 2                        Return-Path: merchantsworld2001@juno.com
## 3                        Return-Path: merchantsworld2001@juno.com
## 4                        Return-Path: merchantsworld2001@juno.com
## 5                      Return-Path: <merchantsworld2001@juno.com>
## 6 \tby hq.pro-ns.net (8.12.5/8.12.5) with ESMTP id g6NLtshY000264
##                                                                   text_3
## 1                            Delivered-To: yyyy@localhost.netnoteinc.com
## 2                                Delivery-Date: Mon May 13 04:46:13 2002
## 3                                Delivery-Date: Wed May 15 08:58:23 2002
## 4                                Delivery-Date: Thu May 16 11:03:55 2002
## 5                            Delivered-To: yyyy@localhost.netnoteinc.com
## 6 \t(version=TLSv1/SSLv3 cipher=EDH-DSS-DES-CBC3-SHA bits=168 verify=NO)
##                                                                             text_4
## 1                                 Received: from localhost (localhost [127.0.0.1])
## 2                Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
## 3                Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
## 4                Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
## 5                                 Received: from localhost (localhost [127.0.0.1])
## 6 \tfor <cypherpunks-forward@ds.pro-ns.net>; Tue, 23 Jul 2002 16:55:55 -0500 (CDT)
##                                                                   text_5
## 1     \tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD
## 2     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4D3kCe15097 for
## 3     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4F7wIe23864 for
## 4     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4GA3qe29480 for
## 5     \tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 8399C44126
## 6                                 \t(envelope-from cpunks@hq.pro-ns.net)
##                                                        text_6
## 1 \tfor <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)
## 2            <jm@jmason.org>; Mon, 13 May 2002 04:46:12 +0100
## 3            <jm@jmason.org>; Wed, 15 May 2002 08:58:18 +0100
## 4            <jm@jmason.org>; Thu, 16 May 2002 11:03:52 +0100
## 5 \tfor <jm@localhost>; Tue,  6 Aug 2002 05:55:17 -0400 (EDT)
## 6                           Received: (from cpunks@localhost)
##                                                                       text_7
## 1                                          Received: from phobos [127.0.0.1]
## 2 Received: from 203.129.205.5.205.129.203.in-addr.arpa ([203.129.205.5]) by
## 3      Received: from webcust2.hightowertech.com (webcust2.hightowertech.com
## 4         Received: from webnote.net (mail.webnote.net [193.120.211.219]) by
## 5                          Received: from mail.webnote.net [193.120.211.219]
## 6                \tby hq.pro-ns.net (8.12.5/8.12.5/Submit) id g6NLtsB8000241
##                                                                           text_8
## 1                                     \tby localhost with IMAP (fetchmail-5.9.0)
## 2      mandark.labs.netnoteinc.com (8.11.2/8.11.2) with SMTP id g4D3k2D12605 for
## 3          [216.41.166.100]) by mandark.labs.netnoteinc.com (8.11.2/8.11.2) with
## 4     mandark.labs.netnoteinc.com (8.11.2/8.11.2) with ESMTP id g4GA3oD28650 for
## 5                                     \tby localhost with POP3 (fetchmail-5.9.0)
## 6 \tfor cypherpunks-forward@ds.pro-ns.net; Tue, 23 Jul 2002 16:55:54 -0500 (CDT)
##                                                                         text_9
## 1      \tfor jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)
## 2                         <jm@netnoteinc.com>; Mon, 13 May 2002 04:46:04 +0100
## 3     ESMTP id g4F7wGD24120 for <jm@netnoteinc.com>; Wed, 15 May 2002 08:58:17
## 4                         <jm@netnoteinc.com>; Thu, 16 May 2002 11:03:51 +0100
## 5      \tfor jm@localhost (single-drop); Tue, 06 Aug 2002 10:55:17 +0100 (IST)
## 6           Received: from locust.minder.net (locust.minder.net [66.92.53.74])
##                                                                              text_10
## 1          Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
## 2                                Received: from html (unverified [207.95.174.49]) by
## 3                                                                              +0100
## 4              Received: from webcust2.hightowertech.com (webcust2.hightowertech.com
## 5 Received: from ns1.snaapp.com (066.dsl6660167.bstatic.surewest.net [66.60.167.66])
## 6                    \tby hq.pro-ns.net (8.12.5/8.12.5) with ESMTP id g6NLtlhY000182
##                                                                         text_11
## 1            dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for
## 2       203.129.205.5.205.129.203.in-addr.arpa (EMWAC SMTPRS 0.83) with SMTP id
## 3    Received: from html ([206.216.197.214]) by webcust2.hightowertech.com with
## 4     [216.41.166.100]) by webnote.net (8.9.3/8.9.3) with ESMTP id BAA11067 for
## 5                         \tby webnote.net (8.9.3/8.9.3) with ESMTP id BAA09623
## 6        \t(version=TLSv1/SSLv3 cipher=EDH-DSS-DES-CBC3-SHA bits=168 verify=NO)
##                                                                      text_12
## 1                       <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100
## 2     <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>; Mon, 13 May 2002
## 3        Microsoft SMTPSVC(5.5.1877.197.19); Wed, 15 May 2002 00:55:53 -0700
## 4                       <jm@netnoteinc.com>; Thu, 16 May 2002 01:58:00 +0100
## 5                  \tfor <jm@netnoteinc.com>; Sun, 4 Aug 2002 01:37:55 +0100
## 6   \tfor <cypherpunks@ds.pro-ns.net>; Tue, 23 Jul 2002 16:55:50 -0500 (CDT)
##                                                                     text_13
## 1       Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
## 2                                                            09:04:46 +0530
## 3                                             From: amknight@mailexcite.com
## 4 Received: from html ([199.35.236.73]) by webcust2.hightowertech.com  with
## 5                           Message-Id: <200208040037.BAA09623@webnote.net>
## 6                                 \t(envelope-from cpunks@waste.minder.net)
##                                                                    text_14
## 1     (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100
## 2         Message-Id: <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>
## 3                                                    To: cbmark@cbmark.com
## 4      Microsoft SMTPSVC(5.5.1877.197.19); Wed, 15 May 2002 13:50:57 -0700
## 5                 Received: from html ([199.35.244.221]) by ns1.snaapp.com
## 6             Received: from waste.minder.net (daemon@waste [66.92.53.73])
##                                                                         text_15
## 1          Received: from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net
## 2                                                     From: lmrn@mailexcite.com
## 3 Subject: New Improved Fat Burners, Now With TV Fat Absorbers! Time:6:25:49 PM
## 4                                                 From: jordan23@mailexcite.com
## 5                                  (Post.Office MTA v3.1.2 release (PO205-101c)
## 6             \tby locust.minder.net (8.11.6/8.11.6) with ESMTP id g6NLtjJ48674
##                                                                          text_16
## 1     [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for
## 2                                                     To: ranmoore@cybertime.net
## 3                                                Date: Wed, 30 Jul 1980 18:25:49
## 4                                                        To: ranmoore@swbell.net
## 5                                    ID# 0-47762U100L2S100) with SMTP id ABD354;
## 6       \tfor <cypherpunks@ds.pro-ns.net>; Tue, 23 Jul 2002 17:55:45 -0400 (EDT)
##                                                                         text_17
## 1                               <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100
## 2          Subject: Real Protection, Stun Guns!  Free Shipping! Time:2:01:35 PM
## 3                                                             MIME-Version: 1.0
## 4 Subject: New Improved Fat Burners, Now With TV Fat Absorbers! Time:7:20:54 AM
## 5                                                Sat, 3 Aug 2002 17:32:59 -0700
## 6                                     \t(envelope-from cpunks@waste.minder.net)
##                                                                              text_18
## 1 X-Authentication-Warning: lugh.tuatha.org: Host w142.z064000057.nyc-ny.dsl.cnc.net
## 2                                                    Date: Mon, 28 Jul 1980 14:01:35
## 3                   Message-Id: <0845b5355070f52WEBCUST2@webcust2.hightowertech.com>
## 4                                                    Date: Thu, 31 Jul 1980 07:20:54
## 5                                                            From: yyyy@pluriproj.pt
## 6                                                  Received: (from cpunks@localhost)
##                                                 text_19
## 1         [64.0.57.142] claimed to be bettyjagessar.com
## 2                                     MIME-Version: 1.0
## 3                                          X-Keywords: 
## 4                                     MIME-Version: 1.0
## 5                 Reply-To: merchantsworld2001@juno.com
## 6 \tby waste.minder.net (8.11.6/8.11.6) id g6NLtj014163
##                                                            text_20
## 1  Received: from 64.0.57.142 [202.63.165.34] by bettyjagessar.com
## 2                                                     X-Keywords: 
## 3                       Content-Type: text/html; charset="DEFAULT"
## 4 Message-Id: <0925c5750200f52WEBCUST2@webcust2.hightowertech.com>
## 5                                            To: yyyy@pluriproj.pt
## 6 \tfor cypherpunks@ds.pro-ns.net; Tue, 23 Jul 2002 17:55:45 -0400
##                                                                                   text_21
## 1                     (SMTPD32-7.06 EVAL) id A42A7FC01F2; Fri, 02 Aug 2002 02:18:18 -0400
## 2                                              Content-Type: text/html; charset="DEFAULT"
## 3                                                                                        
## 4                                                                            X-Keywords: 
## 5                      Subject: Never Repay Cash Grants, $500 - $50,000, Secret Revealed!
## 6 Received: from huffmanoil.net (216-166-208-195.clec.madisonriver.net [216.166.208.195])
##                                                            text_22
## 1                            Message-Id: <1028311679.886@0.57.142>
## 2                                                                 
## 3                                                           <html>
## 4                       Content-Type: text/html; charset="DEFAULT"
## 5                                  Date: Sun, 19 Oct 1980 10:55:16
## 6 \tby waste.minder.net (8.11.6/8.11.6) with ESMTP id g6NLtfR14140
##                                                            text_23
## 1                             Date: Fri, 02 Aug 2002 23:37:59 0530
## 2                                                           <html>
## 3                                                           <body>
## 4                                                                 
## 5                                                Mime-Version: 1.0
## 6 \tfor <cpunks@waste.minder.net>; Tue, 23 Jul 2002 17:55:41 -0400
##                                                text_24
## 1                                    To: ilug@linux.ie
## 2                                               <body>
## 3                                             <center>
## 4                                               <html>
## 5           Content-Type: text/html; charset="DEFAULT"
## 6 Received: from html [61.230.8.153] by huffmanoil.net
##                                                           text_25
## 1                    From: "Start Now" <startnow2002@hotmail.com>
## 2                                                        <center>
## 3                                                             <b>
## 4                                                          <body>
## 5                                                                
## 6   (SMTPD32-7.10) id A2E8640144; Tue, 23 Jul 2002 05:33:28 -0400
##                       text_26
## 1           MIME-Version: 1.0
## 2                        <h3>
## 3         <font color="blue">
## 4                    <center>
## 5               <html><xbody>
## 6 From: 3b3fke@ms10.hinet.net
##                                                                                                      text_27
## 1                                                Content-Type: text/plain; charset="US-ASCII"; format=flowed
## 2                                                                                        <font color="blue">
## 3 *****Bonus Fat Absorbers As Seen On TV, Included Free With Purchase Of 2 Or More Bottle, $24.95 Value*****
## 4                                                                                                        <b>
## 5                                                                                        <hr width = "100%">
## 6                                                                                To: cpunks@waste.minder.net
##                                                text_28
## 1                Subject: [ILUG] STOP THE MLM INSANITY
## 2                                                  <b>
## 3                                              </font>
## 4                                  <font color="blue">
## 5                            <center><h3><font color =
## 6 Subject: ÁÙ¦b¥Î20%ªº«H¥Î¥d´`Àô¶Ü??? Time:PM 05:36:34
##                                                                                                      text_29
## 1                                                                                Sender: ilug-admin@linux.ie
## 2                             The Need For Safety Is Real In 2002, You Might Only Get One Chance - Be Ready!
## 3                                                                                                       <br>
## 4 *****Bonus Fat Absorbers As Seen On TV, Included Free With Purchase Of 2 Or More Bottle, $24.95 Value*****
## 5                                                                 "#44C300"><b>Government Grants E-Book 2002
## 6                                                                            Date: Fri, 23 Jul 1993 17:36:34
##                                                                                     text_30
## 1                                                            Errors-To: ilug-admin@linux.ie
## 2                                                                                       <p>
## 3                                                                                      <br>
## 4                                                                                   </font>
## 5 edition, Just $15.95. Summer Sale, Good Until August 10, 2002!  Was $49.95.</font></b><p>
## 6                                                                         Mime-Version: 1.0
##   spam_or_ham_fg
## 1           spam
## 2           spam
## 3           spam
## 4           spam
## 5           spam
## 6           spam

Delimiting by this does not give an even break between columns. This may be because the emails that are missing content from some columns do not have empty space between escape keys for those columns. Let’s go back to the last iteration for this project, before the widening.

Create Corpus

Using the tm package, the VCorpus function is used.

#save corpus as email_corpus
email_corpus <- VCorpus(VectorSource(emails$email_content))

#check the corpus
writeLines(head(strwrap(email_corpus[[1]]), 3))
## From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002 Return-Path:
## <ilug-admin@linux.ie> Delivered-To: yyyy@localhost.netnoteinc.com
## Received: from localhost (localhost [127.0.0.1]) by
#removing punctuation with tm_map
email_corpus <- tm_map(email_corpus, removePunctuation)

#removing numbers with tm_map
email_corpus <- tm_map(email_corpus, removeNumbers)

#removing white space with tm_map
email_corpus <- tm_map(email_corpus, stripWhitespace)

#removing english stop words with tm_map
email_corpus <- tm_map(email_corpus, removeWords, stopwords("english"))

#stem with tm_map
email_corpus <- tm_map(email_corpus, stemDocument)

#convert to lowercase
email_corpus <- lapply(email_corpus, function(x) {tolower(x)})

Create Document Term Matrix

#create dtm as email_dtm
email_dtm <- DocumentTermMatrix(email_corpus)

#check the contents
tm::inspect(email_dtm)
## <<DocumentTermMatrix (documents: 3898, terms: 94632)>>
## Non-/sparse entries: 653624/368221912
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp from localhost mon oct postfix receiv sep wed
##   1079   0     4    3         1   0   0       0      9   0   0
##   1662   0     5    2         4   0   8       3      6   0   4
##   1967   0     5    2         5   0   0       3      7   8   1
##   2067   0     5    2         4   0   0       3      8   8   0
##   2074   0     4    3         3   0   0       3      8   9   0
##   2162   0     5    3         3   0   0       3      6   8   1
##   2339   0     4    2         3   0   9       3      6   0   4
##   28     0     2    3         0   4   0       2      8   0   0
##   3898   0     0    0         0   0   0       0      0   0   0
##   51     0     2    4         0   0   0       2      8   0   0

Machine Learning || Naïve Bayes Classifier

For this section, the goal is to use R libraries to create a ML algorithm that can categorize an email as accurately as possible. It should be trained on a chunk of the data that is available and tested on the unused data.

In ML, it is common to use an 80/20 split, where 80% of the data is used to train and 20% to check the model. The reason 100% of the data isn’t used is to prevent overfitting.

The classifier used here is the Naïve Bayes classifier, “a probabilistic approach based on Bayes’ theorem with the assumption of independence between features” (GeeksforGeeks.org). It is often used for sentiment analysis, which is useful in this classification project.

#splitting the data into training and testing sets
set.seed(64)
train_index <- sample(1:nrow(emails), 0.8 * nrow(emails))
train_data <- emails[train_index, ]
test_data <- emails[-train_index, ]

#train the Naïve Bayes model
model <- naiveBayes(spam_or_ham_fg ~ ., data = train_data)

#predict on test data with predict function
predictions <- predict(model, newdata = test_data)

#check accuracy
accuracy <- mean(predictions == test_data$spam_or_ham_fg)
accuracy
## [1] 0.6358974

Conclusion

In conclusion, the Naïve Bayes classifier achieved an accuracy of 63% in categorizing emails as spam or ham. While this accuracy may not be considered high, it indicates that the model is performing better than random guessing. Further improvements can be made by exploring different preprocessing techniques, feature engineering, and trying alternative machine learning algorithms. Overall, this project provides valuable insights into document classification and lays the groundwork for future learning.