Overview

During this project we’re going to use a set of spam and normal (ham) emails to predict whether an email is spam or not. We’re using a set of example emails, which were downloaded from https://spamassassin.apache.org/old/publiccorpus/.

Libraries Needed:

  • tm
  • dplyr
  • tidyverse
  • stringr
  • tidytext
  • tidyverse
  • caTools
  • e1071
  • wordcloud
  • RColorBrewer

This function will be used to remove a lot of the text which is related to sending an email. We don’t want to analyze this because it’s not the actual text in an email.

Spam Word Cloud

Here we generate a Word Cloud Based on Spam Words. The most frequent spam words are:

  • esmtp
  • email
  • localhost
  • click
  • mailing
  • money
  • free
  • receive

Some of these words may be related to the email but some make sense. Spam emails try to get users to click and give money.

## # A tibble: 14,014 x 2
##    term      term_count
##    <chr>          <dbl>
##  1 =               4317
##  2 esmtp           3138
##  3 email           1827
##  4 localhost       1745
##  5 mv              1398
##  6 smtp            1283
##  7 free            1195
##  8 postfix          966
##  9 list             875
## 10 na               861
## # … with 14,004 more rows

Prediction

60% of the emails are not spam while 40% are spam. We’re going to create a naive bayes classifier to help us predict whether an email is spam or not. With more time and knowledge, I would have explored more prediction classifiers.

## 
##       0       1 
## 59.9231 40.0769
##        
## predict     0     1
##       0 36996    12
##       1     0 25002