Our Approach

We have utilized SVM model in this project4 code (Our first code that produced uses . Our approach for this project follows:

  1. Load required Libraies
  2. Get data from spamassassin website
  3. Build a Build a Document Corpus
  4. Plot Sentiment Analysis and Wordcloud of Corpus
  5. Create Document-Term Matrix
  6. Clean-up and Normalize Data
  7. Create Training Set
  8. Build/Train SVM
  9. Review Results - Using Confusion Matrix Satistics, Use Radial and Linear type model

Our Approach

Our approach for this project follows:

  1. Load required Libraies
  2. Get data from spamassassin website
  3. Build a Build a Document Corpus
  4. Create Document-Term Matrix
  5. Clean-up and Normalize Data
  6. Create Traing Set
  7. Build/Train SVM
  8. Review Results
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x NLP::annotate() masks ggplot2::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

Get Data

The data for this project was obtained from:

https://spamassassin.apache.org/old/publiccorpus/

Ham and spam files were extracted and stored in a data folder on a local drive.

Analysing the Ham files

Downloading the Dataset for Ham

vizualizing the Length of Different Senders’ Emails

Example of a Ham File

visualizing the length of all Emails

Body of the Email

Extracting words in the Bodies of All Emails

Creating a Data Frame containing the words

Organizing the Data frame and adding the Term Frequency(tf), Inverse Document Frequency of a term(idf), and the combining of two term(tf_idf)

## Warning in bind_tf_idf.data.frame(., word, files, n): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.

Cleaning the Data Frame,

We select only words with IDF greater than 0 and we remove words containg numbers

Example of the sparcity of a word

## # A tibble: 4 x 6
##   files word         n      tf   idf  tf_idf
##   <int> <chr>    <int>   <dbl> <dbl>   <dbl>
## 1  1795 laptop's    60 0.0167   6.46 0.108  
## 2  1300 laptop's   620 0.00161  6.46 0.0104 
## 3  1336 laptop's   645 0.00155  6.46 0.0100 
## 4  1301 laptop's   826 0.00121  6.46 0.00782

Spam Files

Example of a Spam Document

Selecting the Most Frequent Words with TF_IDF

## Warning in bind_tf_idf.data.frame(., word, block, n): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.

Creating a Spam Sender’ Email Data Frame

##    Length     Class      Mode 
##      1396 character character
## [1] "lmrn@mailexcite.com"               "amknight@mailexcite.com"          
## [3] "jordan23@mailexcite.com"           "merchantsworld2001@juno.com"      
## [5] "cypherpunks-forward@ds.pro-ns.net" "sales@outsrc-em.com"

Creating a Spam Senders’ Email Data Frame

## # A tibble: 1,396 x 2
##    email                                                len
##    <chr>                                              <int>
##  1 lmrn@mailexcite.com                                   19
##  2 amknight@mailexcite.com                               23
##  3 jordan23@mailexcite.com                               23
##  4 merchantsworld2001@juno.com                           27
##  5 cypherpunks-forward@ds.pro-ns.net                     33
##  6 sales@outsrc-em.com                                   19
##  7 ormlh@imail.ru                                        14
##  8 spamassassin-sightings-admin@lists.sourceforge.net    50
##  9 fork-admin@xent.com                                   19
## 10 bduyisj36648@Email.cz                                 21
## # ... with 1,386 more rows

vizualizing the Length of Different Spam Senders’ Emails

Build a Document-Term Matrix

Now we use the corpus to construct a document term matrix.

Reduce Sparseness and Normalize

Create a Training Set

Build / Train SVM

We use the training data to build a SVM model that predicts if a message is spam or ham.

Review Results

Finally, we test our model to see how accurate it is.

The confusion matrix below indicates that with n = 15, only 8 emails were misclassified. This equates to approximately 99% accuracy.

Ham Spam
Ham 636 15
Spam 1 110