Overview

Unwanted emails, often called “spam” are becoming increasingly frequent, problematic, and potential dangerous carrying fraudulent offers and deal, or even viruses. Most email accounts have automatic filtering methods to separate out “ham” or the wanted emails, with spam. This project is meant to simulate that filter on a micro-scale.

I have to say this project, appropriately, is the hardest thing I’ve encountered so far.

Loading Packages

I saw the number of packages expand and shrink as I tried something and ran into a block and had to work around or through it. Packages I would put as honorable mentions quanteda and RTextTools. I really wanted to work with the modeling features in RTextTools, but could not get them working with the way I constructed the corpus.

library(tm)
## Warning: package 'tm' was built under R version 4.0.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.0.3
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## Warning: package 'readr' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
library(dplyr)
library(stringr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.3
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin

Directories

Downloading and unzipping the files, I placed both in a folder on my machine and put the pathways in variables to make them easier to work with.

hamDir <- "D:\\CUNY\\DATA 607 - Data Aq\\SpamHam\\easy_ham"
spamDir <- "D:\\CUNY\\DATA 607 - Data Aq\\SpamHam\\spam"

Corpora to Data Frames

Based on some examples of text classification I found in the texts and online, I think I might have this a bit backwards. It all seems to work, so I didn’t mess with it to much. My plan here was to get the corpora into a data frame as quickly as possible.

ham_cor <- VCorpus(DirSource(hamDir), readerControl = list(language="en"))
spam_cor <- VCorpus(DirSource(spamDir), readerControl = list(language="en"))

Data Frames

From the tidyverse, tidy() automatically put the corpora into neat-ish columns. I would like to use this function more, and perhaps find a way for it to read out the other default columns like header and author. I did add in a column for ‘isHam’, indicating if the message is classed as ham (1) or spam (0). I caught this from the intro video on how to download these files. I saw a method to attach this as meta data using the meta() function, but I couldn’t get it working.

ham_df <- ham_cor %>%
  tidy() %>%
  mutate(isHam = 1) 
spam_df <- spam_cor %>%
  tidy() %>%
  mutate(isHam = 0)

Joining the data sets

Using full_join(), I combined both the spam and ham sets and selected the columns I needed.

hamSpam <- full_join(ham_df,spam_df)
## Joining, by = c("author", "datetimestamp", "description", "heading", "id", "language", "origin", "text", "isHam")
hamSpam <- hamSpam %>%
  select(id, text, isHam)
colnames(hamSpam)
## [1] "id"    "text"  "isHam"

Proportions

Based on the proportions of these data, spam accounts for around 17% of the data. When tested, I’d have liked to account for this imbalance to ensure that the samples are themselves representative.

table(hamSpam$isHam)
## 
##    0    1 
##  501 2501
prop.table(table(hamSpam$isHam))
## 
##         0         1 
## 0.1668887 0.8331113

Document Term Matrix

I feel like I took a step backwards here going from corpus to df to corpus to dtm. I couldn’t get the dtm straight from the original corpus or the joined data frames. With a lot of experimentation, I got this working. Adding in the steps to clean the data as much as possible with the help of the tm package.

hamSpam_corp <- Corpus(VectorSource(hamSpam$text)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(PlainTextDocument) %>%
  tm_map(content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(., PlainTextDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
hamSpam_dtm <- DocumentTermMatrix(hamSpam_corp)
hamSpam_dtm <- hamSpam_dtm %>%
  removeSparseTerms(sparse = .99)

inspect(hamSpam_dtm)
## <<DocumentTermMatrix (documents: 3002, terms: 2104)>>
## Non-/sparse entries: 314576/6001632
## Sparsity           : 95%
## Maximal term length: 67
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp from ist localhost mon oct postfix received sep
##   166    0     2    2   2         2   0   6       1        4   0
##   265    0     5    2   2         4   0   8       3        6   0
##   2683   0     2    5   1         3   7   0       2       15   7
##   2698   3     2    4   1         3   3   0       1       15   3
##   2746   2     3    9   2         2   1   0       1        5  10
##   570    0     5    2   2         5   0   0       3        6   8
##   670    0     5    2   2         4   0   0       3        7   8
##   677    0     4    3   2         3   0   0       3        7   9
##   765    0     5    3   2         3   0   0       3        6   8
##   942    0     4    2   2         3   0   9       3        6   0

DTM to Matrix

From the DTM, I created a matrix and then another data frame, adding in the “isHam” column for later use as my data class labels.

hamSpam_matrix <- hamSpam_dtm %>%
  as.matrix() %>%
  as.data.frame() %>%
  mutate(isHam = hamSpam$isHam) %>%
  select(isHam, everything())

Test/Train and Model

Setting a seed for reproducibility, I used the handy createDataPartition() function to index my data into the test and training sets. I got stuck here for a bit, because I failed to put the “,” in the train and test variables…

These pieces of code are all in one chunk to take advantage of the set.seed(). Originally, I had them separated, but I got inconsistent results running it a few times, so I put all these components together. I was able to get to the train and test data sets with relatively minor hiccups. When it came to modeling, training, and testing my data I ran into a bunch of walls. I ping-ponged between modeling packages and kept coming up with errors about using a data frame versus a corpus, or not having my data labels just right. I eventually, came across a solution using just the randomForest package. I was able to feed the model my training data and have it tested rather simply by comparison to the others I looks at.

set.seed(12345)
hamSpamIndex <- createDataPartition(hamSpam_matrix$isHam, times = 1, p = 0.7, list = FALSE)

train <- hamSpam_matrix[hamSpamIndex, ]
test <- hamSpam_matrix[-hamSpamIndex, ]

prop.table(table(train$isHam))
## 
##         0         1 
## 0.1607992 0.8392008
prop.table(table(test$isHam))
## 
##         0         1 
## 0.1811111 0.8188889
classifier <- randomForest(x = train,
                          y = train$isHam,
                          ntree = 7)
## Warning in randomForest.default(x = train, y = train$isHam, ntree = 7):
## The response has five or fewer unique values. Are you sure you want to do
## regression?
pred <- predict(classifier, newdata = test)

conf_matrix <- table(pred>0,test$isHam)


conf_matrix
##        
##           0   1
##   FALSE 147   0
##   TRUE   16 737

Results

From the confusion matrix, the model worked out okay. To validate and check the judgment of the model, I added up the correct observations and divided by all observations. The model performed rather well, selecting correctly ham or spam 99.8% of the time.

val <- conf_matrix['TRUE', 2] + conf_matrix['FALSE', 1] 
accuracy <- val/nrow(test)

val
## [1] 884
accuracy
## [1] 0.9822222