A process to identify a spam or non spam email is really essential.Because, a lot of time our mailbox is bombarded with unnecessary spam email.The purpose of this project is to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project I have started with the example corpus from blackboard.I have used the easy ham and esy spam unzipped file. After downloading the file I have extracted them using 7-zip.

I have followed Emil Hvitfeldt and Julia Silge’s Supervised Machine Learning for Text Analysis in R’s chapter 7:Classification to create my model.

library(tm)
## Loading required package: NLP
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(knitr)
library(tidyverse)
## Registered S3 method overwritten by 'cli':
##   method     from
##   print.tree tree
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(e1071)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Loading Data:

After loading the data in the R environment,list.files is utilized in the spam_folder object which produces a charecter vector of the names of the files in the named directory. The data is transformed into a dataframe.After specifying the column name lapply function is used.unnest() is used to handle list-columns that contain atomic vectors, lists, or data frames however not a mixture of the different types.spam_folder is also organized using the same process.

Similarly we follow this process for our spam_folder contents.

spam_folder <- "C:\\Users\\malia\\OneDrive\\Desktop\\MSDS DATA 607\\spamham\\spam_2"
ham_folder <- "C:\\Users\\malia\\OneDrive\\Desktop\\MSDS DATA 607\\spamham\\easy_ham"

length(list.files(path = spam_folder))
## [1] 1397
length(list.files(path = ham_folder))
## [1] 2551
spam_files <- list.files(path = spam_folder, full.names = TRUE)
ham_files <- list.files(path = ham_folder, full.names = TRUE)

spam <- list.files(path = spam_folder) %>%
  as.data.frame() %>%
  set_colnames("file") %>%
  mutate(text = lapply(spam_files, read_lines)) %>%
  unnest(c(text)) %>%
  mutate(class = "spam",
         spam = 1) %>%
  group_by(file) %>%
  mutate(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  distinct()
            
ham <- list.files(path = ham_folder) %>%
  as.data.frame() %>%
  set_colnames("file") %>%
  mutate(text = lapply(ham_files, read_lines)) %>%
  unnest(c(text)) %>%
  mutate(class = "ham",
         spam = 0) %>%
  group_by(file) %>%
  mutate(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  distinct()

Tidying Data:

In this step I have used rbind for the elements of ham and spam.white spaces and punctuation from the dataframe is removed.

ham_spam1 <- rbind(ham, spam) %>%
  select(class,spam,file, text)

ham_spam1$text <- ham_spam1$text %>%
  str_replace(.,"[\\r\\n\\t]+", "")

replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]", " ", x))})
ham_spam1
## # A tibble: 3,948 x 4
##    class  spam file                   text                                      
##    <chr> <dbl> <chr>                  <chr>                                     
##  1 ham       0 0001.ea7e79d3153e7469~ "From exmh-workers-admin@redhat.com  Thu ~
##  2 ham       0 0002.b3120c4bcbf3101e~ "From Steve_Burt@cursor-system.com  Thu A~
##  3 ham       0 0003.acfc5ad94bbd2711~ "From timc@2ubh.com  Thu Aug 22 13:52:59 ~
##  4 ham       0 0004.e8d5727378ddde5c~ "From irregulars-admin@tb.tf  Thu Aug 22 ~
##  5 ham       0 0005.8c3b9e9c0f3f183d~ "From exmh-users-admin@redhat.com  Thu Au~
##  6 ham       0 0006.ee8b0dba12856155~ "From Stewart.Smith@ee.ed.ac.uk  Thu Aug ~
##  7 ham       0 0007.c75188382f64b090~ "From martin@srv0.ems.ed.ac.uk  Thu Aug 2~
##  8 ham       0 0008.20bc0b4ba2d99aae~ "From martin@srv0.ems.ed.ac.uk  Thu Aug 2~
##  9 ham       0 0009.435ae292d75abb1c~ "From Stewart.Smith@ee.ed.ac.uk  Thu Aug ~
## 10 ham       0 0010.4996141de3f21e85~ "From martin@srv0.ems.ed.ac.uk  Thu Aug 2~
## # ... with 3,938 more rows

Classification Model:

The initial split function from rsample is used.The strata argument ensures that the distribution of class is similar in the training set and testing set. Since the split uses random sampling,seed was set to reproduce results.

library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom        0.7.6      v rsample      0.0.9 
## v dials        0.0.9      v tune         0.1.5 
## v infer        0.5.4      v workflows    0.2.2 
## v modeldata    0.1.0      v workflowsets 0.0.2 
## v parsnip      0.1.5      v yardstick    0.0.8 
## v recipes      0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x ggplot2::annotate()      masks NLP::annotate()
## x data.table::between()    masks dplyr::between()
## x scales::discard()        masks purrr::discard()
## x magrittr::extract()      masks tidyr::extract()
## x dplyr::filter()          masks stats::filter()
## x data.table::first()      masks dplyr::first()
## x recipes::fixed()         masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x data.table::last()       masks dplyr::last()
## x caret::lift()            masks purrr::lift()
## x rsample::permutations()  masks e1071::permutations()
## x yardstick::precision()   masks caret::precision()
## x yardstick::recall()      masks caret::recall()
## x yardstick::sensitivity() masks caret::sensitivity()
## x magrittr::set_names()    masks purrr::set_names()
## x yardstick::spec()        masks readr::spec()
## x yardstick::specificity() masks caret::specificity()
## x recipes::step()          masks stats::step()
## x data.table::transpose()  masks purrr::transpose()
## x tune::tune()             masks e1071::tune()
## * Use tidymodels_prefer() to resolve common conflicts.
set.seed(1234)


email_split <- initial_split(ham_spam1, strata = class)

ham_spam_train <- training(email_split)
ham_spam_test <- testing(email_split)

Recepie package from Tidymodel is implemented to create a specification of preprocessing steps to perform.These transformations are estimated (or “trained”) on the training set so that they can be applied in the same way on the testing set or new data at prediction time, without data leakage. I have initialized a set of preprocessing transformations with the recipe() function, using a formula expression to specify the variables, outcome plus predictor, along with the data set.

ham_spam_rec<-recipe(class ~ text, data = ham_spam_train)

textrecepies is used to handle text of the ham or spam emails. email text is tokenized with step_tokenize.. By default this uses tokenizers::tokenize_words(). Before calculating tf-idf step_tokenfilter() is used to only keep the 1000 most frequent tokens, to avoid creating too many variables in the first model. To finish, step_tfidf() is used to compute tf-idf.

library(textrecipes)

ham_spam_rec <- ham_spam_rec %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 10) %>%
  step_tfidf(text)

tidymodels workflow() is created to bundle together modeling components.

ham_spamworkflow<-workflow() %>%
  add_recipe(ham_spam_rec)

Naive Bayes model which is available in the tidymodels package discrim is utlized. One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.

library(discrim)
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness
nb_spec <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_spec
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Evaluation of the model:

Resampling is used to estimate the performance of the naive Bayes classification mode.I can do this using resampled data sets built from the training set. cross 10-fold cross-validation sets is created and these resampled sets is used for performance estimates.

Each of these splits contains information about how to create cross-validation folds from the original training data. In this example, 90% of the training data is included in each fold and the other 10% is held out for evaluation.For convenience, let’s again use a workflow() for our resampling estimates of performance.

library(naivebayes)
## naivebayes 0.9.7 loaded
## 
## Attaching package: 'naivebayes'
## The following object is masked from 'package:data.table':
## 
##     tables
nb_fit <- ham_spamworkflow %>%
  add_model(nb_spec) %>%
  fit(data = ham_spam_train)
set.seed(234)
ham_spam_folds <- vfold_cv(ham_spam_train)

ham_spam_folds
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [2665/297]> Fold01
##  2 <split [2665/297]> Fold02
##  3 <split [2666/296]> Fold03
##  4 <split [2666/296]> Fold04
##  5 <split [2666/296]> Fold05
##  6 <split [2666/296]> Fold06
##  7 <split [2666/296]> Fold07
##  8 <split [2666/296]> Fold08
##  9 <split [2666/296]> Fold09
## 10 <split [2666/296]> Fold10
nb_wf <- workflow() %>%
  add_recipe(ham_spam_rec) %>%
  add_model(nb_spec)

nb_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

To estimate how well that model performs, let’s fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.

nb_rs <- fit_resamples(
  nb_wf,
  ham_spam_folds,
  control = control_resamples(save_pred = TRUE)
)

The relevant information is extracted using collect_metrics() and collect_predictions().

nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)

The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator characteristic curve). For these resamples, the average accuracy is 82.4%.

nb_rs_metrics
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.824    10 0.00816 Preprocessor1_Model1
## 2 roc_auc  binary     0.876    10 0.00517 Preprocessor1_Model1

ROC curve, shows a visualization of how well a classification model can distinguish between classes.

nb_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = class, .pred_ham) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for Ham or Spam email",
    subtitle = "Each resample fold is shown in a different color"
  )

The model is evaluated using confusion matrix. A confusion matrix tabulates a model’s false positives and false negatives for each class. The function conf_mat_resampled() computes a separate confusion matrix for each resample and takes the average of the cell counts.The confusion metrix rows refer the prediction and column refers the information we already had regarding the ham spam data set.The True positive rate(188.7) of identifying non spam is high and the True negative (55.4)of identifying spam email is high as well.

conf_mat_resampled(nb_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

It can be concluded that the Naive Bayes model from the Tidymodel has performed well with an accuracy rate of 82.4 along with a high True positive value in the confusionmatrix.

References:

Silge, E. H. A. J. (2021, April 29). Chapter 7 Classification | Supervised Machine Learning for Text Analysis in R. Supervised Machine Learning Text Analysis in R. https://smltar.com/mlclassification.html#comparetolasso