Assignment

Requirements

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). Example corpus.

Here are two short videos that you may find helpful.

The second video provides a short overview of predictive classifiers.

For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

New! Project 4 extra credit! Students who use the relatively new tidymodels and textrecipes packages to complete their Project 4 work will automatically receive 5 extra credit points. tidymodels is a significant improvement over Max Kuhn’s older caret package. Here are some resources to help you get up to speed on tidymodels and textrecipes.

Tidy Modeling with R, Max Kuhn and Julia Silge. Julia Silge has also done a number of tidymodels screencasts, including here
github.com/tidymodels/textrecipes
DataCamp course, Modeling with TidyModels in R

Data Preparation

Load Data

URL of file and filename is set by URL_easy_ham and tar_easy_ham respectively.

#easy_ham.tar.bz2 import
URL_easy_ham = 
  "http://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2"
tar_easy_ham = "20021010_easy_ham.tar.bz2"
dir_easy_ham      = "unzipped_files\\easy_ham\\"

URL_spam = 
  "http://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2"
tar_spam     = "20021010_spam.tar.bz2"
dir_spam     = "unzipped_files\\spam\\"

The functions download.file() and untar() is used to download the tar file. The function list.files is used with the variable ham_dir specified above to import the individual files for 20021010_easy_ham.tar.bz2.

# ham zip file download
download.file(url = URL_easy_ham, destfile = tar_easy_ham)
untar(tar_easy_ham, exdir="unzipped_files", compressed = "bzip2")
files_easy_ham = list.files(path = dir_easy_ham,full.names = TRUE)

# spam zip file download
download.file(url = URL_spam, destfile = tar_spam)
untar(tar_spam, exdir="unzipped_files", compressed = "bzip2")
files_spam = list.files(path = dir_spam,full.names = TRUE)

list.files is also used with several piping functions (%>%) to create a data frame with email_ID, text, class, and double value spam column as shown below for the easy_ham data. This step is repeated with spam data as well.

df_easy_ham <-
  list.files(path = dir_easy_ham) %>%
    as.data.frame() %>%
      set_colnames("email_ID") %>%
        mutate(text = lapply(files_easy_ham, read_lines)) %>%
          unnest(c(text)) %>%
            mutate(class = "ham",
                  spam = 0) %>%
                    group_by(email_ID) %>%
            mutate(text = paste(text, collapse = " ")) %>%
                            ungroup() %>%
                              distinct()

df_spam <-
  list.files(path = dir_spam) %>%
    as.data.frame() %>%
      set_colnames("email_ID") %>%
        mutate(text = lapply(files_spam, read_lines)) %>%
          unnest(c(text)) %>%
            mutate(class = "spam",
                  spam = 1) %>%
                    group_by(email_ID) %>%
            mutate(text = paste(text, collapse = " ")) %>%
                            ungroup() %>%
                              distinct()

Create Classification & Naives Models

Both dataframes are merged using rbind.

df_easy_ham_spam <- rbind(df_easy_ham, df_spam)

Afterwards, using the tidymodels package, we create a single binary split of our merged dataframe df_easy_ham_spam, with initial_split(), specifying strata = class, making the class column our stratified sample. This is used to separate the data into training and testing sets in conjunction with the training() and testing() functions, as shown below.

set.seed(3239)
df_split    <- initial_split(df_easy_ham_spam, strata = class)
df_train    <- training(df_split)
df_test     <- testing(df_split)

## [1] "Training dimensions are "

## [1] "2290" "4"

## [1] "Test dimensions are "

## [1] "762" "4"

Using the recipe() also a part of tidymodels we lay out the description to be applied to our data sets. This is in preparation for the processing and tokenizing of the data.

df_rec <- 
  recipe(class ~ text, data = df_train)

The words in the text column is tokenized using step_tokenize() and the term frequency-inverse document frequency is calculated for text analysis with step_tokenfilter thanks to the textrecipes package

df_rec <- df_rec %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 10) %>%
  step_tfidf(text)

From this point the tokenized data is used in add_recipe, to specify the terms of the workflow allow it to be added to a container with workflow(), and stored to df_wf

df_wf <- workflow() %>%
  add_recipe(df_rec)

(nb_spec <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes"))

## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

nb_fit <- df_wf %>%
  add_model(nb_spec) %>%
  fit(data = df_train)

Evaluation

In order to use the Bayes classifcation we start by using vfold_cv() to split the data randomly, into 10 equal sized folds, then again store it into a container with workflow(), add_recipe() and add_model()

set.seed(234)
(df_folds <- vfold_cv(df_train))

## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [2061/229]> Fold01
##  2 <split [2061/229]> Fold02
##  3 <split [2061/229]> Fold03
##  4 <split [2061/229]> Fold04
##  5 <split [2061/229]> Fold05
##  6 <split [2061/229]> Fold06
##  7 <split [2061/229]> Fold07
##  8 <split [2061/229]> Fold08
##  9 <split [2061/229]> Fold09
## 10 <split [2061/229]> Fold10

(nb_wf <- 
  workflow() %>%
  add_recipe(df_rec) %>%
  add_model(nb_spec))

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_tokenize()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

Performance of the models is estimated by fitting several models, once, in each re-sampled fold and then evaluating on the heldout part.

nb_rs <- 
  fit_resamples(
  nb_wf,
  df_folds,
  control =
    control_resamples(save_pred = TRUE)
)

Data is extracted with collect_metrics() which creates a .metric and .estimator column unless summarized and collect_predictions() which creates predictive value columns.

(nb_rs_metrics <- collect_metrics(nb_rs))

## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.798    10  0.0132 Preprocessor1_Model1
## 2 roc_auc  binary     0.769    10  0.0154 Preprocessor1_Model1

nb_rs_predictions <- collect_predictions(nb_rs)

Visualizations

The visualizations reflect the results provided from nb_rs_metrics which showed:

.metric	mean
accuracy	0.7982533
roc_auc	0.7694050

nb_rs_predictions %>%
  group_by(id) %>%
  roc_curve(truth = class, .pred_ham) %>%
  autoplot() +
  labs(
    color = NULL,
    title = "ROC curve for ham - or - spam",
    subtitle = "Each resample fold is shown in a different color"
  )

conf_mat_resampled(nb_rs, tidy = FALSE) %>%
  autoplot(type = "heatmap")

Conclusion

The approach in using tidymodels and textrecipe came primarily from following the layout provided by \(Chapter\ 7: Classification\ |\ Supervised\ Machine\ Learning\ for\ Text\ Analysis\ in\ R\). Although the evaluation and data manipulation was done correctly, I would attribute the lower accuracy from the the email details such as source, sender, etc. In order to expand on and test this theory, a future project with cleaner text data should be used, to see if the results differ.

DATA607 Project 4

Gabriel Campos

May 02 2021