Cleaning and Data Preparation

First, we will import, clean, and process the data for classification.

Datasets

I found a large library of datasets here. Below is an excerpt from the site regarding a library of resources.

The email spam messages are collected from:
The ENRON email archive
The Apache Spam Assassin dataset To make the work simpler, the two datasets are put into a single zip file here (107MB, contains about 60K files).
The SMS dataset is from: SMS data
For Reference on class labels
SOURCES = [
(‘data/spam’, SPAM),
(‘data/spam_2, SPAM),
(’data/easy_ham_2, HAM),
(’data/easy_ham’, HAM),
(‘data/hard_ham’, HAM),
(‘data/beck-s’, HAM),
(‘data/farmer-d’, HAM),
(‘data/kaminski-v’, HAM),
(‘data/kitchen-l’, HAM),
(‘data/lokay-m’, HAM),
(‘data/williams-w3’, HAM),
(‘data/BG’, SPAM),
(‘data/GP’, SPAM),
(‘data/SH’, SPAM)
]

Read Files

I divided the files into two sets for spam and ham. There are ~67K files spread out over many folders, I will load them locally and read in the files. The zip files containing the data is on github.

Using local files for data extraction since they are already downloaded.

spam_path = "/Users/jeffshamp/Downloads/spam_data/SPAM"
ham_path = "/Users/jeffshamp/Downloads/spam_data/HAM"

This function crawls through the files and extracts the messages in there raw form.

make.data.frame<- function(path, class){
  # Dig through the directories for messages
  files <- list.files(path=path, 
                      full.names=TRUE, 
                      recursive=TRUE)
  # Read a file once directories are gone
  message<-lapply(files, function(x) {
    text_body<-read_file(x)
    })
  # Add to dataframe and assign "id" column
  message<-unlist(message)
  data<-as.data.frame(message)
  data$class<-class
  return (data)
}

Make SPAM and HAM dataframes and bind them.

data<-make.data.frame(spam_path, class="SPAM")
data<-rbind(data, make.data.frame(ham_path, class="HAM"))

The SMS dataset can be bound as well. Again, please see the zip files on github or run the markdown chunk as R with sms_path in read_lines function to download and extract.

Using the local file since it is already downloaded.

sms_data<- as.data.frame(
  read_lines(
  "/Users/jeffshamp/Downloads/smsspamcollection/SMSSpamCollection"
    ))
names(sms_data)<-"lines"
sms_data<-sms_data %>%
  separate(col = lines, into = c("class", "message"), sep = "\t") %>%
  mutate(class = str_to_upper(class)) %>%
  mutate(message = factor(message))

data<-rbind(data, sms_data)

Best of Text Messages

Let’s take a look at the messages and see what kind of divine prose our writers produce.

“Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.”

The above was not a text message from SMS dataset. Our most verbose actor had the following, similar take on love…

best_of<-sms_data %>%
  mutate(length = str_length(message)) %>%
  arrange(desc(length))
str_split(best_of[1,'message'], "[.]")[[1]]

##  [1] "For me the love should start with attraction"                                                               
##  [2] "i should feel that I need her every time around me"                                                         
##  [3] "she should be the first thing which comes in my thoughts"                                                   
##  [4] "I would start the day and end it with her"                                                                  
##  [5] "she should be there every time I dream"                                                                     
##  [6] "love will be then when my every breath has her name"                                                        
##  [7] "my life should happen around her"                                                                           
##  [8] "my life will be named to her"                                                                               
##  [9] "I would cry for her"                                                                                        
## [10] "will give all my happiness and take all her sorrows"                                                        
## [11] "I will be ready to fight with anyone for her"                                                               
## [12] "I will be in love when I will be doing the craziest things for her"                                         
## [13] "love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet"
## [14] "I will always be singing praises for her"                                                                   
## [15] "love will be when I start up making chicken curry and end up makiing sambar"                                
## [16] "life will be the most beautiful then"                                                                       
## [17] "will get every morning and thank god for the day because she is with me"                                    
## [18] "I would like to say a lot"                                                                                  
## [19] ""                                                                                                           
## [20] "will tell later"                                                                                            
## [21] ""                                                                                                           
## [22] ""

O Romeo, Romeo, wherefore art thou Romeo?

Text Clean Up

Some classifiers (like XGBoost) need the target class to numerical, others do not (like Naive Bayes). I’ll make both so that we can try out different modeling methods.

data_spam<-data %>%
  filter(class == "SPAM") %>%
  mutate(target = 1)
data_ham<- data %>%
  filter(class == "HAM") %>%
  mutate(target = 0)
data<-rbind(data_spam, data_ham)
data$id <- 1:nrow(data)
data$target<- as.factor(data$target)
DT::datatable(data %>%
              count(class, target),
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE))

Tidymodels and Text Processing

We are revisiting this project for the purposes of using tidymodels as a replacement for the work previously done. Hopefully this will provide a more concise approach to modeling and text analysis.

data<-data %>%
  select(-class) %>% # remove the class label as xgboost wants numerics
  mutate(message= str_remove_all(message, pattern = "<.*?>")) %>%
  mutate(message= str_remove_all(message, pattern = "[:digit:]")) %>%
  mutate(message= str_remove_all(message, pattern = "[:punct:]")) %>%
  mutate(message= str_remove_all(message, pattern = "[\n]")) %>%
  mutate(message= str_to_lower(message))

Set up a recipe by which the data will be preprocessed for both the training and testing data. We can (have to for memory reasons) limit the maximum number of tokens prior to calculating the tfidf score. This is a massive improvement of using various elemets from the tm and caret packages.

spam_recipe<- 
  recipe(target ~ message, data=data) %>%
  step_tokenize(message) %>%
  step_stem(message) %>%
  step_stopwords(message) %>%
  step_tokenfilter(message, max_tokens = 2000) %>%
  step_tfidf(message)

Train, test split. It is much better than previous R packages as well, but I still like the sklearn version.

set.seed(9450)
data_split<- initial_split(data, prop=.75, strata = target)
train_data<- training(data_split)
test_data<- testing(data_split)

Modeling

We will fit the data on a 5-fold cross validation scheme and collect the mean results for ROC AUC and accuracy.

xgb_mod<-                # create the model instance
  boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("classification")

cv_folds<- vfold_cv(train_data, 
                    strata = target, 
                    v = 5, repeats = 1)

Model, recipe, and cv folds set up we will now define a workflow and fit the data.

xgb_wf<-                 # create and define the pipeline
  workflow() %>%
  add_recipe(spam_recipe) %>%
  add_model(xgb_mod)

doParallel::registerDoParallel()

xgb_fit<-
  fit_resamples(
    xgb_wf,
    cv_folds, 
    metrics = metric_set(roc_auc, accuracy), 
    control = control_resamples(save_pred = TRUE)
  )
collect_metrics(xgb_fit)

## # A tibble: 2 x 5
##   .metric  .estimator  mean     n   std_err
##   <chr>    <chr>      <dbl> <int>     <dbl>
## 1 accuracy binary     0.989     5 0.000575 
## 2 roc_auc  binary     0.999     5 0.0000608

Evaluate on the testing data. The last_fit functions does one last fit on the entire training set and then predicts on the test set. If you pass the last_fit the initial data split, it knows which set to use for the fits. That makes me a bit uncomfortable, it is a nice feature.

xgb_final<-
  xgb_wf %>%
  last_fit(data_split) %>%
  collect_predictions() %>%
  conf_mat(truth = target, estimate = .pred_class)

xgb_final

##           Truth
## Prediction    0    1
##          0 7033  110
##          1   59 9269

Very nice results. 169 wrongly classified messages. About twice as many false negatives as false positives.

Spam vs Ham - Tidymodels Remix

Jeff Shamp

2020-06-10