Introduction

Mobile phone usage has seen an explosive growth in the last two decades, and with that growth there has been a parallel increase in the occurrences of phishing and overall disruptive text message spam. Unlike email spam, however, SMS-based spam is directly associated with a per-message cost to the mobile user, so it is a particular burden on the customer.

In this article, I will create a Deep Learning model to predict SMS spam, using R and its implementation of the H2O AI infrastrcture.

The resulting model has an overall accuracy of 97.2%.

SMS Text Data

The SMS data that I will be using in this article is the SMS Spam Collection, which can be found here.

The data itself is made up of 5,574 observations and two variables:

data_ori <- readr::read_csv("/YOUR/PATH/TO/DATA/sms_spam.csv")
head(data_ori) %>% 
  formattable::formattable()
type text
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
ham Ok lar… Joking wif u oni…
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
ham U dun say so early hor… U c already then say…
ham Nah I don’t think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

The only pre-processing that I performed was to conver the ham / spam factor to a binary 1 or 0:

data_ori$'type' <- ifelse(data_ori$'type' == "spam", 1, 0)
head(data_ori) %>% 
  formattable::formattable()
type text
0 Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
0 Ok lar… Joking wif u oni…
1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
0 U dun say so early hor… U c already then say…
0 Nah I don’t think he goes to usf, he lives around here though
1 FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Creating a Document-Term Matrix (DTM)

Before we can model the SMS data it has to first be converted to a Document-Term Matrix (DTM) through a series of steps (vectorization, creating a corpus, … etc).

This process is expanded upon in my previous articles, so for the sake of brevity here I complete the process using the function I wrote called create_corpus_dtm():

dat_dtm <- data_ori %>%
  create_corpus_dtm(text)

Converting the DTM to an H2O Data Object

Once the DTM has been created it has to then be converted to an H2O data.frame. For this purpose, I have created the H2O_dtm_to_df() function.

NOTE: At this point we initialize the H2O Cluster:

h2o.init()

As can be seen, the function takes in a DTM object, as well as the original data, and returns am H2O data object:

dat_df <- H2O_dtm_to_df(
  dtm_in = dat_dtm,
  dat_ori_in = data_ori,
  response_name = "type"
)
dim(dat_df)
## [1] 5574 6593

Splitting the Data

In order to create and validate the model, I first split the data into Train and Test sets, and then I sub-split that Train data into the following:

As you can see below, the sub-split is done using a function that I wrote called H2O_split_frame():

# main Train:
data_train <- dat_df[1:4169, ]

# main Test:
data_test <- dat_df[4170:5574, ]

# subset Train into 3 parts:
data_train <- as.h2o(data_train)

data_sub_split <- data_train %>% 
  H2O_split_frame(response_name)

data_sub_train <- data_sub_split$"dtrain"
data_sub_valid <- data_sub_split$"dvalid"
data_sub_test <- data_sub_split$"dtest"

rm(data_sub_split)

The Deep Learning Grid

In order to maximize the robustness of my H2O model, I create a Deep Learning Model Grid using the code below.

I use five hidden layers, the last of which being a binary layer. The hyper_params arg contains the grid parameters, and the subsequent values of each parameter can be seen in the R call below. The deep learning algorithm will train itself for 10 minutes.

Note the use of the data_sub_train and data_sub_valid objects fron the previous step:

# model:
dl_grid <- h2o.grid(
  algorithm = "deeplearning",
  x = setdiff(names(data_sub_train), response_name),
  y = response_name,
  training_frame = data_sub_train,
  validation_frame = data_sub_valid,
  seed = 1,
  hidden = c(200, 100, 50, 25, 2),
  hyper_params = list(
    epochs = c(50, 100),
    activation = c("Tanh", "TanhWithDropout", "Maxout"),
    l1 = c(0, 0.00001, 0.0001, 0.001, 0.01),
    l2 = c(0, 0.00001, 0.0001, 0.001, 0.01),
    loss = c("Automatic", "CrossEntropy"),
    distribution = c("AUTO", "bernoulli"),
    input_dropout_ratio = c(0.1, 0.2, 0.3),
    balance_classes = c(TRUE, FALSE)
  ),
  search_criteria = list(
    strategy = "RandomDiscrete",
    max_runtime_secs = 600
  )
)

# grid:
dl_grid_perf <- h2o.getGrid(
  grid_id = dl_grid@grid_id,
  sort_by = "accuracy",
  decreasing = TRUE
)

The best model extracted from the DL grid:

dl_grid_perf@summary_table[1, c("activation", "distribution", "epochs", "input_dropout_ratio", "l1", "l2", "loss", "accuracy")]
##   activation distribution epochs input_dropout_ratio    l1  l2         loss
## 1       Tanh         AUTO   50.0                 0.1 0.001 0.0 CrossEntropy
##             accuracy
## 1 0.9852216748768473

Binary Thresholds

Because the goal here is to predict a binary ham or spam response, we have to convert the probability distribution created by the deep learning model into a binary response using a defined threshold.

To this end, I created a function called H2O_binary_thresh() that does just that. It allows for an intuitive way of returning thresholds from an H2O model, including the ability to add your own custom thresholds.

NOTE the use of the data_sub_test object to extract thresholds, as described above:

thresh_ret <- H2O_binary_thresh(
  h2o_model = h2o_mod,
  new_data = data_sub_test,
  # adding two custom thresholds:
  custom_thresh = c(0.5, (1/8))
)
thresh_ret %>% 
  mutate(metric = forcats::fct_reorder(metric, threshold)) %>% 
  ggplot(aes(x = threshold, y = metric)) +
  geom_point() +
  ggtheme1()

Predicting Spam

After threshold extraction, the last step is to use the Test data and evaluate the accuracy of our model.

The following is a comparison of prediction accuracy metrics, each obtained at all of the thresholds from above:

pred_comp_full %>% 
  formattable::formattable()
TPR TNR PPV NPV FNR FPR FDR FOR ACC TS JSTAT AT_THRESH threshold
0.945 0.954 0.755 0.991 0.055 0.046 0.245 0.009 0.953 0.724 0.900 max_min_per_class_accuracy 0.0287
0.913 0.981 0.879 0.987 0.087 0.019 0.121 0.013 0.972 0.811 0.894 custom_thresh_2 0.1250
0.852 0.989 0.918 0.978 0.148 0.011 0.082 0.022 0.971 0.792 0.841 max_f1 0.3213
0.852 0.989 0.918 0.978 0.148 0.011 0.082 0.022 0.971 0.792 0.841 max_f2 0.3213
0.852 0.989 0.918 0.978 0.148 0.011 0.082 0.022 0.971 0.792 0.841 max_accuracy 0.3213
0.852 0.989 0.918 0.978 0.148 0.011 0.082 0.022 0.971 0.792 0.841 max_absolute_mcc 0.3213
0.852 0.989 0.918 0.978 0.148 0.011 0.082 0.022 0.971 0.792 0.841 max_mean_per_class_accuracy 0.3213
0.847 0.990 0.928 0.977 0.153 0.010 0.072 0.023 0.972 0.795 0.837 custom_thresh_1 0.5000
0.803 0.995 0.961 0.971 0.197 0.005 0.039 0.029 0.970 0.778 0.798 max_f0point5 0.7877
0.984 0.609 0.274 0.996 0.016 0.391 0.726 0.004 0.658 0.272 0.592 max_recall 0.0042
0.984 0.609 0.274 0.996 0.016 0.391 0.726 0.004 0.658 0.272 0.592 max_tps 0.0042
0.984 0.609 0.274 0.996 0.016 0.391 0.726 0.004 0.658 0.272 0.592 max_tpr 0.0042
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_precision 0.9982
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_specificity 0.9982
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_tns 0.9982
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_fns 0.9982
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_tnr 0.9982
0.038 1.000 1.000 0.874 0.962 0.000 0.000 0.126 0.875 0.038 0.038 max_fnr 0.9982
1.000 0.002 0.130 1.000 0.000 0.998 0.870 0.000 0.132 0.130 0.002 max_fps 0.0004
1.000 0.002 0.130 1.000 0.000 0.998 0.870 0.000 0.132 0.130 0.002 max_fpr 0.0004

Model Accuracy

When evaluating a binary response model, I tend to lean on the Youden’s J statistic as a reliable metric of overall accuracy, as it takes into account both sensitivity and specificity.

pred_comp_full %>% 
  select(threshold = AT_THRESH, JSTAT) %>% 
  mutate(threshold = forcats::fct_reorder(threshold, JSTAT)) %>% 
  ggplot(aes(x = threshold, y = JSTAT)) +
  geom_point() + 
  ggtheme1()

However, because we have all metrics across all thresholds, it is possible to evaluate any metric:

# Sensitivity (ie True Positive Rate, TPR):
pred_comp_full %>% 
  select(threshold = AT_THRESH, TPR) %>% 
  mutate(threshold = forcats::fct_reorder(threshold, TPR)) %>% 
  ggplot(aes(x = threshold, y = TPR)) +
  geom_point() + 
  ggtheme1()

The Best Performer

With an overall accuracy of 97.2%, the Deep Learning model utilizing my second custom threshold is the best performer of the bunch:

pred_comp_full %>% 
  arrange(desc(ACC)) %>% 
  head(1) %>% 
  formattable::formattable()
TPR TNR PPV NPV FNR FPR FDR FOR ACC TS JSTAT AT_THRESH threshold
0.913 0.981 0.879 0.987 0.087 0.019 0.121 0.013 0.972 0.811 0.894 custom_thresh_2 0.125
pred_comp_full %>% 
  arrange(desc(ACC)) %>% 
  head(1) %>% 
  select(TPR:JSTAT) %>% 
  tidyr::gather() %>% 
  mutate(key = forcats::fct_reorder(key, value)) %>% 
  ggplot(aes(x = key, y = value)) +
  geom_col() +
  ggtheme1()

As usual, thank you for reading.