A Deep Learning Model that Predicts SMS Text Spam with Over 97% Accuracy, using R and H2O

Introduction

Mobile phone usage has seen an explosive growth in the last two decades, and with that growth there has been a parallel increase in the occurrences of phishing and overall disruptive text message spam. Unlike email spam, however, SMS-based spam is directly associated with a per-message cost to the mobile user, so it is a particular burden on the customer.

In this article, I will create a Deep Learning model to predict SMS spam, using R and its implementation of the H2O AI infrastrcture.

The resulting model has an overall accuracy of 97.2%.

SMS Text Data

The SMS data that I will be using in this article is the SMS Spam Collection, which can be found here.

The data itself is made up of 5,574 observations and two variables:

text: the SMS text
type: the ham or spam catagory of each SMS

data_ori <- readr::read_csv("/YOUR/PATH/TO/DATA/sms_spam.csv")

head(data_ori) %>% 
  formattable::formattable()

type	text
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
ham	Ok lar… Joking wif u oni…
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
ham	U dun say so early hor… U c already then say…
ham	Nah I don’t think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

The only pre-processing that I performed was to conver the ham / spam factor to a binary 1 or 0:

data_ori$'type' <- ifelse(data_ori$'type' == "spam", 1, 0)

head(data_ori) %>% 
  formattable::formattable()

type	text
0	Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…
0	Ok lar… Joking wif u oni…
1	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
0	U dun say so early hor… U c already then say…
0	Nah I don’t think he goes to usf, he lives around here though
1	FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Creating a Document-Term Matrix (DTM)

Before we can model the SMS data it has to first be converted to a Document-Term Matrix (DTM) through a series of steps (vectorization, creating a corpus, … etc).

This process is expanded upon in my previous articles, so for the sake of brevity here I complete the process using the function I wrote called create_corpus_dtm():

dat_dtm <- data_ori %>%
  create_corpus_dtm(text)

Converting the DTM to an H2O Data Object

Once the DTM has been created it has to then be converted to an H2O data.frame. For this purpose, I have created the H2O_dtm_to_df() function.

NOTE: At this point we initialize the H2O Cluster:

h2o.init()

As can be seen, the function takes in a DTM object, as well as the original data, and returns am H2O data object:

dat_df <- H2O_dtm_to_df(
  dtm_in = dat_dtm,
  dat_ori_in = data_ori,
  response_name = "type"
)

dim(dat_df)

## [1] 5574 6593

Splitting the Data

In order to create and validate the model, I first split the data into Train and Test sets, and then I sub-split that Train data into the following:

data_sub_train = used in model call
data_sub_valid = used in model call
data_sub_test = used in for creating threshold cutoffs

As you can see below, the sub-split is done using a function that I wrote called H2O_split_frame():

# main Train:
data_train <- dat_df[1:4169, ]

# main Test:
data_test <- dat_df[4170:5574, ]

# subset Train into 3 parts:
data_train <- as.h2o(data_train)

data_sub_split <- data_train %>% 
  H2O_split_frame(response_name)

data_sub_train <- data_sub_split$"dtrain"
data_sub_valid <- data_sub_split$"dvalid"
data_sub_test <- data_sub_split$"dtest"

rm(data_sub_split)

The Deep Learning Grid

In order to maximize the robustness of my H2O model, I create a Deep Learning Model Grid using the code below.

I use five hidden layers, the last of which being a binary layer. The hyper_params arg contains the grid parameters, and the subsequent values of each parameter can be seen in the R call below. The deep learning algorithm will train itself for 10 minutes.

Note the use of the data_sub_train and data_sub_valid objects fron the previous step:

# model:
dl_grid <- h2o.grid(
  algorithm = "deeplearning",
  x = setdiff(names(data_sub_train), response_name),
  y = response_name,
  training_frame = data_sub_train,
  validation_frame = data_sub_valid,
  seed = 1,
  hidden = c(200, 100, 50, 25, 2),
  hyper_params = list(
    epochs = c(50, 100),
    activation = c("Tanh", "TanhWithDropout", "Maxout"),
    l1 = c(0, 0.00001, 0.0001, 0.001, 0.01),
    l2 = c(0, 0.00001, 0.0001, 0.001, 0.01),
    loss = c("Automatic", "CrossEntropy"),
    distribution = c("AUTO", "bernoulli"),
    input_dropout_ratio = c(0.1, 0.2, 0.3),
    balance_classes = c(TRUE, FALSE)
  ),
  search_criteria = list(
    strategy = "RandomDiscrete",
    max_runtime_secs = 600
  )
)

# grid:
dl_grid_perf <- h2o.getGrid(
  grid_id = dl_grid@grid_id,
  sort_by = "accuracy",
  decreasing = TRUE
)

The best model extracted from the DL grid:

dl_grid_perf@summary_table[1, c("activation", "distribution", "epochs", "input_dropout_ratio", "l1", "l2", "loss", "accuracy")]

##   activation distribution epochs input_dropout_ratio    l1  l2         loss
## 1       Tanh         AUTO   50.0                 0.1 0.001 0.0 CrossEntropy
##             accuracy
## 1 0.9852216748768473

Binary Thresholds

Because the goal here is to predict a binary ham or spam response, we have to convert the probability distribution created by the deep learning model into a binary response using a defined threshold.

To this end, I created a function called H2O_binary_thresh() that does just that. It allows for an intuitive way of returning thresholds from an H2O model, including the ability to add your own custom thresholds.

NOTE the use of the data_sub_test object to extract thresholds, as described above:

thresh_ret <- H2O_binary_thresh(
  h2o_model = h2o_mod,
  new_data = data_sub_test,
  # adding two custom thresholds:
  custom_thresh = c(0.5, (1/8))
)

thresh_ret %>% 
  mutate(metric = forcats::fct_reorder(metric, threshold)) %>% 
  ggplot(aes(x = threshold, y = metric)) +
  geom_point() +
  ggtheme1()

Predicting Spam

After threshold extraction, the last step is to use the Test data and evaluate the accuracy of our model.

The following is a comparison of prediction accuracy metrics, each obtained at all of the thresholds from above:

pred_comp_full %>% 
  formattable::formattable()

TPR	TNR	PPV	NPV	FNR	FPR	FDR	FOR	ACC	TS	JSTAT	AT_THRESH	threshold
0.945	0.954	0.755	0.991	0.055	0.046	0.245	0.009	0.953	0.724	0.900	max_min_per_class_accuracy	0.0287
0.913	0.981	0.879	0.987	0.087	0.019	0.121	0.013	0.972	0.811	0.894	custom_thresh_2	0.1250
0.852	0.989	0.918	0.978	0.148	0.011	0.082	0.022	0.971	0.792	0.841	max_f1	0.3213
0.852	0.989	0.918	0.978	0.148	0.011	0.082	0.022	0.971	0.792	0.841	max_f2	0.3213
0.852	0.989	0.918	0.978	0.148	0.011	0.082	0.022	0.971	0.792	0.841	max_accuracy	0.3213
0.852	0.989	0.918	0.978	0.148	0.011	0.082	0.022	0.971	0.792	0.841	max_absolute_mcc	0.3213
0.852	0.989	0.918	0.978	0.148	0.011	0.082	0.022	0.971	0.792	0.841	max_mean_per_class_accuracy	0.3213
0.847	0.990	0.928	0.977	0.153	0.010	0.072	0.023	0.972	0.795	0.837	custom_thresh_1	0.5000
0.803	0.995	0.961	0.971	0.197	0.005	0.039	0.029	0.970	0.778	0.798	max_f0point5	0.7877
0.984	0.609	0.274	0.996	0.016	0.391	0.726	0.004	0.658	0.272	0.592	max_recall	0.0042
0.984	0.609	0.274	0.996	0.016	0.391	0.726	0.004	0.658	0.272	0.592	max_tps	0.0042
0.984	0.609	0.274	0.996	0.016	0.391	0.726	0.004	0.658	0.272	0.592	max_tpr	0.0042
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_precision	0.9982
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_specificity	0.9982
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_tns	0.9982
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_fns	0.9982
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_tnr	0.9982
0.038	1.000	1.000	0.874	0.962	0.000	0.000	0.126	0.875	0.038	0.038	max_fnr	0.9982
1.000	0.002	0.130	1.000	0.000	0.998	0.870	0.000	0.132	0.130	0.002	max_fps	0.0004
1.000	0.002	0.130	1.000	0.000	0.998	0.870	0.000	0.132	0.130	0.002	max_fpr	0.0004

Model Accuracy

When evaluating a binary response model, I tend to lean on the Youden’s J statistic as a reliable metric of overall accuracy, as it takes into account both sensitivity and specificity.

pred_comp_full %>% 
  select(threshold = AT_THRESH, JSTAT) %>% 
  mutate(threshold = forcats::fct_reorder(threshold, JSTAT)) %>% 
  ggplot(aes(x = threshold, y = JSTAT)) +
  geom_point() + 
  ggtheme1()

However, because we have all metrics across all thresholds, it is possible to evaluate any metric:

# Sensitivity (ie True Positive Rate, TPR):
pred_comp_full %>% 
  select(threshold = AT_THRESH, TPR) %>% 
  mutate(threshold = forcats::fct_reorder(threshold, TPR)) %>% 
  ggplot(aes(x = threshold, y = TPR)) +
  geom_point() + 
  ggtheme1()

The Best Performer

With an overall accuracy of 97.2%, the Deep Learning model utilizing my second custom threshold is the best performer of the bunch:

pred_comp_full %>% 
  arrange(desc(ACC)) %>% 
  head(1) %>% 
  formattable::formattable()

TPR	TNR	PPV	NPV	FNR	FPR	FDR	FOR	ACC	TS	JSTAT	AT_THRESH	threshold
0.913	0.981	0.879	0.987	0.087	0.019	0.121	0.013	0.972	0.811	0.894	custom_thresh_2	0.125

pred_comp_full %>% 
  arrange(desc(ACC)) %>% 
  head(1) %>% 
  select(TPR:JSTAT) %>% 
  tidyr::gather() %>% 
  mutate(key = forcats::fct_reorder(key, value)) %>% 
  ggplot(aes(x = key, y = value)) +
  geom_col() +
  ggtheme1()

As usual, thank you for reading.