Mobile phone usage has seen an explosive growth in the last two decades, and with that growth there has been a parallel increase in the occurrences of phishing and overall disruptive text message spam. Unlike email spam, however, SMS-based spam is directly associated with a per-message cost to the mobile user, so it is a particular burden on the customer.
In this article, I will create a Deep Learning model to predict SMS spam, using R and its implementation of the H2O AI infrastrcture.
The resulting model has an overall accuracy of 97.2%.
The SMS data that I will be using in this article is the SMS Spam Collection, which can be found here.
The data itself is made up of 5,574 observations and two variables:
data_ori <- readr::read_csv("/YOUR/PATH/TO/DATA/sms_spam.csv")
head(data_ori) %>%
formattable::formattable()
| type | text |
|---|---|
| ham | Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… |
| ham | Ok lar… Joking wif u oni… |
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| ham | U dun say so early hor… U c already then say… |
| ham | Nah I don’t think he goes to usf, he lives around here though |
| spam | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
The only pre-processing that I performed was to conver the ham / spam factor to a binary 1 or 0:
data_ori$'type' <- ifelse(data_ori$'type' == "spam", 1, 0)
head(data_ori) %>%
formattable::formattable()
| type | text |
|---|---|
| 0 | Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… |
| 0 | Ok lar… Joking wif u oni… |
| 1 | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| 0 | U dun say so early hor… U c already then say… |
| 0 | Nah I don’t think he goes to usf, he lives around here though |
| 1 | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
Before we can model the SMS data it has to first be converted to a Document-Term Matrix (DTM) through a series of steps (vectorization, creating a corpus, … etc).
This process is expanded upon in my previous articles, so for the sake of brevity here I complete the process using the function I wrote called create_corpus_dtm():
dat_dtm <- data_ori %>%
create_corpus_dtm(text)
Once the DTM has been created it has to then be converted to an H2O data.frame. For this purpose, I have created the H2O_dtm_to_df() function.
NOTE: At this point we initialize the H2O Cluster:
h2o.init()
As can be seen, the function takes in a DTM object, as well as the original data, and returns am H2O data object:
dat_df <- H2O_dtm_to_df(
dtm_in = dat_dtm,
dat_ori_in = data_ori,
response_name = "type"
)
dim(dat_df)
## [1] 5574 6593
In order to create and validate the model, I first split the data into Train and Test sets, and then I sub-split that Train data into the following:
As you can see below, the sub-split is done using a function that I wrote called H2O_split_frame():
# main Train:
data_train <- dat_df[1:4169, ]
# main Test:
data_test <- dat_df[4170:5574, ]
# subset Train into 3 parts:
data_train <- as.h2o(data_train)
data_sub_split <- data_train %>%
H2O_split_frame(response_name)
data_sub_train <- data_sub_split$"dtrain"
data_sub_valid <- data_sub_split$"dvalid"
data_sub_test <- data_sub_split$"dtest"
rm(data_sub_split)
In order to maximize the robustness of my H2O model, I create a Deep Learning Model Grid using the code below.
I use five hidden layers, the last of which being a binary layer. The hyper_params arg contains the grid parameters, and the subsequent values of each parameter can be seen in the R call below. The deep learning algorithm will train itself for 10 minutes.
Note the use of the data_sub_train and data_sub_valid objects fron the previous step:
# model:
dl_grid <- h2o.grid(
algorithm = "deeplearning",
x = setdiff(names(data_sub_train), response_name),
y = response_name,
training_frame = data_sub_train,
validation_frame = data_sub_valid,
seed = 1,
hidden = c(200, 100, 50, 25, 2),
hyper_params = list(
epochs = c(50, 100),
activation = c("Tanh", "TanhWithDropout", "Maxout"),
l1 = c(0, 0.00001, 0.0001, 0.001, 0.01),
l2 = c(0, 0.00001, 0.0001, 0.001, 0.01),
loss = c("Automatic", "CrossEntropy"),
distribution = c("AUTO", "bernoulli"),
input_dropout_ratio = c(0.1, 0.2, 0.3),
balance_classes = c(TRUE, FALSE)
),
search_criteria = list(
strategy = "RandomDiscrete",
max_runtime_secs = 600
)
)
# grid:
dl_grid_perf <- h2o.getGrid(
grid_id = dl_grid@grid_id,
sort_by = "accuracy",
decreasing = TRUE
)
The best model extracted from the DL grid:
dl_grid_perf@summary_table[1, c("activation", "distribution", "epochs", "input_dropout_ratio", "l1", "l2", "loss", "accuracy")]
## activation distribution epochs input_dropout_ratio l1 l2 loss
## 1 Tanh AUTO 50.0 0.1 0.001 0.0 CrossEntropy
## accuracy
## 1 0.9852216748768473
Because the goal here is to predict a binary ham or spam response, we have to convert the probability distribution created by the deep learning model into a binary response using a defined threshold.
To this end, I created a function called H2O_binary_thresh() that does just that. It allows for an intuitive way of returning thresholds from an H2O model, including the ability to add your own custom thresholds.
NOTE the use of the data_sub_test object to extract thresholds, as described above:
thresh_ret <- H2O_binary_thresh(
h2o_model = h2o_mod,
new_data = data_sub_test,
# adding two custom thresholds:
custom_thresh = c(0.5, (1/8))
)
thresh_ret %>%
mutate(metric = forcats::fct_reorder(metric, threshold)) %>%
ggplot(aes(x = threshold, y = metric)) +
geom_point() +
ggtheme1()
After threshold extraction, the last step is to use the Test data and evaluate the accuracy of our model.
The following is a comparison of prediction accuracy metrics, each obtained at all of the thresholds from above:
pred_comp_full %>%
formattable::formattable()
| TPR | TNR | PPV | NPV | FNR | FPR | FDR | FOR | ACC | TS | JSTAT | AT_THRESH | threshold |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.945 | 0.954 | 0.755 | 0.991 | 0.055 | 0.046 | 0.245 | 0.009 | 0.953 | 0.724 | 0.900 | max_min_per_class_accuracy | 0.0287 |
| 0.913 | 0.981 | 0.879 | 0.987 | 0.087 | 0.019 | 0.121 | 0.013 | 0.972 | 0.811 | 0.894 | custom_thresh_2 | 0.1250 |
| 0.852 | 0.989 | 0.918 | 0.978 | 0.148 | 0.011 | 0.082 | 0.022 | 0.971 | 0.792 | 0.841 | max_f1 | 0.3213 |
| 0.852 | 0.989 | 0.918 | 0.978 | 0.148 | 0.011 | 0.082 | 0.022 | 0.971 | 0.792 | 0.841 | max_f2 | 0.3213 |
| 0.852 | 0.989 | 0.918 | 0.978 | 0.148 | 0.011 | 0.082 | 0.022 | 0.971 | 0.792 | 0.841 | max_accuracy | 0.3213 |
| 0.852 | 0.989 | 0.918 | 0.978 | 0.148 | 0.011 | 0.082 | 0.022 | 0.971 | 0.792 | 0.841 | max_absolute_mcc | 0.3213 |
| 0.852 | 0.989 | 0.918 | 0.978 | 0.148 | 0.011 | 0.082 | 0.022 | 0.971 | 0.792 | 0.841 | max_mean_per_class_accuracy | 0.3213 |
| 0.847 | 0.990 | 0.928 | 0.977 | 0.153 | 0.010 | 0.072 | 0.023 | 0.972 | 0.795 | 0.837 | custom_thresh_1 | 0.5000 |
| 0.803 | 0.995 | 0.961 | 0.971 | 0.197 | 0.005 | 0.039 | 0.029 | 0.970 | 0.778 | 0.798 | max_f0point5 | 0.7877 |
| 0.984 | 0.609 | 0.274 | 0.996 | 0.016 | 0.391 | 0.726 | 0.004 | 0.658 | 0.272 | 0.592 | max_recall | 0.0042 |
| 0.984 | 0.609 | 0.274 | 0.996 | 0.016 | 0.391 | 0.726 | 0.004 | 0.658 | 0.272 | 0.592 | max_tps | 0.0042 |
| 0.984 | 0.609 | 0.274 | 0.996 | 0.016 | 0.391 | 0.726 | 0.004 | 0.658 | 0.272 | 0.592 | max_tpr | 0.0042 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_precision | 0.9982 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_specificity | 0.9982 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_tns | 0.9982 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_fns | 0.9982 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_tnr | 0.9982 |
| 0.038 | 1.000 | 1.000 | 0.874 | 0.962 | 0.000 | 0.000 | 0.126 | 0.875 | 0.038 | 0.038 | max_fnr | 0.9982 |
| 1.000 | 0.002 | 0.130 | 1.000 | 0.000 | 0.998 | 0.870 | 0.000 | 0.132 | 0.130 | 0.002 | max_fps | 0.0004 |
| 1.000 | 0.002 | 0.130 | 1.000 | 0.000 | 0.998 | 0.870 | 0.000 | 0.132 | 0.130 | 0.002 | max_fpr | 0.0004 |
When evaluating a binary response model, I tend to lean on the Youden’s J statistic as a reliable metric of overall accuracy, as it takes into account both sensitivity and specificity.
pred_comp_full %>%
select(threshold = AT_THRESH, JSTAT) %>%
mutate(threshold = forcats::fct_reorder(threshold, JSTAT)) %>%
ggplot(aes(x = threshold, y = JSTAT)) +
geom_point() +
ggtheme1()
However, because we have all metrics across all thresholds, it is possible to evaluate any metric:
# Sensitivity (ie True Positive Rate, TPR):
pred_comp_full %>%
select(threshold = AT_THRESH, TPR) %>%
mutate(threshold = forcats::fct_reorder(threshold, TPR)) %>%
ggplot(aes(x = threshold, y = TPR)) +
geom_point() +
ggtheme1()
With an overall accuracy of 97.2%, the Deep Learning model utilizing my second custom threshold is the best performer of the bunch:
pred_comp_full %>%
arrange(desc(ACC)) %>%
head(1) %>%
formattable::formattable()
| TPR | TNR | PPV | NPV | FNR | FPR | FDR | FOR | ACC | TS | JSTAT | AT_THRESH | threshold |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.913 | 0.981 | 0.879 | 0.987 | 0.087 | 0.019 | 0.121 | 0.013 | 0.972 | 0.811 | 0.894 | custom_thresh_2 | 0.125 |
pred_comp_full %>%
arrange(desc(ACC)) %>%
head(1) %>%
select(TPR:JSTAT) %>%
tidyr::gather() %>%
mutate(key = forcats::fct_reorder(key, value)) %>%
ggplot(aes(x = key, y = value)) +
geom_col() +
ggtheme1()
As usual, thank you for reading.