library(naivebayes)
## Warning: package 'naivebayes' was built under R version 3.5.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.5.3
## Warning: package 'foreach' was built under R version 3.5.3
library(dplyr)
library(tidyr)
library(rvest)
library(ggplot2)
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.5.3
library(knitr)
For this project we will import a public dataset of several thousand SMS texts each labeled “spam” or “ham” (i.e., not spam.) Our goal is to build a model capable of predicting whether new texts are spam or ham. We will compare two different models to see how they perform.
We start by importing the data from Github (originally procured from engineeringbigdata.com) and making sure it is initially tidy.
raw_df <- read.csv("https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_4/sms_spam_ham.csv", stringsAsFactors = F)
raw_df %>% head() %>% kable()
| type | text |
|---|---|
| ham | Hope you are having a good week. Just checking in |
| ham | K..give back my thanks. |
| ham | Am also doing in cbe only. But have to pay. |
| spam | complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+ |
| spam | okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm |
| ham | Aiya we discuss later lar… Pick u up at 4 is it? |
We take a portion of the data and withhold it from our training model so we can use it to test on.
sms_df <- raw_df %>% as_tibble() %>% mutate(ind=1:nrow(raw_df)) %>% rename(category=1)
train_df <- sms_df %>% sample_frac(0.8)
test_df <- sms_df %>% anti_join(train_df, by="ind")
We use the tidytext package to break down all of the SMS texts into single words.
train_words_df <- train_df %>% unnest_tokens(word, text)
train_words_df %>% arrange(ind) %>% head() %>% kable()
| category | ind | word |
|---|---|---|
| ham | 1 | hope |
| ham | 1 | you |
| ham | 1 | are |
| ham | 1 | having |
| ham | 1 | a |
| ham | 1 | good |
We are interested in how often each word is associated with spam or ham. So we group the data frame by word and calculate the percentage of SMS texts that are spam.
word_spam_pct <- train_words_df %>%
group_by(word) %>%
summarise(n = length(category), spam_pct = mean(category=="spam")) %>%
ungroup()
word_spam_pct %>% arrange(desc(n)) %>% head() %>% kable()
| word | n | spam_pct |
|---|---|---|
| to | 1790 | 0.3050279 |
| i | 1789 | 0.0212409 |
| you | 1703 | 0.1332942 |
| a | 1124 | 0.2624555 |
| u | 1077 | 0.1234912 |
| the | 1067 | 0.1433927 |
If there were no link between any specific words and spam, we would expect every word’s spam percentage to be the base rate spam percentage, plus or minus some statistical error. We do not believe this to be the case, so we will calculate every word’s “delta spam percentage”, in other words, the amount that each word “adds” or “subtracts” to spam over and above the base rate.
To do this we could simply take each word’s empirical spam percentage and subtract the base rate spam percentage. However, some of the words have very high or very low empirical percentages simply due to small sample size. Rather than set a filter based on some minumum sample size, instead we will regress each value to the mean by averaging 100 fictitious instances of the base rate to every word’s empirical spam percentage.
(Note: this is equivalent to using a Bayesian prior. We could be more exact by building a Beta distributed prior with more precise estimates for the shape parameters, but a shortcut will suffice here. We’ll somewhat arbitrarily choose 100 for the regression sample size and the base rate spam percentage for the regression point.)
After regressing every word’s spam percentage toward the mean, we then subtract the base rate to get a delta spam percentage for each word.
# Base rate
prior_spam_freq <- table(train_df$category)/length(train_df$category)
prior_spam_freq
##
## ham spam
## 0.8662019 0.1337981
prior_n <- 100
# Regress toward the mean and subtract the base rate
word_spam_pct <- word_spam_pct %>%
mutate(spam_pct_regr = (prior_spam_freq["spam"]*prior_n + spam_pct*n)/(prior_n+n),
delta_spam_pct = spam_pct_regr - prior_spam_freq["spam"])
word_spam_pct %>% arrange(desc(delta_spam_pct)) %>% head() %>% kable()
| word | n | spam_pct | spam_pct_regr | delta_spam_pct |
|---|---|---|---|---|
| â | 275 | 0.9090909 | 0.7023462 | 0.5685481 |
| free | 226 | 0.8008850 | 0.5962571 | 0.4624590 |
| txt | 138 | 0.9202899 | 0.5898311 | 0.4560331 |
| claim | 87 | 1.0000000 | 0.5367904 | 0.4029923 |
| mobile | 110 | 0.9000000 | 0.5351419 | 0.4013439 |
| call | 472 | 0.6038136 | 0.5216430 | 0.3878450 |
Now that we have a value for how much every word adds or subtracts to spam, we can get a total summed value for each SMS text and visualize the distributions by actual spam and ham.
train_dsp <- train_words_df %>%
left_join(word_spam_pct %>% select(word, delta_spam_pct), by="word") %>%
group_by(ind) %>% summarise(total_dsp = sum(delta_spam_pct, na.rm=T)) %>% ungroup()
train_dsp <- train_dsp %>% left_join(train_df, by="ind")
train_dsp %>%
ggplot(aes(x=category, y=total_dsp, fill=category)) +
geom_boxplot() +
labs(title="SMS Total Delta Spam Percentage by Category", x="Category", y="Total Delta Spam%")
train_dsp %>%
ggplot(aes(x=total_dsp, fill=category)) +
geom_density(alpha=0.8, color="black") +
labs(title="SMS Total Delta Spam Percentage by Category", x="Total Delta Spam%", y="Density")
While there is some overlap, there is also pretty clear segmentation between the two categories.
Armed with a spam “score” for each SMS text as well as a response variable (i.e., whether the text was actually spam or ham), we can fit a supervised learning model to allow us to make future predictions. Here we will use the Naive Bayes algorithm from the naivebayes package.
train_model <- naive_bayes(category ~ total_dsp, data = train_dsp, usekernel = T)
train_predict <- predict(train_model, train_dsp)
train_conf_mat <- table(train_predict, train_dsp$category)
# Confusion matrix
train_conf_mat
##
## train_predict ham spam
## ham 3827 64
## spam 23 531
# Percent accuracy
sum(diag(train_conf_mat))/sum(train_conf_mat)
## [1] 0.9804274
Above is the confusion matrix and percent accuracy for the training data. Now we use the model on the test data that was previously withheld.
test_words_df <- test_df %>% unnest_tokens(word, text)
test_dsp <- test_words_df %>%
left_join(word_spam_pct %>% select(word, delta_spam_pct), by="word") %>%
group_by(ind) %>% summarise(total_dsp = sum(delta_spam_pct, na.rm=T)) %>% ungroup()
test_dsp <- test_dsp %>% left_join(test_df, by="ind")
test_model <- naive_bayes(category ~ total_dsp, data = test_dsp, usekernel = T)
test_predict <- predict(test_model, test_dsp)
test_conf_mat <- table(test_predict, test_dsp$category)
# Confusion matrix
test_conf_mat
##
## test_predict ham spam
## ham 954 26
## spam 6 126
# Percent accuracy
sum(diag(test_conf_mat))/sum(test_conf_mat)
## [1] 0.971223
Even on the test data, our accuracy of 97.1 percent is quite a bit better than the 86.6 percent we could have achieved by randomly guessing based on the base rate.
Lastly, we build a second model for the sake of comparison. For this we use a methodology similar to what is described by Julia Silge here: https://juliasilge.com/blog/tidy-text-classification/
We create a document-term matrix and fit a logistic regression model with LASSO regularization using the glmnet package.
sparse_words_mat <- train_words_df %>%
group_by(word) %>% filter(n() > 5) %>% ungroup() %>%
count(ind, word) %>% cast_sparse(ind, word, n)
response_df <- tibble(ind = as.integer(rownames(sparse_words_mat))) %>%
left_join(train_df %>% select(ind, category), by="ind") %>%
mutate(is_spam=category=="spam")
glm_model <- cv.glmnet(sparse_words_mat, response_df$is_spam, family = "binomial")
glm_coefs <- glm_model$glmnet.fit %>%
tidy() %>%
filter(lambda == glm_model$lambda.1se)
glm_intercept <- glm_coefs %>%
filter(term == "(Intercept)") %>%
pull(estimate)
Using the fitted model on the test data gives us a probability of spam for each SMS text. To convert the probabilities into binary predictions of spam/ham we set a threshold of 0.5, which also yields an overall spam percentage similar to the base rate. We then check the accuracy of the predictions.
test_classifications <- test_words_df %>%
inner_join(glm_coefs, by = c("word" = "term")) %>%
group_by(ind) %>%
summarize(score = sum(estimate)) %>%
mutate(probability = plogis(glm_intercept + score))
# Threshold of 0.5 yields similar base rate
mean(test_classifications$probability>0.5)
## [1] 0.1354167
test_classifications <- test_classifications %>%
left_join(test_df, by="ind") %>%
mutate(prediction = ifelse(probability>0.5, "spam", "ham"))
glm_conf_mat <- table(test_classifications$prediction, test_classifications$category)
# Confusion matrix
glm_conf_mat
##
## ham spam
## ham 805 25
## spam 4 126
# Percent accuracy
sum(diag(glm_conf_mat))/sum(glm_conf_mat)
## [1] 0.9697917
From the above confusion matrix and accuracy percentage we see that this model performs fairly similarly to the Delta Spam Percentage / Naive Bayes model.
We imported a dataset of SMS texts labeled “spam” or “ham” which we then decomposed into single words. We created two different models, one using a “homegrown” metric and Naive Bayes classifier and another using a document-term matrix and logistic regression, to try and predict spam in our test data. Both models performed similarly in terms of accuracy, and both outperformed the base rate by a significant margin.