Introduction

The objective of Project 4 is to create a spam filter using document classification.

Loading Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom        0.7.6      v rsample      0.0.9 
## v dials        0.0.9      v tune         0.1.5 
## v infer        0.5.4      v workflows    0.2.2 
## v modeldata    0.1.0      v workflowsets 0.0.2 
## v parsnip      0.1.5      v yardstick    0.0.8 
## v recipes      0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard()        masks purrr::discard()
## x dplyr::filter()          masks stats::filter()
## x recipes::fixed()         masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x yardstick::spec()        masks readr::spec()
## x recipes::step()          masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidytext)
library(textrecipes)
library(discrim)
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness

Obtaining the Data

For the project, I decided to use UCI’s Machine Learning Repository SMS Spam Collection Data Set, which is linked here. The first step was to download the file off UCI’s website, save it as a csv, then upload it to Github and re-download it into R.

I’m also going to rename the columns.

spam <- read.csv("https://raw.githubusercontent.com/carlisleferguson/DATA607/main/SMSSpamCollection.csv")

spam <- spam %>% rename(
  type = ham,
  text = Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
)

head(spam) %>% kbl(caption = "Head of the SMS Data Set") %>%
  kable_styling(bootstrap_options = "striped")
Head of the SMS Data Set
type text
ham Ok lar… Joking wif u oni…
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s
ham U dun say so early hor… U c already then say…
ham Nah I don’t think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham Even my brother is not like to speak with me. They treat me like aids patent.

Preparing the Data

The next step is to partition the data into training and testing sets. This is accomplished by using the initial_split, training, and testing functions in rsample library. A random seed is set before splitting the data to ensure that each population will be the same each time the program is run.

spam <- as_tibble(spam)

set.seed(12345)

spamham_split <- initial_split(spam, strata = type, prop = .8)
spamham_split
## <Analysis/Assess/Total>
## <4459/1114/5573>
training <- training(spamham_split)
testing <- testing(spamham_split)

The next step in preparing the data is to use textrecipes

spamham_recipe <- recipe(type ~ text, data = training)

spamham_recipe <- spamham_recipe %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 50) %>%
  step_tfidf(text)

Preparing the Model

For my model, I chose to use Naive Bayes. Naive Bayes is a simple classification method that represents a given document as a bag of words, and works under the assumption that the order of the words in the bag doesn’t matter.

The equation for Naive Bayes as it applies to document classification is as follows: \[ P(c|d) = \frac {P(d|c) P(c)}{P(d)} \] In this equation, c represents a class, and d represents a document.

nb <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes"); nb
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes
work <- workflow() %>%
  add_recipe(spamham_recipe) %>%
  add_model(nb); work
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes

As a note, both the stopwords and naivebayes packages need to be installed in order for this code to run.

trained_model <- work %>%
  fit(data = training); trained_model
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## 
## -- Model -----------------------------------------------------------------------
## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.default(x = maybe_data_frame(x), y = y, usekernel = TRUE)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##       ham      spam 
## 0.8658892 0.1341108 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: tfidf_text_2::ham (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (3861 obs.); Bandwidth 'bw' = 0.04344
## 
##        x                 y           
##  Min.   :-0.1303   Min.   :0.000000  
##  1st Qu.: 0.5926   1st Qu.:0.000554  
##  Median : 1.3155   Median :0.008788  
##  Mean   : 1.3155   Mean   :0.345125  
##  3rd Qu.: 2.0384   3rd Qu.:0.058598  
##  Max.   : 2.7613   Max.   :8.649491  
## 
## --------------------------------------------------------------------------------- 
##  ::: tfidf_text_2::spam (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (598 obs.);  Bandwidth 'bw' = 0.07751
## 
##        x                 y           
##  Min.   :-0.2325   Min.   :0.000000  
##  1st Qu.: 0.5415   1st Qu.:0.002786  
## 
## ...
## and 147 more lines.

Evaluating the Model

The last step is to evaluate the model and determine how well it can predict spam vs ham SMS messages. The model can be evaluated using a confusion matrix, which visualizes the model’s false positives and false negatives.

eval <- predict(trained_model, new_data = testing)
table(eval$.pred_class, testing$type)
##       
##        ham spam
##   ham  921   54
##   spam  44   95

Conclusions

The Naive Bayes model struggled to accurately classify the types of SMS messages. I did spent a bit of time changing the token size for the model, and found that reducing the number of tokens (n) increased the accuracy of the model. when n > 400, the model fails to properly identify any of the spam SMS messages. I found that when n = 50, the model has the greatest accuracy, although it still struggles to identify the spam messages.