The objective of Project 4 is to create a spam filter using document classification.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.6 v rsample 0.0.9
## v dials 0.0.9 v tune 0.1.5
## v infer 0.5.4 v workflows 0.2.2
## v modeldata 0.1.0 v workflowsets 0.0.2
## v parsnip 0.1.5 v yardstick 0.0.8
## v recipes 0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(tidytext)
library(textrecipes)
library(discrim)
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
For the project, I decided to use UCI’s Machine Learning Repository SMS Spam Collection Data Set, which is linked here. The first step was to download the file off UCI’s website, save it as a csv, then upload it to Github and re-download it into R.
I’m also going to rename the columns.
spam <- read.csv("https://raw.githubusercontent.com/carlisleferguson/DATA607/main/SMSSpamCollection.csv")
spam <- spam %>% rename(
type = ham,
text = Go.until.jurong.point..crazy...Available.only.in.bugis.n.great.world.la.e.buffet....Cine.there.got.amore.wat...
)
head(spam) %>% kbl(caption = "Head of the SMS Data Set") %>%
kable_styling(bootstrap_options = "striped")
| type | text |
|---|---|
| ham | Ok lar… Joking wif u oni… |
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s |
| ham | U dun say so early hor… U c already then say… |
| ham | Nah I don’t think he goes to usf, he lives around here though |
| spam | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv |
| ham | Even my brother is not like to speak with me. They treat me like aids patent. |
The next step is to partition the data into training and testing sets. This is accomplished by using the initial_split, training, and testing functions in rsample library. A random seed is set before splitting the data to ensure that each population will be the same each time the program is run.
spam <- as_tibble(spam)
set.seed(12345)
spamham_split <- initial_split(spam, strata = type, prop = .8)
spamham_split
## <Analysis/Assess/Total>
## <4459/1114/5573>
training <- training(spamham_split)
testing <- testing(spamham_split)
The next step in preparing the data is to use textrecipes
spamham_recipe <- recipe(type ~ text, data = training)
spamham_recipe <- spamham_recipe %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_tokenfilter(text, max_tokens = 50) %>%
step_tfidf(text)
For my model, I chose to use Naive Bayes. Naive Bayes is a simple classification method that represents a given document as a bag of words, and works under the assumption that the order of the words in the bag doesn’t matter.
The equation for Naive Bayes as it applies to document classification is as follows: \[
P(c|d) = \frac {P(d|c) P(c)}{P(d)}
\] In this equation, c represents a class, and d represents a document.
nb <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("naivebayes"); nb
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
work <- workflow() %>%
add_recipe(spamham_recipe) %>%
add_model(nb); work
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
## Naive Bayes Model Specification (classification)
##
## Computational engine: naivebayes
As a note, both the stopwords and naivebayes packages need to be installed in order for this code to run.
trained_model <- work %>%
fit(data = training); trained_model
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
##
## -- Model -----------------------------------------------------------------------
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.default(x = maybe_data_frame(x), y = y, usekernel = TRUE)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## ham spam
## 0.8658892 0.1341108
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: tfidf_text_2::ham (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (3861 obs.); Bandwidth 'bw' = 0.04344
##
## x y
## Min. :-0.1303 Min. :0.000000
## 1st Qu.: 0.5926 1st Qu.:0.000554
## Median : 1.3155 Median :0.008788
## Mean : 1.3155 Mean :0.345125
## 3rd Qu.: 2.0384 3rd Qu.:0.058598
## Max. : 2.7613 Max. :8.649491
##
## ---------------------------------------------------------------------------------
## ::: tfidf_text_2::spam (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (598 obs.); Bandwidth 'bw' = 0.07751
##
## x y
## Min. :-0.2325 Min. :0.000000
## 1st Qu.: 0.5415 1st Qu.:0.002786
##
## ...
## and 147 more lines.
The last step is to evaluate the model and determine how well it can predict spam vs ham SMS messages. The model can be evaluated using a confusion matrix, which visualizes the model’s false positives and false negatives.
eval <- predict(trained_model, new_data = testing)
table(eval$.pred_class, testing$type)
##
## ham spam
## ham 921 54
## spam 44 95
The Naive Bayes model struggled to accurately classify the types of SMS messages. I did spent a bit of time changing the token size for the model, and found that reducing the number of tokens (n) increased the accuracy of the model. when n > 400, the model fails to properly identify any of the spam SMS messages. I found that when n = 50, the model has the greatest accuracy, although it still struggles to identify the spam messages.