The goal of this project is to correctly classify spam documents and ‘ham’ documents. To do this, we employ the Naive Bayes algorithm for predictive classification. Anytime presented with such a task, it is always a good idea to separate your working data set into training and testing partitions.
Based on our training/testing splits, we have an uneven proportion of ham/spam instances. This becomes an issue when solving classification problems as the model will preform poorly when exposed to different datasets in which the other class is more evenly represented. One solution is to use Synthetic Minority Oversampling Technique (SMOTE), which creates a new dataset with a more even distribution between the target class values. This will not be done here, but might be done in a future project.
Loading Packages
install.packages('rsample')
Installing rsample [1.1.1] ...
OK [linked cache]
install.packages('yardstick')
Installing yardstick [1.2.0] ...
OK [linked cache]
library(rsample)
Warning: package 'rsample' was built under R version 4.2.3
install.packages('tidyverse')
Installing tidyverse [2.0.0] ...
OK [linked cache]
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)install.packages("smotefamily")
Installing smotefamily [1.3.1] ...
OK [linked cache]
library(smotefamily)
Warning: package 'smotefamily' was built under R version 4.2.3
Next, we need to attempt and clean all the text associated with each instance of ham//spam. This function cleans up each line of text so recognizing the class becomes easier.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `msg_list = string_cleaner(.$msg_list)`.
Caused by warning in `stri_replace_all_regex()`:
! argument is not an atomic vector; coercing
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `msg_list = string_cleaner(.$msg_list)`.
Caused by warning in `stri_replace_all_regex()`:
! argument is not an atomic vector; coercing
If we look at the different word list between the two groups, there is alot of overlap between the two classes, so classification accuracy could suffer.
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.710
2 kap multiclass 0
Discussion/Results
Our results show that our simple Naive Bayes classifier did not outpreform our simpler model of simply labeling everything as ‘ham’. This is largely due to the fact that the majority of our data set consists of ‘ham’ instances, meaning that this might not be the case if we had a more balanced data set. The sub-par accuracy could be due to the fact that the text needed more cleaning, but what exactly that means is unclear, as the difficulty recognizing which of the two classes the instances belong to. Naive Bayes empircally is a good algorithim for spam classification, so our poor accuracy is a sign that the data set needs more cleaning.1