Email Spam Filtering

Problem Statement

Given a data set of emails, we want to build a program that can decide which are spam and which are not. A Naive Bayes model is going to be utilized to solve this problem. The predictive accuracy of the model is calculated below, and more can be read in the Analysis section below.

Data Processing

The data is read from the given .csv file, and a glimpse of the data is outputted.

email.df <- read.csv("/Users/hunteregeland/Desktop/classes/vlis/data sets/completeSpamAssassin3.csv")
glimpse(email.df)

## Rows: 6,046
## Columns: 3
## $ Index <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ Body  <chr> "Save up to 70% on Life Insurance.\nWhy Spend More Than You Have…
## $ Label <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Omit NULL Values from the Dataset

Despite being redundant, since Bayesian models handle NULL values well, the predictive accuracy is increased if the NULL values are omitted outside the Naive Bayes function.

email.df <- na.omit(email.df)

Split the Data

Here, the data is split into a training and testing set to be used in the Naives Bayes function.

set.seed(2)
sample_set <- sample(c(1:dim(email.df)[1]), dim(email.df)[1]*0.75)
email_train <- email.df[sample_set,]
email_test <- email.df[-sample_set,]

Utilize Naive Bayes

Naive Bayes is utilized here to make predictions about which emails are spam.

library(e1071)
email_mod <- naiveBayes(Label ~ .-Index, data = email_train, laplace = 1)

Evaluate the Model

Here, the model is evaluated to make predictions.

email_pred <- predict(email_mod, email_test, type = "raw")
head (email_pred)

##              0         1
## [1,] 0.6874724 0.3125276
## [2,] 0.6874724 0.3125276
## [3,] 0.6874724 0.3125276
## [4,] 0.6874724 0.3125276
## [5,] 0.3333333 0.6666667
## [6,] 0.6874724 0.3125276

email_pred <- predict(email_mod, email_test, type = "class")
head(email_pred)

## [1] 0 0 0 0 1 0
## Levels: 0 1

Create a Confusion Matrix

A confusion matrix is created to show how many emails are ham and spam. A “0” denotes an email that is not spam, and a “1’ denotes an email that is spam. The confusion matrix shows which emails are predicted to be spam and are spam, which emails where predicted to be spam and were not, which emails are predicted to not be spam but were spam, and which emails where predicted to not be spam and were not spam.

email_pred_table <- table(email_test$Label, email_pred)
email_pred_table

##    email_pred
##       0   1
##   0 975  58
##   1 314 165

Calculate Predictive Accuracy

Here, the predictive accuracy is calculated. This shows how accurate of a model the Naive Bayes function that was created truly is. The output is the percent accuracy of the model.

pred_acc <- sum(diag(email_pred_table)) / nrow(email_test)
pred_acc * 100

## [1] 75.39683

Analysis

General Analysis
Overall, the Naive Bayes model had about a 75% accuracy. This was slightly higher than the accuracy of the model without omitting the NULL values in the data set (the predictive accuracy without omitted NULL values was about 74%). The confusion matrix allows us to see any false positives or negatives, as well as correct guesses by the model.

Possible Improvements
There are some possible improvements that could be made to raise the predictive accuracy of the model. For example, if all text in the data set is set to lowercase characters, there will not be overlap in words. An example of this is if “sale” and “SALE” are taken down as different words, the predictive accuracy will be worse than if the uppercase version is set to lowercase, making it one word to be recognized instead of two. This was not implemented into this program as it is very computationally expensive on the computer and does not increase the predictive accuracy enough to warrant the computational cost.