Introduction:

Using the link below, I started with a compiled amount of Spam and Ham emails in order to try and predict a new class of documents to determine if the following documents are spam or ham. The way we are going to do this is to obtain the data and clean it to then place a training and testing dataset to see how well it can predict which type of emails they are. The purpose of this assignment will be to in turn create a testing data set to place it in a predictive system to hopefully create a forecast model.

https://spamassassin.apache.org/old/publiccorpus/

Libaries Needed:

The following packages are needed to clean the data and use a Naive Bayes Classifier for the predictions.

library(purrr)
library(tidyverse)
library(readtext)
library(tm)
library(e1071)
library(caret)

Obtaining Data:

First I decided to just download one of each type of zip files, so one of each Spam and Ham Zip files. After we need to cut and expand them through 7-zip to separate all the different emails into a folder for use. The function that is used to read the different text files within each folder and returns a tibble with the different variables: File, Text, Class, and Spam.

ham <- 'C:/Users/puddi/Documents/easy_ham'
spam <- 'C:/Users/puddi/Documents/spam_2'
length(list.files(path = ham))
## [1] 2551
length(list.files(path = spam))
## [1] 1397
spam_files <- list.files(path = spam, full.names = TRUE)
ham_files <- list.files(path = ham, full.names = TRUE)

read_emails <- function(file_path, class) {
  lines <- readLines(file_path)
  text <- paste(lines, collapse = " ")
  tibble(file = file_path, text = text, class = class, spam = as.numeric(class == "spam"))
}

spam_df <- map_dfr(spam_files, ~read_emails(.x, "spam"))
ham_df <- map_dfr(ham_files, ~read_emails(.x, "ham"))

emails <- bind_rows(spam_df, ham_df)



# This will read and create a dataframe for the spam files
spam_1 <- readtext(paste0(spam, "/*")) %>%
  mutate(class = "spam", spam = 1)

# This will read and create a dataframe for the ham files
ham_2 <- readtext(paste0(ham, "/*")) %>%
  mutate(class = "ham", spam = 0)

Data Cleaning:

Following code below will add the 2 different Spam and Ham Data frames into one and then try and clean the text columns to remove the punctuation and different types of line breaks and overall messiness of the data.

#Combining the spam and ham data frames into 1 data frame
spamham <- bind_rows(ham_2, spam_1)

# This will clean the text column by removing line breaks and tabs from the different files
spamham$text <- spamham$text %>%
  str_replace_all("[\r\n\t]+", "")

#Next step is to remove punctuation from the text columns
spamham$text <- spamham$text %>%
  str_replace_all("[[:punct:]]", " ")

replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]", " ", x))})
#This will create a corpus object from the data frame which is then help clean the data by taking out the common words and different types of punctuation and numbers from the text files of the different emails.  
corpus <- Corpus(VectorSource(spamham$text)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(replacePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace)


#Code below creates a Document-term Matrix from the corpus object which will help represent the frequency of a term for the different documents. 
dtm <- DocumentTermMatrix(corpus)
#To remove the different sparse terms from DTM that occurs in fewer than 10% of the total documents
dtm <- removeSparseTerms(dtm, 1-(10/length(corpus)))

#To Convert back into a Data Frame
email_dtm <- dtm %>%
  as.matrix() %>%
  as.data.frame() %>%
  sapply(., as.numeric) %>%
  as.data.frame() %>%
  mutate(class = spamham$class) %>%
  select(class, everything())
#Converting to a Factor to help with ML 
email_dtm$class <- as.factor(email_dtm$class)

Training and Test Data Set:

#The training set will be using 75% of the total Rows
sample_size <- floor(0.75 * nrow(email_dtm))

set.seed(1667)
#Allows to randomly pull and select emails from different text rows
index <- sample(seq_len(nrow(email_dtm)), size = sample_size)

#Below is categorizing between the Testing and Training DTM's
dtm_training <- email_dtm[index, ]
dtm_testing <-  email_dtm[-index, ]

#Training & Test Spam Count labels for Naive Bayes model
training_labels <- dtm_training$class
testing_labels <- dtm_testing$class

#This will determine the proportion between the spam and ham emails for training & test 
prop.table(table(training_labels))
## training_labels
##       ham      spam 
## 0.6460655 0.3539345
prop.table(table(testing_labels))
## testing_labels
##       ham      spam 
## 0.6464032 0.3535968
#This will convert the Occurrence Matrix to a binary matrix
dtm_training[ , 2:3816] <- ifelse(dtm_training[ , 2:3816] == 0, "No", "Yes")
dtm_testing[ , 2:3816] <- ifelse(dtm_testing[ , 2:3816] == 0, "No", "Yes")

#Creating a predictive Model
model_classifier <- naiveBayes(dtm_training, training_labels) 
#Applying model to the testing dataframe
test_prediction <- predict(model_classifier, dtm_testing)
#Performance of the Prediction Model.
confusionMatrix(test_prediction, testing_labels, positive = "spam", 
                dnn = c("Prediction","Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  628  177
##       spam  10  172
##                                           
##                Accuracy : 0.8105          
##                  95% CI : (0.7847, 0.8345)
##     No Information Rate : 0.6464          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5352          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4928          
##             Specificity : 0.9843          
##          Pos Pred Value : 0.9451          
##          Neg Pred Value : 0.7801          
##              Prevalence : 0.3536          
##          Detection Rate : 0.1743          
##    Detection Prevalence : 0.1844          
##       Balanced Accuracy : 0.7386          
##                                           
##        'Positive' Class : spam            
## 

Conclusion:

Through the use of the Naive Bayes Model we were able to predict accurately 81.05% of the emails into either a Spam or Ham Classification. The Positive Predictive Value was 98.43% which is the proportion of predictive spam emails that were correctly labeled as spam. Overall it seems that the model could have been trained better with a larger data set to increase the accuracy from 81.05% to a higher 95%. Other things that can be used for future improvement would be to run a Decision Tree, Random Forest or a Support Vector Machine.