Introduction

For this project, I will be classifying emails using the ham and spam dataset. I will use Naive Bayes Algorithm to create a model. After creating the model, I will predict the model and check the accuracy of the model.

Data Loading

I need to change the data into a usable form, before classifying the emails. This process involves reading in the emails, putting the emails into a dataframe, creating the corpus, and creating a document term matrix and use the data to train models for classfication.

Loading required libraries.

Text processing and transformation

Download the zip files from https://spamassassin.apache.org/old/publiccorpus/. Unzip and paste the files in my working directory. Selected 20021010_spam.tar and 20021010_easy_ham.tar file to do the classification.

Read the files from easy_ham and spam folder. I Created a user-defined function which read each email present in the folders.

Combine the dataset into one data frame.

## # A tibble: 3,054 x 3
##    text                                                              type  DocID
##    <chr>                                                             <chr> <int>
##  1  <NA>                                                             ham       1
##  2 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       2
##  3 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       3
##  4 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       4
##  5 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       5
##  6 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       6
##  7 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       7
##  8 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       8
##  9 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham       9
## 10 "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nR… ham      10
## # … with 3,044 more rows

Train and Test data

Divided data set in to 2 part train and test. Train data contains 70% of data and test contains 30% of data.

# sample 70% data traning and 30 % for prediction
sample_df <- floor(0.70 * nrow(spam_ham_df))

# set the seed to make result reproductible
set.seed(124)
train_ind <- sample(seq_len(nrow(spam_ham_df)), size = sample_df)

train_spam_ham <- spam_ham_df[train_ind, ]
test_spam_ham <- spam_ham_df[-train_ind, ]

# count of spam and ham in train data set
spam<-subset(train_spam_ham,train_spam_ham$type == "spam")
ham<-subset(train_spam_ham,train_spam_ham$type == "ham")

# Create corpus for training and test data
train_corpus <- Corpus(VectorSource(train_spam_ham$text))
test_corpus <- Corpus(VectorSource(test_spam_ham$text))

# Remove numbers
train_corpus <- tm_map(train_corpus ,removeNumbers)
test_corpus <- tm_map(test_corpus, removeNumbers)
# Remove punctuations
train_corpus <- tm_map(train_corpus, removePunctuation)
test_corpus <- tm_map(test_corpus, removePunctuation)
# Remove stop words
train_corpus <- tm_map(train_corpus, removeWords, stopwords())
test_corpus  <- tm_map(test_corpus, removeWords, stopwords())
# Remove white spaces
train_clean_corpus<- tm_map(train_corpus, stripWhitespace)
test_clean_corpus<- tm_map(test_corpus, stripWhitespace)
# Create corpus for train and test 
train_dtm <- DocumentTermMatrix(train_corpus)
test_dtm <- DocumentTermMatrix(test_corpus)

# count function
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

train_sample <- apply(train_dtm, 2, convert_count)
test_sample <- apply(test_dtm, 2, convert_count)

Modeling

Create model using Naive Bayes Algorithm and find the accuracy.

##             
## test_pred_nb       ham      spam
##         ham  0.8222465 0.0000000
##         spam 0.0000000 0.1777535
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  917 
## 
##  
##              | Actual 
##    Predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       754 |         0 |       754 | 
##              |     1.000 |     0.000 |     0.822 | 
##              |     1.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         0 |       163 |       163 | 
##              |     0.000 |     1.000 |     0.178 | 
##              |     0.000 |     1.000 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       754 |       163 |       917 | 
##              |     0.822 |     0.178 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Summary

  • Analysis of word counts in the Spam and Ham emails revealed differences in the most commonly occurring words.
  • Accuracy of Naive Bayes model is 100%.