It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Library

The following library will be used.

library(tm)
library(tidyverse)
library(tidytext)
library(naivebayes)
library(SnowballC)
library(wordcloud)
library(gmodels)
library(caret) 

Loading the dataset

The following datasets are downloaded and unziped on local folder https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2 https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2

There are 2500 ham and 1396 spam messages

spam_dir = "C:\\Users\\tonyl\\OneDrive\\Desktop\\CUNY\\DATA 607 Data Acquisition & Management\\Project 4\\Spamham\\spam_2\\spam_2"
ham_dir = "C:\\Users\\tonyl\\OneDrive\\Desktop\\CUNY\\DATA 607 Data Acquisition & Management\\Project 4\\Spamham\\easy_ham\\easy_ham"

# The following function will read each file under spam or ham folder.
# "list.files()" get the names of the files in the folder, excluding "cmds" 
# The "lapply()" read each file into a data frame with two column: file and text.
# Use do.call function to call rbind function
# rbind function is used to combine data frames into a single data frame
# added class column to indicate if the email is spam or ham

read_files <- function(folder_path, class) {
  files <- list.files(path = folder_path, full.names = TRUE, , recursive = TRUE)
  files <- files[!grepl("cmds", files)]
  df_list <- lapply(files, function(file) {
    data.frame(file = file, text = paste(readLines(file), collapse = "\n"))
  })
  combined_df <- do.call(rbind, df_list)
  combined_df$class <- class
  return(combined_df)
}


spam <- read_files(spam_dir, class="spam")
ham <- read_files(ham_dir, class="ham") 
df <- rbind(spam, ham)
table(df$class)
## 
##  ham spam 
## 2500 1396

Raw Data cleaning

There are three columns in email: file, text and class. We shall clean the raw data before building corpus

# Remove file column
# Remove html body language
# Remove punctuation
# Remove "/n"
df_clean <- df |>
              select(-file) |> 
              mutate(text = gsub("<.*?>", "", text)) |> 
              mutate(text = gsub("\\d+", "", text)) |>
              mutate(text = gsub("[[:punct:]]+", "", text)) |> 
              mutate(text = gsub("[\n]", "", text))

Building Corpus and further data processing

A corpus is created and further data clean up is performed.

df_v_corp <- Corpus(VectorSource(df_clean$text))


# Converting to lowercase
# Remove Numbers, Punctuation, stopwords, whitespace
# Applying stemming
df_v_corp <- df_v_corp |>
                      tm_map(tolower) |>
                      tm_map(removeNumbers) |>                        
                      tm_map(removePunctuation) |>                    
                      tm_map(removeWords, stopwords("en")) |>         
                      tm_map(stripWhitespace) |>                    
                      tm_map(stemDocument)
# Visualizing cleaned corpus

wordcloud(df_v_corp, max.words = 100, colors = brewer.pal(7, "Dark2"), random.order = FALSE)

Creating a sparse document-term

df_dtm <- DocumentTermMatrix(df_v_corp)

df_dtm
## <<DocumentTermMatrix (documents: 3896, terms: 80457)>>
## Non-/sparse entries: 531325/312929147
## Sparsity           : 100%
## Maximal term length: 74230
## Weighting          : term frequency (tf)

Remove Sparse Terms

Removing sparse terms help to decrease difficulty on training Naive Bayes model

df_dtm_rst <- removeSparseTerms(df_dtm, sparse = .995)

df_dtm_rst
## <<DocumentTermMatrix (documents: 3896, terms: 3013)>>
## Non-/sparse entries: 383085/11355563
## Sparsity           : 97%
## Maximal term length: 98
## Weighting          : term frequency (tf)

Creating training sets

# Create training and testing datasets

set.seed(121)

sample_size <- floor(0.70 * nrow(df_dtm_rst))
train_ind <- sample(nrow(df_dtm_rst), size = sample_size)
train <- df_dtm_rst[train_ind,]
test <- df_dtm_rst[-train_ind,]
train_labels <- df[train_ind, ]$class
test_labels <- df[-train_ind, ]$class

# Proportion for training & test labels
# Spam messages were allocated evenly in both training and test datasets.
# Both datasets have about 35% spam

prop.table(table(train_labels))
## train_labels
##       ham      spam 
## 0.6406307 0.3593693
prop.table(table(test_labels))
## test_labels
##       ham      spam 
## 0.6441403 0.3558597

Naive Bayes model

# Convert counts to Yes/No strings

convert_values <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

# Convert the training and test matrices

email_train <- apply(train, MARGIN = 2,
                   convert_values)
email_test <- apply(test, MARGIN = 2,
                  convert_values)

# Training the Naive Bayes model

email_classifier <- naive_bayes(email_train, train_labels)

The prediction

email_test_pred <- predict(email_classifier, email_test)

Confusion matrix and Conclusion

The accuracy of this mode is 97.43%.

confusionMatrix(data = as.factor(email_test_pred), reference = as.factor(test_labels),
                positive = "spam", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  748   25
##       spam   5  391
##                                           
##                Accuracy : 0.9743          
##                  95% CI : (0.9636, 0.9826)
##     No Information Rate : 0.6441          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9434          
##                                           
##  Mcnemar's Test P-Value : 0.0005226       
##                                           
##             Sensitivity : 0.9399          
##             Specificity : 0.9934          
##          Pos Pred Value : 0.9874          
##          Neg Pred Value : 0.9677          
##              Prevalence : 0.3559          
##          Detection Rate : 0.3345          
##    Detection Prevalence : 0.3388          
##       Balanced Accuracy : 0.9666          
##                                           
##        'Positive' Class : spam            
##