SPS_Data607_Week12_DC

Author

David Chen

Project 4 Document Classification:

It can be useful to be able to classify new “test” documents using already classified “training” documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).   One example corpus:   https://spamassassin.apache.org/old/publiccorpus

Approach

Introduction

This project focuses on building a classification model to distinguish between spam and ham (non-spam) documents. Using a labeled dataset, we trained a model to predict whether new, unseen documents should be classified as spam or ham.

There is Naive Bayes classifier (e1071)for prediction which we can apply to this project.

Libraries

• e1071: Used to implement the Naive Bayes classifier for prediction. • caret: Used to evaluate model performance, particularly through confusion matrices. • tm: A text mining package used extensively for text preprocessing and corpus transformation.

#install.packages("e1071")
#install.packages("tm")
library(tidyr)
library(ggplot2)
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':

    extract
library(stringr)
library(tm)
Loading required package: NLP

Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':

    annotate
library(e1071)

Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':

    element
library(caret)
Loading required package: lattice

Loading the Data

The initial step involved loading spam and ham documents into separate data structures (spam_folder and ham_folder). • Use list.files() to retrieve file names from each directory. • File contents were read and stored in data frames with appropriate column names. • The lapply() function was applied to efficiently process collections of files and store their contents. • Use unnest() to expand each element into individual rows for easier manipulation because the data included list-columns, This process was applied consistently to both spam and ham datasets.

#https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2
#https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2
#setwd("~/SPS_Spring2026") and both folder under the working directory.

if(all(file.exists("w12_dtm_spam.csv","w12_dtm_ham.csv"))) {
  print("Cache files exits\n")
  df_ham <- read.csv("w12_dtm_ham.csv")
  df_spam<- read.csv("w12_dtm_spam.csv")

}else{
  
  spam_folder <- './spam/'
  ham_folder <- './easy_ham/'
  
  length(list.files(path = spam_folder))
  length(list.files(path = ham_folder))
  
  spam_files <- list.files(path = spam_folder, full.names = TRUE)
  ham_files <- list.files(path = ham_folder, full.names = TRUE)
  
  df_spam <- as.data.frame(spam_files) %>%
    set_colnames("file") %>%
    mutate(text = lapply(spam_files, function(path) {
      # readLines is more stable for legacy email archives
      paste(readLines(path, warn = FALSE), collapse = " ")
    })) %>%
    unnest(c(text)) %>%
    mutate(class = "spam", spam = 1) %>%
    group_by(file) %>%
    mutate(text = paste(text, collapse = " ")) %>%
    ungroup() %>%
    distinct()
  write.csv(df_spam, "w12_dtm_spam.csv", row.names = FALSE)
  df_ham <- as.data.frame(ham_files) %>%
    set_colnames("file") %>%
    mutate(text = lapply(ham_files, function(path) {
      # readLines is more stable for legacy email archives
      paste(readLines(path, warn = FALSE), collapse = " ")
    })) %>%
    unnest(c(text)) %>%
    mutate(class = "ham", spam = 0) %>%
    group_by(file) %>%
    mutate(text = paste(text, collapse = " ")) %>%
    ungroup() %>%
    distinct()
  write.csv(df_ham, "w12_dtm_ham.csv", row.names = FALSE)  
}
[1] "Cache files exits\n"

Tidying Data / Building Corpus

Text preprocessing included several key steps: • Removing unnecessary whitespace and formatting issues using functions like str_replace(). • Cleaning punctuation and replacing it with spaces using content transformation functions. • Applying transformations across the entire corpus using tm_map() from the tm package. • Removing numbers (removeNumbers()) • Eliminating stop words (common words with little analytical value, such as “the”, “and”, etc.) • Normalizing text for consistency

These steps were preparing the text data for feature extraction.

df_ham_spam <- rbind(df_ham, df_spam) %>%
  select(class, spam, file, text)

df_ham_spam$text <- df_ham_spam$text %>%
  str_replace(.,"[\\r\\n\\t]+", "")

replacePunctuation <- content_transformer(function(x) {
  return (gsub("[[:punct:]]", " ", x))})


corpus <- Corpus(VectorSource(df_ham_spam$text)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(replacePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace)
Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
drops documents
Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
transformation drops documents
Warning in tm_map.SimpleCorpus(., replacePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
dtm <- DocumentTermMatrix(corpus)

dtm <- removeSparseTerms(dtm, 1-(10/length(corpus)))

inspect(dtm)
<<DocumentTermMatrix (documents: 3002, terms: 5300)>>
Non-/sparse entries: 401485/15509115
Sparsity           : 97%
Maximal term length: 21
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   com fork list localhost net org received sep spamassassin taint
  166  175    0    2         5  11  16        4   0           11    11
  2746   4    0    1         5  17  12        5  10            2     2
  2783   5    0    2         5   1   3        4   5            1     1
  2906   6    0    3         5   2   3        4   5            1     1
  2977   8    0    2         5   4   6        6   7            3     3
  570   15   15    7         8   0   9        6   8            5     5
  670   17   17    7         7   0   8        7   8            5     5
  677   21   19    8         6   0   9        7   9            6     6
  765   22   16    9         6   4   8        6   8            5     5
  942   27   16    6         6   0   7        6   0            4     4
dim(dtm)
[1] 3002 5300

Training Data

This step will process Document-Term Matrix was converted into a data frame, and a classification column (spam vs. ham) was added as a factor variable. Also split data into training dataset and testing dataset with 80:20 ratio.

#A DTM is stored as a Sparse Matrix. Because it is sparse, 
#you cannot simply use as.data.frame() directly on the object. 
#You must first convert it into a standard dense matrix.
df_dtm <-as.data.frame(as.matrix(dtm)) %>%
  sapply(., as.numeric) %>%
  as.data.frame() %>%
  mutate(class = df_ham_spam$class) %>%
  select(class, everything())

df_dtm$class <- as.factor(df_dtm$class)

# 80:20 ratio
sample_size <- floor(0.8 * nrow(df_dtm))

set.seed(6549)
index <- sample(seq_len(nrow(df_dtm)), size = sample_size)
  
dtm_train <- df_dtm[index, ]
dtm_test <-  df_dtm[-index, ]

train_labels <- dtm_train$class
test_labels <- dtm_test$class


prop.table(table(train_labels))
train_labels
      ham      spam 
0.8271554 0.1728446 
prop.table(table(test_labels))
test_labels
      ham      spam 
0.8569052 0.1430948 
dim(dtm_train)
[1] 2401 5300
dim(dtm_test)
[1]  601 5300

Model Training

Using the prepared training dataset, apply the Naive Bayes classifier (via the e1071 package) to build the predictive model. The model was trained on term frequencies and corresponding class labels to learn patterns that distinguish spam from ham.

dtm_train[ , -1] <- ifelse(dtm_train[ , -1] == 0, "No", "Yes")
dtm_test[ , -1] <- ifelse(dtm_test[ , -1] == 0, "No", "Yes")
# Convert all cols in to yes/no value, 
#except first one which has spam and ham class value.
model_classifier <- naiveBayes(dtm_train, train_labels) 

test_pred <- predict(model_classifier, dtm_test)

confusionMatrix(test_pred, test_labels, positive = "spam",
                dnn = c("Prediction","Actual"))
Confusion Matrix and Statistics

          Actual
Prediction ham spam
      ham  514    2
      spam   1   84
                                         
               Accuracy : 0.995          
                 95% CI : (0.9855, 0.999)
    No Information Rate : 0.8569         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.9795         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9767         
            Specificity : 0.9981         
         Pos Pred Value : 0.9882         
         Neg Pred Value : 0.9961         
             Prevalence : 0.1431         
         Detection Rate : 0.1398         
   Detection Prevalence : 0.1414         
      Balanced Accuracy : 0.9874         
                                         
       'Positive' Class : spam           
                                         

Conclusion

The Naive Bayes document classification model achieved an outstanding accuracy of 99.5%, demonstrating high reliability in distinguishing between spam and legitimate “ham” emails, achieving very high accuracy with 598 correct predictions out of 601. It correctly identifies most ham emails (514) and spam emails (84), indicating strong overall classification ability. The number of errors is very small, with only 2 spam messages misclassified as ham (false negatives) and 1 ham message misclassified as spam (false positive). This suggests the model is both precise and reliable, especially in detecting spam. However, the presence of a few false negatives means that some spam messages may still slip through, which could be a concern depending on the application. Overall, the model demonstrates excellent performance with a good balance between precision and recall.