Project 4 Document Classification

Introduction

The goal of this project is to classify emails as either spam or ham (not spam) and then predict the class of new emails. We’ll use a data from SpamAssassin (“https://spamassassin.apache.org/old/publiccorpus/”), which includes labeled spam and ham emails, to train and test our model.

First, we’ll preprocess and clean the data to make it suitable for analysis. This involves loading the spam and ham emails from specified folders, ensuring the correct encoding, and removing any blank entries. After that, we’ll combine the spam and ham emails into a single collection and create a Document-Term Matrix (DTM) to represent the text data in a numerical format.

We’ll then divide the dataset into two parts: 70% for training the model and 30% for testing its performance. This helps us evaluate how well the model can predict new, unseen data. We’ll use the Naive Bayes classifier for training and predictions, and afterward, we’ll evaluate the model’s accuracy and discuss factors that might have influenced its performance.

First, let’s load the necessary libraries.

Code

# Header stripping function to extract email body
strip_headers <- function(email_content) {
  if (is.null(email_content) || is.na(email_content) || length(email_content) == 0) {
    return("")
  }

  if (is.list(email_content)) {
    if (length(email_content) > 0) {
      email_content <- email_content[[1]]
    } else {
      return("")
    }
  }

  email_content <- as.character(email_content)

  tryCatch({
    if (grepl("\n\n", email_content, fixed = TRUE)) {
      parts <- strsplit(email_content, "\n\n", fixed = TRUE)[[1]]
      if (length(parts) >= 2) {
        email_body <- paste(parts[-1], collapse = "\n\n")
        return(email_body)
      }
    }
    return(email_content)
  }, error = function(e) {
    return(email_content)
  })
}

# Function to load emails from a directory with encoding handling and header stripping
load_emails <- function(folder) {
  files <- list.files(folder, full.names = TRUE, recursive = TRUE)

  emails <- sapply(files, function(file) {
    tryCatch({
      content <- tryCatch({
        readLines(file, encoding = 'UTF-8', warn = FALSE)
      }, error = function(e) {
        tryCatch({
          readLines(file, encoding = 'latin1', warn = FALSE)
        }, error = function(e) {
          readLines(file, encoding = 'ASCII', warn = FALSE)
        })
      })

      if (length(content) > 5000) {
        return("")
      }

      if (length(content) > 0) {
        content <- paste(content, collapse = "\n")
        if (!grepl("From:", content, fixed = TRUE) && 
            !grepl("Subject:", content, fixed = TRUE) && 
            !grepl("To:", content, fixed = TRUE)) {
          return("")
        }

        content <- strip_headers(content)
        return(content)
      } else {
        return("")
      }
    }, error = function(e) {
      return("")
    })
  })

  names(emails) <- NULL
  return(emails)
}

# Preprocessing function to clean email text
preprocess_text <- function(text) {
  text <- sapply(text, function(t) {
    if (is.na(t) || length(t) == 0) {
      return("")
    }
    tryCatch({
      result <- iconv(t, "latin1", "ASCII", sub="")
      if (is.na(result)) return("")
      return(result)
    }, error = function(e) {
      return("")
    })
  })

  text <- text[text != ""]
  doc <- Corpus(VectorSource(text))

  doc <- tryCatch({
    doc <- tm_map(doc, content_transformer(tolower))
    doc <- tm_map(doc, removePunctuation)
    doc <- tm_map(doc, removeNumbers)
    doc <- tm_map(doc, removeWords, stopwords("english"))
    doc <- tm_map(doc, stemDocument)
    doc <- tm_map(doc, stripWhitespace)
    doc
  }, error = function(e) {
    doc
  })

  return(doc)
}

# Function to classify new emails
classify_email <- function(email_text, classifier, dtm) {
  email_clean <- strip_headers(email_text)
  email_corpus <- preprocess_text(email_clean)

  new_dtm <- DocumentTermMatrix(email_corpus, control = list(dictionary = colnames(dtm)))
  new_matrix <- as.matrix(new_dtm)

  if (ncol(new_matrix) == 0 || all(new_matrix == 0)) {
    return("Unknown")
  }

  prediction <- predict(classifier, new_matrix)
  return(as.character(prediction))
}

# Main function to run the spam detection process with dataset balancing
run_spam_detection <- function(spam_folder, ham_folder) {
  spam_emails <- load_emails(spam_folder)
  ham_emails <- load_emails(ham_folder)

  spam_emails <- spam_emails[spam_emails != ""]
  ham_emails <- ham_emails[ham_emails != ""]

  set.seed(123)
  ham_count <- length(ham_emails)
  if (length(spam_emails) < ham_count) {
    spam_emails <- sample(spam_emails, ham_count, replace = TRUE)
  }

  labels <- factor(c(rep("spam", length(spam_emails)), rep("ham", length(ham_emails))))

  spam_corpus <- preprocess_text(spam_emails)
  ham_corpus <- preprocess_text(ham_emails)

  all_docs <- c(as.character(spam_corpus$content), as.character(ham_corpus$content))
  corpus <- Corpus(VectorSource(all_docs))

  dtm <- DocumentTermMatrix(corpus, control = list(
    wordLengths = c(3, Inf),
    bounds = list(global = c(3, Inf)),
    weighting = weightTf,
    tolower = TRUE,
    removePunctuation = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE
  ))

  if (ncol(dtm) > 10000) {
    dtm <- removeSparseTerms(dtm, 0.995)
  }

  dtm_matrix <- as.matrix(dtm)

  col_variance <- apply(dtm_matrix, 2, var)
  if (any(col_variance == 0)) {
    dtm_matrix <- dtm_matrix[, col_variance > 0]
  }

  set.seed(123)
  split_ratio <- 0.7
  total_docs <- nrow(dtm_matrix)

  train_indices <- sample(1:total_docs, size = round(split_ratio * total_docs))
  test_indices <- setdiff(1:total_docs, train_indices)

  trainData <- dtm_matrix[train_indices, ]
  trainLabels <- labels[train_indices]
  testData <- dtm_matrix[test_indices, ]
  testLabels <- labels[test_indices]

  classifier <- naiveBayes(trainData, trainLabels)

  predictions <- predict(classifier, testData)

  confusion <- confusionMatrix(predictions, testLabels)

  return(list(
    model = classifier,
    confusion_matrix = confusion,
    accuracy = confusion$overall["Accuracy"],
    dtm_matrix = dtm_matrix,
    predictions = predictions,
    test_labels = testLabels
  ))
}

# Set your paths
spam_folder <- 'C:\\Users\\PATELM70\\Desktop\\CUNY\\DATA607\\Project 4\\20021010_spam\\spam'
ham_folder <- 'C:\\Users\\PATELM70\\Desktop\\CUNY\\DATA607\\Project 4\\20021010_easy_ham\\easy_ham'

# Run the spam detection
results <- run_spam_detection(spam_folder, ham_folder)

# Print results
print(results$confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  714  519
##       spam   1  212
##                                          
##                Accuracy : 0.6404         
##                  95% CI : (0.615, 0.6652)
##     No Information Rate : 0.5055         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2863         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9986         
##             Specificity : 0.2900         
##          Pos Pred Value : 0.5791         
##          Neg Pred Value : 0.9953         
##              Prevalence : 0.4945         
##          Detection Rate : 0.4938         
##    Detection Prevalence : 0.8527         
##       Balanced Accuracy : 0.6443         
##                                          
##        'Positive' Class : ham            
##

cat("Accuracy:", results$accuracy, "\n")

## Accuracy: 0.6403873

Conclusion

Based on the model evaluation, the classifier achieved an accuracy of 64%. The model is particularly effective at detecting ham emails, as indicated by the high sensitivity (99.86%), but it struggles with spam detection, shown by the low specificity (29%).

Several factors may contribute to this performance, including:

Data Imbalance: Initially, the dataset had a disproportionate number of spam versus ham emails. Upsampling the spam emails helped mitigate this issue but might not eliminate it entirely.
Feature Representation: The quality of extracted features can impact classifier performance. Enhancing feature extraction methods could improve accuracy.
Model Complexity: Naive Bayes is a simple yet effective model but has limitations in handling complex text data.

Going forward, addressing these factors can improve the classifier’s effectiveness in spam detection and ensure robust performance for real-world applications.

Suggested Improvements:

Enhance Feature Extraction: Implement advanced techniques like TF-IDF and n-grams.
Explore Advanced Models: Evaluate models such as SVM, Random Forests, and deep learning approaches.
Ensure Data Balance: Continuously monitor and balance datasets for optimal model training.

This project lays the foundation for an effective spam detection system. By iterating and refining our approach, we can achieve better accuracy and reliability in email classification.

Project 4 Document Classification

Miraj Patel

2025-05-04

Introduction

Code

Conclusion

Suggested Improvements: