The goal of this project is to classify emails as either spam or ham (not spam) and then predict the class of new emails. We’ll use a data from SpamAssassin (“https://spamassassin.apache.org/old/publiccorpus/”), which includes labeled spam and ham emails, to train and test our model.
First, we’ll preprocess and clean the data to make it suitable for analysis. This involves loading the spam and ham emails from specified folders, ensuring the correct encoding, and removing any blank entries. After that, we’ll combine the spam and ham emails into a single collection and create a Document-Term Matrix (DTM) to represent the text data in a numerical format.
We’ll then divide the dataset into two parts: 70% for training the model and 30% for testing its performance. This helps us evaluate how well the model can predict new, unseen data. We’ll use the Naive Bayes classifier for training and predictions, and afterward, we’ll evaluate the model’s accuracy and discuss factors that might have influenced its performance.
First, let’s load the necessary libraries.
# Header stripping function to extract email body
strip_headers <- function(email_content) {
if (is.null(email_content) || is.na(email_content) || length(email_content) == 0) {
return("")
}
if (is.list(email_content)) {
if (length(email_content) > 0) {
email_content <- email_content[[1]]
} else {
return("")
}
}
email_content <- as.character(email_content)
tryCatch({
if (grepl("\n\n", email_content, fixed = TRUE)) {
parts <- strsplit(email_content, "\n\n", fixed = TRUE)[[1]]
if (length(parts) >= 2) {
email_body <- paste(parts[-1], collapse = "\n\n")
return(email_body)
}
}
return(email_content)
}, error = function(e) {
return(email_content)
})
}
# Function to load emails from a directory with encoding handling and header stripping
load_emails <- function(folder) {
files <- list.files(folder, full.names = TRUE, recursive = TRUE)
emails <- sapply(files, function(file) {
tryCatch({
content <- tryCatch({
readLines(file, encoding = 'UTF-8', warn = FALSE)
}, error = function(e) {
tryCatch({
readLines(file, encoding = 'latin1', warn = FALSE)
}, error = function(e) {
readLines(file, encoding = 'ASCII', warn = FALSE)
})
})
if (length(content) > 5000) {
return("")
}
if (length(content) > 0) {
content <- paste(content, collapse = "\n")
if (!grepl("From:", content, fixed = TRUE) &&
!grepl("Subject:", content, fixed = TRUE) &&
!grepl("To:", content, fixed = TRUE)) {
return("")
}
content <- strip_headers(content)
return(content)
} else {
return("")
}
}, error = function(e) {
return("")
})
})
names(emails) <- NULL
return(emails)
}
# Preprocessing function to clean email text
preprocess_text <- function(text) {
text <- sapply(text, function(t) {
if (is.na(t) || length(t) == 0) {
return("")
}
tryCatch({
result <- iconv(t, "latin1", "ASCII", sub="")
if (is.na(result)) return("")
return(result)
}, error = function(e) {
return("")
})
})
text <- text[text != ""]
doc <- Corpus(VectorSource(text))
doc <- tryCatch({
doc <- tm_map(doc, content_transformer(tolower))
doc <- tm_map(doc, removePunctuation)
doc <- tm_map(doc, removeNumbers)
doc <- tm_map(doc, removeWords, stopwords("english"))
doc <- tm_map(doc, stemDocument)
doc <- tm_map(doc, stripWhitespace)
doc
}, error = function(e) {
doc
})
return(doc)
}
# Function to classify new emails
classify_email <- function(email_text, classifier, dtm) {
email_clean <- strip_headers(email_text)
email_corpus <- preprocess_text(email_clean)
new_dtm <- DocumentTermMatrix(email_corpus, control = list(dictionary = colnames(dtm)))
new_matrix <- as.matrix(new_dtm)
if (ncol(new_matrix) == 0 || all(new_matrix == 0)) {
return("Unknown")
}
prediction <- predict(classifier, new_matrix)
return(as.character(prediction))
}
# Main function to run the spam detection process with dataset balancing
run_spam_detection <- function(spam_folder, ham_folder) {
spam_emails <- load_emails(spam_folder)
ham_emails <- load_emails(ham_folder)
spam_emails <- spam_emails[spam_emails != ""]
ham_emails <- ham_emails[ham_emails != ""]
set.seed(123)
ham_count <- length(ham_emails)
if (length(spam_emails) < ham_count) {
spam_emails <- sample(spam_emails, ham_count, replace = TRUE)
}
labels <- factor(c(rep("spam", length(spam_emails)), rep("ham", length(ham_emails))))
spam_corpus <- preprocess_text(spam_emails)
ham_corpus <- preprocess_text(ham_emails)
all_docs <- c(as.character(spam_corpus$content), as.character(ham_corpus$content))
corpus <- Corpus(VectorSource(all_docs))
dtm <- DocumentTermMatrix(corpus, control = list(
wordLengths = c(3, Inf),
bounds = list(global = c(3, Inf)),
weighting = weightTf,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE
))
if (ncol(dtm) > 10000) {
dtm <- removeSparseTerms(dtm, 0.995)
}
dtm_matrix <- as.matrix(dtm)
col_variance <- apply(dtm_matrix, 2, var)
if (any(col_variance == 0)) {
dtm_matrix <- dtm_matrix[, col_variance > 0]
}
set.seed(123)
split_ratio <- 0.7
total_docs <- nrow(dtm_matrix)
train_indices <- sample(1:total_docs, size = round(split_ratio * total_docs))
test_indices <- setdiff(1:total_docs, train_indices)
trainData <- dtm_matrix[train_indices, ]
trainLabels <- labels[train_indices]
testData <- dtm_matrix[test_indices, ]
testLabels <- labels[test_indices]
classifier <- naiveBayes(trainData, trainLabels)
predictions <- predict(classifier, testData)
confusion <- confusionMatrix(predictions, testLabels)
return(list(
model = classifier,
confusion_matrix = confusion,
accuracy = confusion$overall["Accuracy"],
dtm_matrix = dtm_matrix,
predictions = predictions,
test_labels = testLabels
))
}
# Set your paths
spam_folder <- 'C:\\Users\\PATELM70\\Desktop\\CUNY\\DATA607\\Project 4\\20021010_spam\\spam'
ham_folder <- 'C:\\Users\\PATELM70\\Desktop\\CUNY\\DATA607\\Project 4\\20021010_easy_ham\\easy_ham'
# Run the spam detection
results <- run_spam_detection(spam_folder, ham_folder)
# Print results
print(results$confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 714 519
## spam 1 212
##
## Accuracy : 0.6404
## 95% CI : (0.615, 0.6652)
## No Information Rate : 0.5055
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2863
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9986
## Specificity : 0.2900
## Pos Pred Value : 0.5791
## Neg Pred Value : 0.9953
## Prevalence : 0.4945
## Detection Rate : 0.4938
## Detection Prevalence : 0.8527
## Balanced Accuracy : 0.6443
##
## 'Positive' Class : ham
##
cat("Accuracy:", results$accuracy, "\n")
## Accuracy: 0.6403873
Based on the model evaluation, the classifier achieved an accuracy of 64%. The model is particularly effective at detecting ham emails, as indicated by the high sensitivity (99.86%), but it struggles with spam detection, shown by the low specificity (29%).
Several factors may contribute to this performance, including:
Going forward, addressing these factors can improve the classifier’s effectiveness in spam detection and ensure robust performance for real-world applications.
This project lays the foundation for an effective spam detection system. By iterating and refining our approach, we can achieve better accuracy and reliability in email classification.