library(tm)
library(caret)
library(e1071)
set.seed(123)
This is project four of the Fall 2024 edition of DATA 607 at the CUNY School of Professional Studies. The assignment states:
“It can be useful to be able to classify new”test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/”
I have taken one spam and one ham dataset from the provided link.
This first chunk defines the relative location of the spam and ham
emails and converts them into usable datasets. 'text_sweep
removes all non ASCII characters to prevent encoding issues. That
function is then passed into emailer
, a function that reads
the emails, turns them into concatenated strings, uses
text_sweep
to removed non ASCII characters, and then
returns vectors of the processed emails.
spam_folder <- "/Users/uwsthoughts/Desktop/github_sync/data_science_masters_work/2024_Fall/data_607_data_management/project_four/spam"
ham_folder <- "/Users/uwsthoughts/Desktop/github_sync/data_science_masters_work/2024_Fall/data_607_data_management/project_four/ham"
spam <- list.files(spam_folder, full.names = TRUE)
ham <- list.files(ham_folder, full.names = TRUE)
text_sweep <- function(text) {
iconv(text, from = "UTF-8", to = "ASCII", sub = "")
}
emailer <- function(files) sapply(files, function(f) {
tryCatch({
text <- paste(readLines(f, warn = FALSE, encoding = "UTF-8"), collapse = " ")
text_sweep(text)
}, error = function(e) {
""
})
})
The chunk starts by using the functions above to read and preprocess
the emails. It then turns the dataset into a structured
DocumentTermMatrix
by tokenizing the text and filtering out
uncommon terms that have a 1% or less preveleance rate. The processed
matrix is then converted into a dataframe with a label column that
declares spam or ham for it. The data is then split into training and
testing sets, after which a Naive Bayes classifier is run on the
training data and then used to predict labels for the test set. The
performance is evaluated using a confusion matrix
.
corpus_spamus <- emailer(spam)
corpus_hamus <- emailer(ham)
corpus_d <- c(corpus_spamus, corpus_hamus)
labels <- c(rep("spam", length(corpus_spamus)), rep("ham", length(corpus_hamus)))
valid_indices <- corpus_d != ""
corpus_d <- corpus_d[valid_indices]
labels <- labels[valid_indices]
corpus_w <- Corpus(VectorSource(corpus_d))
corpus_w <- tm_map(corpus_w, content_transformer(tolower))
corpus_w <- tm_map(corpus_w, removePunctuation)
corpus_w <- tm_map(corpus_w, removeNumbers)
corpus_w <- tm_map(corpus_w, removeWords, stopwords("en"))
corpus_w <- tm_map(corpus_w, stripWhitespace)
terminator <- DocumentTermMatrix(corpus_w)
terminator <- removeSparseTerms(terminator, 0.99)
terminator_df <- as.data.frame(as.matrix(terminator))
terminator_df$label <- factor(labels)
trainer <- createDataPartition(terminator_df$label, p = 0.8, list = FALSE)
train_x <- terminator_df[trainer, ]
test_y <- terminator_df[-trainer, ]
very_naive <- naiveBayes(label ~ ., data = train_x)
spam_ham <- predict(very_naive, test_y)
confusionMatrix(spam_ham, test_y$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 24 0
## spam 486 100
##
## Accuracy : 0.2033
## 95% CI : (0.172, 0.2374)
## No Information Rate : 0.8361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0159
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.04706
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.17065
## Prevalence : 0.83607
## Detection Rate : 0.03934
## Detection Prevalence : 0.03934
## Balanced Accuracy : 0.52353
##
## 'Positive' Class : ham
##
My model showed a low overall accurate rate of ~20%, which is unfortunate but also a good learning lesson. The model did correctly predict all the spam emails but it could not effectively separate out the ham emails. The low sensitivity of 0.047% reflects the general inability to classify ham emails, suggesting heavy bias towards predicting spam.