Introduction

Every day, our inboxes are filled with messages — some are useful, while others are unwanted spam. Being able to automatically tell the difference between “ham” (legitimate emails) and “spam” (junk emails) is an important task in data science and everyday life.

In this project, I use a collection of emails that have already been labeled as spam or ham to train a computer model. The idea is simple: by showing the model many examples of both types of messages, it can learn the patterns that distinguish them. Once trained, the model can then look at new, unseen emails and predict whether they are spam or not.

To make this work, I start by reading in the raw email files, cleaning them up (removing headers, punctuation, and common words), and turning them into a structured format that the computer can understand. I then split the data into two groups: one for training the model and one for testing how well it performs. Finally, I compare different machine learning approaches — such as Naive Bayes, Logistic Regression, and Random Forest — to see which one does the best job at spotting spam.

The goal of this project is not only to build a working spam filter but also to demonstrate the process of text classification in a way that is reproducible and easy to follow.

Configuring R Markdown settings

options(repos = c(CRAN = "https://cran.rstudio.com"))

Install and load necessary packages

req_packages <- c("tm", "SnowballC", "wordcloud", "textclean", "stringr",
                   "e1071", "glmnet", "randomForest", "caret", 
                   "pROC", "ggplot2", "gmodels")
for (pkg in req_packages) {
  if (!require(pkg, character.only = TRUE)) {
    message(paste("Installing package:", pkg))
    install.packages(pkg, dependencies = TRUE)
  } else {
    message(paste(pkg, " already installed."))
  }
  library(pkg, character.only = TRUE)
}
## Loading required package: tm
## Loading required package: NLP
## tm  already installed.
## Loading required package: SnowballC
## SnowballC  already installed.
## Loading required package: wordcloud
## Loading required package: RColorBrewer
## wordcloud  already installed.
## Loading required package: textclean
## textclean  already installed.
## Loading required package: stringr
## stringr  already installed.
## Loading required package: e1071
## e1071  already installed.
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 4.1-10
## glmnet  already installed.
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## randomForest  already installed.
## Loading required package: caret
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## The following object is masked from 'package:NLP':
## 
##     annotate
## Loading required package: lattice
## caret  already installed.
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## pROC  already installed.
## ggplot2  already installed.
## Loading required package: gmodels
## 
## Attaching package: 'gmodels'
## The following object is masked from 'package:pROC':
## 
##     ci
## gmodels  already installed.

Define and read file paths

I set my working directory to the file location and define my file paths and read them into R.

# Define paths
setwd("/Users/paulabrown/Documents/CUNY SPS- Data 607/Project4")

spam_path <- "spam"
ham_path  <- "easy_ham"

# List files
spam_files <- list.files(spam_path, full.names = TRUE, recursive = TRUE)
ham_files  <- list.files(ham_path, full.names = TRUE, recursive = TRUE)

# Read spam files
spam_texts <- lapply(spam_files, function(f) {
  paste(readLines(f, warn = FALSE, encoding = "UTF-8"), collapse = "\n")
})

# Read ham files
ham_texts <- lapply(ham_files, function(f) {
  paste(readLines(f, warn = FALSE, encoding = "UTF-8"), collapse = "\n")
})

Combine into a data frame and label spam 1 and ham 0

Here is where I place the spam and ham file data into a data frame then flag/label spam records with a 1 and ham records with a 0 to differentiate the two.

# Combine into data frame
docs <- data.frame(
  text = c(unlist(spam_texts), unlist(ham_texts)),
  label = c(rep("spam", length(spam_texts)), rep("ham", length(ham_texts))),
  stringsAsFactors = FALSE
)

# Convert labels to numeric
docs$label_num <- ifelse(docs$label == "spam", 1, 0)

# Normalize encoding to UTF-8
docs$text <- iconv(docs$text, from = "", to = "UTF-8", sub = "")

Strip headers before corpus creation to prevent errors in the tm_map()

strip_headers <- function(x) {
  # remove headers: everything up to the first blank line
  sub(".*\\n\\n", "", x)
}

docs$text <- vapply(docs$text, strip_headers, character(1))

Shuffle and split

I will now shuffle and split my data into “Test” and “Train” sets. I will do a 70/30 split and set a seed for reproducibility.

set.seed(123)  # for reproducibility

# Create an index for training rows
train_idx <- sample(1:nrow(docs), 0.7 * nrow(docs))

# Split into train/test
train_data <- docs[train_idx, ]
test_data  <- docs[-train_idx, ]

Classify

Since classifiers can’t use raw text, I need Document-Term Matrix (DTM) or Term Frequency (TF) - Inverse Document Frequency (IDF) matrix.

# Build corpus
corpus <- VCorpus(VectorSource(docs$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Create DTM
dtm <- DocumentTermMatrix(corpus)

# Split into train/test using the same indices
train_dtm <- dtm[train_idx, ]
test_dtm  <- dtm[-train_idx, ]

train_labels <- docs$label_num[train_idx]
test_labels  <- docs$label_num[-train_idx]

Train Classifiers

Train classifiers to enable them to predict the categorical class or label of new, unseen data points. I will use 3 models - Naive Bayes, Logistic Regression, and Random Forest.

Naive Bayes

Naive Bayes is a simple classification method that predicts categories by calculating probabilities. It works by looking at the data and asking “based on what I’ve seen before, what’s the most likely category for this new item?”

nb_model <- naiveBayes(as.matrix(train_dtm), as.factor(train_labels))
nb_pred <- predict(nb_model, as.matrix(test_dtm))

table(nb_pred, test_labels)
##        test_labels
## nb_pred   0   1
##       0   0   0
##       1 746 155

Interpretation:

- Every email was classified as spam (1)

- True Positives (spam correctly identified): 155

- False Positives (ham misclassified as spam): 746

- Accuracy: Very low — it failed to identify any ham emails

Logistic Regression

Logistic Regression is a method used to predict yes/no or categorical outcomes. Despite its name, it’s used for classification rather than traditional regression. It works by finding the relationship between the input variables and the probability of a particular outcome occurring.

log_model <- cv.glmnet(as.matrix(train_dtm), train_labels, family = "binomial")
log_pred <- predict(log_model, as.matrix(test_dtm), type = "class")

table(log_pred, test_labels)
##         test_labels
## log_pred   0   1
##        0 740 133
##        1   6  22

Interpretation:

- True Positives: 22

- True Negatives: 740

- False Positives: 6

-False Negatives: 133

- Accuracy: 740+22/740+133+6+22 = 762/901 ~ 84.6%

Random Forest

Random Forest is a powerful classification method that works like a committee of decision-makers voting on the final answer. It creates many individual decision trees (hence “forest”), where each tree makes its own prediction based on a random subset of the data and features. The final prediction is determined by majority vote across all the trees.

rf_model <- randomForest(as.matrix(train_dtm), as.factor(train_labels))
rf_pred <- predict(rf_model, as.matrix(test_dtm))

table(rf_pred, test_labels)
##        test_labels
## rf_pred   0   1
##       0 745 134
##       1   1  21

Interpretation:

- True Positives: 21

- True Negatives: 745

- False Positives: 1

-False Negatives: 134

- Accuracy: 745+21/745+134+1+21 = 766/901 ~ 85.0%

Conclusion: