Project 4 Document Classification

Author

Ciara Bonnett

Quarto

Intro

The goal of this project is to develop a supervised machine learning model capable of classifying documents as either Spam or Ham. I will be using the SpamAssassin Public Corpus.

Approach

I plan on using the tm and tidytext packages in R to handle the raw email files. I will download and decompress the SpamAssassin tarballs, then read the files into a volatile corpus using VCorpus.

To reduce noise, I will apply transformations to lowercase the text, remove punctuation, strip numbers, and eliminate stop words. I will convert the cleaned corpus into a Document Term Matrix.

I want to remove infrequent terms to prevent model from overfitting and to keep the matrix computationally manageable. I will also use a binary indicator to help with simplification.