Project 4 Document Classification
Quarto
Intro
The goal of this project is to develop a supervised machine learning model capable of classifying documents as either Spam or Ham. I will be using the SpamAssassin Public Corpus.
Approach
I plan on using the tm and tidytext packages in R to handle the raw email files. I will download and decompress the SpamAssassin tarballs, then read the files into a volatile corpus using VCorpus.
To reduce noise, I will apply transformations to lowercase the text, remove punctuation, strip numbers, and eliminate stop words. I will convert the cleaned corpus into a Document Term Matrix.
I want to remove infrequent terms to prevent model from overfitting and to keep the matrix computationally manageable. I will also use a binary indicator to help with simplification.