STEP 1: GET THE DATA

Download the SpamAssassin files and unzip them

STEP 2: LOAD INTO R

Read all the files in, label each one “spam” or “ham” based on which folder it came from

HURDLE:

some files have weird characters/formatting. wrap readLines() in tryCatch() so R skips problem files instead of crashing

STEP 3: CLEAN THE TEXT

Emails are messy so we have to strip punctuation, lowercase everything, remove filler words like “the” and “is” and just keep the meaningful words

STEP 4: TURN WORDS INTO NUMBERS

Build a table where each row = an email, each column = a word, the numbers = how often that word appears. This is called a Document-Term Matrix (DTM)

HURDLE:

the matrix gets huge, tens of thousands of columns, so we have to trim it down to only keep words that appear a decent number of times and we use removeSparseTerms() to do this

STEP 5: TRAIN THE MODEL

Split data 80% training, 20% testing. feed the 80% to Naive Bayes and let it learn which words are typical of spam vs ham

HURDLE: way more ham than spam in the dataset

The model may get lazy and just guess ham for everything. fix by using equal amounts of each class

STEP 6: TEST IT

run the model on the 20% it hasn’t seen and check the results with a confusion matrix

HURDLE: accuracy can be misleading

if model just guesses ham every time it still looks decent on paper so always check precision and recall too, not just accuracy

STEP 7: PREDICT NEW EMAILS

drop in any new email, run it through the same cleaning steps and get a spam/ham verdict from the model

HURDLE:

new emails might contain words the model never saw in training this breaks the math (probability goes to zero) so we can fix by turning on Laplace smoothing (laplace = 1) in naiveBayes()