Download the SpamAssassin files and unzip them
Read all the files in, label each one “spam” or “ham” based on which folder it came from
some files have weird characters/formatting. wrap readLines() in tryCatch() so R skips problem files instead of crashing
Emails are messy so we have to strip punctuation, lowercase everything, remove filler words like “the” and “is” and just keep the meaningful words
Build a table where each row = an email, each column = a word, the numbers = how often that word appears. This is called a Document-Term Matrix (DTM)
the matrix gets huge, tens of thousands of columns, so we have to trim it down to only keep words that appear a decent number of times and we use removeSparseTerms() to do this
Split data 80% training, 20% testing. feed the 80% to Naive Bayes and let it learn which words are typical of spam vs ham
The model may get lazy and just guess ham for everything. fix by using equal amounts of each class
run the model on the 20% it hasn’t seen and check the results with a confusion matrix
if model just guesses ham every time it still looks decent on paper so always check precision and recall too, not just accuracy
drop in any new email, run it through the same cleaning steps and get a spam/ham verdict from the model
new emails might contain words the model never saw in training this breaks the math (probability goes to zero) so we can fix by turning on Laplace smoothing (laplace = 1) in naiveBayes()