MA_Data607_Project4
Approach
First I need to take the ham and spam files and read them into a data frame. I will combine the spam and ham data sets then begin to preprocess the text. I plan to use naive bayes model because from my readings and research I see it is a classic model used in spam classification. Since I plan to use this model I need to separate the words in the files and count the number of times words appear in ham and spam files.
Challenges
One major step that will definitely be trial and error is fine tuning the training and testing split along with what value should we classify something as spam or not. Preprocessing the “noise” out of the text such as punctuation out of out data or word count is important.