Spam or Ham

Today, we’ll be looking at a group of spam and ham messages (not spam). We’ll gather the data into a single dataframe, and then separate that dataset into two sections- one to train a model, and another to test it.

The files were downloaded from here according to the project instructions found on blackboard and decompressed into my local computer. I’ll share the specific files I downloaded below, but you’ll need to unpack the files and update the local directories to make this rmarkdown work on your computer.

  • 20030228_easy_ham.tar.bz2
  • 20030228_spam.tar.bz2

packages

Let’s load the packages we’ll be using for this exercise:

Split Datasets

Here we’ll split the dataframe we created into a testing and training set. We’ll try to split the dataframe in half and check to see if there’s a reasonable distribution of both types.

## 
##  ham spam 
## 1868  382
## 
##  ham spam 
##  633  118

It looks like both the training set has a decent distribution of spam and ham emails.

Random Forest

We’ll use the random forest model to train a model and then make our predictions and see how well it fits.

##       
##        ham spam
##   ham  632    1
##   spam   0  118

Results

Wow, is that right? The model correctly predicted 99.87% or 750 / 751 emails. That seems too good to be true. I’m wondering if I made a mistake somewhere.