Today, we’ll be looking at a group of spam and ham messages (not spam). We’ll gather the data into a single dataframe, and then separate that dataset into two sections- one to train a model, and another to test it.
The files were downloaded from here according to the project instructions found on blackboard and decompressed into my local computer. I’ll share the specific files I downloaded below, but you’ll need to unpack the files and update the local directories to make this rmarkdown work on your computer.
Let’s load the packages we’ll be using for this exercise:
Lets start by setting the local directories where we’ll find all the files:
# string paths in windows
ham_path <- 'C:/Downloads/docs/data607/spamham/easy_ham'
spam_path <- 'C:\\Downloads\\docs\\data607\\spamham\\spam'Let’s loop through all the files in each folder and add them to a dataframe:
hlist <- list.files(ham_path)
slist <- list.files(spam_path)
hams <- NA
spams <- NA
for (i in 1:length(hlist)) {
current_file <- str_c(ham_path, '/', hlist[i], sep = "")
read_file <- readLines(current_file)
one_line <- paste(read_file, collapse = '\n')
hams[i] <- one_line
}
for (i in 1:length(slist)) {
current_file <- str_c(spam_path, '/', slist[i], sep = "")
read_file <- readLines(current_file)
one_line <- paste(read_file, collapse = '\n')
spams[i] <- one_line
}
ham <- data.frame(fn = hlist,
text = unlist(hams),
type = 'ham',
stringsAsFactors = F)
spam <- data.frame(fn = slist,
text = unlist(spams),
type = 'spam',
stringsAsFactors = F)
# dataframe with filename, email text, and spam type
df_all <- bind_rows(ham, spam) %>%
mutate(type = as.factor(type))Here we’ll take the text from each email, clean it up, and then create a dataframe that contains all the data.
corpus <- Corpus(VectorSource(df_all$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
# Lets keep terms that occur in at least 1% of the emails
dtm_small <- removeSparseTerms(dtm, 0.99)
dm <- as.matrix(dtm_small)
dfc <- data.frame(dm)
# final dataframe with the first column as the email spam type
dfs <- bind_cols(select(df_all, spam_type = type), dfc)Here we’ll split the dataframe we created into a testing and training set. We’ll try to split the dataframe in half and check to see if there’s a reasonable distribution of both types.
set.seed(222)
ss <- floor(nrow(dfs) * 0.75)
si <- sample(seq_len(nrow(dfs)), ss)
train <- dfs[si,]
test <- dfs[-si,]
table(train$spam_type)##
## ham spam
## 1868 382
##
## ham spam
## 633 118
It looks like both the training set has a decent distribution of spam and ham emails.
We’ll use the random forest model to train a model and then make our predictions and see how well it fits.
# rfm <- randomForest(train[,2:ncol(train)], train$spam_type)
rfm <- randomForest(spam_type ~ ., data = train)
test$pred <- predict(rfm, newdata = test)
table(test$spam_type, test$pred)##
## ham spam
## ham 632 1
## spam 0 118
Wow, is that right? The model correctly predicted 99.87% or 750 / 751 emails. That seems too good to be true. I’m wondering if I made a mistake somewhere.