Introduction

For this project, we’ll be doing text mining and analysis on spam and ham emails. First we’ll load the libraries we’ll use for this project. We’ll also store the names of our spam and ham emails into a variable

library(tidytext)
library(tidyverse)
library(tm)
library(stringr)
library(caret)
library(SnowballC)

spam_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/spam_2")
ham_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/easy_ham")

Get our data into a data frame

We’ll now read our files into a data frame: one for spam and the other for ham.

#Change into the directories where the files are and read them into a dataframe

#spam data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/spam_2") 
spam <- map_df(spam_tr, ~ data_frame(txt = read_lines(.x)) %>%
  mutate(filename = .x, type = "spam"))

#Set an integer ID for each email
spam$ID <- cumsum(!duplicated(spam[2]))

#ham data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/easy_ham")
ham <- map_df(ham_tr, ~ data_frame(txt = read_lines(.x)) %>%
  mutate(filename = .x, type = "ham"))

#Set an integer ID for each email
ham$ID <- cumsum(!duplicated(ham[2]))

#combining into a single data frame
spam_ham_train <- full_join(spam, ham, by = c("ID","txt","type","filename"))

knitr::kable(head(spam_ham_train), format = "html")
txt filename type ID
From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1
Return-Path: <ilug-admin@linux.ie>; 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1
Delivered-To: yyyy@localhost.netnoteinc.com 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1
Received: from localhost (localhost [127.0.0.1]) 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1
by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1
for <jm@localhost>;; Tue, 6 Aug 2002 06:48:09 -0400 (EDT) 00001.317e78fa8ee2f54cd4890fdc09ba8176 spam 1

We have four columns for our data frame and each row represents a line from our email. I’ve assigned an integer ID for each email, which would better suit our purposes when we create our document term matrix. I was also sure to label each email as ham or spam based on how they were classified from the link provided for this project.

Tidy our data

Now our data is ready to be prepared. I’ll use the tidytext library to clean up our text. Each row will now represent a word in each email. We’ll take the following steps:

#get each word into a row and get rid of stop words
spam_ham_cl <- spam_ham_train %>%
  unnest_tokens(word, txt) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$")) %>%
  #Get just the word stem
  mutate(word = wordStem(word)) %>%
  #add count column
  count(type,word,sort = TRUE) 
## Joining, by = "word"
#take out whitespace
spam_ham_cl$word <- gsub("\\s+","",spam_ham_cl$word)

knitr::kable(head(spam_ham_cl), format = "html")
type word n
spam font 33911
spam 3d 32154
spam br 16751
spam td 15692
ham list 15018
ham id 14763

TF-IDF Analysis

We’ll now use tf-idf to analyze which words are the most important for each document type. We’ll use the convenient bind_tf_idf function to get our values and arrange it in descending order to get the top tf-idf scores

spam_ham_tfidf <- spam_ham_cl %>%
  bind_tf_idf(word,type,n) %>%
  arrange(desc(tf_idf))

knitr::kable(head(spam_ham_tfidf), format = "html")
type word n tf idf tf_idf
ham example.com 7676 0.0100666 0.6931472 0.0069776
ham exmh 4888 0.0064103 0.6931472 0.0044433
ham freshrpms.net 3780 0.0049572 0.6931472 0.0034361
spam helvetica 3316 0.0042733 0.6931472 0.0029620
ham zzzlist 2879 0.0037756 0.6931472 0.0026171
spam serif 2906 0.0037449 0.6931472 0.0025958

Now we’ll display the top 15 tf-idf scores for words in our spam and ham emails.

spam_tf_idf <- spam_ham_tfidf %>%
  filter(type == 'spam') %>%
  top_n(15)

ham_tf_idf <- spam_ham_tfidf %>%
  filter(type == 'ham') %>%
  top_n(15)

ggplot(spam_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill = tf_idf)) +
  coord_flip() +
  xlab('Word') +
  ggtitle('Spam') +
  scale_color_gradient() +
  theme_bw()

ggplot(ham_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill = tf_idf)) +
  coord_flip() +
  xlab('Word') +
  ggtitle('Ham') +
  scale_color_gradient() +
  theme_bw()

Creating a Document Term Matrix

We’ll now create a document term matrix to potentially utilize it in a machine learning algorithm to classify emails as spam or ham. We’ll repeat the same steps as earlier to make our text tidy, but we’ll omit a “count” column.

spam_ham_df <- spam_ham_train %>%
  #get each word into a row.
  unnest_tokens(word, txt) %>%
  #Remove numbers and any NA values
  filter(!str_detect(word, "^[0-9]*$"),
         !is.na(word)) %>%
  #Remove stop words
  anti_join(stop_words) %>%
  #get just the word stem
  mutate(word = wordStem(word)) %>%
  ungroup()

#take out whitespace
spam_ham_df$word <- gsub("\\s+","",spam_ham_df$word)

#add word counts
spam_ham_df <- spam_ham_df %>%
  count(ID, word, type) 

#create our dtm
spam_ham_dtm <- spam_ham_df %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = tm::weightTfIdf)

#remove sparsity
spam_ham_dtm <- removeSparseTerms(spam_ham_dtm, sparse = 0.99)

spam_ham_dtm
## <<DocumentTermMatrix (documents: 2551, terms: 2863)>>
## Non-/sparse entries: 437107/6866406
## Sparsity           : 94%
## Maximal term length: 65
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Challenges

This project was particularly challenging and I was unsuccessful with many things. I found the tm package difficult to work with. The tidytext package was far easier to work with, but I was unable to figure out how to read only the text in the email body or subject. Also, I attempted to create a predictive model without success. I wasn’t clear on where to begin and what steps needed to be taken to create a model. Next steps for this project would be for me to do research on how to do machine learning to be able to do any modeling.