Project 4

Introduction

For this project, we’ll be doing text mining and analysis on spam and ham emails. First we’ll load the libraries we’ll use for this project. We’ll also store the names of our spam and ham emails into a variable

library(tidytext)
library(tidyverse)
library(tm)
library(stringr)
library(caret)
library(SnowballC)

spam_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/spam_2")
ham_tr <- list.files("/Users/chesterpoon/Desktop/Data607/Project4/easy_ham")

Get our data into a data frame

We’ll now read our files into a data frame: one for spam and the other for ham.

#Change into the directories where the files are and read them into a dataframe

#spam data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/spam_2") 
spam <- map_df(spam_tr, ~ data_frame(txt = read_lines(.x)) %>%
  mutate(filename = .x, type = "spam"))

#Set an integer ID for each email
spam$ID <- cumsum(!duplicated(spam[2]))

#ham data frame
#change the working directory to get the right files
setwd("~/Desktop/Data607/Project4/easy_ham")
ham <- map_df(ham_tr, ~ data_frame(txt = read_lines(.x)) %>%
  mutate(filename = .x, type = "ham"))

#Set an integer ID for each email
ham$ID <- cumsum(!duplicated(ham[2]))

#combining into a single data frame
spam_ham_train <- full_join(spam, ham, by = c("ID","txt","type","filename"))

knitr::kable(head(spam_ham_train), format = "html")

txt	filename	type	ID
From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1
Return-Path: <ilug-admin@linux.ie>;	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1
Delivered-To: yyyy@localhost.netnoteinc.com	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1
Received: from localhost (localhost [127.0.0.1])	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1
by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1
for <jm@localhost>;; Tue, 6 Aug 2002 06:48:09 -0400 (EDT)	00001.317e78fa8ee2f54cd4890fdc09ba8176	spam	1

We have four columns for our data frame and each row represents a line from our email. I’ve assigned an integer ID for each email, which would better suit our purposes when we create our document term matrix. I was also sure to label each email as ham or spam based on how they were classified from the link provided for this project.

Tidy our data

Now our data is ready to be prepared. I’ll use the tidytext library to clean up our text. Each row will now represent a word in each email. We’ll take the following steps:

Use unnest_tokens to create one row per word
Remove stop words
Take out numbers/digits
Get only the word stem for our words
Add a column that counts the numbers
Remove any whitespace

#get each word into a row and get rid of stop words
spam_ham_cl <- spam_ham_train %>%
  unnest_tokens(word, txt) %>%
  anti_join(stop_words) %>%
  #take out numbers  
  filter(!str_detect(word, "^[0-9]*$")) %>%
  #Get just the word stem
  mutate(word = wordStem(word)) %>%
  #add count column
  count(type,word,sort = TRUE)

## Joining, by = "word"

#take out whitespace
spam_ham_cl$word <- gsub("\\s+","",spam_ham_cl$word)

knitr::kable(head(spam_ham_cl), format = "html")

type	word	n
spam	font	33911
spam	3d	32154
spam	br	16751
spam	td	15692
ham	list	15018
ham	id	14763

TF-IDF Analysis

We’ll now use tf-idf to analyze which words are the most important for each document type. We’ll use the convenient bind_tf_idf function to get our values and arrange it in descending order to get the top tf-idf scores

spam_ham_tfidf <- spam_ham_cl %>%
  bind_tf_idf(word,type,n) %>%
  arrange(desc(tf_idf))

knitr::kable(head(spam_ham_tfidf), format = "html")

type	word	n	tf	idf	tf_idf
ham	example.com	7676	0.0100666	0.6931472	0.0069776
ham	exmh	4888	0.0064103	0.6931472	0.0044433
ham	freshrpms.net	3780	0.0049572	0.6931472	0.0034361
spam	helvetica	3316	0.0042733	0.6931472	0.0029620
ham	zzzlist	2879	0.0037756	0.6931472	0.0026171
spam	serif	2906	0.0037449	0.6931472	0.0025958

Now we’ll display the top 15 tf-idf scores for words in our spam and ham emails.

spam_tf_idf <- spam_ham_tfidf %>%
  filter(type == 'spam') %>%
  top_n(15)

ham_tf_idf <- spam_ham_tfidf %>%
  filter(type == 'ham') %>%
  top_n(15)

ggplot(spam_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill = tf_idf)) +
  coord_flip() +
  xlab('Word') +
  ggtitle('Spam') +
  scale_color_gradient() +
  theme_bw()

ggplot(ham_tf_idf, aes(reorder(word,tf_idf),tf_idf)) +
  geom_bar(stat = 'identity', aes(fill = tf_idf)) +
  coord_flip() +
  xlab('Word') +
  ggtitle('Ham') +
  scale_color_gradient() +
  theme_bw()

Creating a Document Term Matrix

We’ll now create a document term matrix to potentially utilize it in a machine learning algorithm to classify emails as spam or ham. We’ll repeat the same steps as earlier to make our text tidy, but we’ll omit a “count” column.

spam_ham_df <- spam_ham_train %>%
  #get each word into a row.
  unnest_tokens(word, txt) %>%
  #Remove numbers and any NA values
  filter(!str_detect(word, "^[0-9]*$"),
         !is.na(word)) %>%
  #Remove stop words
  anti_join(stop_words) %>%
  #get just the word stem
  mutate(word = wordStem(word)) %>%
  ungroup()

#take out whitespace
spam_ham_df$word <- gsub("\\s+","",spam_ham_df$word)

#add word counts
spam_ham_df <- spam_ham_df %>%
  count(ID, word, type) 

#create our dtm
spam_ham_dtm <- spam_ham_df %>%
  cast_dtm(document = ID, term = word, value = n,
           weighting = tm::weightTfIdf)

#remove sparsity
spam_ham_dtm <- removeSparseTerms(spam_ham_dtm, sparse = 0.99)

spam_ham_dtm

## <<DocumentTermMatrix (documents: 2551, terms: 2863)>>
## Non-/sparse entries: 437107/6866406
## Sparsity           : 94%
## Maximal term length: 65
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Challenges

This project was particularly challenging and I was unsuccessful with many things. I found the tm package difficult to work with. The tidytext package was far easier to work with, but I was unable to figure out how to read only the text in the email body or subject. Also, I attempted to create a predictive model without success. I wasn’t clear on where to begin and what steps needed to be taken to create a model. Next steps for this project would be for me to do research on how to do machine learning to be able to do any modeling.